23

Index a document

indexfile

POST https://bezillion.com/api/v1/indexfile?login=&password=

loginYour identification code.
passwordYour password.
multipart/form-data
fileContent of the document or the image in binary.
storedKeep a copy of the text.
ocrExtract text with the OCR.
langLanguage of the text.
psmScan mode.
firstpageFirst page to process in a PDF.
lastpageLast page to process in a PDF.
resolutionResolution in dpi of the image generated for each page of a PDF .
imagesDirectly extract just the images in a PDF.
rotateRotate images.
cropCrop images. Cut the clear text in a PDF.
resizeResize images.
negateRevert colors.
normalizeAdd contrast to the colors.
colorspaceConvert to grayscale.
unsharpSharpen the contours.
dotsRemove white dots.

stored: 0 - don't keep a copy of the text extracted from the document or the image within the index. Without a copy of the text, highlighting excerpts in a search is not possible.

lang - language of the text:  eng,  fra,  deu,  spa,  ita or  rus. Specify several languages by separating them with a +, e.g. eng+fra. NOTE: The order is important.

psm - Page Segmentation Mode: 1 - Automatic page segmentation with OSD (Orientation and Script Detection), 3 - Fully automatic page segmentation, but no OSD, 4 - Assume a single column of text of variable sizes, 6 - Assume a single uniform block of text).

ocr: 1 - force reading a PDF which contains plain text with the OCR.

Specify the extraction mode of each page of a PDF:

firstpage : number of the first page to process,
lastpage : number of the last page to process,
resolution : resolution of the image generated in dpi - 50, 75, 100, 125, 150 or 200. IMPORTANT: If a page contains only one image and no text, the image is systematically directly extracted from the document.
images : 1 - directly extract only the images.

Activate the processing options of each image before analysis:

rotate : 180 to flip the image, -90 to rotate it to the left or to the right,
crop -  : limit reading the image or the text to the area specified by a width and a height separated by an x from a position specified by x and y coordinates preceded by a + in pixels for the given resolution, e.g. 640x200+50+80,
resize -  : resize the image by 50, 75, 125, 150 or 200 %,
negate -  : 1 - revert colors,
normalize -  : 1 - add contrast to the colors,
colorspace -  : 1 - convert the image to grayscale,
unsharp -  : 1 - sharpen the contours,
dots -  : 1 - remove white dots.

IMPORTANT: Image processing options are run in the above order.

To have a correct understanding of the effects of these parameters, test them in the interface of your personal space.

Index

fox.pdf • 32.3k • 

dpi  

       

px

         

 eng  fra  deu  spa  ita  rus

 

The quick brown fox
jumps over
the lazy dog.
$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "file=@fox.txt" -o -
{"status":"success","data":null}

Look for the term fox in your index:

The quick brown fox jumps over the lazy dog.

Get the MIME type of the file:

$ file -b --mime-type fox.odt
application/vnd.oasis.opendocument.text

Pass the MIME type of the file with the content of the document:

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "file=@fox.odt; type=application/vnd.oasis.opendocument.text" -o -
{"status":"success","data":null}

The text of the document is extracted with Tika.

fox.jpg

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "lang=eng" -F "psm=6" -F "file=@fox.jpg" -o -
{"status":"success","data":null}

The text of the image is read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language.

sushi.png

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "lang=eng+jpn+spa" -F "psm=6" -F "file=@sushi.png" -o -
{"status":"success","data":null}

The text of the image is read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English, the Japanese and the Spanish language.

Look for the term pollo in your index:

I eat すし y Pollo

Try looking for the terms eat or すし:

I eat すし y Pollo

xof.jpg

This PDF contains an image which is upside down.

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "lang=eng" -F "psm=6" -F "rotate=180" -F "file=@xof.pdf" -o -
{"status":"success","data":null}

The image in the PDF is automatically extracted as is, flipped and read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language.

This PDF is the result of the Print in a file function of the navigator on the page Legal information of the website Collaboractor. NOTE: If you upload this PDF in the interface of your personal space, you can directly retrieve the plain text it contains without analyzing it.

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "file=@legal_en.pdf" -o -
{"status":"success","data":null}

The text is extracted from the PDF with Poppler. The layout of the text is preserved. The pages are separated by a 0xC (FORMFEED).

Compare the result by forcing the usage of the OCR:

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "ocr=1" -F "lang=eng" -F "psm=6" -F "resolution=100" -F "file=@legal_en.pdf" -o -
{"status":"success","data":null}

Each page in the PDF is converted in an image with Ghostscript wit a resolution of 100 dpi and read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language. NOTE: If you run the same operation in the interface of your personal space, you can retrieve the image of the first page and the text read by the OCR.

Look for the term legal in your index.

invoice.png

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "lang=eng" -F "psm=6" -crop="200x100+50+230" -F "resize=125" -F "unsharp=1" -F "file=@invoice.png" -o -
{"status":"success","data":null}

The image is cropped around the billing information, resized to 125% with sharpened contours and read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language. NOTE: If you run the same operation in the interface of your personal space, you can check how the image is cropped.

Look for the terms +horne +"new york" in your index:

Jessie M Horne 4312 Wood Road New York, NY 10031

Download the code of the sendpost and file_mime_type functions from the iZend library. Copy the files in the space of your application.

NOTE: See the page Call the service API for a description of the sendpost and file_mime_type functions.

Add the file indexfile.php with the following content:

  1. require_once 'sendhttp.php';
  2. require_once 'filemimetype.php';

Loads the code of the sendpost and file_mime_type functions.

  1. function indexfile($login, $password, $file, $params=false) {

Defines the function indexfile. $login is your identification code. $password is your password. $file is the pathname of file to index. $params is an associative array containing the names and the values of the parameters specifying the extraction mode of each page of a PDF and the processing options of each image before analysis, the language of the text and the analysis mode of the text for the OCR, e.g. array('resize' => 125, 'psm' => 6, 'lang' => 'eng').

  1.     $curl = 'https://bezillion.com/api/v1/indexfile' . '?' . 'login=' . urlencode($login) . '&' . 'password=' . urlencode($password);

Sets $curl to the URL of the indexfile action of the API with the identification code and the password of the user's account. $login and $password must be escaped.

  1.     $files=array('file' => array('name' => basename($file), 'tmp_name' => $file, 'type' => file_mime_type($file)));

Prepares the list of files attached to the POST: file - the document to index with the name of the file, the pathname of the file and its MIME type.

  1.     $response=sendpost($curl, $params, $files);

Sends the HTTP request with sendpost. The arguments login and password are already in $curl.

  1.     if (!$response) {
  2.         return false;
  3.     }

If $response is false, the server is unreachable. indexfile returns false.

  1.     if ($response[0] == 200) {
  2.         return true;
  3.     }

If $response[0] contains the HTTP return code 200 Ok, indexfile returns true.

  1.     return false;
  2. }

If $response[0] contains an HTTP return code other than 200 Ok, an execution error has occurred. indexfile returns false.

EXAMPLE

Assuming you have saved the files sendhttp.php, filemimetype.php and indexfile.php in the current directory, run PHP in interactive mode, load the indexfile function and call it with your identification code and password, the pathname of file and other options in an associative array in argument:

$ php -a
php > require_once 'indexfile.php';
php > $r=indexfile('abcdef', 'ABCDEF', 'fox.pdf', array('lang' => 'eng', 'psm' => 6));
php > echo $r ? 'Ok' : 'Ko';
php > Ok
php > quit
SEE ALSO

Call the service API, Search for a document

Comments

To add a comment, click here.