Upload a LibreOffice or a Windows document, a PDF or a JPG, PNG or GIF image and extract the text in it. Index the text with Solr. Find by relevance the documents corresponding to a precise or an approximate text, a language, a date, a file name or a file size. Refine a query by following suggestions. Display the list of terms indexed by a document. Find documents similar to a document. Get a full report on all the terms in your index by language with for each term, the number of documents matching it and its maximum frequency in a document.

Configure how images are extracted from a PDF and how images are prepared for the OCR (resolution, orientation, coloring, contrast, brightness, resizing, cropping, etc.) and reuse this set of parameters by program with the API.

The quick brown fox
jumps over
the lazy dog.

Search

 q

 fq

 2 • 

2022-11-28 02:00 45 en fox.txt

The quick brown fox
jumps over
the lazy dog.

2022-11-28 02:00 30,9k en fox.jpg

The quick brown fox
jumps over
the lazy dog.

Download the list of the terms which are indexed by a document. Run a search of proximity to display the list of the documents similar to a document of reference and the terms in common. Click on the caret to toggle the display of extracts.

xof.jpg

This PDF contains an image which is upside down.

$ curl -s --fail --show-error -X POST "https://bezillion.com/api/v1/indexfile?login=abcdef&password=ABCDEF" -F "lang=eng" -F "psm=6" -F "rotate=180" -F "file=@xof.pdf" -o -
{"status":"success","data":null}

The image in the PDF is automatically extracted as is, flipped and read with Tesseract in mode 6 - Assume a single uniform block of text - with the trained data for the English language.

Index a document

All functionalities are available for free in the interface of your personal space. Indexing a file by program with the API is a paid service. All the other operations, like searching for a file, are for free. See the User's Guide.

Solr is an open-source search engine platform using the Lucene library.

Lucene is a library of functions for indexing and searching text.

Tika is a toolbox which can detect and extract the text and the metadata of more than a thousand types of documents.

Tesseract is an open-source optical character recognition engine sponsored by Google since 2006.

Ghostscript is a suite of software for processing Postscript and PDF files.

Poppler provides a set of commands for extracting the pages, the text and the images of PDF files.

All communications are encrypted.

Access to your index is protected. The files you upload are inaccessible to others and the files which are processed and generated by the API are automatically deleted.

You wish to add indexing and searching documents in your web service? bezillion.com is a software developed by an editor open to sharing knowledge and code. To contact mcPaLo, click here.