https://github.com/deckerego/docidx

A document indexing daemon that can populate Elasticsearch indexes with the contents and metadata of a number of document types including PDF, image scans, etc. Used to power Facile Search, however can be re-used for anything that requires search indexing for scanned documents.
https://github.com/deckerego/docidx

elasticsearch full-text-search pdf-search scanned-documents search-engine

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/deckerego/docidx
Owner: deckerego
License: mpl-2.0
Created: 2017-11-25T20:48:27.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2023-10-17T22:06:18.000Z (about 2 years ago)
Last Synced: 2024-12-20T02:07:41.097Z (10 months ago)
Topics: elasticsearch, full-text-search, pdf-search, scanned-documents, search-engine
Language: Java
Homepage:
Size: 1.63 MB
Stars: 1
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# DocIndex

DocIndex is the batch process used to feed DocMag, a front-end to Elasticsearch
that allows server-side document searching to be simple.

## Requirements

DocIndex can be run directly on an OS, however it is recommended to be run within
a Docker container. The container is composed within the DocMag docker-compose.yml.

Usually you won't want to build and run docidx locally, instead it is best to
run the docker container published at: https://hub.docker.com/r/deckerego/docidx/

## Building and Testing Locally

Since docidx relies heavily on computer vision and image processing, bindings to
native libraries are heavily used. Packaged Java distributions with native libraries
are a giant pain in the butt - hence leveraging Docker containers to ship things
by default. If you just want to get docidx up and running Docker will be the easiest
way to go, but if you would like to tweak the code and run it locally you will
need to jump through some hoops to install the native libs.

docidx uses bindings for OpenCV and Tesseract native libraries. The OpenCV
libraries are especially version-sensitive. To install the native Tessearact libriaries
in MacOS you can use Homebrew, as in:

brew install tesseract

Unfortunately OpenCV 3.2 does not build properly under Homebrew. For MacOS,
OpenCV needs to be built from source. This can be done with:

wget https://github.com/opencv/opencv/archive/3.2.0.tar.gz
tar xzf 3.2.0.tar.gz
mkdir opencv-3.2.0/build
cd opencv-3.2.0/build
cmake .. -DBUILD_opencv_java=ON
make
make install

Linux distributions often ship with Tesseract and OpenCV 3.2, such as with
Ubuntu (Bionic):

apt-get install tesseract-ocr libopencv3.2-jni

After the native libraries are installed, building and testing can be performed
locally with Maven and Spring Boot:

mvn -DargLine="-Djava.library.path=/usr/local/share/OpenCV/java/" install

If you would also like to spin up a local Elasticsearch and Kibana instance for
testing, you can deploy both with Docker configs in the `tests/` directory:

cd tests
docker-compose up -d

## Searching and Querying Documents

To search within your documents, use DocMag available at https://github.com/deckerego/docmag

You could also query Elasticsearch directly using the API or Kibana's dev tools. A query sent over the API might be:

GET /docidx/_search
{"query": { "simple_query_string" :
{ "query": "water bill" }
}}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deckerego/docidx

Awesome Lists containing this project

README