Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/deckerego/docidx
A document indexing daemon that can populate Elasticsearch indexes with the contents and metadata of a number of document types including PDF, image scans, etc. Used to power Facile Search, however can be re-used for anything that requires search indexing for scanned documents.
https://github.com/deckerego/docidx
elasticsearch full-text-search pdf-search scanned-documents search-engine
Last synced: about 1 month ago
JSON representation
A document indexing daemon that can populate Elasticsearch indexes with the contents and metadata of a number of document types including PDF, image scans, etc. Used to power Facile Search, however can be re-used for anything that requires search indexing for scanned documents.
- Host: GitHub
- URL: https://github.com/deckerego/docidx
- Owner: deckerego
- License: mpl-2.0
- Created: 2017-11-25T20:48:27.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2023-10-17T22:06:18.000Z (over 1 year ago)
- Last Synced: 2024-11-08T12:44:40.346Z (2 months ago)
- Topics: elasticsearch, full-text-search, pdf-search, scanned-documents, search-engine
- Language: Java
- Homepage:
- Size: 1.63 MB
- Stars: 1
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DocIndex
DocIndex is the batch process used to feed DocMag, a front-end to Elasticsearch
that allows server-side document searching to be simple.## Requirements
DocIndex can be run directly on an OS, however it is recommended to be run within
a Docker container. The container is composed within the DocMag docker-compose.yml.Usually you won't want to build and run docidx locally, instead it is best to
run the docker container published at: https://hub.docker.com/r/deckerego/docidx/## Building and Testing Locally
Since docidx relies heavily on computer vision and image processing, bindings to
native libraries are heavily used. Packaged Java distributions with native libraries
are a giant pain in the butt - hence leveraging Docker containers to ship things
by default. If you just want to get docidx up and running Docker will be the easiest
way to go, but if you would like to tweak the code and run it locally you will
need to jump through some hoops to install the native libs.docidx uses bindings for OpenCV and Tesseract native libraries. The OpenCV
libraries are especially version-sensitive. To install the native Tessearact libriaries
in MacOS you can use Homebrew, as in:brew install tesseract
Unfortunately OpenCV 3.2 does not build properly under Homebrew. For MacOS,
OpenCV needs to be built from source. This can be done with:wget https://github.com/opencv/opencv/archive/3.2.0.tar.gz
tar xzf 3.2.0.tar.gz
mkdir opencv-3.2.0/build
cd opencv-3.2.0/build
cmake .. -DBUILD_opencv_java=ON
make
make installLinux distributions often ship with Tesseract and OpenCV 3.2, such as with
Ubuntu (Bionic):apt-get install tesseract-ocr libopencv3.2-jni
After the native libraries are installed, building and testing can be performed
locally with Maven and Spring Boot:mvn -DargLine="-Djava.library.path=/usr/local/share/OpenCV/java/" install
If you would also like to spin up a local Elasticsearch and Kibana instance for
testing, you can deploy both with Docker configs in the `tests/` directory:cd tests
docker-compose up -d## Searching and Querying Documents
To search within your documents, use DocMag available at https://github.com/deckerego/docmag
You could also query Elasticsearch directly using the API or Kibana's dev tools. A query sent over the API might be:
GET /docidx/_search
{"query": { "simple_query_string" :
{ "query": "water bill" }
}}