Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/diging/giles-eco-cepheus


https://github.com/diging/giles-eco-cepheus

extract-images giles-ecosystem java spring

Last synced: 6 days ago
JSON representation

Awesome Lists containing this project

README

        

# Cepheus
## giles-eco-cepheus

This repository contains Cepheus which is part of the Giles Ecosystem. Cepheus is an app to extract images and embedded text from PDFs.

The Giles Ecosystem is a distributed system to run OCR on images and extract images and texts from PDF files. This repository contains the text and image extraction component of this system called "Cepheus". The system requires the following software:

* Apache Tomcat 8
* Apache Kafka
* Apache Zookeeper (required by Apache Kafka)
* Tesseract (https://github.com/tesseract-ocr/)

The components of the Giles Ecosystem are located in the following repositories:

* Giles: https://github.com/diging/giles-eco-giles-web (user-facing component for uploading files)
* Nepomuk: https://github.com/diging/giles-eco-nepomuk (file storage)
* Cepheus: https://github.com/diging/giles-eco-cepheus (this repository)
* Cassiopeia: https://github.com/diging/giles-eco-cassiopeia (OCR using Tesseract)

The above applications have dependencies to libraries located in the following repositories:

* https://github.com/diging/giles-eco-requests
* https://github.com/diging/giles-eco-util

There is a docker compose file for testing and evaluation purposes that sets up the Giles Ecosystem in Docker. You can find that file here: https://github.com/diging/giles-eco-docker