Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mmiklavc/scalable-ocr
Scalable Optical Character Recognition with Apache NiFi and Tesseract
https://github.com/mmiklavc/scalable-ocr
Last synced: about 1 month ago
JSON representation
Scalable Optical Character Recognition with Apache NiFi and Tesseract
- Host: GitHub
- URL: https://github.com/mmiklavc/scalable-ocr
- Owner: mmiklavc
- License: apache-2.0
- Created: 2016-04-09T16:15:16.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2016-08-11T15:26:39.000Z (almost 8 years ago)
- Last Synced: 2024-04-18T02:56:51.523Z (2 months ago)
- Language: Java
- Size: 6 MB
- Stars: 30
- Watchers: 2
- Forks: 22
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-nifi - mmiklavc/scalable-ocr - Scalable OCR with Apache NiFi and Tesseract (Processors and Bundles / Mailing List Best Of)
- awesome-nifi - mmiklavc/scalable-ocr - Scalable OCR with Apache NiFi and Tesseract (Processors and Bundles / Mailing List Best Of)
- awesome-nifi - mmiklavc/scalable-ocr - Scalable OCR with Apache NiFi and Tesseract (Processors and Bundles / Mailing List Best Of)
README
# Scalable OCR
Welcome to the project
So much of our data is represented as human readable scans of documents.
However, this kind of document-by-document analysis does not scale, so
it is becoming evermore common to need to ingest large numbers of PDFs
or scanned documents shows up in almost all sectors. Inevitably these
scanned documents must be converted to text for analysis. And since
dealing with unstructured data is one of the main selling points for a
platform like Hadoop, it means that we must convert large volumes of
potentially large documents into a textual representation. We will show
you how to use scalable open source tooling (Apache NiFi and Tesseract) to scalably convert volumes of PDFs and ingest into a platform that will allow you to analyze this data at scale.# Modules
### Core Modules
- conversion - convert multi-page PDFs to single-page TIFF files
- preprocessing - image correction for better text extraction during OCR
- extraction - OCR images and output text### Utility
- CLI - command line tool for manual pipeline process execution
- NiFi - custom processors for exposing the core modules via NiFi. Workflow template.# Developers
#### Cutting a release for ocr
```bash
mvn release:prepare -Dscm-connection.url= -Dscm-developer-connection.url=
```**Note**: The main pom assumes "scm:git:" - simply pass in the URL portion as a build parameter as shown above.
Examples: [maven scm] (http://maven.apache.org/scm/git.html)
1. local git - file://localhost/foo/bar/mygitrepodir
1. github connection url (readonly) - git://github.com/mmiklavc/myproject.git
1. github developer connection url (read/write) - [email protected]:mmiklavc/myproject.gitPerforming the release prepare will do the following high-level steps:
1. Change pom versions from X.X-SNAPSHOT to X.X
1. Commit the new poms for the release to Git
1. Tag the release commit in Git
1. Increment poms to a new SNAPSHOT version, e.g. Update from X.0-SNAPSHOT to X.1-SNAPSHOT
1. Commit the updated SNAPSHOT poms*See [Maven release prepare] (http://maven.apache.org/maven-release/maven-release-plugin/examples/prepare-release.html) documentation for more detail*