Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/elbakerino/abc-soup
Simple OCR API with tesseract 5 in python
https://github.com/elbakerino/abc-soup
ocr tesseract tesseract-ocr
Last synced: 25 days ago
JSON representation
Simple OCR API with tesseract 5 in python
- Host: GitHub
- URL: https://github.com/elbakerino/abc-soup
- Owner: elbakerino
- License: mit
- Created: 2023-10-12T12:32:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-23T13:06:17.000Z (3 months ago)
- Last Synced: 2024-08-23T15:08:50.357Z (3 months ago)
- Topics: ocr, tesseract, tesseract-ocr
- Language: Python
- Homepage:
- Size: 49.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ABC-Soup - an OCR API
[![Github actions Build](https://github.com/elbakerino/abc-soup/actions/workflows/blank.yml/badge.svg)](https://github.com/elbakerino/abc-soup/actions)
Simple Tesseract 5 API in python, using [tessdata_best](https://github.com/tesseract-ocr/tessdata_best/) and dockerized setup.
Endpoints:
- `GET:/info` get tesseract version and available languages
- `POST:/ocr` perform OCR on a single image
- `POST:/ocr-batch` perform OCR on multiple images
- `POST:/ocr-to-pdf` perform OCR on a single image and create a searchable PDF> Upload the file in the field `file`, the batch endpoint uses all given files. See below [JS Client example](#js-client-example).
Options for OCR endpoints:
- `optimize_images: true` applies some binarization and contrast optimizations
- `save_intermediate: false` stores intermediate image processing files to `/app/shared-assets`
- `intra_block_breaks: true` adds line breaks when text is below each other in a single block (except PDF endpoint)
- `keep_details: false` when `true` returns data per page, block and box; when `false` only the combined content per page (except PDF endpoint)
- [tesseract](https://tesseract-ocr.github.io/tessdoc) options:
- `lang` the languages used for inference, defaults to `eng+deu`
- `psm` when not none passed as `--psm` to tesseract, [see docs](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#page-segmentation-method)
- `preserve_interword_spaces` when not none passed as `-c preserve_interword_spaces=` to tesseract> Options: send options as JSON in field name `options` (serialized additionally, as multipart upload).
>
> Image optimizations: in [helpers/optimize_image.py](./src/helpers/optimize_image.py) are a few input-quality assumptions noted and (open) todos which may yield better results.For further options and mount points checkout [the docker example](#docker-example).
## Docker Example
The ready to use docker image includes models `eng` and `deu`:
```yaml
# docker-compose.yml
services:
abc-soup:
image: ghcr.io/elbakerino/abc-soup:latest
environment:
PORT: 80
APP_ENV: local
#GUN_W: 2 # control gunicorn workers
#DEFAULT_LANG: deu+eng # control the default `lang` for tesseract
volumes:
- ./shared-data:/app/shared-assets
ports:
- "8730:80"
```Models must be saved in `/usr/share/tesseract-ocr/5/tessdata` - thus included in docker image itself, see [Dockerfile](./Dockerfile), for all available best-models see [github.com/tesseract-ocr/tessdata_best](https://github.com/tesseract-ocr/tessdata_best).
> todo: find out how to configure the "init only" config values https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html#useful-parameters especially for checking out `load_system_dawg`.
## JS Client Example
Example in Javascript with extra options:
```js
const data = {
file: file, // the File object from e.g. input field
options: JSON.stringify({
keep_details: true,
lang: 'eng+deu',
psm: 4,
}),
}
const formData = new FormData()
for(const prop in data) {
formData.append(prop, data[prop])
}
// send with e.g. `XMLHttpRequest`
const xhr = new XMLHttpRequest()
xhr.open('POST', 'http://localhost:8730/ocr', true)
xhr.send(formData)
xhr.addEventListener('readystatechange', () => {
if(xhr.readyState !== 4) return
if(xhr.status === 200) {
const result = JSON.parse(xhr.response)
console.log(result.outcome)
} else {
throw new Error('OCR failed')
}
})
```## Dev Notes
Install deps for IDE support (or set up a remote interpreter):
```shell
pip install -r requirements.txt
```Start the dev-server with docker compose:
```shell
docker compose up# or go into the container:
# docker compose run --rm api bash
```- Service Home: [localhost:8730](http://localhost:8730)
- OpenAPI Docs: [localhost:8730/docs](http://localhost:8730/docs) (WIP 🚧)Notice that the image requires rebuilding when changing e.g. deps:
```shell
docker compose up --build
# or:
# docker compose build
```## See also
- [Simple end-2-end flow with potential gotchas demonstrated](https://nanonets.com/blog/ocr-with-tesseract/)
- Example documents found with image search:
- from https://www.researchgate.net/publication/293322356_Are_Your_Digital_Documents_Web_Friendly_Making_Scanned_Documents_Web_Accessible
- https://i1.rgstatic.net/publication/293322356_Are_Your_Digital_Documents_Web_Friendly_Making_Scanned_Documents_Web_Accessible/links/574846a408ae2301b0b97d32/largepreview.png
- from https://www.researchgate.net/publication/307516380_Signature_line_detection_in_scanned_documents
- https://i1.rgstatic.net/publication/307516380_Signature_line_detection_in_scanned_documents/links/5aa951ef458515178818bc87/largepreview.png## License
See [LICENSE](LICENSE).