https://github.com/glenrobson/iiif2annos
OCR a IIIF images in a manifest and generate annotations
https://github.com/glenrobson/iiif2annos
annotations iiif
Last synced: 6 months ago
JSON representation
OCR a IIIF images in a manifest and generate annotations
- Host: GitHub
- URL: https://github.com/glenrobson/iiif2annos
- Owner: glenrobson
- License: mit
- Created: 2023-04-21T17:58:37.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-11T01:06:10.000Z (8 months ago)
- Last Synced: 2025-03-19T19:49:09.195Z (7 months ago)
- Topics: annotations, iiif
- Language: Python
- Homepage:
- Size: 375 KB
- Stars: 24
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# iiif2annos
Read a manifest, OCR the images, create AnnotationLists and add them to a copy of the manifestThis tool uses the [tesseract](https://tesseract-ocr.github.io/) OCR engine. Ensure you have this installed and on your $PATH before running the code below.
```
usage: ocr.py [-h] [--base-output-uri OUTPUTURI] [--lang LANG] [-c] manifest outputRead a manifest, OCR all the pages then adds the results as annotation lists
positional arguments:
manifest URL to Manifest file
output Output directory for annotation listsoptions:
-h, --help show this help message and exit
--base-output-uri OUTPUTURI
Output URI for annotations and annotation list
--lang LANG Language to pass to the OCR engine see: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
-c, --confidence Include OCR confidence value in text of the annotation?
```This should work with v2 manifests and v3 manifest. For v2 AnnotationLists are created for v3 AnnotationPages are created.
## Example
```
python iiif2annos/ocr.py --lang frk --base-output-uri http://localhost:5500/newspaper https://preview.iiif.io/cookbook/update_newspaper/recipe/0068-newspaper/newspaper_issue_1-manifest.json newspaper
```Using these blogs as a guide:
* https://nanonets.com/blog/ocr-with-tesseract/#ocr-with-pytesseract-and-opencv
* https://pypi.org/project/pytesseract/# Testing
Unit tests are in the tests folder and can be run with:
python -m unittest discover -s tests
Run single test:
python -m unittest tests.testImages.TestImage.testCanvasImageMissmatch