Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/glenrobson/iiif2annos
OCR a IIIF images in a manifest and generate annotations
https://github.com/glenrobson/iiif2annos
annotations iiif
Last synced: 13 days ago
JSON representation
OCR a IIIF images in a manifest and generate annotations
- Host: GitHub
- URL: https://github.com/glenrobson/iiif2annos
- Owner: glenrobson
- License: mit
- Created: 2023-04-21T17:58:37.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-06-21T10:06:51.000Z (over 1 year ago)
- Last Synced: 2024-12-06T22:13:58.802Z (about 2 months ago)
- Topics: annotations, iiif
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 23
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# iiif2annos
Read a manifest, OCR the images, create AnnotationLists and add them to a copy of the manifestThis tool uses the [tesseract](https://tesseract-ocr.github.io/) OCR engine. Ensure you have this installed and on your $PATH before running the code below.
```
usage: ocr.py [-h] [--base-output-uri OUTPUTURI] [--lang LANG] [-c] manifest outputRead a manifest, OCR all the pages then adds the results as annotation lists
positional arguments:
manifest URL to Manifest file
output Output directory for annotation listsoptions:
-h, --help show this help message and exit
--base-output-uri OUTPUTURI
Output URI for annotations and annotation list
--lang LANG Language to pass to the OCR engine see: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
-c, --confidence Include OCR confidence value in text of the annotation?
```This should work with v2 manifests and v3 manifest. For v2 AnnotationLists are created for v3 AnnotationPages are created.
## Example
```
python iiif2annos/ocr.py --lang frk --base-output-uri http://localhost:5500/newspaper https://preview.iiif.io/cookbook/update_newspaper/recipe/0068-newspaper/newspaper_issue_1-manifest.json newspaper
```Using these blogs as a guide:
* https://nanonets.com/blog/ocr-with-tesseract/#ocr-with-pytesseract-and-opencv
* https://pypi.org/project/pytesseract/