https://github.com/edsu/alto-words
simplistic calculation of the ratio of dictionary words to all words in a METS Alto OCR file
https://github.com/edsu/alto-words
Last synced: about 1 year ago
JSON representation
simplistic calculation of the ratio of dictionary words to all words in a METS Alto OCR file
- Host: GitHub
- URL: https://github.com/edsu/alto-words
- Owner: edsu
- Created: 2011-03-17T03:04:45.000Z (about 15 years ago)
- Default Branch: master
- Last Pushed: 2022-02-21T03:03:16.000Z (over 4 years ago)
- Last Synced: 2025-03-31T22:25:14.902Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 88.9 KB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Alto words
This is a simplistic demonstration of how you can calculate the
ratio of dictionary words to all words in a METS Alto OCR XML file.
A dump of Wiktionary is used as source for the dictionary.
The latest dump of the English Wiktionary is used because its available
and somewhat sizable: ~2 million words.
```sh
$ make dictionary.db
```
**Downloading the dump and creating the dictionary database will take a bit of time.**
Afterwarts the script `alto_words.py` can be used to compute the ratio of dictionary words.
```sh
$ make install
$ source ./.venv/bin/activate
$ python alto_words.py example.xml
```