{"id":20703810,"url":"https://github.com/tesseract-ocr/tesstrain","last_synced_at":"2025-05-16T15:05:11.310Z","repository":{"id":42040878,"uuid":"131282419","full_name":"tesseract-ocr/tesstrain","owner":"tesseract-ocr","description":"Train Tesseract LSTM with make","archived":false,"fork":false,"pushed_at":"2024-06-04T11:51:18.000Z","size":13817,"stargazers_count":648,"open_issues_count":57,"forks_count":192,"subscribers_count":26,"default_branch":"main","last_synced_at":"2024-12-21T19:33:12.724Z","etag":null,"topics":["ocr","tesseract","training"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tesseract-ocr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-27T10:30:54.000Z","updated_at":"2024-12-19T19:05:58.000Z","dependencies_parsed_at":"2022-07-15T16:37:40.702Z","dependency_job_id":"4af115dd-4172-4784-9692-b680913c19c1","html_url":"https://github.com/tesseract-ocr/tesstrain","commit_stats":{"total_commits":231,"total_committers":29,"mean_commits":"7.9655172413793105","dds":0.748917748917749,"last_synced_commit":"dba332e0aba0935430f674e8900715bea770b5b8"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tesseract-ocr%2Ftesstrain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tesseract-ocr%2Ftesstrain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tesseract-ocr%2Ftesstrain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tesseract-ocr%2Ftesstrain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tesseract-ocr","download_url":"https://codeload.github.com/tesseract-ocr/tesstrain/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248564888,"owners_count":21125412,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","tesseract","training"],"created_at":"2024-11-17T01:09:38.305Z","updated_at":"2025-05-16T15:05:11.302Z","avatar_url":"https://github.com/tesseract-ocr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tesstrain\n\n\u003e Training workflow for Tesseract 5 as a Makefile for dependency tracking.\n\n* [Installation](#installation)\n    * [Auxiliaries](#auxiliaries)\n    * [Leptonica, Tesseract](#leptonica-tesseract)\n       * [Windows](#windows)\n    * [Python](#python)\n    * [Language data](#language-data)\n* [Usage](#usage)    \n    * [Choose the model name](#choose-the-model-name)\n    * [Provide ground truth data](#provide-ground-truth-data)\n    * [Train](#train)\n    * [Change directory assumptions](#change-directory-assumptions)\n    * [Make model files (traineddata)](#make-model-files-traineddata)\n    * [Plotting CER](#plotting-cer)\n* [License](#license)\n\n## Installation\n\n### Auxiliaries\n\nYou will need at least GNU `make` (minimal version 4.2), `wget`, `find`, `bash`, and `unzip`.\n\n### Leptonica, Tesseract\n\nYou will need a recent version (\u003e= 5.3) of tesseract built with the\ntraining tools and matching leptonica bindings.\n[Build](https://tesseract-ocr.github.io/tessdoc/Compiling)\n[instructions](https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation)\nand more can be found in the [Tesseract User Manual](https://tesseract-ocr.github.io/tessdoc/).\n\n#### Windows\n\n  1. Install the latest tesseract (e.g. from https://digi.bib.uni-mannheim.de/tesseract/), and make sure that tesseract is added to your PATH.\n  2. Install [Python 3](https://www.python.org/downloads/)\n  3. Install [Git SCM to Windows](https://gitforwindows.org/) - it provides a lot of linux utilities on Windows (e.g. `find`, `unzip`, `rm`) and put `C:\\Program Files\\Git\\usr\\bin` to the beginning of your PATH variable (temporarily you can do it in `cmd` with `set PATH=C:\\Program Files\\Git\\usr\\bin;%PATH%` - unfortunately there are several Windows tools with the same name as on linux (`find`, `sort`) with different behavior/functionality and there is need to avoid them during training.\n  4. Install winget/[Windows Package Manager](https://github.com/microsoft/winget-cli/releases/) and then run `winget install ezwinports.make` and `winget install wget` to install missing tools.\n\n### Python\n\nYou need a recent version of Python 3.x. For image processing the Python library `Pillow` is used.\nIf you don't have a global installation, please use the provided requirements file `pip install -r requirements.txt`.\n\n\n### Language data\n\nTesseract expects some configuration data (a file `radical-stroke.txt` and `*.unicharset` for all scripts) in `DATA_DIR`.\nTo fetch them:\n\n    make tesseract-langdata\n\n(While this step is only needed once and implicitly included in the `training` target,\nyou might want to run it explicitly beforehand.)\n\n## Usage\n\n### Choose the model name\n\nChoose a name for your model. By convention, Tesseract stack models including\nlanguage-specific resources use (lowercase) three-letter codes defined in\n[ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) with additional\ninformation separated by underscore. E.g., `chi_tra_vert` for **tra**ditional\nChinese with **vert**ical typesetting. Language-independent (i.e. script-specific)\nmodels use the capitalized name of the script type as an identifier. E.g.,\n`Hangul_vert` for Hangul script with vertical typesetting. In the following,\nthe model name is referenced by `MODEL_NAME`.\n\n### Provide ground truth data\n\nPlace ground truth consisting of line images and transcriptions in the folder\n`data/MODEL_NAME-ground-truth`. This list of files will be split into training and\nevaluation data, the ratio is defined by the `RATIO_TRAIN` variable.\n\nImages must be TIFF and have the extension `.tif` or PNG and have the\nextension `.png`, `.bin.png`, or `.nrm.png`.\n\nTranscriptions must be single-line plain text and have the same name as the\nline image but with the image extension replaced by `.gt.txt`.\n\nThe repository contains a ZIP archive with sample ground truth, see\n[ocrd-testset.zip](./ocrd-testset.zip). Extract it to `./data/foo-ground-truth` and run\n`make training`.\n\n**NOTE:** If you want to generate line images for transcription from a full\npage, see tips in [issue 7](https://github.com/OCR-D/ocrd-train/issues/7) and\nin particular [@Shreeshrii's shell\nscript](https://github.com/OCR-D/ocrd-train/issues/7#issuecomment-419714852).\n\n### Train\n\nRun\n\n    make training MODEL_NAME=name-of-the-resulting-model\n\n\nwhich is a shortcut for\n\n    make unicharset lists proto-model tesseract-langdata training MODEL_NAME=name-of-the-resulting-model\n\n\nRun `make help` to see all the possible targets and variables:\n\n\u003c!-- BEGIN-EVAL -w '```' '```' -- make help --\u003e\n```\n\n  Targets\n\n    unicharset       Create unicharset\n    charfreq         Show character histogram\n    lists            Create lists of lstmf filenames for training and eval\n    training         Start training (i.e. create .checkpoint files)\n    traineddata      Create best and fast .traineddata files from each .checkpoint file\n    proto-model      Build the proto model\n    tesseract-langdata  Download stock unicharsets\n    evaluation       Evaluate .checkpoint models on eval dataset via lstmeval\n    plot             Generate train/eval error rate charts from training log\n    clean            Clean all generated files\n\n  Variables\n\n    MODEL_NAME         Name of the model to be built. Default: foo\n    START_MODEL        Name of the model to continue from (i.e. fine-tune). Default: ''\n    PROTO_MODEL        Name of the prototype model. Default: OUTPUT_DIR/MODEL_NAME.traineddata\n    WORDLIST_FILE      Optional file for dictionary DAWG. Default: OUTPUT_DIR/MODEL_NAME.wordlist\n    NUMBERS_FILE       Optional file for number patterns DAWG. Default: OUTPUT_DIR/MODEL_NAME.numbers\n    PUNC_FILE          Optional file for punctuation DAWG. Default: OUTPUT_DIR/MODEL_NAME.punc\n    DATA_DIR           Data directory for output files, proto model, start model, etc. Default: data\n    OUTPUT_DIR         Output directory for generated files. Default: DATA_DIR/MODEL_NAME\n    GROUND_TRUTH_DIR   Ground truth directory. Default: OUTPUT_DIR-ground-truth\n    TESSDATA_REPO      Tesseract model repo to use (_fast or _best). Default: _best\n    TESSDATA           Path to the directory containing START_MODEL.traineddata\n                       (for example tesseract-ocr/tessdata_best). Default: ./usr/share/tessdata\n    MAX_ITERATIONS     Max iterations. Default: 10000\n    EPOCHS             Set max iterations based on the number of lines for training. Default: none\n    DEBUG_INTERVAL     Debug Interval. Default:  0\n    LEARNING_RATE      Learning rate. Default: 0.0001 with START_MODEL, otherwise 0.002\n    NET_SPEC           Network specification (in VGSL) for new model from scratch. Default: [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c###]\n    FINETUNE_TYPE      Fine-tune Training Type - Impact, Plus, Layer or blank. Default: ''\n    LANG_TYPE          Language Type - Indic, RTL or blank. Default: ''\n    PSM                Page segmentation mode. Default: 13\n    RANDOM_SEED        Random seed for shuffling of the training data. Default: 0\n    RATIO_TRAIN        Ratio of train / eval training data. Default: 0.90\n    TARGET_ERROR_RATE  Stop training if the character error rate (CER in percent) gets below this value. Default: 0.01\n    LOG_FILE           File to copy training output to and read plot figures from. Default: OUTPUT_DIR/training.log\n```\n\n\u003c!-- END-EVAL --\u003e\n\n### Choose training regime\n\nFirst, decide what [kind of training](https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html#introduction)\nyou want.\n\n* Fine-tuning: select (and install) a `START_MODEL`\n* From scratch: specify a `NET_SPEC` (see [documentation](https://tesseract-ocr.github.io/tessdoc/tess4/VGSLSpecs.html))\n\n### Change directory assumptions\n\nTo override the default path name requirements, just set the respective variables in the above list:\n\n    make training MODEL_NAME=name-of-the-resulting-model DATA_DIR=/data GROUND_TRUTH_DIR=/data/GT\n\nIf you want to use shell variables to override the make variables (for example because\nyou are running tesstrain from a script or other makefile), then you can use the `-e` flag:\n\n    MODEL_NAME=name-of-the-resulting-model DATA_DIR=/data GROUND_TRUTH_DIR=/data/GT make -e training\n\n### Make model files (traineddata)\n\nWhen the training is finished, it will write a `traineddata` file which can be used\nfor text recognition with Tesseract. Note that this file does not include a\ndictionary. The `tesseract` executable therefore prints a warning.\n\nIt is also possible to create additional `traineddata` files from intermediate\ntraining results (the so-called checkpoints). This can even be done while the\ntraining is still running. Example:\n\n    # Add MODEL_NAME and OUTPUT_DIR like for the training.\n    make traineddata\n\nThis will create two directories `tessdata_best` and `tessdata_fast` in `OUTPUT_DIR`\nwith a best (double based) and fast (int based) model for each checkpoint.\n\nIt is also possible to create models for selected checkpoints only. Examples:\n\n    # Make traineddata for the checkpoint files of the last three weeks.\n    make traineddata CHECKPOINT_FILES=\"$(find data/foo -name '*.checkpoint' -mtime -21)\"\n\n    # Make traineddata for the last two checkpoint files.\n    make traineddata CHECKPOINT_FILES=\"$(ls -t data/foo/checkpoints/*.checkpoint | head -2)\"\n\n    # Make traineddata for all checkpoint files with CER better than 1 %.\n    make traineddata CHECKPOINT_FILES=\"$(ls data/foo/checkpoints/*[^1-9]0.*.checkpoint)\"\n\nAdd `MODEL_NAME` and `OUTPUT_DIR` and replace `data/foo` with the output directory if needed.\n\n### Plotting CER\n\nTraining and Evaluation Character Error Rate (CER) can be plotted using Matplotlib:\n\n    # Make OUTPUT_DIR/MODEL_FILE.plot_*.png\n    make plot\n\nAll the variables defined above apply, but there is no explicit dependency on `training`.\n\nStill, the target depends on the `LOG_FILE` captured during training (just will not trigger\ntraining itself). Besides analysing the log file, this also directly evaluates the trained models\n(for each checkpoint) on the eval dataset. The latter is also available as an independent target\n`evaluation`:\n\n    # Make OUTPUT_DIR/eval/MODEL_FILE*.*.log\n    make evaluation\n\nPlotting can even be done while training is still running, and  will depict the training status\nup to that point. (It can be rerun any time the `LOG_FILE` has changed or new checkpoints written.)\n\nAs an example, use the training data provided in [ocrd-testset.zip](./ocrd-testset.zip) to do some\ntraining and generate the plots:\n\n    unzip ocrd-testset.zip -d data/ocrd-ground-truth\n    make training MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 \u0026\n    # Make data/ocrd/ocrd.plot_cer.png and plot_log.png (repeat during/after training)\n    make plot MODEL_NAME=ocrd\n\nWhich should then look like this:\n\n![ocrd.plot_cer.png](./ocrd.plot_cer.png)\n\n## License\n\nSoftware is provided under the terms of the `Apache 2.0` license.\n\nSample training data provided by [Deutsches Textarchiv](https://deutschestextarchiv.de) is [in the public domain](http://creativecommons.org/publicdomain/mark/1.0/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftesseract-ocr%2Ftesstrain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftesseract-ocr%2Ftesstrain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftesseract-ocr%2Ftesstrain/lists"}