{"id":24951928,"url":"https://github.com/ocr-d/ocrd_keraslm","last_synced_at":"2025-04-10T12:51:41.026Z","repository":{"id":35424665,"uuid":"138599682","full_name":"OCR-D/ocrd_keraslm","owner":"OCR-D","description":"Simple character-based language model using keras","archived":false,"fork":false,"pushed_at":"2024-10-01T16:30:28.000Z","size":295,"stargazers_count":7,"open_issues_count":1,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-24T11:38:25.363Z","etag":null,"topics":["ocr-d"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OCR-D.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-25T13:36:36.000Z","updated_at":"2024-10-01T16:30:32.000Z","dependencies_parsed_at":"2024-02-09T13:50:32.264Z","dependency_job_id":"24cc23cf-7e0d-4bd9-8b96-50c0a37a0ffe","html_url":"https://github.com/OCR-D/ocrd_keraslm","commit_stats":{"total_commits":103,"total_committers":8,"mean_commits":12.875,"dds":0.4077669902912622,"last_synced_commit":"9e3f5a06b8efb706f8f1ac1c172fa5809ad6bab9"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_keraslm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_keraslm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_keraslm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_keraslm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OCR-D","download_url":"https://codeload.github.com/OCR-D/ocrd_keraslm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248220200,"owners_count":21067255,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr-d"],"created_at":"2025-02-03T01:32:28.502Z","updated_at":"2025-04-10T12:51:41.007Z","avatar_url":"https://github.com/OCR-D.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocrd_keraslm\n    character-level language modelling using Keras\n\n[![CircleCI](https://circleci.com/gh/OCR-D/ocrd_keraslm.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_keraslm)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/keraslm.svg)](https://hub.docker.com/r/ocrd/keraslm/tags/)\n\n * [Introduction](#introduction)\n    * [Architecture](#architecture)\n    * [Modes of operation](#modes-of-operation)\n    * [Context conditioning](#context-conditioning)\n    * [Underspecification](#underspecification)\n * [Installation](#installation)\n * [Usage](#usage)\n    * [Command line interface `keraslm-rate`](#command-line-interface-keraslm-rate)\n    * [OCR-D processor interface `ocrd-keraslm-rate`](#ocr-d-processor-interface-ocrd-keraslm-rate)\n    * [Models](#models)\n * [Testing](#testing)\n\n## Introduction\n\nThis is a tool for statistical _language modelling_ (predicting text from context) with recurrent neural networks. It models probabilities not on the word level but the _character level_ so as to allow open vocabulary processing (avoiding morphology, historic orthography and word segmentation problems). It manages a vocabulary of mapped characters, which can be easily extended by training on more text. Above that, unmapped characters are treated with underspecification.\n\nIn addition to character sequences, (meta-data) context variables can be configured as extra input. \n\n### Architecture\n\nThe model consists of:\n\n0. an input layer: characters are represented as indexes from the vocabulary mapping, in windows of a number `length` of characters,\n1. a character embedding layer: window sequences are converted into dense vectors by looking up the indexes in an embedding weight matrix,\n2. a context embedding layer: context variables are converted into dense vectors by looking up the indexes in an embedding weight matrix, \n3. character and context vector sequences are concatenated,\n4. a number `depth` of hidden layers: each with a number `width` of hidden recurrent units of _LSTM cells_ (Long Short-term Memory) connected on top of each other,\n5. an output layer derived from the transposed character embedding matrix (weight tying): hidden activations are projected linearly to vectors of dimensionality equal to the character vocabulary size, then softmax is applied returning a probability for each possible value of the next character, respectively.\n\n![model graph depiction](model-graph.png \"graph with 1 context variable\")\n\nThe model is trained by feeding windows of text in index representation to the input layer, calculating output and comparing it to the same text shifted backward by 1 character, and represented as unit vectors (\"one-hot coding\") as target. The loss is calculated as the (unweighted) cross-entropy between target and output. Backpropagation yields error gradients for each layer, which is used to iteratively update the weights (stochastic gradient descent).\n\nThis is implemented in [Keras](https://keras.io) with [Tensorflow](https://www.tensorflow.org/) as backend. It automatically uses a fast CUDA-optimized LSTM implementation (Nividia GPU and Tensorflow installation with GPU support, see below), both in learning and in prediction phase, if available.\n\n\n### Modes of operation\n\nNotably, this model (by default) runs _statefully_, i.e. by implicitly passing hidden state from one window (batch of samples) to the next. That way, the context available for predictions can be arbitrarily long (above `length`, e.g. the complete document up to that point), or short (below `length`, e.g. at the start of a text). (However, this is a passive perspective above `length`, because errors are never back-propagated any further in time during gradient-descent training.) This is favourable to stateless mode because all characters can be output in parallel, and no partial windows need to be presented during training (which slows down).\n\nBesides stateful mode, the model can also be run _incrementally_, i.e. by explicitly passing hidden state from the caller. That way, multiple alternative hypotheses can be processed together. This is used for generation (sampling from the model) and alternative decoding (finding the best path through a sequence of alternatives).\n\n### Context conditioning\n\nEvery text has meta-data like time, author, text type, genre, production features (e.g. print vs typewriter vs digital born rich text, OCR version), language, structural element (e.g. title vs heading vs paragraph vs footer vs marginalia), font family (e.g. Antiqua vs Fraktura) and font shape (e.g. bold vs letter-spaced vs italic vs normal) etc. \n\nThis information (however noisy) can be very useful to facilitate stochastic modelling, since language has an extreme diversity and complexity. To that end, models can be conditioned on extra inputs here, termed _context variables_. The model learns to represent these high-dimensional discrete values as low-dimensional continuous vectors (embeddings), also entering the recurrent hidden layers (as a form of simple additive adaptation).\n\n### Underspecification\n\nIndex zero is reserved for unmapped characters (unseen contexts). During training, its embedding vector is regularised to occupy a center position of all mapped characters (all other contexts), and the hidden layers get to see it every now and then by random degradation. At runtime, therefore, some unknown character (some unknown context) represented as zero does not disturb follow-up predictions too much.\n\n\n## Installation\n\nRequired Ubuntu packages:\n\n* Python (``python`` or ``python3``)\n* pip (``python-pip`` or ``python3-pip``)\n* virtualenv (``python-virtualenv`` or ``python3-virtualenv``)\n\nCreate and activate a virtualenv as usual.\n\nIf you need a custom version of ``keras`` or ``tensorflow`` (like [GPU support](https://www.tensorflow.org/install/install_sources)), install them via `pip` now.\n\nTo install Python dependencies and this module, then do:\n```shell\nmake deps install\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements.txt\npip install -e .\n```\n\nUseful environment variables are:\n- ``TF_CPP_MIN_LOG_LEVEL`` (set to `1` to suppress most of Tensorflow's messages\n- ``CUDA_VISIBLE_DEVICES`` (set empty to force CPU even in a GPU installation)\n\n\n## Usage\n\nThis packages has two user interfaces:\n\n### command line interface `keraslm-rate`\n\nTo be used with string arguments and plain-text files.\n\n```shell\nUsage: keraslm-rate [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  train                           train a language model\n  test                            get overall perplexity from language model\n  apply                           get individual probabilities from language model\n  generate                        sample characters from language model\n  print-charset                   Print the mapped characters\n  prune-charset                   Delete one character from mapping\n  plot-char-embeddings-similarity\n                                  Paint a heat map of character embeddings\n  plot-context-embeddings-similarity\n                                  Paint a heat map of context embeddings\n  plot-context-embeddings-projection\n                                  Paint a 2-d PCA projection of context embeddings\n```\n\nExamples:\n```shell\nkeraslm-rate train --width 64 --depth 4 --length 256 --model model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/*.tcf.txt\nkeraslm-rate generate -m model_dta_64_4_256.h5 --number 6 \"für die Wiſſen\"\nkeraslm-rate apply -m model_dta_64_4_256.h5 \"so schädlich ist es Borkickheile zu pflanzen\"\nkeraslm-rate test -m model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/grimm_*.tcf.txt\n```\n\n### [OCR-D processor](https://github.com/OCR-D/core) interface `ocrd-keraslm-rate`\n\nTo be used with [PageXML](https://www.primaresearch.org/tools/PAGELibraries) documents in an [OCR-D](https://github.com/OCR-D/spec/) annotation workflow. Input could be anything with a textual annotation (`TextEquiv` on the given `textequiv_level`). The LM rater could be used for both quality control (without alternative decoding, using only each first index `TextEquiv`) and part of post-correction (with `alternative_decoding=True`, finding the best path among `TextEquiv` indexes).\n\n```shell\nUsage: ocrd-keraslm-rate [worker|server] [OPTIONS]\n\n  Rate elements of the text with a character-level LSTM language model in Keras\n\n  \u003e Rate text with the language model, either for scoring or finding the\n  \u003e best path across alternatives.\n\n  \u003e Open and deserialise PAGE input files, then iterate over the segment\n  \u003e hierarchy down to the requested `textequiv_level`, making sequences\n  \u003e of first TextEquiv objects (if `alternative_decoding` is false), or\n  \u003e of lists of all TextEquiv objects (otherwise) as a linear graph for\n  \u003e input to the LM. If the level is above glyph, then insert artificial\n  \u003e whitespace TextEquiv where implicit tokenisation rules require it.\n\n  \u003e Next, if `alternative_decoding` is false, then pass the concatenated\n  \u003e string of the page text to the LM and map the returned sequence of\n  \u003e probabilities to the substrings in the input TextEquiv. For each\n  \u003e TextEquiv, calculate the average character probability (LM score)\n  \u003e and combine that with the input confidence (OCR score) by applying\n  \u003e `lm_weight`. Assign the resulting probability as new confidence to\n  \u003e the TextEquiv, and ensure no other TextEquiv remain on the segment.\n  \u003e Finally, calculate the overall average LM probability,  and the\n  \u003e character and segment-level perplexity, and print it on the logger.\n\n  \u003e Otherwise (i.e with `alternative_decoding=true`), search for the\n  \u003e best paths through the input graph of the page (with TextEquiv\n  \u003e string alternatives as edges) by applying the LM successively via\n  \u003e beam search using `beam_width` (keeping a traceback of LM state\n  \u003e history at each node, passing and updating LM state explicitly). As\n  \u003e in the above trivial case without `alternative_decoding`, then\n  \u003e combine LM scores weighted by `lm_weight` with input confidence on\n  \u003e the graph's edges. Also, prune worst paths and apply LM state\n  \u003e history clustering to avoid expanding all possible combinations.\n  \u003e Finally, look into the current best overall path, traversing back to\n  \u003e the last node of the previous page's graph. Lock into that node by\n  \u003e removing all current paths that do not derive from it, and making\n  \u003e its history path the final decision for the previous page: Apply\n  \u003e that path by removing all but the chosen TextEquiv alternatives,\n  \u003e assigning the resulting confidences, and making the levels above\n  \u003e `textequiv_level` consistent with that textual result (via\n  \u003e concatenation joined by whitespace). Also, calculate the overall\n  \u003e average LM probability, and the character and segment-level\n  \u003e perplexity, and print it on the logger. Moreover, at the last page\n  \u003e at the end of the document, lock into the current best path\n  \u003e analogously.\n\n  \u003e Produce new output files by serialising the resulting hierarchy for\n  \u003e each page.\n\nSubcommands:\n    worker      Start a processing worker rather than do local processing\n    server      Start a processor server rather than do local processing\n\nOptions for processing:\n  -m, --mets URL-PATH             URL or file path of METS to process [./mets.xml]\n  -w, --working-dir PATH          Working directory of local workspace [dirname(URL-PATH)]\n  -I, --input-file-grp USE        File group(s) used as input\n  -O, --output-file-grp USE       File group(s) used as output\n  -g, --page-id ID                Physical page ID(s) to process instead of full document []\n  --overwrite                     Remove existing output pages/images\n                                  (with \"--page-id\", remove only those)\n  --profile                       Enable profiling\n  --profile-file PROF-PATH        Write cProfile stats to PROF-PATH. Implies \"--profile\"\n  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string\n                                  or JSON file path\n  -P, --param-override KEY VAL    Override a single JSON object key-value pair,\n                                  taking precedence over --parameter\n  -U, --mets-server-url URL       URL of a METS Server for parallel incremental access to METS\n                                  If URL starts with http:// start an HTTP server there,\n                                  otherwise URL is a path to an on-demand-created unix socket\n  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]\n                                  Override log level globally [INFO]\n\nOptions for information:\n  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME\n  -L, --list-resources            List names of processor resources\n  -J, --dump-json                 Dump tool description as JSON\n  -D, --dump-module-dir           Show the 'module' resource location path for this processor\n  -h, --help                      Show this message\n  -V, --version                   Show version\n\nParameters:\n   \"model_file\" [string - REQUIRED]\n    path of h5py weight/config file for model trained with keraslm\n   \"textequiv_level\" [string - \"glyph\"]\n    PAGE XML hierarchy level to evaluate TextEquiv sequences on\n    Possible values: [\"region\", \"line\", \"word\", \"glyph\"]\n   \"alternative_decoding\" [boolean - true]\n    whether to process all TextEquiv alternatives, finding the best path\n    via beam search, and delete each non-best alternative\n   \"beam_width\" [number - 10]\n    maximum number of best partial paths to consider during search with\n    alternative_decoding\n   \"lm_weight\" [number - 0.5]\n    share of the LM scores over the input confidences\n```\n\nExamples:\n```shell\nmake deps-test # installs ocrd_tesserocr\nmake test/assets # downloads GT, imports PageXML, builds workspaces\nocrd workspace -d ws1 clone -a test/assets/kant_aufklaerung_1784/mets.xml\ncd ws1\nocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK\nocrd-tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-WORD -P textequiv_level word -P model Fraktur\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-GLYPH -P textequiv_level glyph -P model deu-frak\n# download Deutsches Textarchiv language model\nocrd resmgr download ocrd-keraslm-rate model_dta_full.h5\n# get confidences and perplexity:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-WORD -O OCR-D-OCR-LM-WORD -P model_file model_dta_full.h5 -P textequiv_level word -P alternative_decoding false\n# also get best path:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-GLYPH -O OCR-D-OCR-LM-GLYPH -P model_file model_dta_full.h5 -P textequiv_level glyph -P alternative_decoding true -P beam_width 10\n```\n\n### Models\n\nPretrained models will be published under [Github release assets](https://github.com/OCR-D/ocrd_keraslm/releases)\nand made visible via [OCR-D Resource Manager](https://ocr-d.de/en/models).\n\nSo far, the only published models are:\n\n- [model_dta_full.h5](https://github.com/OCR-D/ocrd_keraslm/releases/download/v0.4.3/model_dta_full.h5)  \n  This LM was configured as stateful contiguous LSTM model (2 layers, 128 hidden nodes each, window length 256),\n  and trained on the complete [Deutsches Textarchiv](https://deutsches-textarchiv.de/) fulltext (80%/20% split).  \n  It achieves a perplexity of 2.51 on the validation subset after 4 epochs.\n\n## Testing\n\n```shell\nmake deps-test test\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements_test.txt\ntest -e test/assets || test/prepare_gt.bash test/assets\ntest -f model_dta_test.h5 || keraslm-rate train -m model_dta_test.h5 test/assets/*.txt\nkeraslm-rate test -m model_dta_test.h5 test/assets/*.txt\npython -m pytest test $(PYTEST_ARGS)\n```\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focr-d%2Focrd_keraslm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Focr-d%2Focrd_keraslm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focr-d%2Focrd_keraslm/lists"}