{"id":20307773,"url":"https://github.com/hyperparticle/lemmatag","last_synced_at":"2025-04-11T15:12:31.016Z","repository":{"id":61596339,"uuid":"140996701","full_name":"Hyperparticle/LemmaTag","owner":"Hyperparticle","description":"A neural network that jointly part-of-speech tags and lemmatizes sentences, boosting accuracy for morphologically-rich languages (Czech, Arabic, etc.)","archived":false,"fork":false,"pushed_at":"2019-04-05T04:17:57.000Z","size":776,"stargazers_count":34,"open_issues_count":0,"forks_count":3,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-25T11:21:30.555Z","etag":null,"topics":["deep-learning","lemmatization","machine-learning","natural-language-processing","neural-network","nlp","pos-tagging","tensorflow"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1808.03703","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hyperparticle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-15T03:57:59.000Z","updated_at":"2024-05-26T12:12:39.000Z","dependencies_parsed_at":"2022-10-19T23:00:34.099Z","dependency_job_id":null,"html_url":"https://github.com/Hyperparticle/LemmaTag","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyperparticle%2FLemmaTag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyperparticle%2FLemmaTag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyperparticle%2FLemmaTag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyperparticle%2FLemmaTag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hyperparticle","download_url":"https://codeload.github.com/Hyperparticle/LemmaTag/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248429116,"owners_count":21101785,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","lemmatization","machine-learning","natural-language-processing","neural-network","nlp","pos-tagging","tensorflow"],"created_at":"2024-11-14T17:19:06.088Z","updated_at":"2025-04-11T15:12:30.985Z","avatar_url":"https://github.com/Hyperparticle.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LemmaTag\n\n[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) [![TensorFlow 1.12](https://img.shields.io/badge/TensorFlow-1.12-orange.svg)](https://www.tensorflow.org/install/) [![Python 3.5+](https://img.shields.io/badge/Python-3.5+-yellow.svg)](https://www.python.org/downloads/)\n\nThe following project provides a neural network architecture for [part-of-speech tagging](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb) and [lemmatizing](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/) sentences, achieving state-of-the-art (as of 2018) results for morphologically-rich languages, e.g., Czech, German, and Arabic [(Kondratyuk et al., 2018)](https://arxiv.org/abs/1808.03703).\n\n## Overview\n\nThere are two main ideas:\n\n1. Since part-of-speech tagging and lemmatization are related tasks, sharing the initial layers of the network is mutually beneficial. This results in higher accuracy and requires less training time.\n2. The lemmatizer can further improve its accuracy by looking at the tagger's predictions, i.e., taking the output of the tagger as an additional lemmatizer input.\n\n### Model\n\nThe model consists of 3 parts:\n\n- The **shared encoder** generates [character-level](http://colinmorris.github.io/blog/1b-words-char-embeddings) and [word-level embeddings](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/) and processes them through a [bidirectional RNN (BRNN)](https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15).\n- The **tagger decoder** generates part-of-speech tags with a [softmax classifier](https://becominghuman.ai/making-a-simple-neural-network-classification-2449da88c77e) by using the output of the shared encoder.\n- The **lemmatizer decoder** generates lemmas character-by-character with an RNN by using the outputs of the shared encoder and also the output of the tagger.\n\nThe image below provides a detailed overview of the architecture and design of the system.\n\n[![Model](images/model.png)](https://arxiv.org/abs/1808.03703 \"LemmaTag model\")\n\n- **Bottom** - Word-level encoder, with word input `w`, character inputs `c`, character states `e^c`, and combined word embedding `e^w`. Thick slanted lines denote [training dropout](https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5).\n- **Top Left** - Sentence-level encoder and tag classifier, with word-level inputs `e^w`. Two BRNN layers with residual connections act on the embedded words of a sentence, producing intermediate sentence contexts `o^w` and tag classification `t`.\n- **Top Right** - Lemma decoder, consisting of a [seq2seq decoder](https://medium.com/@devnag/seq2seq-the-clown-car-of-deep-learning-f88e1204dac3) with [attention](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/) on character encodings `e^c`, and with additional inputs of processed tagger features `t`, embeddings `e^w` and sentence-level contexts `o^w`, producing lemma characters `l`.\n\n For technical details, see the paper, [\"LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs\"](https://arxiv.org/abs/1808.03703).\n\n### Morphology Tagging\n\nNot all languages are alike when part-of-speech tagging. For instance, the Czech language has over 1500 different types of tags, while English has about 50. This discrepancy is due to Czech being a morphologically-rich language, which alters the endings of its words to indicate information like case, number, and gender. English, on the other hand, relies heavily on the positioning of a word relative to other words to convey this information.\n\nThe image below shows how Czech tags are split up into several subcategories that delineate a word's [morphology](http://all-about-linguistics.group.shef.ac.uk/branches-of-linguistics/morphology/what-is-morphology/), along with the number of unique values in each subcategory.\n\n[![Tag Components](images/tag-components.png \"Czech morphology tags\")](http://ufal.mff.cuni.cz/czech-tagging/)\n\nLemmaTag takes advantage of this by also predicting each tag subcategory and feeding this information to the lemmatizer (if the subcategories exist for the language). This modification improves both tagging and lemmatizing accuracies.\n\n## Getting Started\n\n### Requirements\n\nThe code uses Python 3.5+ running TensorFlow (tested working with TF 1.12).\n\n1. Clone the repository.\n\n```bash\ngit clone https://github.com/Hyperparticle/LemmaTag.git\ncd ./LemmaTag\n```\n\n2. Install the python packages in `requirements.txt` if you don't have them already.\n\n```bash\npip install -r ./requirements.txt\n```\n\n### Training and Testing\n\nTo start training on a sample dataset with default parameters, run\n\n```bash\npython lemmatag.py\n```\n\nThis will save the model periodically and output the training/validation accuracy. See the [Visualize Results](#visualize-results) section on how to view the training graphs.\n\nFor a list of all supported arguments, run\n\n```bash\npython lemmatag.py --help\n```\n\n## Obtaining Datasets\n\nA wide range of datasets supporting many languages can be downloaded from [Universal Dependencies](http://universaldependencies.org/). Each dataset repo should contain `train`, `dev`, and `test` files in `conllu` tab-separated format.\n\nThe `train`, `dev`, and `test` files must be converted to `conllu` or LemmaTag format. The LemmaTag format has 3 tab-separated columns: the word form, its lemma, and its part-of-speech tag. Sentences are split by empty lines. See [data/sample-cs-cltt-ud-test.txt](data/sample-cs-cltt-ud-test.txt) for an example of a small Czech dataset.\n\nTo read the dataset as `conllu` files, use the `--conllu` flag and specify the dataset files with `--train`, `--dev`, and `--test`:\n\n```bash\npython lemmatag.py --conllu --train TRAIN_FILE --dev DEV_FILE --test TEST_FILE\n```\n\nwhere `INPUT_FILE` and `OUTPUT_FILE` are the names of the input and output dataset files. Alternatively, one can convert the files beforehand:\n\n```bash\npython util/conllu_to_lemmatag.py \u003c INPUT_FILE \u003e OUTPUT_FILE\n```\n\n## Visualize Results\n\nThe training metrics can be viewed with TensorBoard in the `logs` directory:\n\n```bash\ntensorboard --logdir logs\n```\n\nThen navigate to [localhost:6006](http://localhost:6006).\n\n## Credits\n\nPlease cite this project ([PDF](https://arxiv.org/pdf/1808.03703.pdf)) as\n\n\u003e Daniel Kondratyuk, Tomáš Gavenčiak, Milan Straka, and Jan Hajič. 2018. \"**LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs**\". In *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics.\n\n```bibtex\n@InProceedings{D18-1532,\n  author = \t\"Kondratyuk, Daniel\n\t\tand Gaven{\\v{c}}iak, Tom{\\'a}{\\v{s}}\n\t\tand Straka, Milan\n\t\tand Haji{\\v{c}}, Jan\",\n  title = \t\"LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs\",\n  booktitle = \t\"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing\",\n  year = \t\"2018\",\n  publisher = \t\"Association for Computational Linguistics\",\n  pages = \t\"4921--4928\",\n  location = \t\"Brussels, Belgium\",\n  url = \t\"http://aclweb.org/anthology/D18-1532\"\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyperparticle%2Flemmatag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyperparticle%2Flemmatag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyperparticle%2Flemmatag/lists"}