{"id":13765980,"url":"https://github.com/adbar/simplemma","last_synced_at":"2025-12-24T17:49:47.883Z","repository":{"id":38474938,"uuid":"330707034","full_name":"adbar/simplemma","owner":"adbar","description":"Simple multilingual lemmatizer for Python, especially useful for speed and efficiency","archived":false,"fork":false,"pushed_at":"2025-05-09T11:45:27.000Z","size":764076,"stargazers_count":155,"open_issues_count":14,"forks_count":12,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-05-09T12:34:56.106Z","etag":null,"topics":["corpus-tools","language-detection","language-identification","lemmatiser","lemmatization","lemmatizer","low-resource-nlp","morphological-analysis","nlp","tokenization","tokenizer","wordlist"],"latest_commit_sha":null,"homepage":"https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adbar.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.rst","contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":"docs/supported-languages.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-01-18T15:22:27.000Z","updated_at":"2025-05-09T11:41:29.000Z","dependencies_parsed_at":"2024-04-02T13:31:13.993Z","dependency_job_id":"9f733bd4-b850-4f0e-9a52-c312eaef64bc","html_url":"https://github.com/adbar/simplemma","commit_stats":{"total_commits":217,"total_committers":6,"mean_commits":"36.166666666666664","dds":"0.20276497695852536","last_synced_commit":"5f4fa16053dd52756f1ff9f072a409d91fac6b2a"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adbar%2Fsimplemma","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adbar%2Fsimplemma/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adbar%2Fsimplemma/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adbar%2Fsimplemma/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adbar","download_url":"https://codeload.github.com/adbar/simplemma/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253255932,"owners_count":21879257,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus-tools","language-detection","language-identification","lemmatiser","lemmatization","lemmatizer","low-resource-nlp","morphological-analysis","nlp","tokenization","tokenizer","wordlist"],"created_at":"2024-08-03T16:00:49.888Z","updated_at":"2025-12-24T17:49:47.837Z","avatar_url":"https://github.com/adbar.png","language":"Python","readme":"# Simplemma: a simple multilingual lemmatizer for Python\n\n[![Python package](https://img.shields.io/pypi/v/simplemma.svg)](https://pypi.python.org/pypi/simplemma)\n[![Python versions](https://img.shields.io/pypi/pyversions/simplemma.svg)](https://pypi.python.org/pypi/simplemma)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/simplemma.svg)](https://codecov.io/gh/adbar/simplemma)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Reference DOI: 10.5281/zenodo.4673264](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4673264-brightgreen)](https://doi.org/10.5281/zenodo.4673264)\n\n\n## Purpose\n\n[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the\nprocess of grouping together the inflected forms of a word so they can\nbe analysed as a single item, identified by the word\\'s lemma, or\ndictionary form. Unlike stemming, lemmatization outputs word units that\nare still valid linguistic forms.\n\nIn modern natural language processing (NLP), this task is often\nindirectly tackled by more complex systems encompassing a whole\nprocessing pipeline. However, it appears that there is no\nstraightforward way to address lemmatization in Python although this\ntask can be crucial in fields such as information retrieval and NLP.\n\n*Simplemma* provides a simple and multilingual approach to look for base\nforms or lemmata. It may not be as powerful as full-fledged solutions\nbut it is generic, easy to install and straightforward to use. In\nparticular, it does not need morphosyntactic information and can process\na raw series of tokens or even a text with its built-in tokenizer. By\ndesign it should be reasonably fast and work in a large majority of\ncases, without being perfect.\n\nWith its comparatively small footprint it is especially useful when\nspeed and simplicity matter, in low-resource contexts, for educational\npurposes, or as a baseline system for lemmatization and morphological\nanalysis.\n\nCurrently, 49 languages are partly or fully supported (see table below).\n\n\n## Installation\n\nThe current library is written in pure Python with no dependencies:\n`pip install simplemma`\n\n- `pip3` where applicable\n- `pip install -U simplemma` for updates\n- `pip install git+https://github.com/adbar/simplemma` for the cutting-edge version\n\nThe last version supporting Python 3.6 and 3.7 is `simplemma==1.0.0`.\n\n\n## Usage\n\n### Word-by-word\n\nSimplemma is used by selecting a language of interest and then applying\nthe data on a list of words.\n\n``` python\n\u003e\u003e\u003e import simplemma\n# get a word\nmyword = 'masks'\n# decide which language to use and apply it on a word form\n\u003e\u003e\u003e simplemma.lemmatize(myword, lang='en')\n'mask'\n# grab a list of tokens\n\u003e\u003e\u003e mytokens = ['Hier', 'sind', 'Vaccines']\n\u003e\u003e\u003e for token in mytokens:\n\u003e\u003e\u003e     simplemma.lemmatize(token, lang='de')\n'hier'\n'sein'\n'Vaccines'\n# list comprehensions can be faster\n\u003e\u003e\u003e [simplemma.lemmatize(t, lang='de') for t in mytokens]\n['hier', 'sein', 'Vaccines']\n```\n\n\n### Chaining languages\n\nChaining several languages can improve coverage, they are used in\nsequence:\n\n``` python\n\u003e\u003e\u003e from simplemma import lemmatize\n\u003e\u003e\u003e lemmatize('Vaccines', lang=('de', 'en'))\n'vaccine'\n\u003e\u003e\u003e lemmatize('spaghettis', lang='it')\n'spaghettis'\n\u003e\u003e\u003e lemmatize('spaghettis', lang=('it', 'fr'))\n'spaghetti'\n\u003e\u003e\u003e lemmatize('spaghetti', lang=('it', 'fr'))\n'spaghetto'\n```\n\n\n### Greedier decomposition\n\nFor certain languages a greedier decomposition is activated by default\nas it can be beneficial, mostly due to a certain capacity to address\naffixes in an unsupervised way. This can be triggered manually by\nsetting the `greedy` parameter to `True`.\n\nThis option also triggers a stronger reduction through an additional\niteration of the search algorithm, e.g. \\\"angekündigten\\\" →\n\\\"angekündigt\\\" (standard) → \\\"ankündigen\\\" (greedy). In some cases it\nmay be closer to stemming than to lemmatization.\n\n``` python\n# same example as before, comes to this result in one step\n\u003e\u003e\u003e simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)\n'spaghetto'\n# German case described above\n\u003e\u003e\u003e simplemma.lemmatize('angekündigten', lang='de', greedy=True)\n'ankündigen' # 2 steps: reduction to infinitive verb\n\u003e\u003e\u003e simplemma.lemmatize('angekündigten', lang='de', greedy=False)\n'angekündigt' # 1 step: reduction to past participle\n```\n\n\n### is_known()\n\nThe additional function `is_known()` checks if a given word is present\nin the language data:\n\n``` python\n\u003e\u003e\u003e from simplemma import is_known\n\u003e\u003e\u003e is_known('spaghetti', lang='it')\nTrue\n```\n\n\n### Tokenization\n\nA simple tokenization function is provided for convenience:\n\n``` python\n\u003e\u003e\u003e from simplemma import simple_tokenizer\n\u003e\u003e\u003e simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')\n['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']\n# use iterator instead\n\u003e\u003e\u003e simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True)\n```\n\nThe functions `text_lemmatizer()` and `lemma_iterator()` chain\ntokenization and lemmatization. They can take `greedy` (affecting\nlemmatization) and `silent` (affecting errors and logging) as arguments:\n\n``` python\n\u003e\u003e\u003e from simplemma import text_lemmatizer\n\u003e\u003e\u003e sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.'\n\u003e\u003e\u003e text_lemmatizer(sentence, lang='pt')\n# caveat: desejo is also a noun, should be desejar here\n['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']\n# same principle, returns a generator and not a list\n\u003e\u003e\u003e from simplemma import lemma_iterator\n\u003e\u003e\u003e lemma_iterator(sentence, lang='pt')\n```\n\n\n### Caveats\n\n``` python\n# don't expect too much though\n# this diminutive form isn't in the model data\n\u003e\u003e\u003e simplemma.lemmatize('spaghettini', lang='it')\n'spaghettini' # should read 'spaghettino'\n# the algorithm cannot choose between valid alternatives yet\n\u003e\u003e\u003e simplemma.lemmatize('son', lang='es')\n'son' # valid common name, but what about the verb form?\n```\n\nAs the focus lies on overall coverage, some short frequent words\n(typically: pronouns and conjunctions) may need post-processing, this\ngenerally concerns a few dozens of tokens per language.\n\nThe current absence of morphosyntactic information is an advantage in\nterms of simplicity. However, it is also an impassable frontier regarding\nlemmatization accuracy, for example when it comes to disambiguating\nbetween past participles and adjectives derived from verbs in Germanic\nand Romance languages. In most cases, `simplemma` often does not change\nsuch input words.\n\nThe greedy algorithm seldom produces invalid forms. It is designed to\nwork best in the low-frequency range, notably for compound words and\nneologisms. Aggressive decomposition is only useful as a general\napproach in the case of morphologically-rich languages, where it can\nalso act as a linguistically motivated stemmer.\n\nBug reports over the [issues\npage](https://github.com/adbar/simplemma/issues) are welcome.\n\n\n### Language detection\n\nLanguage detection works by providing a text and tuple `lang` consisting\nof a series of languages of interest. Scores between 0 and 1 are\nreturned.\n\nThe `lang_detector()` function returns a list of language codes along\nwith their corresponding scores, appending \\\"unk\\\" for unknown or\nout-of-vocabulary words. The latter can also be calculated by using\nthe function `in_target_language()` which returns a ratio.\n\n``` python\n# import necessary functions\n\u003e\u003e\u003e from simplemma import in_target_language, lang_detector\n# language detection\n\u003e\u003e\u003e lang_detector('\"Exoplaneta, též extrasolární planeta, je planeta obíhající kolem jiné hvězdy než kolem Slunce.\"', lang=(\"cs\", \"sk\"))\n[(\"cs\", 0.75), (\"sk\", 0.125), (\"unk\", 0.25)]\n# proportion of known words\n\u003e\u003e\u003e in_target_language(\"opera post physica posita (τὰ μετὰ τὰ φυσικά)\", lang=\"la\")\n0.5\n```\n\nThe `greedy` argument (`extensive` in past software versions) triggers\nuse of the greedier decomposition algorithm described above, thus\nextending word coverage and recall of detection at the potential cost of\na lesser accuracy.\n\n\n### Advanced usage via classes\n\nThe functions described above are suitable for simple usage, but you\ncan have more control by instantiating Simplemma classes and calling\ntheir methods instead. Lemmatization is handled by the `Lemmatizer`\nclass, while language detection is handled by the `LanguageDetector`\nclass. These in turn rely on different lemmatization strategies, which\nare implementations of the `LemmatizationStrategy` protocol. The\n`DefaultStrategy` implementation uses a combination of different\nstrategies, one of which is `DictionaryLookupStrategy`. It looks up\ntokens in a dictionary created by a `DictionaryFactory`.\n\nFor example, it is possible to conserve RAM by limiting the number of\ncached language dictionaries (default: 8) by creating a custom\n`DefaultDictionaryFactory` with a specific `cache_max_size` setting,\ncreating a `DefaultStrategy` using that factory, and then creating a\n`Lemmatizer` and/or a `LanguageDetector` using that strategy:\n\n``` python\n# import necessary classes\n\u003e\u003e\u003e from simplemma import LanguageDetector, Lemmatizer\n\u003e\u003e\u003e from simplemma.strategies import DefaultStrategy\n\u003e\u003e\u003e from simplemma.strategies.dictionaries import DefaultDictionaryFactory\n\nLANG_CACHE_SIZE = 5  # How many language dictionaries to keep in memory at once (max)\n\u003e\u003e\u003e dictionary_factory = DefaultDictionaryFactory(cache_max_size=LANG_CACHE_SIZE)\n\u003e\u003e\u003e lemmatization_strategy = DefaultStrategy(dictionary_factory=dictionary_factory)\n\n# lemmatize using the above customized strategy\n\u003e\u003e\u003e lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)\n\u003e\u003e\u003e lemmatizer.lemmatize('doughnuts', lang='en')\n'doughnut'\n\n# detect languages using the above customized strategy\n\u003e\u003e\u003e language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)\n\u003e\u003e\u003e language_detector.proportion_in_target_languages(\"opera post physica posita (τὰ μετὰ τὰ φυσικά)\")\n0.5\n```\n\nFor more information see the\n[extended documentation](https://adbar.github.io/simplemma/).\n\n\n### Reducing memory usage\n\nSimplemma provides an alternative solution for situations where low\nmemory usage and fast initialization time are more important than\nlemmatization and language detection performance. This solution uses a\n`DictionaryFactory` that employs a trie as its underlying data structure,\nrather than a Python dict.\n\nThe `TrieDictionaryFactory` reduces memory usage by an average of\n20x and initialization time by 100x, but this comes at the cost of\npotentially reducing performance by 50% or more, depending on the\nspecific usage.\n\nTo use the `TrieDictionaryFactory` you have to install Simplemma with\nthe `marisa-trie` extra dependency (available from version 1.1.0):\n\n```\npip install simplemma[marisa-trie]\n```\n\nThen you have to create a custom strategy using the\n`TrieDictionaryFactory` and use that for `Lemmatizer` and\n`LanguageDetector` instances:\n\n``` python\n\u003e\u003e\u003e from simplemma import LanguageDetector, Lemmatizer\n\u003e\u003e\u003e from simplemma.strategies import DefaultStrategy\n\u003e\u003e\u003e from simplemma.strategies.dictionaries import TrieDictionaryFactory\n\n\u003e\u003e\u003e lemmatization_strategy = DefaultStrategy(dictionary_factory=TrieDictionaryFactory())\n\n\u003e\u003e\u003e lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)\n\u003e\u003e\u003e lemmatizer.lemmatize('doughnuts', lang='en')\n'doughnut'\n\n\u003e\u003e\u003e language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)\n\u003e\u003e\u003e language_detector.proportion_in_target_languages(\"opera post physica posita (τὰ μετὰ τὰ φυσικά)\")\n0.5\n```\n\nWhile memory usage and initialization time when using the\n`TrieDictionaryFactory` are significantly lower compared to the\n`DefaultDictionaryFactory`, that's only true if the trie dictionaries\nare available on disk. That's not the case when using the\n`TrieDictionaryFactory` for the first time, as Simplemma only ships\nthe dictionaries as Python dicts. The trie dictionaries have to be\ngenerated once from the Python dicts. That happens on-the-fly when\nusing the `TrieDictionaryFactory` for the first time for a language and\nwill take a few seconds and use as much memory as loading the Python\ndicts for the language requires. For further invocations the trie\ndictionaries get cached on disk.\n\nIf the computer supposed to run Simplemma doesn't have enough memory to\ngenerate the trie dictionaries, they can also be generated on another\ncomputer with the same CPU architecture and copied over to the cache\ndirectory.\n\n\n## Supported languages\n\nThe following languages are available, identified by their [BCP 47\nlanguage tag](https://en.wikipedia.org/wiki/IETF_language_tag), which\ntypically corresponds to the [ISO 639-1 code](\nhttps://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).\nIf no such code exists, a [ISO 639-3\ncode](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) is\nused instead.\n\nAvailable languages (2022-01-20):\n\n\n| Code | Language | Forms (10³) | Acc. | Comments |\n| ---- | -------- | ----------- | ---- | -------- |\n| `ast` | Asturian | 124 | \n| `bg` | Bulgarian | 204 | \n| `ca` | Catalan | 579 | \n| `cs` | Czech | 187 | 0.89 | on UD CS-PDT\n| `cy` | Welsh | 360 | \n| `da` | Danish | 554 | 0.92 | on UD DA-DDT, alternative: [lemmy](https://github.com/sorenlind/lemmy)\n| `de` | German | 675 | 0.95 | on UD DE-GSD, see also [German-NLP list](https://github.com/adbar/German-NLP#Lemmatization)\n| `el` | Greek | 181 | 0.88 | on UD EL-GDT\n| `en` | English | 131 | 0.94 | on UD EN-GUM, alternative: [LemmInflect](https://github.com/bjascob/LemmInflect)\n| `enm` | Middle English | 38\n| `es` | Spanish | 665 | 0.95 | on UD ES-GSD\n| `et` | Estonian | 119 | | low coverage\n| `fa` | Persian | 12 | | experimental\n| `fi` | Finnish | 3,199 | | see [this benchmark](https://github.com/aajanki/finnish-pos-accuracy)\n| `fr` | French | 217 | 0.94 | on UD FR-GSD\n| `ga` | Irish | 372\n| `gd` | Gaelic | 48\n| `gl` | Galician | 384\n| `gv` | Manx | 62\n| `hbs` | Serbo-Croatian | 656 | | Croatian and Serbian lists to be added later\n| `hi` | Hindi | 58 | | experimental\n| `hu` | Hungarian | 458\n| `hy` | Armenian | 246\n| `id` | Indonesian | 17 | 0.91 | on UD ID-CSUI\n| `is` | Icelandic | 174\n| `it` | Italian | 333 | 0.93 | on UD IT-ISDT\n| `ka` | Georgian | 65\n| `la` | Latin | 843\n| `lb` | Luxembourgish | 305\n| `lt` | Lithuanian | 247\n| `lv` | Latvian | 164\n| `mk` | Macedonian | 56\n| `ms` | Malay | 14\n| `nb` | Norwegian (Bokmål) |  617\n| `nl` | Dutch | 250 | 0.92 | on UD-NL-Alpino\n| `nn` | Norwegian (Nynorsk) | 56\n| `pl` | Polish | 3,211 | 0.91 | on UD-PL-PDB\n| `pt` | Portuguese | 924 | 0.92 | on UD-PT-GSD\n| `ro` | Romanian | 311\n| `ru` | Russian | 595 | | alternative: [pymorphy2](https://github.com/kmike/pymorphy2/)\n| `se` | Northern Sámi | 113\n| `sk` | Slovak | 818 | 0.92 | on UD SK-SNK\n| `sl` | Slovene | 136\n| `sq` | Albanian | 35\n| `sv` | Swedish | 658 | | alternative: [lemmy](https://github.com/sorenlind/lemmy)\n| `sw` | Swahili | 10 | | experimental\n| `tl` | Tagalog | 32 | | experimental\n| `tr` | Turkish | 1,232 | 0.89 | on UD-TR-Boun\n| `uk` | Ukrainian | 370 | | alternative: [pymorphy2](https://github.com/kmike/pymorphy2/)\n\n\nLanguages marked as having low coverage may be better suited to\nlanguage-specific libraries, but Simplemma can still provide limited\nfunctionality. Where possible, open-source Python alternatives are\nreferenced.\n\n*Experimental* mentions indicate that the language remains untested or\nthat there could be issues with the underlying data or lemmatization\nprocess.\n\nThe scores are calculated on [Universal\nDependencies](https://universaldependencies.org/) treebanks on single\nword tokens (including some contractions but not merged prepositions),\nthey describe to what extent simplemma can accurately map tokens to\ntheir lemma form. See `eval/` folder of the code repository for more\ninformation.\n\nThis library is particularly relevant as regards the lemmatization of\nless frequent words. Its performance in this case is only incidentally\ncaptured by the benchmark above. In some languages, a fixed number of\nwords such as pronouns can be further mapped by hand to enhance\nperformance.\n\n\n## Speed\n\nThe following orders of magnitude are provided for reference only and\nwere measured on an old laptop to establish a lower bound:\n\n-   Tokenization: \\\u003e 1 million tokens/sec\n-   Lemmatization: \\\u003e 250,000 words/sec\n\nUsing the most recent Python version (i.e. with `pyenv`) can make the\npackage run faster.\n\n\n## Roadmap\n\n- [x] Add further lemmatization lists\n- [ ] Grammatical categories as option\n- [ ] Function as a meta-package?\n- [ ] Integrate optional, more complex models?\n\n\n## Credits and licenses\n\nThe software is licensed under the MIT license. For information on the\nlicenses of the linguistic information databases, see the `licenses` folder.\n\nThe surface lookups (non-greedy mode) rely on lemmatization lists derived\nfrom the following sources, listed in order of relative importance:\n\n-   [Lemmatization\n    lists](https://github.com/michmech/lemmatization-lists) by Michal\n    Měchura (Open Database License)\n-   Wiktionary entries packaged by the [Kaikki\n    project](https://kaikki.org/)\n-   [FreeLing project](https://github.com/TALP-UPC/FreeLing)\n-   [spaCy lookups\n    data](https://github.com/explosion/spacy-lookups-data)\n-   [Unimorph Project](https://unimorph.github.io/)\n-   [Wikinflection\n    corpus](https://github.com/lenakmeth/Wikinflection-Corpus) by Eleni\n    Metheniti (CC BY 4.0 License)\n\n\n## Contributions\n\nThis package has been first created and published by Adrien Barbaresi.\nIt has then benefited from extensive refactoring by Juanjo Diaz (especially the new classes).\nSee the [full list of contributors](https://github.com/adbar/simplemma/graphs/contributors)\nto the repository.\n\nFeel free to contribute, notably by [filing\nissues](https://github.com/adbar/simplemma/issues/) for feedback, bug\nreports, or links to further lemmatization lists, rules and tests.\n\nContributions by pull requests ought to follow the following\nconventions: code style with [black](https://github.com/psf/black), type\nhinting with [mypy](https://github.com/python/mypy), included tests with\n[pytest](https://pytest.org).\n\n\n## Other solutions\n\nSee lists: [German-NLP](https://github.com/adbar/German-NLP) and [other\nawesome-NLP lists](https://github.com/adbar/German-NLP#More-lists).\n\nFor another approach in Python see Spacy's\n[edit tree lemmatizer](https://spacy.io/api/edittreelemmatizer).\n\n\n## References\n\nTo cite this software:\n\n[![Reference DOI: 10.5281/zenodo.4673264](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4673264-brightgreen)](https://doi.org/10.5281/zenodo.4673264)\n\nBarbaresi A. (*year*). Simplemma: a simple multilingual lemmatizer for\nPython \\[Computer software\\] (Version *version number*). Berlin,\nGermany: Berlin-Brandenburg Academy of Sciences. Available from\n\u003chttps://github.com/adbar/simplemma\u003e DOI: 10.5281/zenodo.4673264\n\nThis work draws from lexical analysis algorithms used in:\n\n-   Barbaresi, A., \u0026 Hein, K. (2017). [Data-driven identification of\n    German phrasal\n    compounds](https://hal.archives-ouvertes.fr/hal-01575651/document).\n    In International Conference on Text, Speech, and Dialogue Springer,\n    pp. 192-200.\n-   Barbaresi, A. (2016). [An unsupervised morphological criterion for\n    discriminating similar\n    languages](https://aclanthology.org/W16-4827/). In 3rd Workshop on\n    NLP for Similar Languages, Varieties and Dialects (VarDial 2016),\n    Association for Computational Linguistics, pp. 212-220.\n-   Barbaresi, A. (2016). [Bootstrapped OCR error detection for a\n    less-resourced language\n    variant](https://hal.archives-ouvertes.fr/hal-01371689/document). In\n    13th Conference on Natural Language Processing (KONVENS 2016), pp.\n    21-26.\n","funding_links":[],"categories":["Tools"],"sub_categories":["Morphology"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadbar%2Fsimplemma","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadbar%2Fsimplemma","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadbar%2Fsimplemma/lists"}