{"id":26109567,"url":"https://github.com/wzbsocialsciencecenter/germalemma","last_synced_at":"2025-04-12T20:31:05.690Z","repository":{"id":21125768,"uuid":"91801391","full_name":"WZBSocialScienceCenter/germalemma","owner":"WZBSocialScienceCenter","description":"A lemmatizer for German language text","archived":false,"fork":false,"pushed_at":"2023-02-07T22:58:06.000Z","size":76,"stargazers_count":88,"open_issues_count":8,"forks_count":11,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-10T04:04:42.867Z","etag":null,"topics":["german","language-processing","lemmatization","lemmatizer","nlp","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WZBSocialScienceCenter.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-05-19T11:58:49.000Z","updated_at":"2025-03-02T14:03:46.000Z","dependencies_parsed_at":"2023-09-30T21:15:09.930Z","dependency_job_id":null,"html_url":"https://github.com/WZBSocialScienceCenter/germalemma","commit_stats":{"total_commits":32,"total_committers":1,"mean_commits":32.0,"dds":0.0,"last_synced_commit":"667bd2937e1d221d8d9dd2966f67e724f17a6e04"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fgermalemma","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fgermalemma/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fgermalemma/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WZBSocialScienceCenter%2Fgermalemma/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WZBSocialScienceCenter","download_url":"https://codeload.github.com/WZBSocialScienceCenter/germalemma/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248629366,"owners_count":21136241,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["german","language-processing","lemmatization","lemmatizer","nlp","python"],"created_at":"2025-03-09T23:12:13.244Z","updated_at":"2025-04-12T20:31:05.669Z","avatar_url":"https://github.com/WZBSocialScienceCenter.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GermaLemma\n\nDecember 2019, Markus Konrad \u003cmarkus.konrad@wzb.eu\u003e / \u003cpost@mkonrad.net\u003e / [Berlin Social Science Center](https://www.wzb.eu/en)\n\n**This project is currently not maintained.**\n\n## A lemmatizer for German language text\n\nGermalemma lemmatizes Part-of-Speech-tagged German language words. To do so, it combines a large lemma dictionary (an excerpt of the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)), functions from the CLiPS \"Pattern\" package, and an algorithm to split composita.\n\n## Installation\n\n### Easy option: Installing from PyPI via `pip`\n\nYou can install the package from [PyPI](https://pypi.org/project/germalemma/) via `pip`:\n\n```\npip install -U germalemma\n```\n\n### Alternative option: Downloading and installing from source\n\n**Only do this if you don't install germalemma via pip:**\n\nIn order to use GermaLemma, you will need to install some additional packages (see *Requirements* section below) and then download the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html). You will need to use the CONLL09 format, *not* the XML format.\nThe corpus is free to use for non-commercial purposes (see [License Agreement](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/htmlicense.html)).\n\nThen, you should convert the corpus into pickle format for faster loading by executing `germalemma/__init__.py` and passing the path to the corpus file in CONLL09 format:\n\n```\npython germalemma/__init__.py tiger_release_[...].conll09\n```\n\nThis will place a `lemmata.pickle` file in the `data` directory which is then automatically loaded.\n\n## Part-of-Speech (POS) Tagging\n\nYou will need to apply [Part-of-Speech (POS) tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) to your text before you can lemmatize its words. See [this blog post](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/) on how to do that.\n\n## Usage\n\nYou have set up GermaLemma to use the TIGER corpus (as explained above). You have tokenized your text (e.g. with NLTK). You have POS-tagged your tokens. Now you can use GermaLemma:\n\n```python\nfrom germalemma import GermaLemma\n\nlemmatizer = GermaLemma()\n\n# passing the word and the POS tag (\"N\" for noun)\nlemma = lemmatizer.find_lemma('Feinstaubbelastungen', 'N')\nprint(lemma)\n# -\u003e lemma is \"Feinstaubbelastung\"\n```\n\n## Valid POS tags\n\nYou can pass POS tags from the [STTS tagset](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html), however, only four POS tags can be processed:\n\n* 'N...' (nouns)\n* 'V...' (verbs)\n* 'ADJ...' (adjectives)\n* 'ADV...' (adverbs)\n\n**All other POS tags will result in a `ValueError` so you should wrap the call to `find_lemma` in a *try-except block*.**\n\n## Accuracy\n\nGermaLemma's accuracy was evaluated using a sample of 696 POS tagged and manually lemmatized words from a sample of paragraphs from proceedings of the European Parliament, Goethe's \"Werther\", Kafka's \"Verwandlung\" and a news article from the website of the WZB (see samples in folder \"eval_texts\").\n\n**Under the assumption that the POS tag is correct** (only those words were selected), GermaLemma finds the correct lemma in 99.43% of the cases. For comparison, *Pattern* achieved 95.11% for the same sample.\n\n## Requirements\n\n* Python 3.6 or newer\n* required package [*Pyphen*](http://pyphen.org/)\n* optional package [*PatternLite*](https://github.com/WZBSocialScienceCenter/patternlite) (This package is optional but highly recommended as it boosts the lemmatizer's accuracy.)\n\n## License\n\nApache License 2.0. See *LICENSE* file.\n\nThe TIGER corpus is **not** part of this repository and has to be downloaded separately under separate license conditions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwzbsocialsciencecenter%2Fgermalemma","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwzbsocialsciencecenter%2Fgermalemma","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwzbsocialsciencecenter%2Fgermalemma/lists"}