https://github.com/wzbsocialsciencecenter/germalemma
A lemmatizer for German language text
https://github.com/wzbsocialsciencecenter/germalemma
german language-processing lemmatization lemmatizer nlp python
Last synced: 6 months ago
JSON representation
A lemmatizer for German language text
- Host: GitHub
- URL: https://github.com/wzbsocialsciencecenter/germalemma
- Owner: WZBSocialScienceCenter
- License: apache-2.0
- Created: 2017-05-19T11:58:49.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-02-07T22:58:06.000Z (over 2 years ago)
- Last Synced: 2025-04-10T04:04:42.867Z (6 months ago)
- Topics: german, language-processing, lemmatization, lemmatizer, nlp, python
- Language: Python
- Size: 74.2 KB
- Stars: 88
- Watchers: 11
- Forks: 11
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE
Awesome Lists containing this project
README
# GermaLemma
December 2019, Markus Konrad / / [Berlin Social Science Center](https://www.wzb.eu/en)
**This project is currently not maintained.**
## A lemmatizer for German language text
Germalemma lemmatizes Part-of-Speech-tagged German language words. To do so, it combines a large lemma dictionary (an excerpt of the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)), functions from the CLiPS "Pattern" package, and an algorithm to split composita.
## Installation
### Easy option: Installing from PyPI via `pip`
You can install the package from [PyPI](https://pypi.org/project/germalemma/) via `pip`:
```
pip install -U germalemma
```### Alternative option: Downloading and installing from source
**Only do this if you don't install germalemma via pip:**
In order to use GermaLemma, you will need to install some additional packages (see *Requirements* section below) and then download the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html). You will need to use the CONLL09 format, *not* the XML format.
The corpus is free to use for non-commercial purposes (see [License Agreement](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/htmlicense.html)).Then, you should convert the corpus into pickle format for faster loading by executing `germalemma/__init__.py` and passing the path to the corpus file in CONLL09 format:
```
python germalemma/__init__.py tiger_release_[...].conll09
```This will place a `lemmata.pickle` file in the `data` directory which is then automatically loaded.
## Part-of-Speech (POS) Tagging
You will need to apply [Part-of-Speech (POS) tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) to your text before you can lemmatize its words. See [this blog post](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/) on how to do that.
## Usage
You have set up GermaLemma to use the TIGER corpus (as explained above). You have tokenized your text (e.g. with NLTK). You have POS-tagged your tokens. Now you can use GermaLemma:
```python
from germalemma import GermaLemmalemmatizer = GermaLemma()
# passing the word and the POS tag ("N" for noun)
lemma = lemmatizer.find_lemma('Feinstaubbelastungen', 'N')
print(lemma)
# -> lemma is "Feinstaubbelastung"
```## Valid POS tags
You can pass POS tags from the [STTS tagset](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html), however, only four POS tags can be processed:
* 'N...' (nouns)
* 'V...' (verbs)
* 'ADJ...' (adjectives)
* 'ADV...' (adverbs)**All other POS tags will result in a `ValueError` so you should wrap the call to `find_lemma` in a *try-except block*.**
## Accuracy
GermaLemma's accuracy was evaluated using a sample of 696 POS tagged and manually lemmatized words from a sample of paragraphs from proceedings of the European Parliament, Goethe's "Werther", Kafka's "Verwandlung" and a news article from the website of the WZB (see samples in folder "eval_texts").
**Under the assumption that the POS tag is correct** (only those words were selected), GermaLemma finds the correct lemma in 99.43% of the cases. For comparison, *Pattern* achieved 95.11% for the same sample.
## Requirements
* Python 3.6 or newer
* required package [*Pyphen*](http://pyphen.org/)
* optional package [*PatternLite*](https://github.com/WZBSocialScienceCenter/patternlite) (This package is optional but highly recommended as it boosts the lemmatizer's accuracy.)## License
Apache License 2.0. See *LICENSE* file.
The TIGER corpus is **not** part of this repository and has to be downloaded separately under separate license conditions.