https://github.com/wzbsocialsciencecenter/germalemma

A lemmatizer for German language text
https://github.com/wzbsocialsciencecenter/germalemma

german language-processing lemmatization lemmatizer nlp python

Last synced: 6 months ago
JSON representation

A lemmatizer for German language text

Host: GitHub
URL: https://github.com/wzbsocialsciencecenter/germalemma
Owner: WZBSocialScienceCenter
License: apache-2.0
Created: 2017-05-19T11:58:49.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2023-02-07T22:58:06.000Z (over 2 years ago)
Last Synced: 2025-04-10T04:04:42.867Z (6 months ago)
Topics: german, language-processing, lemmatization, lemmatizer, nlp, python
Language: Python
Size: 74.2 KB
Stars: 88
Watchers: 11
Forks: 11
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- License: LICENSE

Awesome Lists containing this project

README

          # GermaLemma

December 2019, Markus Konrad  /  / [Berlin Social Science Center](https://www.wzb.eu/en)

**This project is currently not maintained.**

## A lemmatizer for German language text

Germalemma lemmatizes Part-of-Speech-tagged German language words. To do so, it combines a large lemma dictionary (an excerpt of the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)), functions from the CLiPS "Pattern" package, and an algorithm to split composita.

## Installation

### Easy option: Installing from PyPI via `pip`

You can install the package from [PyPI](https://pypi.org/project/germalemma/) via `pip`:

```

pip install -U germalemma

```

### Alternative option: Downloading and installing from source

**Only do this if you don't install germalemma via pip:**

In order to use GermaLemma, you will need to install some additional packages (see *Requirements* section below) and then download the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html). You will need to use the CONLL09 format, *not* the XML format.

The corpus is free to use for non-commercial purposes (see [License Agreement](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/htmlicense.html)).

Then, you should convert the corpus into pickle format for faster loading by executing `germalemma/__init__.py` and passing the path to the corpus file in CONLL09 format:

```

python germalemma/__init__.py tiger_release_[...].conll09

```

This will place a `lemmata.pickle` file in the `data` directory which is then automatically loaded.

## Part-of-Speech (POS) Tagging

You will need to apply [Part-of-Speech (POS) tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) to your text before you can lemmatize its words. See [this blog post](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/) on how to do that.

## Usage

You have set up GermaLemma to use the TIGER corpus (as explained above). You have tokenized your text (e.g. with NLTK). You have POS-tagged your tokens. Now you can use GermaLemma:

```python

from germalemma import GermaLemma

lemmatizer = GermaLemma()

# passing the word and the POS tag ("N" for noun)

lemma = lemmatizer.find_lemma('Feinstaubbelastungen', 'N')

print(lemma)

# -> lemma is "Feinstaubbelastung"

```

## Valid POS tags

You can pass POS tags from the [STTS tagset](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html), however, only four POS tags can be processed:

* 'N...' (nouns)

* 'V...' (verbs)

* 'ADJ...' (adjectives)

* 'ADV...' (adverbs)

**All other POS tags will result in a `ValueError` so you should wrap the call to `find_lemma` in a *try-except block*.**

## Accuracy

GermaLemma's accuracy was evaluated using a sample of 696 POS tagged and manually lemmatized words from a sample of paragraphs from proceedings of the European Parliament, Goethe's "Werther", Kafka's "Verwandlung" and a news article from the website of the WZB (see samples in folder "eval_texts").

**Under the assumption that the POS tag is correct** (only those words were selected), GermaLemma finds the correct lemma in 99.43% of the cases. For comparison, *Pattern* achieved 95.11% for the same sample.

## Requirements

* Python 3.6 or newer

* required package [*Pyphen*](http://pyphen.org/)

* optional package [*PatternLite*](https://github.com/WZBSocialScienceCenter/patternlite) (This package is optional but highly recommended as it boosts the lemmatizer's accuracy.)

## License

Apache License 2.0. See *LICENSE* file.

The TIGER corpus is **not** part of this repository and has to be downloaded separately under separate license conditions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wzbsocialsciencecenter/germalemma

Awesome Lists containing this project

README