https://github.com/jfilter/german-lemmatizer

✂️ Python package (using a Docker image under the hood) to lemmatize German texts.
https://github.com/jfilter/german-lemmatizer

german lemmatization lemmatizer natural-language-processing nlp python

Last synced: 6 months ago
JSON representation

✂️ Python package (using a Docker image under the hood) to lemmatize German texts.

Host: GitHub
URL: https://github.com/jfilter/german-lemmatizer
Owner: jfilter
License: mit
Created: 2019-05-25T18:13:52.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-12-09T05:18:31.000Z (almost 3 years ago)
Last Synced: 2025-04-25T07:51:27.307Z (6 months ago)
Topics: german, lemmatization, lemmatizer, natural-language-processing, nlp, python
Language: Python
Homepage:
Size: 95.7 KB
Stars: 7
Watchers: 3
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


  



# German Lemmatizer

A Python package (using a Docker image under the hood) to [lemmatize](https://en.wikipedia.org/wiki/Lemmatisation) German texts.

Built upon:

-   [IWNLP](https://github.com/Liebeck/spacy-iwnlp) uses the crowd-generated token tables on [de.wikitionary](https://de.wiktionary.org/).

-   [GermaLemma](https://github.com/WZBSocialScienceCenter/germalemma): Looks up lemmas in the [TIGER Corpus](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/) and uses [Pattern](https://www.clips.uantwerpen.be/pattern) as a fallback for some rule-based lemmatizations.

It works as follows. First [spaCy](https://spacy.io/) tags the token with POS. Then `German Lemmatizer` looks up lemmas on IWNLP and GermanLemma. If they disagree, choose the one from IWNLP. If they agree or only one tool finds it, take it. Try to preserve the casing of the original token.

You may want to use underlying Docker image: [german-lemmatizer-docker](https://github.com/jfilter/german-lemmatizer-docker)

## Installation

1. Install [Docker](https://docs.docker.com/).

2. `pip install german-lemmatizer`

## Usage

1. Read and accept the [license terms of the TIGER Corpus](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/htmlicense.html) (free to use for non-commercial purposes).

2. Make sure the Docker daemons runs.

3. Write some Python code

```python

from german_lemmatizer import lemmatize

lemmatize(

    ['Johannes war ein guter Schüler', 'Sabiene sang zahlreiche Lieder'],

    working_dir='*',

    chunk_size=10000,

    n_jobs=1,

    escape=False,

    remove_stop=False)

```

The list of texts is split into chunks (`chunk_size`) and processed in parallel (`n_jobs`).

Enable the `escape` parameter if your text contains newslines. `remove_stop` removes stopwords as defined by spaCy.

## License

MIT.

## Sponsoring

This work was created as part of a [project](https://github.com/jfilter/ptf) that was funded by the German [Federal Ministry of Education and Research](https://www.bmbf.de/en/index.html).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jfilter/german-lemmatizer

Awesome Lists containing this project

README