Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mawiesne/de-lemma
DE-Lemma: An OpenNLP lemmatizer tool and model files trained via German treebanks
https://github.com/mawiesne/de-lemma
lemmatization nlp opennlp-models
Last synced: 2 months ago
JSON representation
DE-Lemma: An OpenNLP lemmatizer tool and model files trained via German treebanks
- Host: GitHub
- URL: https://github.com/mawiesne/de-lemma
- Owner: mawiesne
- License: apache-2.0
- Created: 2023-10-14T09:37:35.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-28T11:13:39.000Z (2 months ago)
- Last Synced: 2024-10-28T14:35:40.388Z (2 months ago)
- Topics: lemmatization, nlp, opennlp-models
- Language: Java
- Homepage:
- Size: 921 KB
- Stars: 1
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DE-Lemma
[![GitHub license](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://raw.githubusercontent.com/mawiesne/DE-Lemma/main/LICENSE)
[![Build Status](https://github.com/mawiesne/DE-Lemma/actions/workflows/maven.yml/badge.svg)](https://github.com/mawiesne/DE-Lemma/actions)
[![Contributors](https://img.shields.io/github/contributors/mawiesne/DE-Lemma)](https://github.com/mawiesne/DE-Lemma/graphs/contributors)
[![GitHub pull requests](https://img.shields.io/github/issues-pr-raw/mawiesne/DE-Lemma.svg)](https://github.com/mawiesne/DE-Lemma/pulls)DE-Lemma (_pronounced_: de:e: le:ma:) is an object-oriented lemmatizer for German texts with a focus on the (bio)medical domain.
It is based on [Apache OpenNLP](https://github.com/apache/opennlp) and provides several pre-trained, binary Maximum-Entropy _models_ in the corresponding directory. Those have been trained during October 2022 from freely available German treebanks.## Requirements
### Build
- [Apache Maven](https://maven.apache.org) in version 3.6+### Runtime
- Java / [OpenJDK](https://adoptium.net/de/) in version 17+
- [Apache OpenNLP](https://github.com/apache/opennlp) in version 2.1.0+
#### Notes:
- OpenNLP releases < 2.1.0 can't reliably load the lemmatizer model files of this project! This is due to [OpenNLP-1366](https://issues.apache.org/jira/browse/OPENNLP-1366) which was detected during work for **DE-Lemma**. The bug has been fixed via [PR-427](https://github.com/apache/opennlp/pull/427) and was included in version 2.1.0.
- Check and take care of your classpath so no older OpenNLP version is around!## Build
Build the project via Apache Maven.
The command for the relevant parts is `mvn clean package`.
This should download all required dependencies which are:1. Apache OpenNLP,
2. Apache Commons Lang3, _and_
3. slf4j + log4j2 bindings.If you want to re-use the current, experimental version of **DE-Lemma** in your projects,
execute `mvn clean install` to transport the bundled _jar_ file to your local `.m2` folder.Note:
You have to select one or more model files and copy it over to the execution environment.
Those models must reside in the `models` directory, as the current code inspects this directory name.
## Usage
For a first impression, just execute `DELemmaDemo.java` which will, by default, load the [DE-Lemma_UD-gsd-2022-maxent.bin](models%2FDE-Lemma_UD-gsd-2022-maxent.bin)
model resource. The loaded `Lemmatizer` instance will then find the lemmas for German (non-)**inflected** nouns from the (bio)medical domain.> [!IMPORTANT]
> For reasons of limited LFS storage, only the _DE-Lemma_UD-gsd-2022-maxent.bin_ model will be included in the `models` directory of this Git repository, if you clone this repository.
> You will have to download all other [model files](https://download.it.hs-heilbronn.de/de-lemma/) separately.Once retrieved, place those model files in the `models` directory to start experimenting with it.
In the demo example, the German nouns `List.of("Ärzte", "Herzzusatztöne", ...)` will be processed.
The results are logged to STD out / console. It should be similar to:
```
INFO [main] OpenNLPModelServiceImpl (OpenNLPModelServiceImpl.java:50) - Importing NLP model file 'DE-Lemma_UD-gsd-2022-maxent.bin' ...
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Virus' for noun 'Viren'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Herzzusatzton' for noun 'Herzzusatztöne'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Vorhofflattern' for noun 'Vorhofflatterns'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Arzt' for noun 'Ärzte'.
INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Klinikum' for noun 'Klinikum'.
```## How to obtain all German model files?
The complete set of files consists of four models:| Model name | Size | External download required |
|---------------------------------|------|----------------------------------------------------------------------------------------|
| DE-Lemma_UD-gsd-2022-maxent.bin | 861K | No |
| DE-Lemma_UD-hdt-2022-maxent.bin | 14M | [Yes](https://download.it.hs-heilbronn.de/de-lemma/DE-Lemma_UD-hdt-2022-maxent.bin) |
| DE-Lemma_Tue-BuReg-2022-maxent.bin | 3.9M | [Yes](https://download.it.hs-heilbronn.de/de-lemma/DE-Lemma_Tue-BuReg-2022-maxent.bin) |
| DE-Lemma_Tue-Wiki-2022-maxent.bin | 131M | [Yes](https://download.it.hs-heilbronn.de/de-lemma/DE-Lemma_Tue-Wiki-2022-maxent.bin) |as reported in the paper.
> [!NOTE]
> All trained models were evaluated for lemma prediction performance, see **Table 3** in the paper.## How to cite?
If you use **DE-Lemma** models or the lemmatizer code in scientific work, please cite the
[GMDS 2023](https://www.gmds2023.de/proceedings) paper as follows:> :memo:
Wiesner M. _DE-Lemma: A Maximum-Entropy Based Lemmatizer for German Medical Text._
Studies in Health Technology and Informatics. 2023 Sep 12;**307**:189-195.
DOI: [10.3233/SHTI230712](https://doi.org/10.3233/SHTI230712),
PMID: [37697853](https://www.ncbi.nlm.nih.gov/pubmed/37697853)## Training details
Several available treebanks (in _CoNLL-U_ or _CoNLL-X_ format) were identified
and selected as candidates for training German lemmatizer models.The German UD-treebanks, [UD-GSD and UD-HDT](https://universaldependencies.org/treebanks/de-comparison.html), are constructed from
text corpora of German newspapers and other freely available text materials.
The treebanks [TüBa-D/DP](https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-ddp/)
and [TüBa-D/W](https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-dw/)
also qualified for training lemmatizer models. Those contain information about word types, morphology, lemmas, as well as dependency relations.
TüBa-D/W represents a huge corpus: It is based on Wikipedia text material including 36.1 million sentences.The training of lemmatizer models was conducted based on the open-source NLP toolkit [Apache OpenNLP](https://opennlp.apache.org).
For the generation of lemmatizer models with smaller treebanks (UD-GSD, UDHDT, TüBa-D/DP-political),
the OpenNLP training parameters were chosen as follows:```
training.algorithm=maxent
training.iterations=100
training.cutoff=5
training.threads=16
language=de
use.token.end=false
sentences.per.sample=5
upos.tagset=upos
```
The training for TüBa-D/W was conducted with these parameters:
```
training.algorithm=maxent
training.iterations=20
training.cutoff=5
training.threads=4
language=de
use.token.end=false
sentences.per.sample=5
upos.tagset=upos
```Since the training of a lemmatizer model (LM) required between ~32 GB (UD-GSD) and
~1,100 GB (TüBa-D/W) of RAM at runtime, these tasks could not be performed on
conventional workstation hardware. Therefore, the training of each model was conducted
on the mainframe environment of the [bwUniCluster](https://wiki.bwhpc.de/e/BwUniCluster2.0) during October 2022.
The execution environment of the training program was a Java Runtime
Environment (JRE), a 64bit OpenJDK in version 8 build 292.The resulting binary model files were persisted for evaluation and later re-use in NLP
applications with a lemmatizer component.