Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mawiesne/de-lemma

DE-Lemma: An OpenNLP lemmatizer tool and model files trained via German treebanks
https://github.com/mawiesne/de-lemma

lemmatization nlp opennlp-models

Last synced: 2 months ago
JSON representation

DE-Lemma: An OpenNLP lemmatizer tool and model files trained via German treebanks

Host: GitHub
URL: https://github.com/mawiesne/de-lemma
Owner: mawiesne
License: apache-2.0
Created: 2023-10-14T09:37:35.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-10-28T11:13:39.000Z (2 months ago)
Last Synced: 2024-10-28T14:35:40.388Z (2 months ago)
Topics: lemmatization, nlp, opennlp-models
Language: Java
Homepage:
Size: 921 KB
Stars: 1
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # DE-Lemma

[![GitHub license](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://raw.githubusercontent.com/mawiesne/DE-Lemma/main/LICENSE)

[![Build Status](https://github.com/mawiesne/DE-Lemma/actions/workflows/maven.yml/badge.svg)](https://github.com/mawiesne/DE-Lemma/actions)

[![Contributors](https://img.shields.io/github/contributors/mawiesne/DE-Lemma)](https://github.com/mawiesne/DE-Lemma/graphs/contributors)

[![GitHub pull requests](https://img.shields.io/github/issues-pr-raw/mawiesne/DE-Lemma.svg)](https://github.com/mawiesne/DE-Lemma/pulls) 

DE-Lemma (_pronounced_: de:e: le:ma:) is an object-oriented lemmatizer for German texts with a focus on the (bio)medical domain.

                

It is based on [Apache OpenNLP](https://github.com/apache/opennlp) and provides several pre-trained, binary Maximum-Entropy _models_ in the corresponding directory. Those have been trained during October 2022 from freely available German treebanks.

## Requirements

### Build

- [Apache Maven](https://maven.apache.org) in version 3.6+

### Runtime

- Java / [OpenJDK](https://adoptium.net/de/) in version 17+

- [Apache OpenNLP](https://github.com/apache/opennlp) in version 2.1.0+ 

 

#### Notes: 

- OpenNLP releases < 2.1.0 can't reliably load the lemmatizer model files of this project! This is due to [OpenNLP-1366](https://issues.apache.org/jira/browse/OPENNLP-1366) which was detected during work for **DE-Lemma**. The bug has been fixed via [PR-427](https://github.com/apache/opennlp/pull/427) and was included in version 2.1.0. 

- Check and take care of your classpath so no older OpenNLP version is around!

## Build

Build the project via Apache Maven. 

The command for the relevant parts is `mvn clean package`.   

This should download all required dependencies which are:

1. Apache OpenNLP, 

2. Apache Commons Lang3, _and_  

3. slf4j + log4j2 bindings.

If you want to re-use the current, experimental version of **DE-Lemma** in your projects, 

execute `mvn clean install` to transport the bundled _jar_ file to your local `.m2` folder.

Note: 

You have to select one or more model files and copy it over to the execution environment.

Those models must reside in the `models` directory, as the current code inspects this directory name.

     

## Usage

For a first impression, just execute `DELemmaDemo.java` which will, by default, load the [DE-Lemma_UD-gsd-2022-maxent.bin](models%2FDE-Lemma_UD-gsd-2022-maxent.bin) 

model resource. The loaded `Lemmatizer` instance will then find the lemmas for German (non-)**inflected** nouns from the (bio)medical domain.

> [!IMPORTANT]  

> For reasons of limited LFS storage, only the _DE-Lemma_UD-gsd-2022-maxent.bin_ model will be included in the `models` directory of this Git repository, if you clone this repository.

> You will have to download all other [model files](https://download.it.hs-heilbronn.de/de-lemma/) separately.

Once retrieved, place those model files in the `models` directory to start experimenting with it.

In the demo example, the German nouns `List.of("Ärzte", "Herzzusatztöne", ...)` will be processed. 

The results are logged to STD out / console. It should be similar to:

 

```

INFO [main] OpenNLPModelServiceImpl (OpenNLPModelServiceImpl.java:50) - Importing NLP model file 'DE-Lemma_UD-gsd-2022-maxent.bin' ...

INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Virus' for noun 'Viren'.

INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Herzzusatzton' for noun 'Herzzusatztöne'.

INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Vorhofflattern' for noun 'Vorhofflatterns'.

INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Arzt' for noun 'Ärzte'.

INFO [main] DELemmaDemo (DELemmaDemo.java:30) - Found lemma 'Klinikum' for noun 'Klinikum'.

```

## How to obtain all German model files?

The complete set of files consists of four models:

| Model name                      | Size | External download required                                                             |

|---------------------------------|------|----------------------------------------------------------------------------------------|

| DE-Lemma_UD-gsd-2022-maxent.bin | 861K | No                                                                                     |

| DE-Lemma_UD-hdt-2022-maxent.bin | 14M  | [Yes](https://download.it.hs-heilbronn.de/de-lemma/DE-Lemma_UD-hdt-2022-maxent.bin)    |

| DE-Lemma_Tue-BuReg-2022-maxent.bin | 3.9M | [Yes](https://download.it.hs-heilbronn.de/de-lemma/DE-Lemma_Tue-BuReg-2022-maxent.bin) |

| DE-Lemma_Tue-Wiki-2022-maxent.bin | 131M | [Yes](https://download.it.hs-heilbronn.de/de-lemma/DE-Lemma_Tue-Wiki-2022-maxent.bin)  |

as reported in the paper. 

> [!NOTE]  

> All trained models were evaluated for lemma prediction performance, see **Table 3** in the paper.

## How to cite?

If you use **DE-Lemma** models or the lemmatizer code in scientific work, please cite the 

[GMDS 2023](https://www.gmds2023.de/proceedings) paper as follows:

> :memo: 


Wiesner M. _DE-Lemma: A Maximum-Entropy Based Lemmatizer for German Medical Text._ 

Studies in Health Technology and Informatics. 2023 Sep 12;**307**:189-195.

DOI: [10.3233/SHTI230712](https://doi.org/10.3233/SHTI230712), 

PMID: [37697853](https://www.ncbi.nlm.nih.gov/pubmed/37697853)

## Training details

Several available treebanks (in _CoNLL-U_ or _CoNLL-X_ format) were identified

and selected as candidates for training German lemmatizer models.

The German UD-treebanks, [UD-GSD and UD-HDT](https://universaldependencies.org/treebanks/de-comparison.html), are constructed from

text corpora of German newspapers and other freely available text materials.

The treebanks [TüBa-D/DP](https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-ddp/) 

and [TüBa-D/W](https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-dw/) 

also qualified for training lemmatizer models. Those contain information about word types, morphology, lemmas, as well as dependency relations. 

TüBa-D/W represents a huge corpus: It is based on Wikipedia text material including 36.1 million sentences.

The training of lemmatizer models was conducted based on the open-source NLP toolkit [Apache OpenNLP](https://opennlp.apache.org).

For the generation of lemmatizer models with smaller treebanks (UD-GSD, UDHDT, TüBa-D/DP-political), 

the OpenNLP training parameters were chosen as follows:

```

training.algorithm=maxent 

training.iterations=100 

training.cutoff=5

training.threads=16 

language=de 

use.token.end=false

sentences.per.sample=5 

upos.tagset=upos

```

    

The training for TüBa-D/W was conducted with these parameters:

```

training.algorithm=maxent

training.iterations=20 

training.cutoff=5

training.threads=4 

language=de 

use.token.end=false

sentences.per.sample=5

upos.tagset=upos

```

Since the training of a lemmatizer model (LM) required between ~32 GB (UD-GSD) and

~1,100 GB (TüBa-D/W) of RAM at runtime, these tasks could not be performed on

conventional workstation hardware. Therefore, the training of each model was conducted

on the mainframe environment of the [bwUniCluster](https://wiki.bwhpc.de/e/BwUniCluster2.0) during October 2022.

The execution environment of the training program was a Java Runtime

Environment (JRE), a 64bit OpenJDK in version 8 build 292.

The resulting binary model files were persisted for evaluation and later re-use in NLP

applications with a lemmatizer component.