https://github.com/adbar/German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
https://github.com/adbar/German-NLP
computational-linguistics corpus-linguistics german-language natural-language-processing nlp text-mining
Last synced: over 1 year ago
JSON representation
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Host: GitHub
URL: https://github.com/adbar/German-NLP
Owner: adbar
Created: 2018-06-12T10:31:03.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2024-08-28T12:31:54.000Z (almost 2 years ago)
Last Synced: 2024-08-28T13:57:10.626Z (almost 2 years ago)
Topics: computational-linguistics, corpus-linguistics, german-language, natural-language-processing, nlp, text-mining
Homepage:
Size: 144 KB
Stars: 440
Watchers: 45
Forks: 63
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
Awesome Lists containing this project

awesome-nlp - 德文-自然語言處理 - 開發的開放式訪問/開源/現成資源和工具列表，特別關注德語。 (自然語言處理-德文 / 函式庫)
awesome-nlp - 德文-自然語言處理 - 開發的開放式訪問/開源/現成資源和工具列表，特別關注德語。 (自然語言處理-德文 / 函式庫)
pronunciation - Awesome German NLP - Natural language processing tools for German. (Related Resources / Learning Strategies)
README

          # German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

Resources and tools which can be used either off-the-shelf or with minor adjustments and which are currently maintained are primarily chosen for this list. It is deliberately biased in terms of usability and user-friendliness.

Community support is needed to keep this list up-to-date, pull requests and suggestions are welcome! See [contributing guidelines](contributing.md).

## Table of Contents

- [Text corpora](#Text-corpora)

   - [General-purpose](#General-purpose)

   - [Historical](#Historical)

   - [Specialized](#Specialized)

   - [Word lists](#Word-lists)

   - [Data acquisition](#Data-acquisition)

   - [Lists of corpora](#Lists-of-corpora)

- [Generic resources](#Generic-resources)

   - [Frameworks](#Frameworks)

   - [Treebanks](#Treebanks)

   - [Deep learning models and transformers](#Deep-learning-models-and-transformers)

   - [Annotation](#Annotation)

   - [Standards](#Standards)

- [Linguistic processing](#Linguistic-processing)

   - [Preprocessing](#Preprocessing)

   - [Tokenization / Sentence boundary detection](#Tokenization--sentence-boundary-detection)

   - [Stemming](#Stemming)

   - [Lemmatization](#Lemmatization)

   - [Morphological analysis](#Morphological-analysis)

   - [Normalization](#Normalization)

   - [Phonology](#Phonology)

   - [POS-tagging](#POS-tagging)

   - [Syntactical parsing](#Syntactical-parsing)

   - [Named Entity Recognition](#Named-Entity-Recognition)

   - [Industry/Applications](#industryapplications)

   - [Evaluation](#Evaluation)

- [Semantic analysis](#Semantic-analysis)

   - [Datasets](#Datasets)

   - [Word embeddings and senses](#Word-embeddings-and-senses)

   - [Sentiment analysis datasets / polarity clues](#sentiment-analysis-datasets--polarity-clues)

   - [Sentiment detection](#Sentiment-detection)

   - [GermEval](#GermEval)

   - [Coreference resolution](#Coreference-resolution)

   - [Summarization and Simplification](#Summarization-and-simplification)

   - [Psycholinguistics](#Psycholinguistics)

- [Speech NLP](#Speech-NLP)

- [Machine Translation](#Machine-Translation)

- [Large Language Models](#Large-language-models)

- [Teaching resources and tutorials](#Teaching-resources-and-tutorials)

- [More lists](#More-lists)

   - [German](#German)

   - [General](#General)

   - [Comparable lists](#Comparable-lists)

   - [Larger institutional GitHub groups](#Larger-institutional-GitHub-groups)

## Text corpora

### General-purpose

* [Araneum Germanicum](http://aranea.juls.savba.sk/aranea_about/_germanicum.html)

* [CEHugeWebCorpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2638)

* [COW](http://corporafromtheweb.org/category/corpora/german/)

* [Digitales Wörterbuch der deutschen Sprache (DWDS)](https://dwds.de)

* [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (CommonCrawl)

* [IDS Corpora](http://www1.ids-mannheim.de/kl/projekte/korpora)

* [Leipzig Corpora Collection](http://wortschatz.uni-leipzig.de/en/download/)

* [SdeWaC](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/sdewac.en.html)

### Historical

* [Anselm (14th-16th centuries)](https://www.linguistics.ruhr-uni-bochum.de/anselm/access/index.en.html)

* [Austrian Newspapers (19th C. NewsEye / READ OCR training dataset)](https://github.com/UB-Mannheim/AustrianNewspapers)

* [Deutsches Textarchiv](https://deutschestextarchiv.de/)

* [Elektronische Texte (Thomas Gloning)](http://www.staff.uni-giessen.de/gloning/etexte.htm)

* [GerManC (1650-1800)](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2544)

* [German Drama Corpus (GerDraCor)](https://github.com/dracor-org/gerdracor)

* [German Novels](https://github.com/computationalstylistics/68_german_novels)

* [German Poetry Corpus (DLK)](https://github.com/thomasnikolaushaider/DLK)

* [Lesekorpus Altdeutsch (750-1050)](http://titus.uni-frankfurt.de/lea)

* [LiederCorpus](https://github.com/corpusmusic/liederCorpusAnalysis)

* [Referenzkorpus Altdeutsch (750-1050)](http://www.deutschdiachrondigital.de/)

* [Referenzkorpus Mittelhochdeutsch (1050-1350)](https://www.linguistics.rub.de/rem/)

* [Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)](https://corpora.uni-hamburg.de/hzsk/de/islandora/object/text-corpus:ren-0.6)

* [Referenzkorpus Frühneuhochdeutsch (1350-1650)](https://www.linguistics.rub.de/ref/)

* [Thesaurus Indogermanischer Text- und Sprachmaterialien (TITUS)](http://titus.uni-frankfurt.de/indexd.htm?/texte/texte.htm)

* [Transkriptionen von Fibeln (19. Jahrhundert)](https://github.com/UB-Mannheim/Fibeln)

### Specialized

* [AGB-DE](https://github.com/DaBr01/AGB-DE)

* [arg-microtexts](http://angcl.ling.uni-potsdam.de/resources/argmicro.html)

* [auto-hMDS (multi-document summarization)](https://github.com/AIPHES/auto-hMDS)

* [DFKI MobIE](https://github.com/DFKI-NLP/MobIE)

* [DIRNDL -- (D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/dirndl.en.html)

* [Dortmunder Chat Korpus](http://www.chatkorpus.tu-dortmund.de/)

* [Feidegger (Fashion Images and Descriptions)](https://github.com/zalandoresearch/feidegger)

* [Foodblog-Korpus](https://doi.org/10.5281/zenodo.1410445)

* [Fußballlinguistik](https://fussballlinguistik.de/korpora/)

* [German EUROPARL data w/ NE annotation](https://nlpado.de/~sebastian/software/ner_german.shtml)

* [German Job Reference Corpus](https://github.com/iug-htw/GJRC)

* [German Political Speeches Corpus](http://purl.org/corpus/german-speeches)

* [German Recipes Dataset](https://www.kaggle.com/sterby/german-recipes-dataset)

* [GermaParl (Bundestag)](https://github.com/PolMine/GermaParlTEI)

* [German Parliamentary Corpus (GerParCor)](https://github.com/texttechnologylab/GerParCor)

* [German Wikipedia Text Corpus](https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus)

* [GRAIN corpus -- (G)erman-(RA)dio-(IN)terviews](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/grain.html)

* [Legal Entity Recognition](https://github.com/elenanereiss/Legal-Entity-Recognition)

* [One Million Posts Corpus](https://ofai.github.io/million-post-corpus/)

* [Open Legal Data Corpus (German laws and court decisions)](http://openlegaldata.io/research/2019/02/19/court-decision-dataset.html)

* [Pegida Facebook Comments](http://0x0a.li/wp-content/uploads/2015/01/pegida_korpus.zip)

* [Potsdam Commentary Corpus (PCC)](http://angcl.ling.uni-potsdam.de/resources/pcc.html)

* [Songkorpus](http://songkorpus.de/)

* [Survey of Corpora for Germanic Low-Resource Languages and Dialects](https://github.com/mainlp/germanic-lrl-corpora)

* [TeCoPhy: A Text Corpus of German Physics Texts]{https://zenodo.org/records/8316079}

* [Ten Thousand German News Articles Dataset](https://github.com/tblock/10kGNAD)

* [TTLab StadtWiki Corpus](https://vlo.clarin.eu/?7&fq=collection:CEDIFOR.Corpus.StadtWikis&fqType=collection:or)

* [GSM-1k-de (translated german subset of the first 1000 items of GSM8K)](https://huggingface.co/datasets/D4ve-R/gsm-1k-de)

#### Swiss German

* [ArchiMob Corpus](https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html)

* [NOAH's Corpus: Part-of-Speech Tagging for Swiss German](https://noe-eva.github.io/NOAH-Corpus/)

* [SpinningBytes Swiss German Sentiment Corpus](https://github.com/spinningbytes/SB-CH)

* [Swiss SMS Corpus](http://www.sms4science.ch/Main/WebHome)

#### Learner and Error Corpora

* [C-WEP](http://lingured.info/linguistic-resources/cwep/)

* [DysList (list of dyslexic errors)](https://github.com/Rauschii/DysListGerman)

* [Falko](https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko)

* [Litkey](https://www.linguistics.ruhr-uni-bochum.de/litkeycorpus/)

* [OpinionSpam](https://github.com/hdaSprachtechnologie/OpinionSpam)

#### Word lists

* [Analogies in German Particle Verb Meaning Shifts](http://www.ims.uni-stuttgart.de/data/pv-meaning-shift)

* [Degree of Grammaticalization for German Prepositions](https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/gramm-prepositions.html)

* [DWDS lemma list](https://www.dwds.de/lemma/list)

* [DeReWo](http://www1.ids-mannheim.de/kl/projekte/methoden/derewo.html)

* [Diachronic Usage Relatedness (DURel)](http://www.ims.uni-stuttgart.de/data/durel)

* [DiMLex (lexicon of German discourse markers)](https://github.com/discourse-lab/dimlex)

* [German Compound Database](https://www.webcorpora.org/opendata/gecodb/)

* [German derivational lexicons](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/DErivBase.html)

* [German nouns from Wiktionary](https://github.com/gambolputty/german-nouns)

* [german_stopwords](https://github.com/solariz/german_stopwords)

* [German Wiktionary Lexicon Graph](https://vlo.clarin.eu/record?10&count=2&docId=21.11105_47_0000-000B-D244-B&fq=collection:CEDIFOR.Lexicon&fqType=collection:or&index=0)

* [German word list for GNU Aspell](https://sourceforge.net/projects/germandict/files/)

* [Metaphoric Change (annotated lexemes)](http://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/metaphoric_change.html)

* [Morphological Dictionaries (DEMorphy)](https://github.com/DuyguA/german-morph-dictionaries)

* [OpenThesaurus](https://www.openthesaurus.de/about/download)

* [Stopwords German (DE)](https://github.com/stopwords-iso/stopwords-de)

* [VulGer](https://github.com/ee-2/VulGer/)

* [wiktextract](https://github.com/tatuylonen/wiktextract)

* [wiktionary-de-parser](https://github.com/gambolputty/wiktionary-de-parser)

### Data acquisition

* [bundestag](https://github.com/bundestag)

* [bundestweets](https://github.com/michi-d/bundestweets)

* [DKPro C4Corpus](https://github.com/dkpro/dkpro-c4corpus)

* [german-reddit](https://github.com/adbar/german-reddit)

* [news-crawler](https://github.com/theSoenke/news-crawler)

* [news-please](https://github.com/fhamborg/news-please)

* [pattern](https://github.com/clips/pattern/wiki/pattern-de)

  * [patternlite](https://github.com/WZBSocialScienceCenter/patternlite)

* [scrape-gutenberg-de](https://github.com/jfilter/scrape-gutenberg-de)

* [SwigSpot Schwyzertuutsch-Spotting](https://github.com/derlin/SwigSpot_Schwyzertuutsch-Spotting)

* Twitter

  * [German April 2013 Twitter Corpus](https://github.com/TScheffler/GermanTwitterApril2013)

* [trafilatura](https://github.com/adbar/trafilatura)

### Lists of corpora

* [CLARIN-D list](https://www.clarin-d.net/en/corpora)

* [Corpora at the IMS](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/index.en.html)

* [CorpusExplorer's list of corpora](https://notes.jan-oliver-ruediger.de/korpora/)

* [Korpusarchiv (IDS Mannheim)](http://www1.ids-mannheim.de/kl/projekte/korpora/archiv.html)

* [Laudatio (Long-term Access and Usage of Deeply Annotated Information)](https://www.laudatio-repository.org/)

* [Parallel corpora (see below)](#parallel-corpora)

* [Treebanks (see below)](#treebanks)

* [ZAS list](http://www.zas.gwz-berlin.de/katalog00100.html?&L=1)

## Generic resources

### Frameworks

* [AmbiverseNLU](https://github.com/ambiverse-nlu/ambiverse-nlu)

* [CLARIN-D web tools](https://www.clarin-d.net/en/analysing)

* [CorpusExplorer](http://notes.jan-oliver-ruediger.de/software/corpusexplorer-overview/)

* [DKPro Core](https://dkpro.github.io/dkpro-core)

* [DKPro Similarity](https://dkpro.github.io/dkpro-similarity)

* [DKPro Text Classification (TC)](https://dkpro.github.io/dkpro-tc)

* [DKPro Word Sense Disambiguation (WSD)](https://dkpro.github.io/dkpro-wsd)

* [flair](https://github.com/zalandoresearch/flair)

* [FreeLing](http://nlp.lsi.upc.edu/freeling/)

* [ixa pipes](http://ixa2.si.ehu.es/ixa-pipes/)

* [Mate Tools](http://hdl.handle.net/11022/1007-0000-0000-8E4E-A), webservice via [WebLicht](https://weblicht.sfs.uni-tuebingen.de/)

* [NLP-Cube](https://github.com/adobe/NLP-Cube)

* [nlptasks](https://github.com/ulf1/nlptasks)

* [spaCy](https://github.com/explosion/spaCy)

* [Sparv](https://spraakbanken.gu.se/sparv/docs/)

* [Stanford CoreNLP](https://github.com/stanfordnlp/CoreNLP)

* [textblob-de](https://github.com/markuskiller/textblob-de)

* [TextImager](https://vlo.clarin.eu/record?4&count=1&docId=21.11105_47_0000-000B-CAE6-E&index=0&q=TextImager)

### Treebanks

* [German Universal Dependency Treebank](https://github.com/UniversalDependencies/UD_German-GSD/tree/master)/[UD German GSD](https://universaldependencies.org/treebanks/de_gsd/index.html)

* [Hamburg Dependency Treebank](https://corpora.uni-hamburg.de/hzsk/de/islandora/object/treebank:hdt)

* [NEGRA](http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/)

* [TIGER Corpus](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html)

   * [SALSA (role semantic annotation)](http://www.coli.uni-saarland.de/projects/salsa/corpus/)

   * [Tiger2Dep (dependency parses)](http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/Tiger2Dep.en.html)

* [TGermaCorp (literary texts)](https://vlo.clarin.eu/record?1&count=2&docId=21.11105_47_0000-000B-D4D9-1&index=0&q=TGermaCorp)

* [TüBa-D/Z](http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html)

## Deep learning models and transformers

* [LAION LeoLM Llama v2 German Foundation Language Model 7B Parameters](https://huggingface.co/LeoLM/leo-hessianai-7b)

* [LAION  LeoLM Llama v2 German Foundation Language Model 13B Parameters](https://huggingface.co/LeoLM/leo-hessianai-13b)

* [dbmdz BERT models](https://github.com/dbmdz/berts)

* [Deepset German BERT model](https://deepset.ai/german-bert)

* [Evaluating German Transformer Language Models with Syntactic Agreement Tests](https://github.com/DFKI-NLP/gevalm)

* [German ELMo Model](https://github.com/t-systems-on-site-services-gmbh/german-elmo-model)

* [german-transformer-training](https://github.com/PhilipMay/german-transformer-training)

* [GermLM](https://github.com/tonianelope/Multilingual-BERT) (NER exploration)

* [GerPT2](https://github.com/bminixhofer/gerpt2)

* [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)

### Annotation

* [cora](https://github.com/comphist/cora)

* [CorefAnnotator](https://www.ims.uni-stuttgart.de/en/research/resources/tools/corefannotator/)

* [corpus-tools.org (HU Berlin)](http://corpus-tools.org/home/)

* [INCEpTION](https://inception-project.github.io/)

* [satzify](https://github.com/michdr/satzify)

* [TreeAnno](https://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/treeanno/)

* [WebAnno](https://webanno.github.io/webanno/)

### Standards

* [DTA Basisformat](http://www.deutschestextarchiv.de/doku/basisformat/)

* [ISO TC 37 SC 4](https://www.iso.org/committee/297592.html)

* [UIMA](http://docs.oasis-open.org/uima/v1.0/os/uima-spec-os.html)

* [UIMA CAS XMI](https://uima.apache.org/d/uimaj-current/references.html#ugr.ref.xmi)

## Linguistic processing

### Preprocessing

* [clean-text](https://github.com/jfilter/clean-text)

* [german-preprocessing](https://github.com/jfilter/german-preprocessing)

* [german_transliterate](https://github.com/repodiac/german_transliterate)

### Tokenization / Sentence boundary detection

* [Cutter](https://pub.cl.uzh.ch/wiki/public/cutter/start)

* [Datok](https://github.com/korap/datok)

* [deep-eos](https://github.com/dbmdz/deep-eos) (sentence boundary detection only)

* [FullStop](https://github.com/oliverguhr/fullstop-deep-punctuation-prediction) (sentence boundary detection only)

* [JTok](https://github.com/DFKI-MLT/JTok)

* [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer)

* [nnsplit](https://github.com/bminixhofer/nnsplit) (sentence boundary detection only)

* [SoMaJo](https://github.com/tsproisl/SoMaJo)

* [syntok](https://github.com/fnl/syntok)

* [waste](http://kaskade.dwds.de/waste/)

* [german-abbreviations](https://github.com/jfilter/german-abbreviations) (resource)

### Stemming

* [CISTEM](https://github.com/LeonieWeissweiler/CISTEM)

* [german-go-stemmer](https://github.com/antonbaumann/german-go-stemmer)

* [Snowball](http://snowballstem.org)

### Lemmatization

* [cstlemma](https://github.com/kuhumcst/cstlemma)

* [germalemma](https://github.com/WZBSocialScienceCenter/germalemma)

* [GermaLemma++ (ensemble)](https://github.com/rubcompling/germalemmaplusplus)

* [german-lemmatizer](https://github.com/jfilter/german-lemmatizer)

* [HanTa](https://github.com/wartaal/HanTa)

* [IWNLP](https://github.com/Liebeck/IWNLP)

   * [spacy-iwnlp](https://github.com/Liebeck/spacy-iwnlp)

* [LemmaTag](https://github.com/Hyperparticle/LemmaTag)

* [simplemma](https://github.com/adbar/simplemma)

### Morphological analysis

* [CharSplit](https://github.com/dtuggener/CharSplit)

* [DEMorphy](https://github.com/DuyguA/DEMorphy)

* [dehyphen](https://github.com/jfilter/dehyphen)

* [deep-german](https://github.com/aakhundov/deep-german) (classification of nouns by genders)

* [Durm Lemmatizer](http://www.semanticsoftware.info/durm-german-lemmatizer)

* [german_compound_splitter](https://github.com/repodiac/german_compound_splitter)

* [GermanNumerus](https://github.com/ulrischa/GermanNumerus)

* [HypheNN-de](https://github.com/msiemens/HypheNN-de)

* [jwordsplitter](https://github.com/danielnaber/jwordsplitter)

* [lang-deu](https://github.com/giellalt/lang-deu)

* [Low German morphology and tools](https://github.com/giellalt/lang-nds)

* [MarMoT](http://cistern.cis.lmu.de/marmot/)

* [MOP Compound Splitter](https://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/mcs/)

* [Morphy](http://morphy.wolfganglezius.de/)

* [morphisto](https://code.google.com/archive/p/morphisto/)

* [nnsplit](https://github.com/bminixhofer/nnsplit)

* [SECOS (unsupervised compound splitter)](https://github.com/riedlma/SECOS)

* [SFST](http://www.cis.uni-muenchen.de/~schmid/tools/SFST/)

* [SMOR](http://www.cis.uni-muenchen.de/~schmid/tools/SMOR/), webservice via [WebLicht](https://weblicht.sfs.uni-tuebingen.de/)

* [timur](https://github.com/wrznr/timur)

* [zmorge](https://github.com/rsennrich/zmorge)

### Normalization

* [CAB](http://www.deutschestextarchiv.de/cab)

* [dehyphen](https://github.com/pd3f/dehyphen)

* [norma](https://github.com/comphist/norma)

* [transnormer](https://github.com/ybracke/transnormer)

### Phonology

* [gramophone](https://github.com/wrznr/gramophone)

### POS-tagging

* [clevertagger](https://github.com/rsennrich/clevertagger)

* [HanTa](https://github.com/wartaal/HanTa/)

* [hunpos](https://github.com/mivoq/hunpos)

* [LemmaTag](https://github.com/Hyperparticle/LemmaTag)

* [moot](http://kaskade.dwds.de/~jurish/projects/moot)

* [pattern.de](https://www.clips.uantwerpen.be/pages/pattern-de)

* [RFTagger](http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/), webservice via [WebLicht](https://weblicht.sfs.uni-tuebingen.de/)

  * [Java interface](http://sifnos.sfs.uni-tuebingen.de/resource/A4/rftj/)

* [RNNTagger](https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/)

* [SoMeWeTa](https://github.com/tsproisl/SoMeWeTa)

* [TnT](http://www.coli.uni-saarland.de/~thorsten/tnt/)

* [TreeTagger (including models)](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)

  * [Demo for middle high German](http://clarin05.ims.uni-stuttgart.de/mhdtt/index.html)

### Syntactical parsing

* [Berkeley Parser](https://github.com/slavpetrov/berkeleyparser)

* [BitPar](http://www.cis.uni-muenchen.de/~schmid/tools/BitPar/), webservice via [WebLicht](https://weblicht.sfs.uni-tuebingen.de/)

* [CDG](https://nats-www.informatik.uni-hamburg.de/CDG/DownloadPage)

   * [jwcdg](https://gitlab.com/nats/jwcdg)

* [IMSTrans (dependency parser)](http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/imstrans.en.html)

* [ParZu](https://github.com/rsennrich/parzu)

* [Stanford Parser](https://nlp.stanford.edu/software/lex-parser.shtml)

* [STEPS Parser](https://github.com/boschresearch/steps-parser)

### Named Entity Recognition

* [AmbiverseNLU KnowNER](https://github.com/ambiverse-nlu/ambiverse-nlu)

* [flair](https://github.com/zalandoresearch/flair)

* [GermaNER](https://github.com/tudarmstadt-lt/GermaNER)

* [GERNERMED](https://github.com/frankkramer-lab/GERNERMED)

* [historic-ner](https://github.com/dbmdz/historic-ner)

* [LSTM+CRF+FastText with models for (historic) German](https://github.com/riedlma/sequence_tagging)

* [microNER](https://uhh-lt.github.io/microNER/)

* [Named Entity Recognition (LSTM + CRF + FastText) with models for [historic] German](https://github.com/riedlma/sequence_tagging)

* [ner-corpora](https://github.com/EuropeanaNewspapers/ner-corpora)

* [NER-datasets](https://github.com/davidsbatista/NER-datasets)

* [(Faruqui & Pado 2010) Components and evaluation data](https://nlpado.de/~sebastian/software/ner_german.shtml)

* [Towards Robust Named Entity Recognition for Historic German](https://github.com/dbmdz/historic-ner)

### Misc

* [German Preprocessing](https://github.com/jfilter/german-preprocessing)

* [ICARUS (query tool)](http://hdl.handle.net/11022/1007-0000-0000-8E56-0)

   * [ICARUS2](http://hdl.handle.net/11022/1007-0000-0007-C635-E)

* [nlprule (error correction)](https://github.com/bminixhofer/nlprule)

### Text generation

* [fake_text](https://github.com/fhswf/fake_text)

* [ngen](https://github.com/Fedjmike/ngen)

* [pypolibox](https://github.com/arne-cl/pypolibox)

### Industry/Applications

* [German Decompounder for Apache Lucene / Apache Solr / Elasticsearch](https://github.com/uschindler/german-decompounder)

* [holmes-extractor](https://github.com/msg-systems/holmes-extractor)

* [LanguageTool](https://languagetool.org)

* [Plenum First Said](https://github.com/ungeschneuer/plenum_first_said)

### Evaluation

* [DKPro Statistics](https://dkpro.github.io/dkpro-statistics)

* [Evaluating Off-the-Shelf NLP Tools for German](https://github.com/rubcompling/konvens2019)

* [Evaluation of different NLP toolkits](https://github.com/goerlitz/nlp-german)

## Semantic analysis

### Datasets

* [Complex Word Identification (DE, EN, ES, FR)](https://sites.google.com/view/cwisharedtask2018/home)

* Distributional memories: [DM.de](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/dm-de.html) [TransDM.de](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/transdmde.html)

* [Distributional thesauri (includes German)](https://sourceforge.net/projects/jobimtext/files/data/models/)

* [Downloads page](https://sites.google.com/site/iggsahome/home) of the Interest Group on German Sentiment Analysis

* [Lexical Chains](https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/lexical-chains.html)

* [Logical metonymy database](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/GLMD.html)

* [schulteimwalde.de/resources.html](http://www.schulteimwalde.de/resources.html)

* [Semantic Relations in Context](https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/semreldata.html)

* [UKP Darmstadt data list](https://www.informatik.tu-darmstadt.de/ukp/research_6/data/index.en.jsp)

### Word embeddings and senses

* [disco (semantic similarity)](https://github.com/linguatools/disco)

* [GermaNet](http://www.sfs.uni-tuebingen.de/GermaNet/)

   * [pygermanet](https://github.com/wroberts/pygermanet)

* [german2vec](https://github.com/Bachfischer/german2vec)

* [GermanWordEmbeddings](https://github.com/devmount/GermanWordEmbeddings)

* [German ELMO model](https://github.com/t-systems-on-site-services-gmbh/german-elmo-model)

* [Open German WordNet](https://github.com/hdaSprachtechnologie/odenet)

* [sensegram](https://github.com/tudarmstadt-lt/sensegram)

* [SpinningBytes word embeddings (tweets)](https://www.spinningbytes.com/resources/wordembeddings/)

* [UBY Linked Lexical Resource](https://dkpro.github.io/dkpro-uby/)

* [WECHSEL (subword embeddings)](https://github.com/CPJKU/wechsel)

### Sentiment analysis datasets / polarity clues

* [Affective norms: abstractness, arousal, imageability and valence ratings](http://www.ims.uni-stuttgart.de/data/affective_norms)

* [German Sentiment Classification Model for Dialog Systems](https://github.com/oliverguhr/german-sentiment)

* [GermanPolarityClues](http://www.ulliwaltinger.de/sentiment/)

* [HeiST – Heidelberg Sentiment Treebank](http://www.cl.uni-heidelberg.de/~versley/HeiST/)

* [(Non-)Literalness Ratings for complex verbs](http://www.ims.uni-stuttgart.de/data/pv_nonlit )

* [Potsdam Twitter Sentiment Corpus (PotTS)](https://github.com/WladimirSidorenko/PotTS)

* [Sentiment dictionary for German political language](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BKBXWD)

* [Sentiment Lexicon (Univ. Zurich)](http://bics.sentimental.li/files/8614/2462/8150/german.lex)

* [SentimentWortschatz](http://wortschatz.uni-leipzig.de/en/download/)

* [SpinningBytes Swiss German Sentiment Corpus](https://github.com/spinningbytes/SB-CH)

### Sentiment detection

* [3x8emotions](https://github.com/tweedmann/3x8emotions)

* [EmotiKLUE](https://github.com/tsproisl/EmotiKLUE)

* [germansentiment: A simple python package for sentiment classification](https://github.com/oliverguhr/german-sentiment-lib)

* [LT-ABSA: Aspect-based Sentiment Analysis](https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/software/lt-absa.html)

* [sentiment-analyser](https://github.com/syzer/sentiment-analyser)

* [spacy-sentiws](https://github.com/Liebeck/spacy-sentiws)

### GermEval

*(category to improve)*

* [Official GermEval tools list](https://projects.fzai.h-da.de/iggsa/resources-tools-and-literature/)

* [GermEval 2015 data (Lexical Substitution)](https://sites.google.com/site/germeval2015/)

* [Germeval Task 2017](https://sites.google.com/view/germeval2017-absa/home)

* [GermEval-2018 data](https://github.com/uds-lsv/GermEval-2018-Data)

* [germeval-rug](https://github.com/malvinanissim/germeval-rug)

* [IWG_hatespeech_public](https://github.com/UCSM-DUE/IWG_hatespeech_public)

* [jpadillamontani/germeval2018](https://github.com/jpadillamontani/germeval2018)

* [uhh-lt/GermEval2017-Baseline](https://github.com/uhh-lt/GermEval2017-Baseline)

* [UKP embeddings for GermEval 2017](https://github.com/UKPLab/germeval2017-sentiment-detection)

### Discourse

* [Bilingual formality (T/V) corpus (EN/DE)](https://nlpado.de/~sebastian/data/tv_data.shtml)

* [Bilingual FrameNet frame embeddings (EN/DE)](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/XLFrameEmbed.html)

* [Bilingual parallel frame-semantic annotation (EN/DE)](https://nlpado.de/~sebastian/data/srl_data.shtml)

* [Coreferee](https://github.com/msg-systems/coreferee)

* [CorZu (coreference resolution)](https://github.com/dtuggener/CorZu)

* [Discourse Segmenter](https://github.com/WladimirSidorenko/DiscourseSegmenter)

* [Frame Identification](https://github.com/UKPLab/naacl18-multimodal-frame-identification)

* [German social media textual entailment dataset](https://www.cl.uni-heidelberg.de/~zeller/res/te-ger/index.mhtml)

* [HotCoref DE (coreference resolution)](https://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/HotCorefDe)

* [PropS-DE (proposition structures)](https://github.com/UKPLab/props-de)

* [Tense-mood-voice annotation system](https://github.com/aniramm/tmv-annotator)

### Summarization and Simplification

* [DEPlain](https://github.com/rstodden/DEPlain)

* [Klexikon](https://github.com/dennlinger/klexikon) (Joint Summarization and Simplification)

* [Tools and corpora for summarization of German texts](https://github.com/AIPHES)

### Psycholinguistics

* [Noun Associations for German](http://www.psycholing.es.uni-tuebingen.de/nag/index.php)

## Speech NLP

* [Archiv für gesprochenes Deutsch](http://agd.ids-mannheim.de/korpus_index.shtml)

* [BAS ressources](http://www.bas.uni-muenchen.de/Bas/BasSpeechresourceseng.html)

* [Bochumer Korpus der gesprochenen Sprache im Ruhrgebiet](https://www.ruhr-uni-bochum.de/kgsr/)

* [Database for Spoken German (IDS Mannheim)](https://dgd.ids-mannheim.de/dgd/pragdb.dgd_extern.welcome)

* [deepspeech-german](https://github.com/AASHISHAG/deepspeech-german)

* [(D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis ](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/dirndl.en.html)

* [Hamburger Zentrum für Sprachkorpora](https://corpora.uni-hamburg.de/hzsk/)

* [kaldi-tuda-de](https://github.com/uhh-lt/kaldi-tuda-de)

* [Open Speech Data Corpus](http://voxforge.org/home/forums/other-languages/german/open-speech-data-corpus-for-german)

* [Thorsten (Emotional) - Open German Voice Dataset](https://github.com/thorstenMueller/deep-learning-german-tts/#emotional-dataset-information-and-samples-microphone)

* [Thorsten (Neutral) - Open German Voice Dataset](https://github.com/thorstenMueller/deep-learning-german-tts/#samples-of-my-neutral-voice)

## Machine Translation

*(category to improve)*

* [Tensorflow NMT DE-EN](https://github.com/tensorflow/nmt)

   * [NMT English to German](https://github.com/thomasschmied/Neural_Machine_Translation_Tensorflow)

* [Unsupervised Word Segmentation for NMT](https://github.com/rsennrich/subword-nmt)

#### Parallel corpora

* [Linguatools Webcrawl German-English 2015](http://linguatools.org/tools/corpora/webcrawl-parallel-corpus-german-english-2015/)

* [MuchMore Springer Bilingual Corpus](http://muchmore.dfki.de/resources1.htm)

* [OPUS collection](http://opus.nlpl.eu/)

## Large Language Models

* [EM_German](https://github.com/jphme/EM_German)

* [German Alpaca Dataset](https://github.com/LEL-A/GerAlpacaDataCleaned)

* [German Benchmark Datasets](https://github.com/bjoernpl/GermanBenchmark)

* [German Language Models](https://github.com/malteos/german-language-models)

* [GermanRAG](https://github.com/rasdani/germanrag)

* [German Text Embedding Clustering Benchmark](https://github.com/ClimSocAna/tecb-de)

* [Swiss German Text Encoders](https://github.com/ZurichNLP/swiss-german-text-encoders)

* [Vox Populi, Vox AI](https://github.com/leahvdh/Vox-Populi-Vox-AI)

## Teaching resources and tutorials

* [bubenhofer.com/korpuslinguistik/kurs/](http://www.bubenhofer.com/korpuslinguistik/kurs/)

* [CorpusExplorer v2.0 – Seminartauglich in einem halben Tag](https://lernen-mit.jan-oliver-ruediger.de/)

* [deeplearning4nlp-tutorial](https://github.com/UKPLab/deeplearning4nlp-tutorial)

* [deutsch-nlp (text classification)](https://github.com/taylorhawks/deutsch-nlp)

* [German Text Classification Tutorial Series](https://github.com/realjanpaulus/german_text_classification_nlp)

* [Statistics for linguists (S. Vasishth)](https://github.com/vasishth/Statistics-lecture-notes-Potsdam)

* [Stilometrie](https://github.com/realjanpaulus/stylometry)

* Uni Zürich: Sprachtechnologie in den Digital Humanities – MOOC [Youtube](https://www.youtube.com/channel/UChb3Rd5vo3WEgMSy99VInaw) & [Coursera](http://www.coursera.org/learn/digital-humanities)

## More lists

### German

* [CLARIN VLO (DE+public)](https://vlo.clarin.eu/search?2&fq=languageCode:code:deu&fq=licenseType:PUB)

* [computerlinguistik.org](http://www.computerlinguistik.org/portal/portal.html?s=Ressourcen)

* [Learn German as a foreign language](https://github.com/willianpaixao/awesome-german)

* [LRE Map](http://lremap.elra.info/?&selected_facets=languageFilter_exact%3AGerman)

* [MetaShare Language Resources](http://metashare.ilsp.gr:8080/repository/search/?q=&selected_facets=languageNameFilter_exact%3AGerman)

* [Peter Kolb's list](http://www.ling.uni-potsdam.de/~kolb/nlp-tools.html)

* [Swiss German Language Processing](http://kitt.cl.uzh.ch/kitt/noah/resources)

### General

* GitHub topics [corpus-linguistics](https://github.com/topics/corpus-linguistics) & [nlp](https://github.com/topics/nlp)

* [nlp-datasets](https://github.com/niderhoff/nlp-datasets)

* [NLP-progress](https://github.com/sebastianruder/NLP-progress)

* [/r/LanguageTechnology/](https://www.reddit.com/r/LanguageTechnology/)

### Comparable lists

* [awesome-nlp](https://github.com/keon/awesome-nlp)

* [Awesome Community-Curated NLP List](https://github.com/alvations/awesome-community-curated-nlp)

* [awesome-chinese-nlp](https://github.com/crownpku/Awesome-Chinese-NLP)

* [awesome-danish](https://github.com/fnielsen/awesome-danish)

* [awesome-hungarian-nlp](https://github.com/oroszgy/awesome-hungarian-nlp)

* [awesome Information Retrieval](https://github.com/harpribot/awesome-information-retrieval)

* [Indonesian NLP](https://github.com/kmkurn/id-nlp-resource)

* [Norwegian NLP resources](https://github.com/web64/norwegian-nlp-resources)

* [awesome-nlp-polish](https://github.com/ksopyla/awesome-nlp-polish)

* [awesome-spanish-nlp](https://github.com/dav009/awesome-spanish-nlp)

* [NLP-Pandect](https://github.com/ivan-bilan/The-NLP-Pandect)

* [M. Weisser's list of NLP/Computational Linguistics Resources](http://martinweisser.org/corpora_site/comp_ling_resources.html)

* [NLP tools (Saarland University)](http://www.coli.uni-saarland.de/~csporled/page.php?id=tools)

* [W. Roberts' Computational Linguistics Links](http://amor.cms.hu-berlin.de/~robertsw/links.html)

### Larger institutional GitHub groups

* [DFKI-NLP](https://github.com/DFKI-NLP)

* [Language Technology Group, Universität Hamburg](https://github.com/uhh-lt)

* [Saarland University Spoken Language Systems Group](https://github.com/uds-lsv)

* [Ubiquitous Knowledge Processing Lab, TU Darmstadt](https://github.com/ukplab)

* [Webis](https://github.com/webis-de)

## Contributors

See the [list of contributors](https://github.com/adbar/German-NLP/graphs/contributors).

## License

[![CC-BY](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by.svg)](https://creativecommons.org/licenses/by/4.0/)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adbar/German-NLP

Awesome Lists containing this project

README