Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-danish

A curated list of awesome resources for Danish language technology
https://github.com/fnielsen/awesome-danish

Last synced: 4 days ago
JSON representation

Data
- Corpora
  - Danish review dataset - Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews .
  - SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
  - XED - emotion annotated movie subtitles. Described in *[XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection](https://www.aclweb.org/anthology/2020.coling-main.575.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q105978198)).
  - DaN+ - annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in *[DaN+: Danish Nested Named Entities and Lexical Normalization](https://www.aclweb.org/anthology/2020.coling-main.583/)*
  - Corona Dataset - Question dataset from Certainly annotated for domain and intent.
  - wiki40b/da - Clean-up text from Danish Wikipedia. Described in *[Wiki-40B: Multilingual Language Model Dataset](https://www.aclweb.org/anthology/2020.lrec-1.297/)*. ([Scholia](https://scholia.toolforge.org/work/Q105430726))
  - DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in *[Offensive Language and Hate Speech Detection for Danish](https://www.aclweb.org/anthology/2020.lrec-1.430)* ([Scholia](https://scholia.toolforge.org/work/Q66592400))
  - DaNewsroom: A Large-scale Danish Summarisation Dataset
  - WikiANN - Named entity annotated corpus. Described in *[Cross-lingual Name Tagging and Linking for 282 Languages](https://www.aclweb.org/anthology/P17-1178.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q54488065))
  - Danish Gigaword - Collection of 10^12 words of Danish text. Described in *[The Danish Gigaword Corpus](https://arxiv.org/abs/2005.03521)* ([Scholia](https://scholia.toolforge.org/work/Q107060118))
  - OSCAR - Danish corpus derived from the Common Crawl corpus. Described in *[Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures](https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/9021/file/Suarez_Sagot_Romary_Asynchronous_Pipeline_for_Processing_Huge_Corpora_2019.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q85398180))
  - The Danish Parliament Corpus 2009 - 2017, v1 - Attribution 4.0 International
  - Grundtvig's Works Corpus - Attribution-NonCommercial 4.0 International.
  - DK-CLARIN Reference Corpus of General Danish
  - DanFEVER - Danish text corpus with over 6'400 claims and support. Described in *[DanFEVER: claim verification dataset for Danish](http://www.ep.liu.se/ecp/178/047/ecp2021178047.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q107060524))
  - DanNet - wordnet with usage examples. The usage examples have been used for word sense disambiguation, see [XL-WSD: An Extra-Large and Cross-Lingual Evaluation Frameworkfor Word Sense Disambiguation](http://wwwusers.di.uniroma1.it/~navigli/pubs/AAAI_2021_Pasinietal.pdf)
  - NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ [Scholia](https://scholia.toolforge.org/work/Q57730960) ]
  - Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
  - Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
  - Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in *Tekststrukturering pa italiensk og dansk*
  - Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
  - Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
- Parallel corpora
  - Europarl - parallel sentences between Danish and English from the European Parlament.
  - ITU Faroese Pairs Dataset - Faroese-Danish parallel text. Described in *[The ITU Faroese Pairs Dataset](https://arxiv.org/pdf/2206.08727.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q112673011))
  - JW300 - "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
  - OpenSubtitles2018 - Parallel corpus from movie and tv subtitles. Described in *[OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles](http://www.lrec-conf.org/proceedings/lrec2016/pdf/947_Paper.pdf)*.
  - Tatoeba - Sentences
  - WikiMatrix
  - Tatoeba - Sentences
- Spoken language corpora
  - DanPASS - Described in *DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus* ([Scholia](https://scholia.toolforge.org/work/Q71038312))
  - LANCHART - Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., *[The data and design of the LANCHART study](https://www.tandfonline.com/doi/pdf/10.1080/03740460903364003)* ([Scholia](https://scholia.toolforge.org/work/Q97756317)).
  - Common Voice - Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at https://commonvoice.mozilla.org/sentence-collector. Common Voice is described in *Common Voice: A Massively-Multilingual Speech Corpus* ([Scholia](https://scholia.toolforge.org/work/Q79020060)).
  - FT Speech - Described in *FT SPEECH : Danish Parliament Speech Corpus* ([Scholia](https://scholia.toolforge.org/work/Q98841513)).
  - NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
  - NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
  - NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
  - Wikimedia Commons Audio files of Danish language - Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works.
  - CoRal - Danish Conversational and Read-aloud Dataset
- Dictionaries and ontologies
  - Det Centrale Ordregister - identifier for words and their inflections with 516,017 forms (COR).
  - NST-lexical-database
  - DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
  - Excerpt
  - Opslagsord og ordklasser
  - Stavekontrolden - word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL.
  - The Concise Danish Dictionary
  - Interactive Terminology for Europe - European Union terminology database. October 2020 version contains over 500,000 Danish terms.
  - The Danish FrameNet Lexicon
  - 1,290,000 lexemes
  - Overview over Danish lexemes in Ordia - webapp with overview of content of Wikidata lexemes based on SPARQL queries.
  - AFINN - Danish lexicons annotated for sentiment.
  - concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in *[Automatic Estimation of Lexical Concreteness in 77 Languages](http://pubman.mpdl.mpg.de/pubman/item/escidoc:2622741/component/escidoc:2622739/Thompson_Lupyan_2018.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q56219750)).
  - SAM lexicon - sentiment analysis word list extended from AFINN to 4275 lines. Described in *[Sentiment Analysis Multitool, SAM](https://raw.githubusercontent.com/lucaspuvis/SAM/master/Thesis.pdf)*.
  - Wikidata lexemes latest lexemes dump in ttl - official dump of lexeme-only part of Wikidata.
  - NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
  - Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.
- Word sets
  - Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in [danlp](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#wordsim-353).
  - Four words - 100 odd-one-out sets of 4 words or phrases.
- Neural text models
  - Ælæctra - Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
  - Multilingual sentence transformers - Pre-trained multilingual sentence transformers,
  - Danish ELECTRA - Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library.
  - daT5-summariser - Danish abstractive summarisation of news articles based on mT5-base.
  - ConvBERT - Philip Tamimi-Sarnikowski's model
  - Danish ELMo on OSCAR - (Link does not work as of December 2020)
  - mfaq - Multilingual FAQ retrieval model. Described in *[MFAQ: a Multilingual FAQ Dataset](https://arxiv.org/pdf/2109.12870)* ([Scholia](https://scholia.toolforge.org/work/Q114963815))
  - wiki40b-lm-da - language model trained on Danish from Wiki40B dataset
- Embeddings
  - cc.da.300 - us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.bin.gz)) - fastText-trained embedding on Danish part of *Common Crawl* and Danish Wikipedia. Read more about the method in *[Learning Word Vectors for 157 Languages](https://arxiv.org/pdf/1802.06893)* ([Scholia](https://scholia.toolforge.org/work/Q49985142)).
  - wiki.da - us-west-1.amazonaws.com/fasttext-vectors/wiki.da.zip)) - fastText-trained embedding on Danish Wikipedia. Read more about the method in *[Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606)* ([Scholia](https://scholia.toolforge.org/work/Q28775150)).
  - NLPL word embeddings repository - NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020.
  - Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
  - Danish DSL and Reddit word2vec word embeddings - 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit.
- Neural speech models
  - Hugging Face - List of models for Danish automatic speech recognition.
  - Alvenir Wav2vec2 - Pretrained Danish neural model.
  - Whisper - Multilingual neural model from OpenAI.
  - xls-r-300m-danish-nst-cv9 - Pretrained Danish neural model.
Tools
- Named entity recognition
  - ScandiNER - Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese.
  - flair+danlp ner-tagger - Flair NER tagger trained by the Alexandra Institute.
  - Polyglot named entity extraction
- Entity linking
  - Babelfy - Web app and service for linking words and entities.
  - DBpedia Spotlight - DBpedia-based entity linker. Described in *Improving Efficiency and Accuracy in Multilingual Entity Extraction* ([Scholia](https://scholia.toolforge.org/work/Q106526263))
- Automatic Speech Recognition
  - kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
- Speech Synthesis (text-to-speech)
  - Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at [TTSMP3](https://ttsmp3.com/text-to-speech/Danish/).
  - ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
- Fundamental processing
  - DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
  - spaCy - Python-based natural language processing package
Competitions
- Fundamental processing
  - ELEXIS Monolingual Word Sense Alignment Task - Predicting the relationship between two senses in each of several languages, including Danish.
  - OffensEval 2020 - Danish - Offensive Language Identification in Social Media competition. Described in [Offensive Language and Hate Speech Detection for Danish](https://arxiv.org/pdf/1908.04531.pdf) ([Scholia](https://scholia.toolforge.org/work/Q66592400))
Resources about resources
- Fundamental processing
  - Language Technology Resources for Danish - og Litteraturselskab
  - European Language Resources Association (ELRA) list for *Danish* - commercial licenses.
  - sprogteknologi.dk - List of Danish language resources. Compiled by the Agency for Digitisation.
  - Danish resources - Finn Årup Nielsen's PDF with pointers to Danish resources.
  - Scholia's topic aspect for *Danish*

Programming Languages

Python 1 Jupyter Notebook 1

Ecosyste.ms: Awesome

awesome-danish

Data

Corpora

Parallel corpora

Spoken language corpora

Dictionaries and ontologies

Word sets

Neural text models

Embeddings

Neural speech models

Tools

Named entity recognition

Entity linking

Automatic Speech Recognition

Speech Synthesis (text-to-speech)

Fundamental processing

Competitions

Fundamental processing

Resources about resources

Fundamental processing