Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-danish
A curated list of awesome resources for Danish language technology
https://github.com/fnielsen/awesome-danish
Last synced: 4 days ago
JSON representation
-
Data
-
Corpora
- Danish review dataset - Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews .
- SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
- XED - emotion annotated movie subtitles. Described in *[XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection](https://www.aclweb.org/anthology/2020.coling-main.575.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q105978198)).
- DaN+ - annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in *[DaN+: Danish Nested Named Entities and Lexical Normalization](https://www.aclweb.org/anthology/2020.coling-main.583/)*
- Corona Dataset - Question dataset from Certainly annotated for domain and intent.
- wiki40b/da - Clean-up text from Danish Wikipedia. Described in *[Wiki-40B: Multilingual Language Model Dataset](https://www.aclweb.org/anthology/2020.lrec-1.297/)*. ([Scholia](https://scholia.toolforge.org/work/Q105430726))
- DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in *[Offensive Language and Hate Speech Detection for Danish](https://www.aclweb.org/anthology/2020.lrec-1.430)* ([Scholia](https://scholia.toolforge.org/work/Q66592400))
- DaNewsroom: A Large-scale Danish Summarisation Dataset
- WikiANN - Named entity annotated corpus. Described in *[Cross-lingual Name Tagging and Linking for 282 Languages](https://www.aclweb.org/anthology/P17-1178.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q54488065))
- Danish Gigaword - Collection of 10^12 words of Danish text. Described in *[The Danish Gigaword Corpus](https://arxiv.org/abs/2005.03521)* ([Scholia](https://scholia.toolforge.org/work/Q107060118))
- OSCAR - Danish corpus derived from the Common Crawl corpus. Described in *[Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures](https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/9021/file/Suarez_Sagot_Romary_Asynchronous_Pipeline_for_Processing_Huge_Corpora_2019.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q85398180))
- The Danish Parliament Corpus 2009 - 2017, v1 - Attribution 4.0 International
- Grundtvig's Works Corpus - Attribution-NonCommercial 4.0 International.
- DK-CLARIN Reference Corpus of General Danish
- DanFEVER - Danish text corpus with over 6'400 claims and support. Described in *[DanFEVER: claim verification dataset for Danish](http://www.ep.liu.se/ecp/178/047/ecp2021178047.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q107060524))
- DanNet - wordnet with usage examples. The usage examples have been used for word sense disambiguation, see [XL-WSD: An Extra-Large and Cross-Lingual Evaluation Frameworkfor Word Sense Disambiguation](http://wwwusers.di.uniroma1.it/~navigli/pubs/AAAI_2021_Pasinietal.pdf)
- NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ [Scholia](https://scholia.toolforge.org/work/Q57730960) ]
- Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
- Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
- Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in *Tekststrukturering pa italiensk og dansk*
- Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
- Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
-
Parallel corpora
- Europarl - parallel sentences between Danish and English from the European Parlament.
- ITU Faroese Pairs Dataset - Faroese-Danish parallel text. Described in *[The ITU Faroese Pairs Dataset](https://arxiv.org/pdf/2206.08727.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q112673011))
- JW300 - "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
- OpenSubtitles2018 - Parallel corpus from movie and tv subtitles. Described in *[OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles](http://www.lrec-conf.org/proceedings/lrec2016/pdf/947_Paper.pdf)*.
- Tatoeba - Sentences
- WikiMatrix
- Tatoeba - Sentences
-
Spoken language corpora
- DanPASS - Described in *DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus* ([Scholia](https://scholia.toolforge.org/work/Q71038312))
- LANCHART - Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., *[The data and design of the LANCHART study](https://www.tandfonline.com/doi/pdf/10.1080/03740460903364003)* ([Scholia](https://scholia.toolforge.org/work/Q97756317)).
- Common Voice - Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at https://commonvoice.mozilla.org/sentence-collector. Common Voice is described in *Common Voice: A Massively-Multilingual Speech Corpus* ([Scholia](https://scholia.toolforge.org/work/Q79020060)).
- FT Speech - Described in *FT SPEECH : Danish Parliament Speech Corpus* ([Scholia](https://scholia.toolforge.org/work/Q98841513)).
- NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
- NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
- NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
- Wikimedia Commons Audio files of Danish language - Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works.
- CoRal - Danish Conversational and Read-aloud Dataset
-
Dictionaries and ontologies
- Det Centrale Ordregister - identifier for words and their inflections with 516,017 forms (COR).
- NST-lexical-database
- DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
- Excerpt
- Opslagsord og ordklasser
- Stavekontrolden - word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL.
- The Concise Danish Dictionary
- Interactive Terminology for Europe - European Union terminology database. October 2020 version contains over 500,000 Danish terms.
- The Danish FrameNet Lexicon
- 1,290,000 lexemes
- Overview over Danish lexemes in Ordia - webapp with overview of content of Wikidata lexemes based on SPARQL queries.
- AFINN - Danish lexicons annotated for sentiment.
- concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in *[Automatic Estimation of Lexical Concreteness in 77 Languages](http://pubman.mpdl.mpg.de/pubman/item/escidoc:2622741/component/escidoc:2622739/Thompson_Lupyan_2018.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q56219750)).
- SAM lexicon - sentiment analysis word list extended from AFINN to 4275 lines. Described in *[Sentiment Analysis Multitool, SAM](https://raw.githubusercontent.com/lucaspuvis/SAM/master/Thesis.pdf)*.
- Wikidata lexemes latest lexemes dump in ttl - official dump of lexeme-only part of Wikidata.
- NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
- Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.
-
Word sets
- Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in [danlp](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#wordsim-353).
- Four words - 100 odd-one-out sets of 4 words or phrases.
-
Neural text models
- Ælæctra - Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
- Multilingual sentence transformers - Pre-trained multilingual sentence transformers,
- Danish ELECTRA - Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library.
- daT5-summariser - Danish abstractive summarisation of news articles based on mT5-base.
- ConvBERT - Philip Tamimi-Sarnikowski's model
- Danish ELMo on OSCAR - (Link does not work as of December 2020)
- mfaq - Multilingual FAQ retrieval model. Described in *[MFAQ: a Multilingual FAQ Dataset](https://arxiv.org/pdf/2109.12870)* ([Scholia](https://scholia.toolforge.org/work/Q114963815))
- wiki40b-lm-da - language model trained on Danish from Wiki40B dataset
-
Embeddings
- cc.da.300 - us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.bin.gz)) - fastText-trained embedding on Danish part of *Common Crawl* and Danish Wikipedia. Read more about the method in *[Learning Word Vectors for 157 Languages](https://arxiv.org/pdf/1802.06893)* ([Scholia](https://scholia.toolforge.org/work/Q49985142)).
- wiki.da - us-west-1.amazonaws.com/fasttext-vectors/wiki.da.zip)) - fastText-trained embedding on Danish Wikipedia. Read more about the method in *[Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606)* ([Scholia](https://scholia.toolforge.org/work/Q28775150)).
- NLPL word embeddings repository - NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020.
- Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
- Danish DSL and Reddit word2vec word embeddings - 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit.
-
Neural speech models
- Hugging Face - List of models for Danish automatic speech recognition.
- Alvenir Wav2vec2 - Pretrained Danish neural model.
- Whisper - Multilingual neural model from OpenAI.
- xls-r-300m-danish-nst-cv9 - Pretrained Danish neural model.
-
-
Tools
-
Named entity recognition
- ScandiNER - Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese.
- flair+danlp ner-tagger - Flair NER tagger trained by the Alexandra Institute.
- Polyglot named entity extraction
-
Entity linking
- Babelfy - Web app and service for linking words and entities.
- DBpedia Spotlight - DBpedia-based entity linker. Described in *Improving Efficiency and Accuracy in Multilingual Entity Extraction* ([Scholia](https://scholia.toolforge.org/work/Q106526263))
-
Automatic Speech Recognition
- kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
-
Speech Synthesis (text-to-speech)
- Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at [TTSMP3](https://ttsmp3.com/text-to-speech/Danish/).
- ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
-
Fundamental processing
-
-
Competitions
-
Fundamental processing
- ELEXIS Monolingual Word Sense Alignment Task - Predicting the relationship between two senses in each of several languages, including Danish.
- OffensEval 2020 - Danish - Offensive Language Identification in Social Media competition. Described in [Offensive Language and Hate Speech Detection for Danish](https://arxiv.org/pdf/1908.04531.pdf) ([Scholia](https://scholia.toolforge.org/work/Q66592400))
-
-
Resources about resources
-
Fundamental processing
- Language Technology Resources for Danish - og Litteraturselskab
- European Language Resources Association (ELRA) list for *Danish* - commercial licenses.
- sprogteknologi.dk - List of Danish language resources. Compiled by the Agency for Digitisation.
- Danish resources - Finn Årup Nielsen's PDF with pointers to Danish resources.
- Scholia's topic aspect for *Danish*
-
Programming Languages
Categories
Sub Categories