Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-danish

A curated list of awesome resources for Danish language technology
https://github.com/fnielsen/awesome-danish

Last synced: 4 days ago
JSON representation

  • Data

    • Corpora

      • Danish review dataset - Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews .
      • SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
      • XED - emotion annotated movie subtitles. Described in *[XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection](https://www.aclweb.org/anthology/2020.coling-main.575.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q105978198)).
      • DaN+ - annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in *[DaN+: Danish Nested Named Entities and Lexical Normalization](https://www.aclweb.org/anthology/2020.coling-main.583/)*
      • Corona Dataset - Question dataset from Certainly annotated for domain and intent.
      • wiki40b/da - Clean-up text from Danish Wikipedia. Described in *[Wiki-40B: Multilingual Language Model Dataset](https://www.aclweb.org/anthology/2020.lrec-1.297/)*. ([Scholia](https://scholia.toolforge.org/work/Q105430726))
      • DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in *[Offensive Language and Hate Speech Detection for Danish](https://www.aclweb.org/anthology/2020.lrec-1.430)* ([Scholia](https://scholia.toolforge.org/work/Q66592400))
      • DaNewsroom: A Large-scale Danish Summarisation Dataset
      • WikiANN - Named entity annotated corpus. Described in *[Cross-lingual Name Tagging and Linking for 282 Languages](https://www.aclweb.org/anthology/P17-1178.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q54488065))
      • Danish Gigaword - Collection of 10^12 words of Danish text. Described in *[The Danish Gigaword Corpus](https://arxiv.org/abs/2005.03521)* ([Scholia](https://scholia.toolforge.org/work/Q107060118))
      • OSCAR - Danish corpus derived from the Common Crawl corpus. Described in *[Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures](https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/9021/file/Suarez_Sagot_Romary_Asynchronous_Pipeline_for_Processing_Huge_Corpora_2019.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q85398180))
      • The Danish Parliament Corpus 2009 - 2017, v1 - Attribution 4.0 International
      • Grundtvig's Works Corpus - Attribution-NonCommercial 4.0 International.
      • DK-CLARIN Reference Corpus of General Danish
      • DanFEVER - Danish text corpus with over 6'400 claims and support. Described in *[DanFEVER: claim verification dataset for Danish](http://www.ep.liu.se/ecp/178/047/ecp2021178047.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q107060524))
      • DanNet - wordnet with usage examples. The usage examples have been used for word sense disambiguation, see [XL-WSD: An Extra-Large and Cross-Lingual Evaluation Frameworkfor Word Sense Disambiguation](http://wwwusers.di.uniroma1.it/~navigli/pubs/AAAI_2021_Pasinietal.pdf)
      • NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ [Scholia](https://scholia.toolforge.org/work/Q57730960) ]
      • Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
      • Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
      • Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in *Tekststrukturering pa italiensk og dansk*
      • Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
      • Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
    • Parallel corpora

      • Europarl - parallel sentences between Danish and English from the European Parlament.
      • ITU Faroese Pairs Dataset - Faroese-Danish parallel text. Described in *[The ITU Faroese Pairs Dataset](https://arxiv.org/pdf/2206.08727.pdf)* ([Scholia](https://scholia.toolforge.org/work/Q112673011))
      • JW300 - "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
      • OpenSubtitles2018 - Parallel corpus from movie and tv subtitles. Described in *[OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles](http://www.lrec-conf.org/proceedings/lrec2016/pdf/947_Paper.pdf)*.
      • Tatoeba - Sentences
      • WikiMatrix
      • Tatoeba - Sentences
    • Spoken language corpora

      • DanPASS - Described in *DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus* ([Scholia](https://scholia.toolforge.org/work/Q71038312))
      • LANCHART - Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., *[The data and design of the LANCHART study](https://www.tandfonline.com/doi/pdf/10.1080/03740460903364003)* ([Scholia](https://scholia.toolforge.org/work/Q97756317)).
      • Common Voice - Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at https://commonvoice.mozilla.org/sentence-collector. Common Voice is described in *Common Voice: A Massively-Multilingual Speech Corpus* ([Scholia](https://scholia.toolforge.org/work/Q79020060)).
      • FT Speech - Described in *FT SPEECH : Danish Parliament Speech Corpus* ([Scholia](https://scholia.toolforge.org/work/Q98841513)).
      • NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
      • NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
      • NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
      • Wikimedia Commons Audio files of Danish language - Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works.
      • CoRal - Danish Conversational and Read-aloud Dataset
    • Dictionaries and ontologies

    • Word sets

      • Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in [danlp](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#wordsim-353).
      • Four words - 100 odd-one-out sets of 4 words or phrases.
    • Neural text models

      • Ælæctra - Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
      • Multilingual sentence transformers - Pre-trained multilingual sentence transformers,
      • Danish ELECTRA - Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library.
      • daT5-summariser - Danish abstractive summarisation of news articles based on mT5-base.
      • ConvBERT - Philip Tamimi-Sarnikowski's model
      • Danish ELMo on OSCAR - (Link does not work as of December 2020)
      • mfaq - Multilingual FAQ retrieval model. Described in *[MFAQ: a Multilingual FAQ Dataset](https://arxiv.org/pdf/2109.12870)* ([Scholia](https://scholia.toolforge.org/work/Q114963815))
      • wiki40b-lm-da - language model trained on Danish from Wiki40B dataset
    • Embeddings

      • cc.da.300 - us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.bin.gz)) - fastText-trained embedding on Danish part of *Common Crawl* and Danish Wikipedia. Read more about the method in *[Learning Word Vectors for 157 Languages](https://arxiv.org/pdf/1802.06893)* ([Scholia](https://scholia.toolforge.org/work/Q49985142)).
      • wiki.da - us-west-1.amazonaws.com/fasttext-vectors/wiki.da.zip)) - fastText-trained embedding on Danish Wikipedia. Read more about the method in *[Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606)* ([Scholia](https://scholia.toolforge.org/work/Q28775150)).
      • NLPL word embeddings repository - NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020.
      • Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
      • Danish DSL and Reddit word2vec word embeddings - 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit.
    • Neural speech models

  • Tools

    • Named entity recognition

    • Entity linking

      • Babelfy - Web app and service for linking words and entities.
      • DBpedia Spotlight - DBpedia-based entity linker. Described in *Improving Efficiency and Accuracy in Multilingual Entity Extraction* ([Scholia](https://scholia.toolforge.org/work/Q106526263))
    • Automatic Speech Recognition

      • kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
    • Speech Synthesis (text-to-speech)

      • Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at [TTSMP3](https://ttsmp3.com/text-to-speech/Danish/).
      • ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
    • Fundamental processing

      • DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
      • spaCy - Python-based natural language processing package
  • Competitions

    • Fundamental processing

      • ELEXIS Monolingual Word Sense Alignment Task - Predicting the relationship between two senses in each of several languages, including Danish.
      • OffensEval 2020 - Danish - Offensive Language Identification in Social Media competition. Described in [Offensive Language and Hate Speech Detection for Danish](https://arxiv.org/pdf/1908.04531.pdf) ([Scholia](https://scholia.toolforge.org/work/Q66592400))
  • Resources about resources