Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-swedish-nlp

A curated list of resources for natural language processing (NLP) in Swedish
https://github.com/dkalpakchi/awesome-swedish-nlp

Last synced: 4 days ago
JSON representation

  • Corpora

    • Monolingual

      • CC-100 - - documents extracted from [Common Crawl](https://commoncrawl.org/), automatically classified and filtered. Swedish part is 21 GB of raw text.
      • mC4 - - a colossal, cleaned version of Common Crawl's web crawl corpus (C4), Swedish part contains about 65GB of raw text
      • SBS - - a collection of sentences from Swedish blog posts from November 2010 until September 2012, **contains scrambled sentences** -- **NOTE: links seem to be broken as of 2022-05-25**
      • Project Runeberg - - copyright-free Swedish literature
      • Swedish Diachronic Corpus - - text corpora covering the time period from Old Swedish to present day for various text genres
      • OSCAR
      • Polyglot's processed Swedish Wikipedia
      • Språkbanken Text - - this is a hub page for many Swedish corpora maintained by the Språkbanken Text, monolingual corpora come from newspapers, blog posts, literature of different years (some from as early as the 18th century). **Note that many of these corpora contain scrambled sentences**.
    • Parallel

      • OPUS - - The Open Parallel Corpus, a hub for parallel datasets for many pairs of languages, including to/from Swedish.
      • SMULTRON - - a parallel treebank that contains around 1000 sentences in English, German and Swedish
  • Datasets

    • Monolingual

      • Talbanken
      • LinES
      • PUD
      • Swedish-sentiment - - a sentiment analysis dataset of 10000 texts with roughly 50/50 split between positive and negative sentiments
      • SIC - - a corpus of Swedish Internet tags, manually annotated wth part of speech tags and named entities
      • SUSC - - a corpus of seven novels by August Strindberg annotated with part of speech tags with morphological analysis and lemmas
      • SNEC - - The Strindberg National Edition Corpus, both plain text version and linguistically annotated CoNLL-U version -- **NOTE: links seem to be broken as of 2022-05-25**
      • SuperLim - - a Swedish version of GLUE benchmark
      • SUC 2.0 - - annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information
      • SUC 3.0 - - improved and extended SUC 2.0
      • OverLim - - dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT, **the translation quality was not manually checked**
  • Pre-trained resources

  • Tools

    • Machine Translation models to/from Swedish

      • Granska - - software for grammar control
      • Stava - - software for spell checking