awesome-swedish-nlp

A curated list of resources for natural language processing (NLP) in Swedish
https://github.com/dkalpakchi/awesome-swedish-nlp

Last synced: 17 days ago
JSON representation

Corpora
- Monolingual
  - CC-100 - - documents extracted from [Common Crawl](https://commoncrawl.org/), automatically classified and filtered. Swedish part is 21 GB of raw text.
  - mC4 - - a colossal, cleaned version of Common Crawl's web crawl corpus (C4), Swedish part contains about 65GB of raw text
  - SBS - - a collection of sentences from Swedish blog posts from November 2010 until September 2012, **contains scrambled sentences** -- **NOTE: links seem to be broken as of 2022-05-25**
  - Project Runeberg - - copyright-free Swedish literature
  - Swedish Diachronic Corpus - - text corpora covering the time period from Old Swedish to present day for various text genres
  - OSCAR
  - Polyglot's processed Swedish Wikipedia
  - Språkbanken Text - - this is a hub page for many Swedish corpora maintained by the Språkbanken Text, monolingual corpora come from newspapers, blog posts, literature of different years (some from as early as the 18th century). **Note that many of these corpora contain scrambled sentences**.
- Parallel
  - OPUS - - The Open Parallel Corpus, a hub for parallel datasets for many pairs of languages, including to/from Swedish.
  - SMULTRON - - a parallel treebank that contains around 1000 sentences in English, German and Swedish
Datasets
- Monolingual
  - Talbanken
  - LinES
  - PUD
  - Swedish-sentiment - - a sentiment analysis dataset of 10000 texts with roughly 50/50 split between positive and negative sentiments
  - SIC - - a corpus of Swedish Internet tags, manually annotated wth part of speech tags and named entities
  - SUSC - - a corpus of seven novels by August Strindberg annotated with part of speech tags with morphological analysis and lemmas
  - SNEC - - The Strindberg National Edition Corpus, both plain text version and linguistically annotated CoNLL-U version -- **NOTE: links seem to be broken as of 2022-05-25**
  - SuperLim - - a Swedish version of GLUE benchmark
  - SUC 2.0 - - annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information
  - SUC 3.0 - - improved and extended SUC 2.0
  - OverLim - - dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT, **the translation quality was not manually checked**
Pre-trained resources
- Word embeddings
  - vecs
  - vecs
  - Diachronic embeddings
  - NLPL repository
  - vecs
  - vecs
  - Swectors - dimensional (the released vectors are Word2Vec)
  - Polyglot embeddings
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
  - vecs
- Swedish-specific Transformer models
  - this thread
  - model on HF Hub
  - model on HF Hub
  - model on HF Hub
  - model on HF Hub
  - model on HF Hub
  - model on HF Hub - - **NOTE: The repository is empty as of 2022-08-23**
  - model on HF Hub
- Nordic Transformer models
  - model on HF Hub
- Multilingual Transformer models
  - mBERT - - multilingual BERT by Google Research
  - mBART50 - - multilingual BART by FAIR
- Dependency parsing models
  - Stanza's models - - trained on UD treebanks: one on Talbanken and another on LinES
  - MaltParser
- Part of speech taggers
  - Stagger
- Machine Translation models to/from Swedish
  - models on HF Hub
Tools
- Machine Translation models to/from Swedish
  - Granska - - software for grammar control
  - Stava - - software for spell checking

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-swedish-nlp

Corpora

Monolingual

Parallel

Datasets

Monolingual

Pre-trained resources

Word embeddings

Swedish-specific Transformer models

Nordic Transformer models

Multilingual Transformer models

Dependency parsing models

Part of speech taggers

Machine Translation models to/from Swedish

Tools

Machine Translation models to/from Swedish