Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-swedish-nlp
A curated list of resources for natural language processing (NLP) in Swedish
https://github.com/dkalpakchi/awesome-swedish-nlp
Last synced: 4 days ago
JSON representation
-
Corpora
-
Monolingual
- CC-100 - - documents extracted from [Common Crawl](https://commoncrawl.org/), automatically classified and filtered. Swedish part is 21 GB of raw text.
- mC4 - - a colossal, cleaned version of Common Crawl's web crawl corpus (C4), Swedish part contains about 65GB of raw text
- SBS - - a collection of sentences from Swedish blog posts from November 2010 until September 2012, **contains scrambled sentences** -- **NOTE: links seem to be broken as of 2022-05-25**
- Project Runeberg - - copyright-free Swedish literature
- Swedish Diachronic Corpus - - text corpora covering the time period from Old Swedish to present day for various text genres
- OSCAR
- Polyglot's processed Swedish Wikipedia
- Språkbanken Text - - this is a hub page for many Swedish corpora maintained by the Språkbanken Text, monolingual corpora come from newspapers, blog posts, literature of different years (some from as early as the 18th century). **Note that many of these corpora contain scrambled sentences**.
-
Parallel
-
-
Datasets
-
Monolingual
- Talbanken
- LinES
- PUD
- Swedish-sentiment - - a sentiment analysis dataset of 10000 texts with roughly 50/50 split between positive and negative sentiments
- SIC - - a corpus of Swedish Internet tags, manually annotated wth part of speech tags and named entities
- SUSC - - a corpus of seven novels by August Strindberg annotated with part of speech tags with morphological analysis and lemmas
- SNEC - - The Strindberg National Edition Corpus, both plain text version and linguistically annotated CoNLL-U version -- **NOTE: links seem to be broken as of 2022-05-25**
- SuperLim - - a Swedish version of GLUE benchmark
- SUC 2.0 - - annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information
- SUC 3.0 - - improved and extended SUC 2.0
- OverLim - - dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT, **the translation quality was not manually checked**
-
-
Pre-trained resources
-
Word embeddings
- vecs
- vecs
- Diachronic embeddings
- NLPL repository
- vecs
- vecs
- Swectors - dimensional (the released vectors are Word2Vec)
- Polyglot embeddings
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
- vecs
-
Swedish-specific Transformer models
- this thread
- model on HF Hub
- model on HF Hub
- model on HF Hub
- model on HF Hub
- model on HF Hub
- model on HF Hub - - **NOTE: The repository is empty as of 2022-08-23**
- model on HF Hub
-
Nordic Transformer models
-
Multilingual Transformer models
-
Dependency parsing models
- Stanza's models - - trained on UD treebanks: one on Talbanken and another on LinES
- MaltParser
-
Part of speech taggers
-
Machine Translation models to/from Swedish
-
-
Tools
Categories