Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-nlp-polish

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
https://github.com/ksopyla/awesome-nlp-polish

Last synced: 2 days ago
JSON representation

  • Polish text datasets

    • Task oriented datsets

      • Ermlab Opineo dataset - opineo reviews - [GDrive](https://drive.google.com/file/d/1vXqUEBjUHGGy3vV2dA7LlvBjjZlQnl0D/view?usp=sharing)
      • [PolEval 2019 Task6
      • Polish CDSCorpus - The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment.
      • Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) - corpus of Polish reviews annotated with sentiment at the level of the whole text (*text*) and at the level of sentences (*sentence*) for the following domains: hotels, medicine, products and university (reviews*)
      • Polish analogy dataset - example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
      • PolEmo 2.0 Sentiment Analysis Dataset for CoNLL
      • Polish Music Dataset - Polish Music Dataset is the largest dataset with information about artists, songs and lyrics in Poland (now only Hip Hop artists).
      • NKJP - National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for [download](http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=get&target=NKJP-PodkorpusMilionowy-1.2.tar.gz) (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus.
    • Raw texts

      • Clean Polish OSCAR - preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab
      • Polish Wikipedia dump - regular monthly copy of Polish wikipedia. More then 4GB of text.
      • ParaCrawl v5 - ParaCrawl/v5/mono/pl.txt.gz) [tokenized txt corpus](https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/mono/pl.tok.gz)
      • Polish Parliamentary Corpus
      • Polish OpenSubtitles v2018 - sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from [opensubtitles](http://www.opensubtitles.org/) [raw txt corpus (unpacked 7.2GB)](https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/pl.txt.gz) [tokenized txt corpus (unpacked 7.6GB)](https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/pl.tok.gz).
      • ParaCrawl v5 - ParaCrawl/v5/mono/pl.txt.gz) [tokenized txt corpus](https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/mono/pl.tok.gz)
  • Models and Embeddings

  • Language processing tools and libraries

  • Papers, articles, blog post

  • Contribution