Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-nlp-polish

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
https://github.com/ksopyla/awesome-nlp-polish

Last synced: 2 days ago
JSON representation

Polish text datasets
- Task oriented datsets
  - Ermlab Opineo dataset - opineo reviews - [GDrive](https://drive.google.com/file/d/1vXqUEBjUHGGy3vV2dA7LlvBjjZlQnl0D/view?usp=sharing)
  - [PolEval 2019 Task6
  - Polish CDSCorpus - The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment.
  - Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) - corpus of Polish reviews annotated with sentiment at the level of the whole text (*text*) and at the level of sentences (*sentence*) for the following domains: hotels, medicine, products and university (reviews*)
  - Polish analogy dataset - example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
  - PolEmo 2.0 Sentiment Analysis Dataset for CoNLL
  - Polish Music Dataset - Polish Music Dataset is the largest dataset with information about artists, songs and lyrics in Poland (now only Hip Hop artists).
  - NKJP - National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for [download](http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=get&target=NKJP-PodkorpusMilionowy-1.2.tar.gz) (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus.
- Raw texts
  - Clean Polish OSCAR - preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab
  - Polish Wikipedia dump - regular monthly copy of Polish wikipedia. More then 4GB of text.
  - ParaCrawl v5 - ParaCrawl/v5/mono/pl.txt.gz) [tokenized txt corpus](https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/mono/pl.tok.gz)
  - Polish Parliamentary Corpus
  - Polish OpenSubtitles v2018 - sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from [opensubtitles](http://www.opensubtitles.org/) [raw txt corpus (unpacked 7.2GB)](https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/pl.txt.gz) [tokenized txt corpus (unpacked 7.6GB)](https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/pl.tok.gz).
  - ParaCrawl v5 - ParaCrawl/v5/mono/pl.txt.gz) [tokenized txt corpus](https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/mono/pl.tok.gz)
Models and Embeddings
- Other models
  - ELMO embeddings - A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10).
  - Zalando Flair polish models - Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward"
  - IPIPAN Word2vec polish models
  - Common Crawl - vectors.md)
  - FastText KGR10 polish model binary
  - ULMFiT for Tensorflow 2.0 - this collection contains ULMFiT recurrent language models trained on Wikipedia dumps for English and Polish. The models themselves were trained using FastAI and then exported to a TensorFlow-usable format. Code is available on [Bitbucket](https://bitbucket.org/edroneteam/tf2_ulmfit/src/master/).
  - Wrocław University of Science and Technology Word2Vec - Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia).
Language processing tools and libraries
- Other models
  - Morfeusz - morphological analyzer. See also [Elasticsearch plugin](https://github.com/allegro/elasticsearch-analysis-morfologik)
  - spaCy for Polish - extend spaCy, a popular production-ready NLP library, to fully support Polish language.
  - Polish abbreviations for NLTK sentence tokenizer
Papers, articles, blog post
- Other models
  - Benchmarks of some of polish NLP tools - Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc.
  - TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE - complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish
Contribution
- Other models
  - LinkedIn

Programming Languages

Python 2

Ecosyste.ms: Awesome

awesome-nlp-polish

Polish text datasets

Task oriented datsets

Raw texts

Models and Embeddings

Other models

Language processing tools and libraries

Other models

Papers, articles, blog post

Other models

Contribution

Other models