Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

awesome-nlp-polish

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
https://github.com/ksopyla/awesome-nlp-polish

The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding.
[PolEval 2019 Task6
Polish CDSCorpus - The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment.
Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) - corpus of Polish reviews annotated with sentiment at the level of the whole text (*text*) and at the level of sentences (*sentence*) for the following domains: hotels, medicine, products and university (reviews*)
Ermlab Opineo dataset - opineo reviews - [GDrive](https://drive.google.com/file/d/1vXqUEBjUHGGy3vV2dA7LlvBjjZlQnl0D/view?usp=sharing)
Polish analogy dataset - example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
NKJP - National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for [download](http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=get&target=NKJP-PodkorpusMilionowy-1.2.tar.gz) (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus.
PolEmo 2.0 Sentiment Analysis Dataset for CoNLL
Polish Music Dataset - Polish Music Dataset is the largest dataset with information about artists, songs and lyrics in Poland (now only Hip Hop artists).
Clean Polish OSCAR - preprosessed polish oscar corpus, removed: foreign sentences(non-polish), non-valid polish senteces (eg. enums), corpus preprocessed by @Ermlab
OSCAR or Open Super-large Crawled ALMAnaCH coRpus - is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.
Polish Wikipedia dump - regular monthly copy of Polish wikipedia. More then 4GB of text.
Opus - the open parallel corpus - you can select languages and download only polish file
Polish OpenSubtitles v2018 - sentences 45.9M, polish tokens 287.1M ,collection of translated movie subtitles from [opensubtitles](http://www.opensubtitles.org/) [raw txt corpus (unpacked 7.2GB)](https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/pl.txt.gz) [tokenized txt corpus (unpacked 7.6GB)](https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/mono/pl.tok.gz).
ParaCrawl v5 - ParaCrawl/v5/mono/pl.txt.gz) [tokenized txt corpus](https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/mono/pl.tok.gz)
Polish Parliamentary Corpus
Polish Roberta Model - model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, Polish Parliamentary Corpus
PoLitBert - Polish RoBERTA model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that quality text will give good model.
PolBert - Polish BERT model. Model was trained with code provided in Google BERT's github repository. Merge with [huggingface/Transformers](https://huggingface.co/dkleczek/bert-base-polish-uncased-v1)
Allegro HerBERT - Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
SlavicBert - multilingual BERT model - BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model http://docs.deeppavlov.ai/en/master/features/models/bert.html but I have problems to convert it to pytorch.
ELMO embeddings - A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10).
Zalando Flair polish models - Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward"
IPIPAN Word2vec polish models
Wrocław University of Science and Technology Word2Vec - Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia).
Common Crawl - vectors.md)
FastText KGR10 polish model binary
Universal Sentence Encoder Multilingual - sentence embeddings, it covers 16 languages (including Polish)
BPEmb: Subword Embeddings includes polish - easy to use with [Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/BYTE_PAIR_EMBEDDINGS.md)
ULMFiT for Tensorflow 2.0 - this collection contains ULMFiT recurrent language models trained on Wikipedia dumps for English and Polish. The models themselves were trained using FastAI and then exported to a TensorFlow-usable format. Code is available on [Bitbucket](https://bitbucket.org/edroneteam/tf2_ulmfit/src/master/).
Morfologik - dictionary-based morphological analyzer
Morfeusz - morphological analyzer. See also [Elasticsearch plugin](https://github.com/allegro/elasticsearch-analysis-morfologik)
Stempel - algorithmic stemmer. See also [Elasticsearch plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-stempel.html)
spaCy for Polish - extend spaCy, a popular production-ready NLP library, to fully support Polish language.
spacy-pl by IPI PAN - integrating existing Polish language tools and resources into the spaCy pipeline
KRNNT Polish morphological tagger - KRNNT is a morphological tagger for Polish based on recurrent neural networks [Paper](http://ltc.amu.edu.pl/book2017/papers/PolEval1-6.pdf)
Stanza - NLP analysis package from Stanford University.
Duckling - library for parsing text into structured data with support for Polish
Polish abbreviations for NLTK sentence tokenizer
Benchmarks of some of polish NLP tools - Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc.
Polish Word Embeddings Review - Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task.
Polish Sentence Evaluation - contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks
TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE - complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish
LinkedIn

Programming Languages

Python 7 Haskell 1 HTML 1 Java 1 Jupyter Notebook 1

Keywords

nlp 4 deep-learning 3 machine-learning 3 natural-language-processing 3 polish 3 polish-language 3 word2vec 2 polish-sentiment-analysis 1 language-models 1 lexicons 1 word-embedding 1 pos-tagger 1 tagger 1 artificial-intelligence 1 corenlp 1 named-entity-recognition 1 python 1 pytorch 1 universal-dependencies 1 computational-linguistics 1 fasttext 1 wordembeddings 1 sentence-embeddings 1 word-embeddings 1 roberta 1 text-corpus 1