Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-nlp-polish
Awesome-nlp-polish
https://github.com/plubinski/awesome-nlp-polish
- SlavicBert - multilingual BERT model - BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model http://docs.deeppavlov.ai/en/master/features/models/bert.html but I have problems to convert it to pytorch.
- ELMO embeddings - A model of ELMo embeddings for Polish language trained on large textual corpora (KGR10).
- Zalando Flair polish models - Contextual string embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. There are two models "pl-forward and pl-backward"
- IPIPAN Word2vec polish models
- Wrocław University of Science and Technology Word2Vec - Distributional language models for Polish trained on different corpora (KGR10, NKJP, Wikipedia).
- Common Crawl - vectors.md)
- FastText KGR10 polish model binary
- Universal Sentence Encoder Multilingual - sentence embeddings, it covers 16 languages (including Polish)
- BPEmb: Subword Embeddings includes polish - easy to use with [Flair](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/BYTE_PAIR_EMBEDDINGS.md)
- Morfologik - dictionary-based morphological analyzer
- Morfeusz - morphological analyzer. See also [Elasticsearch plugin](https://github.com/allegro/elasticsearch-analysis-morfologik)
- Stempel - algorithmic stemmer. See also [Elasticsearch plugin](https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-stempel.html)
- scaCy for Polish - extend spaCy, a popular production-ready NLP library, to fully support Polish language.
- Polish Word Embeddings Review - Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task.
- Polish Sentence Evaluation - contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks
- The KLEJ (Kompleksowa Lista Ewaluacji Językowych) benchmark is a set of nine evaluation tasks for the Polish language understanding.
- [PolEval 2019 Task6
- Polish CDSCorpus - The dataset for compositional distributional semantics. Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment.
- Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) - corpus of Polish reviews annotated with sentiment at the level of the whole text (*text*) and at the level of sentences (*sentence*) for the following domains: hotels, medicine, products and university (reviews*)
- Ermlab Opineo dataset - opineo reviews - [GDrive](https://drive.google.com/file/d/1vXqUEBjUHGGy3vV2dA7LlvBjjZlQnl0D/view?usp=sharing)
- OSCAR or Open Super-large Crawled ALMAnaCH coRpus - is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Contains 109GB or 49GB of polish text.
- Polish Wikipedia dump - regular monthly copy of Polish wikipedia. More then 4GB of text.
- Polish analogy dataset - example: "Ateny Grecja Bagdad Irak" - useful for word embeddings evaluation
- Polish OpenSubtitles - collection of translated movie subtitles from [opensubtitles](http://www.opensubtitles.org/).
- NKJP - National Corpus of Polish. It contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. Only a small sub-corpus is available for [download](http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=get&target=NKJP-PodkorpusMilionowy-1.2.tar.gz) (GNU GLP v.3). Direct contact and maybe necessary to get the full corpus.