Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/natasha/corus
Links to Russian corpora + Python functions for loading and parsing
https://github.com/natasha/corus
corpora datasets nlp python russian
Last synced: 4 days ago
JSON representation
Links to Russian corpora + Python functions for loading and parsing
- Host: GitHub
- URL: https://github.com/natasha/corus
- Owner: natasha
- License: mit
- Created: 2019-04-26T08:00:10.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-07-24T08:53:32.000Z (over 1 year ago)
- Last Synced: 2025-01-10T17:13:00.040Z (11 days ago)
- Topics: corpora, datasets, nlp, python, russian
- Language: Jupyter Notebook
- Homepage:
- Size: 1 MB
- Stars: 288
- Watchers: 18
- Forks: 21
- Open Issues: 66
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![CI](https://github.com/natasha/corus/actions/workflows/test.yml/badge.svg)
Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
## Usage
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
```bash
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
```Use `corus` to load the data:
```python
>>> from corus import load_lenta>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
```Iterate over texts:
```python
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...```
For links to other datasets and their loaders see the Reference section.
## Documentation
Materials are in Russian:
* Corus page on natasha.github.io
* Corus section of Datafest 2020 talk## Install
`corus` supports Python 3.5+, PyPy 3.
```bash
$ pip install corus
```## Reference
Dataset
APIfrom corus import
Tags
Texts
Uncompressed
DescriptionLenta.ru v1.0
news
739 351
1.66 Gb
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Lenta.ru v1.1+
news
800 975
1.94 Gb
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
fiction
301 871
144.92 Gb
Dump of lib.rus.ec prepared for RUSSE workshop
wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
news
1 003 869
3.70 Gb
wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
Mokoron Russian Twitter Corpus
social
sentiment
17 633 417
1.86 Gb
Russian Twitter sentiment markup
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
1 541 401
12.94 Gb
Russian Wiki dump
wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
162 372
30.04 Mb
wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
morph
4 030
20.21 Mb
wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
RusVectores SimLex-965
emb
sim
wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
morph
web
fiction
489.62 Gb
Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf
Manually download http://bit.ly/2ZT4BY9
ner
news
254
969.27 Kb
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
ner
news
97
455.02 Kb
Manual PER, ORG markup (no LOC)
Email Rinat Gareev ([email protected]) ask for dataset
tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
ner
news
1 000
2.96 Mb
News articles with manual PER, LOC, ORG markup
wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
ner
203 287
36.15 Mb
Sentences from Wiki auto annotated with PER, LOC, ORG tags
wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
ner
464
1.16 Mb
Markup prepared for 2019 BSNLP Shared Task
wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
ner
news
1 000
2.96 Mb
Same as Collection5, only PER markup + normalized names
wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
The Russian Drug Reaction Corpus (RuDReC)
ner
4 809
1.73 Kb
RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
Arzamas
news
311
4.50 Mb
Fontanka
news
342 683
786.23 Mb
Interfax
news
46 429
77.55 Mb
KP
news
45 503
61.79 Mb
Lenta
news
36 446
95.15 Mb
Taiga/N+1
news
7 696
24.96 Mb
Magazines
39 890
2.19 Gb
Subtitles
19 011
909.08 Mb
Social
social
1 876 442
648.18 Mb
Proza
fiction
1 732 434
38.25 Gb
Stihi
9 157 686
12.80 Gb
Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News
news
2 154 801
6.84 Gb
Dump of top 40 news + 20 fashion news sites.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
Webhose
news
285 965
859.32 Mb
Dump from webhose.io, 300 sources for one month.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2
Several news sites scraped by members of #proj_news_viz ODS project.
Interfax
news
543 961
1.22 Gb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
Gazeta
news
865 847
1.63 Gb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
Izvestia
news
86 601
307.19 Mb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
Meduza
news
71 806
270.11 Mb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
RIA
news
101 543
233.88 Mb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
Russia Today
news
106 644
187.12 Mb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
TASS
news
1 135 635
3.27 Gb
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
GSD
morph
syntax
5 030
1.01 Mb
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
Taiga
morph
syntax
3 264
353.80 Kb
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
PUD
morph
syntax
1 000
207.78 Kb
wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
SynTagRus
morph
syntax
61 889
11.33 Mb
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
General Internet-Corpus
morph
83 148
10.58 Mb
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
Russian National Corpus
morph
98 892
12.71 Mb
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
OpenCorpora
morph
38 510
4.80 Mb
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs
emb
sim
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
RT: Synonyms and Hypernyms from the Thesaurus RuThes
emb
sim
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
AE: Cognitive Associations from the Sociation.org Experiment
emb
sim
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
Lexical Relations from the Wisdom of the Crowd (LRWC)
emb
sim
wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)
social
9 515
2.09 Mb
This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip
unzip RuADReCT.zip
rm RuADReCT.zip
## Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru## Add new source
1. Implement `corus/sources/.py`
2. Add import into `corus/sources/__init__.py`
3. Add meta into `corus/source/meta.py`
4. Add example into `docs.ipynb` (check meta table is correct)
5. Run tests (readme is updated)## Development
Dev env
```bash
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activatepip install -r requirements/dev.txt
pip install -e .python -m ipykernel install --user --name natasha-corus
```Lint + update docs
```bash
make lint
make exec-docs
```Release
```bash
# Update setup.py versiongit commit -am 'Up version'
git tag v0.10.0git push
git push --tags
```