Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/natasha/corus

Links to Russian corpora + Python functions for loading and parsing
https://github.com/natasha/corus

corpora datasets nlp python russian

Last synced: 4 days ago
JSON representation

Links to Russian corpora + Python functions for loading and parsing

Awesome Lists containing this project

README

        

![CI](https://github.com/natasha/corus/actions/workflows/test.yml/badge.svg)

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

## Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
```bash
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
```

Use `corus` to load the data:

```python
>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
```

Iterate over texts:

```python
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...

```

For links to other datasets and their loaders see the Reference section.

## Documentation

Materials are in Russian:

* Corus page on natasha.github.io
* Corus section of Datafest 2020 talk

## Install

`corus` supports Python 3.5+, PyPy 3.

```bash
$ pip install corus
```

## Reference

Dataset
API from corus import
Tags
Texts
Uncompressed
Description

Lenta.ru

Lenta.ru v1.0


load_lenta
#

news

739 351

1.66 Gb

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Lenta.ru v1.1+


load_lenta2
#

news

800 975

1.94 Gb

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2

Lib.rus.ec


load_librusec
#

fiction

301 871

144.92 Gb

Dump of lib.rus.ec prepared for RUSSE workshop

wget http://panchenko.me/data/russe/librusec_fb2.plain.gz

Rossiya Segodnya


load_ria_raw
#


load_ria
#

news

1 003 869

3.70 Gb

wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz

Mokoron Russian Twitter Corpus


load_mokoron
#

social
sentiment

17 633 417

1.86 Gb

Russian Twitter sentiment markup

Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql

Wikipedia


load_wiki
#

1 541 401

12.94 Gb

Russian Wiki dump

wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2

GramEval2020


load_gramru
#

162 372

30.04 Mb

wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip

unzip master.zip

mv GramEval2020-master/dataTrain train

mv GramEval2020-master/dataOpenTest dev

rm -r master.zip GramEval2020-master

wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu

OpenCorpora


load_corpora
#

morph

4 030

20.21 Mb

wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip

RusVectores SimLex-965


load_simlex
#

emb
sim

wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv

wget https://rusvectores.org/static/testsets/ru_simlex965.tsv

Omnia Russica


load_omnia
#

morph
web
fiction

489.62 Gb

Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf

Manually download http://bit.ly/2ZT4BY9

factRuEval-2016


load_factru
#

ner
news

254

969.27 Kb

Manual PER, LOC, ORG markup prepared for 2016 Dialog competition

wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip

unzip master.zip

rm master.zip

Gareev


load_gareev
#

ner
news

97

455.02 Kb

Manual PER, ORG markup (no LOC)

Email Rinat Gareev ([email protected]) ask for dataset

tar -xvf rus-ner-news-corpus.iob.tar.gz

rm rus-ner-news-corpus.iob.tar.gz

Collection5


load_ne5
#

ner
news

1 000

2.96 Mb

News articles with manual PER, LOC, ORG markup

wget http://www.labinform.ru/pub/named_entities/collection5.zip

unzip collection5.zip

rm collection5.zip

WiNER


load_wikiner
#

ner

203 287

36.15 Mb

Sentences from Wiki auto annotated with PER, LOC, ORG tags

wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2

BSNLP-2019


load_bsnlp
#

ner

464

1.16 Mb

Markup prepared for 2019 BSNLP Shared Task

wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip

wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip

unzip TRAININGDATA_BSNLP_2019_shared_task.zip

unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg

rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip

Persons-1000


load_persons
#

ner
news

1 000

2.96 Mb

Same as Collection5, only PER markup + normalized names

wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip

The Russian Drug Reaction Corpus (RuDReC)


load_rudrec
#

ner

4 809

1.73 Kb

RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json

Taiga

Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks

wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz

tar -xzvf retagged_taiga.tar.gz

Arzamas


load_taiga_arzamas
#

news

311

4.50 Mb

Fontanka


load_taiga_fontanka
#

news

342 683

786.23 Mb

Interfax


load_taiga_interfax
#

news

46 429

77.55 Mb

KP


load_taiga_kp
#

news

45 503

61.79 Mb

Lenta


load_taiga_lenta
#

news

36 446

95.15 Mb

Taiga/N+1


load_taiga_nplus1
#

news

7 696

24.96 Mb

Magazines


load_taiga_magazines
#

39 890

2.19 Gb

Subtitles


load_taiga_subtitles
#

19 011

909.08 Mb

Social


load_taiga_social
#

social

1 876 442

648.18 Mb

Proza


load_taiga_proza
#

fiction

1 732 434

38.25 Gb

Stihi


load_taiga_stihi
#

9 157 686

12.80 Gb

Russian NLP Datasets

Several Russian news datasets from webhose.io, lenta.ru and other news sites.

News


load_buriy_news
#

news

2 154 801

6.84 Gb

Dump of top 40 news + 20 fashion news sites.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2

Webhose


load_buriy_webhose
#

news

285 965

859.32 Mb

Dump from webhose.io, 300 sources for one month.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2

ODS #proj_news_viz

Several news sites scraped by members of #proj_news_viz ODS project.

Interfax


load_ods_interfax
#

news

543 961

1.22 Gb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz

Gazeta


load_ods_gazeta
#

news

865 847

1.63 Gb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz

Izvestia


load_ods_izvestia
#

news

86 601

307.19 Mb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz

Meduza


load_ods_meduza
#

news

71 806

270.11 Mb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz

RIA


load_ods_ria
#

news

101 543

233.88 Mb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz

Russia Today


load_ods_rt
#

news

106 644

187.12 Mb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz

TASS


load_ods_tass
#

news

1 135 635

3.27 Gb

wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz

Universal Dependencies

GSD


load_ud_gsd
#

morph
syntax

5 030

1.01 Mb

wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu

wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu

wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu

Taiga


load_ud_taiga
#

morph
syntax

3 264

353.80 Kb

wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu

wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu

wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu

PUD


load_ud_pud
#

morph
syntax

1 000

207.78 Kb

wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu

SynTagRus


load_ud_syntag
#

morph
syntax

61 889

11.33 Mb

wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu

wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu

wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu

morphoRuEval-2017

General Internet-Corpus


load_morphoru_gicrya
#

morph

83 148

10.58 Mb

wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip

unzip GIKRYA_texts_new.zip

rm GIKRYA_texts_new.zip

Russian National Corpus


load_morphoru_rnc
#

morph

98 892

12.71 Mb

wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar

unrar x RNC_texts.rar

rm RNC_texts.rar

OpenCorpora


load_morphoru_corpora
#

morph

38 510

4.80 Mb

wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar

unrar x OpenCorpora_Texts.rar

rm OpenCorpora_Texts.rar

RUSSE Russian Semantic Relatedness

HJ: Human Judgements of Word Pairs


load_russe_hj
#

emb
sim

wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv

RT: Synonyms and Hypernyms from the Thesaurus RuThes


load_russe_rt
#

emb
sim

wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv

AE: Cognitive Associations from the Sociation.org Experiment


load_russe_ae
#

emb
sim

wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv

wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv

wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv

Toloka Datasets

Lexical Relations from the Wisdom of the Crowd (LRWC)


load_toloka_lrwc
#

emb
sim

wget https://tlk.s3.yandex.net/dataset/LRWC.zip

unzip LRWC.zip

rm LRWC.zip

The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)


load_ruadrect
#

social

9 515

2.09 Mb

This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip

unzip RuADReCT.zip

rm RuADReCT.zip

## Support

- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru

## Add new source

1. Implement `corus/sources/.py`
2. Add import into `corus/sources/__init__.py`
3. Add meta into `corus/source/meta.py`
4. Add example into `docs.ipynb` (check meta table is correct)
5. Run tests (readme is updated)

## Development

Dev env

```bash
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus
```

Lint + update docs

```bash
make lint
make exec-docs
```

Release

```bash
# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags
```