{"id":13935770,"url":"https://github.com/SergeyShk/ruTS","last_synced_at":"2025-07-19T21:30:35.925Z","repository":{"id":36678929,"uuid":"229525642","full_name":"SergeyShk/ruTS","owner":"SergeyShk","description":"Библиотека для извлечения статистик из текстов на русском языке.","archived":false,"fork":false,"pushed_at":"2023-01-21T17:00:42.000Z","size":4335,"stargazers_count":120,"open_issues_count":1,"forks_count":21,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-19T12:54:40.777Z","etag":null,"topics":["computational-linguistics","natural-language-processing","nlp","russian-specific","text-analytics"],"latest_commit_sha":null,"homepage":"https://sergeyshk.github.io/ruTS/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SergeyShk.png","metadata":{"files":{"readme":"README.en.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-22T06:04:57.000Z","updated_at":"2025-06-26T06:06:26.000Z","dependencies_parsed_at":"2023-02-12T09:45:54.302Z","dependency_job_id":null,"html_url":"https://github.com/SergeyShk/ruTS","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/SergeyShk/ruTS","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SergeyShk%2FruTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SergeyShk%2FruTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SergeyShk%2FruTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SergeyShk%2FruTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SergeyShk","download_url":"https://codeload.github.com/SergeyShk/ruTS/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SergeyShk%2FruTS/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266019657,"owners_count":23864916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-linguistics","natural-language-processing","nlp","russian-specific","text-analytics"],"created_at":"2024-08-07T23:02:04.939Z","updated_at":"2025-07-19T21:30:35.919Z","avatar_url":"https://github.com/SergeyShk.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Russian Texts Statistics (ruTS) [![README_RU](https://raw.githubusercontent.com/gosquared/flags/master/flags/flags/flat/24/Russia.png)](https://github.com/SergeyShk/ruTS/blob/master/README.md) ![README_EN](https://raw.githubusercontent.com/gosquared/flags/master/flags/flags/flat/24/United-Kingdom.png)\n\n![Version](https://img.shields.io/pypi/v/ruTS?logo=pypi\u0026logoColor=FFE873)\n[![Supported Python versions](https://img.shields.io/pypi/pyversions/ruts.svg?logo=python\u0026logoColor=FFE873)](https://pypi.org/project/ruts/)\n![Downloads](https://img.shields.io/pypi/dm/ruTS)\n[![Build Status](https://travis-ci.com/SergeyShk/ruTS.svg?branch=master)](https://travis-ci.com/SergeyShk/ruTS)\n[![codecov](https://codecov.io/gh/SergeyShk/ruTS/branch/master/graph/badge.svg)](https://codecov.io/gh/SergeyShk/ruTS)\n![Status](https://img.shields.io/pypi/status/ruts)\n[![License](https://img.shields.io/github/license/sergeyshk/ruts.svg)](LICENSE.txt)\n![Repo size](https://img.shields.io/github/repo-size/SergeyShk/ruTS)\n![Codacy grade](https://img.shields.io/codacy/grade/5e1cef0e2fa64bdc835f7bfcb7996edc.svg?logo=codacy)\n\n\u003cp align=\"center\"\u003e \n\u003cimg src=\"https://clipartart.com/images/free-tree-roots-clipart-black-and-white-2.png\"\u003e\n\u003c/p\u003e\n\nLibrary for statistics extraction from texts in Russian.\n\n## Installation\n\nRun the following command:\n\n```bash\n$ pip install ruts\n```\n\nDependencies:\n\n*   python 3.8-3.10\n*   nltk\n*   pymorphy2\n*   razdel\n*   scipy\n*   spaCy\n*   numpy\n*   pandas\n*   matplotlib\n*   graphviz\n\n## Usage\n\nThe main functions are based on the [textacy](https://github.com/chartbeat-labs/textacy) statistics adapted to Russian language. The library allows working both with raw texts and Doc-objects of the [spaCy](https://github.com/explosion/spaCy) library.\n\n[API](https://ruts-api.herokuapp.com/docs) to explore the available functions.\n\n### Object extraction\n\nThe library allows creating your own tools for sentence and word extraction from a text, which can be further employed for counting statistics.\n\nExample:\n\n```python\nimport re\nfrom nltk.corpus import stopwords\nfrom ruts import SentsExtractor, WordsExtractor\ntext = \"Не имей 100 рублей, а имей 100 друзей\"\nse = SentsExtractor(tokenizer=re.compile(r', '))\nse.extract(text)\n\n    ('Не имей 100 рублей', 'а имей 100 друзей')\n\nwe = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True, ngram_range=(1, 2))\nwe.extract(text)\n\n    ('иметь', 'рубль', 'иметь', 'друг', 'иметь_рубль', 'рубль_иметь', 'иметь_друг')\n   \nwe.get_most_common(3)\n\n    [('иметь', 2), ('рубль', 1), ('друг', 1)]\n```\n\n### Basic statistics\n\nThe library allows extracting the following statistics from a text:\n\n*   the number of sentences\n*   the number of words\n*   the number of unique words\n*   the number of long words\n*   the number of complex words\n*   the number of simple words\n*   the number of monosyllabic words\n*   the number of polysyllabic words\n*   the number of symbols\n*   the number of letters\n*   the number of spaces\n*   the number of syllables\n*   the number of punctuation marks\n*   word distribution by the number of letters\n*   word distribution by the number of syllables\n\nExample:\n\n```python\nfrom ruts import BasicStats\ntext = \"Существуют три вида лжи: ложь, наглая ложь и статистика\"\nbs = BasicStats(text)\nbs.get_stats()\n\n    {'c_letters': {1: 1, 3: 2, 4: 3, 6: 1, 10: 2},\n    'c_syllables': {1: 5, 2: 1, 3: 1, 4: 2},\n    'n_chars': 55,\n    'n_complex_words': 2,\n    'n_letters': 45,\n    'n_long_words': 3,\n    'n_monosyllable_words': 5,\n    'n_polysyllable_words': 4,\n    'n_punctuations': 2,\n    'n_sents': 1,\n    'n_simple_words': 7,\n    'n_spaces': 8,\n    'n_syllables': 18,\n    'n_unique_words': 8,\n    'n_words': 9}\n\nbs.print_stats()\n\n        Статистика     | Значение \n    ------------------------------\n    Предложения         |    1     \n    Слова               |    9     \n    Уникальные слова    |    8     \n    Длинные слова       |    3     \n    Сложные слова       |    2     \n    Простые слова       |    7     \n    Односложные слова   |    5     \n    Многосложные слова  |    4     \n    Символы             |    55    \n    Буквы               |    45    \n    Пробелы             |    8     \n    Слоги               |    18\n    Знаки препинания    |    2\n```\n\n### Readability metrics\n\nThe library allows counting the following readability metrics:\n\n*   Flesch Reading Ease\n*   Flesch-Kincaid Grade Level\n*   Coleman-Liau Index\n*   SMOG Index\n*   Automated Readability Index\n*   LIX readability measure\n\nCoefficients for Russian language were borrowed from the [Plain Russian Language](https://github.com/infoculture/plainrussian) project dedicated to counting readability coefficients based on a special corpus of texts with age labels.\n\nExample:\n\n```python\nfrom ruts import ReadabilityStats\ntext = \"Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать\"\nrs = ReadabilityStats(text)\nrs.get_stats()\n\n    {'automated_readability_index': 0.2941666666666656,\n    'coleman_liau_index': 0.2941666666666656,\n    'flesch_kincaid_grade': 3.4133333333333304,\n    'flesch_reading_easy': 83.16166666666666,\n    'lix': 48.333333333333336,\n    'smog_index': 0.05}\n\nrs.print_stats()\n\n                    Метрика                 | Значение \n    --------------------------------------------------\n    Тест Флеша-Кинкайда                     |   3.41   \n    Индекс удобочитаемости Флеша            |  83.16   \n    Индекс Колман-Лиау                      |   0.29   \n    Индекс SMOG                             |   0.05   \n    Автоматический индекс удобочитаемости   |   0.29   \n    Индекс удобочитаемости LIX              |  48.33  \n```\n\n### Lexical diversity metrics\n\nThe library allows counting the following lexical diversity metrics for a text:\n\n*   Type-Token Ratio (TTR)\n*   Root Type-Token Ratio (RTTR)\n*   Corrected Type-Token Ratio (CTTR)\n*   Herdan Type-Token Ratio (HTTR)\n*   Summer Type-Token Ratio (STTR)\n*   Mass Type-Token Ratio (MTTR)\n*   Dugast Type-Token Ratio (DTTR)\n*   Moving Average Type-Token Ratio (MATTR)\n*   Mean Segmental Type-Token Ratio (MSTTR)\n*   Measure of Textual Lexical Diversity (MTLD)\n*   Moving Average Measure of Textual Lexical Diversity (MAMTLD)\n*   Hypergeometric Distribution D (HD-D)\n*   Simpson's Diversity Index\n*   Hapax Legomena Index\n\nSome of the implementations were borrowed from the [lexical_diversity](https://github.com/kristopherkyle/lexical_diversity) project.\n\nExample:\n\n```python\nfrom ruts import DiversityStats\ntext = \"Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать\"\nds = DiversityStats(text)\nds.get_stats()\n\n    {'ttr': 0.7333333333333333,\n    'rttr': 2.840187787218772,\n    'cttr': 2.008316044185609,\n    'httr': 0.8854692840710253,\n    'sttr': 0.2500605793160845,\n    'mttr': 0.0973825075623254,\n    'dttr': 10.268784661968104,\n    'mattr': 0.7333333333333333,\n    'msttr': 0.7333333333333333,\n    'mtld': 15.0,\n    'mamtld': 11.875,\n    'hdd': -1,\n    'simpson_index': 21.0,\n    'hapax_index': 431.2334616537499}\n\nds.print_stats()\n\n                              Метрика                           | Значение \n    ----------------------------------------------------------------------\n    Type-Token Ratio (TTR)                                      |   0.92   \n    Root Type-Token Ratio (RTTR)                                |   7.17   \n    Corrected Type-Token Ratio (CTTR)                           |   5.07   \n    Herdan Type-Token Ratio (HTTR)                              |   0.98   \n    Summer Type-Token Ratio (STTR)                              |   0.96   \n    Mass Type-Token Ratio (MTTR)                                |   0.01   \n    Dugast Type-Token Ratio (DTTR)                              |  85.82   \n    Moving Average Type-Token Ratio (MATTR)                     |   0.91   \n    Mean Segmental Type-Token Ratio (MSTTR)                     |   0.94   \n    Measure of Textual Lexical Diversity (MTLD)                 |  208.38  \n    Moving Average Measure of Textual Lexical Diversity (MTLD)  |   1.00   \n    Hypergeometric Distribution D (HD-D)                        |   0.94   \n    Индекс Симпсона                                             |  305.00  \n    Гапакс-индекс                                               | 2499.46  \n```\n\n### Morphological statistics\n\nThe library allows extracting the following morphological features:\n\n*   part of speech\n*   animacy\n*   aspect\n*   case\n*   gender\n*   involvement\n*   mood\n*   number\n*   person\n*   tense\n*   transitivity\n*   voice\n\nMorphological analysis is made using [pymorphy2](https://github.com/kmike/pymorphy2). Descriptions of morphological features were borrowed from [OpenCorpora](http://opencorpora.org/dict.php?act=gram).\n\nExample:\n\n```python\nfrom ruts import MorphStats\ntext = \"Постарайтесь получить то, что любите, иначе придется полюбить то, что получили\"\nms = MorphStats(text)\nms.pos\n\n    ('VERB', 'INFN', 'CONJ', 'CONJ', 'VERB', 'ADVB', 'VERB', 'INFN', 'CONJ', 'CONJ', 'VERB')\n\nms.get_stats()\n\n    {'animacy': {None: 11},\n    'aspect': {None: 5, 'impf': 1, 'perf': 5},\n    'case': {None: 11},\n    'gender': {None: 11},\n    'involvement': {None: 10, 'excl': 1},\n    'mood': {None: 7, 'impr': 1, 'indc': 3},\n    'number': {None: 7, 'plur': 3, 'sing': 1},\n    'person': {None: 9, '2per': 1, '3per': 1},\n    'pos': {'ADVB': 1, 'CONJ': 4, 'INFN': 2, 'VERB': 4},\n    'tense': {None: 8, 'futr': 1, 'past': 1, 'pres': 1},\n    'transitivity': {None: 5, 'intr': 2, 'tran': 4},\n    'voice': {None: 11}}\n\nms.explain_text(filter_none=True)\n\n    (('Постарайтесь',\n        {'aspect': 'perf',\n        'involvement': 'excl',\n        'mood': 'impr',\n        'number': 'plur',\n        'pos': 'VERB',\n        'transitivity': 'intr'}),\n    ('получить', {'aspect': 'perf', 'pos': 'INFN', 'transitivity': 'tran'}),\n    ('то', {'pos': 'CONJ'}),\n    ('что', {'pos': 'CONJ'}),\n    ('любите',\n        {'aspect': 'impf',\n        'mood': 'indc',\n        'number': 'plur',\n        'person': '2per',\n        'pos': 'VERB',\n        'tense': 'pres',\n        'transitivity': 'tran'}),\n    ('иначе', {'pos': 'ADVB'}),\n    ('придется',\n        {'aspect': 'perf',\n        'mood': 'indc',\n        'number': 'sing',\n        'person': '3per',\n        'pos': 'VERB',\n        'tense': 'futr',\n        'transitivity': 'intr'}),\n    ('полюбить', {'aspect': 'perf', 'pos': 'INFN', 'transitivity': 'tran'}),\n    ('то', {'pos': 'CONJ'}),\n    ('что', {'pos': 'CONJ'}),\n    ('получили',\n        {'aspect': 'perf',\n        'mood': 'indc',\n        'number': 'plur',\n        'pos': 'VERB',\n        'tense': 'past',\n        'transitivity': 'tran'}))\n\nms.print_stats('pos', 'tense')\n\n    ---------------Часть речи---------------\n    Глагол (личная форма)         |    4     \n    Союз                          |    4     \n    Глагол (инфинитив)            |    2     \n    Наречие                       |    1     \n\n    -----------------Время------------------\n    Неизвестно                    |    8     \n    Настоящее                     |    1     \n    Будущее                       |    1     \n    Прошедшее                     |    1 \n```\n\n### Datasets\n\nLibrary allows working with a number of  preprocessed datasets:\n\n*   sov_chrest_lit - soviet reading-books for literature classes\n*   stalin_works - the collected works of Stalin\n\nOne can work solely with texts (without title info) or texts with metadata. There is also an opportunity to filter texts on different criteria.\n\nExample:\n\n```python\nfrom ruts.datasets import SovChLit\nsc = SovChLit()\nsc.info\n\n    {'description': 'Корпус советских хрестоматий по литературе',\n    'url': 'https://dataverse.harvard.edu/file.xhtml?fileId=3670902\u0026version=DRAFT',\n    'Наименование': 'sov_chrest_lit'}\n\nfor i in sc.get_records(max_len=100, category='Весна', limit=1):\n    pprint(i)\n\n    {'author': 'Е. Трутнева',\n    'book': 'Родная речь. Книга для чтения в I классе начальной школы',\n    'category': 'Весна',\n    'file': PosixPath('../ruTS/ruts_data/texts/sov_chrest_lit/grade_1/155'),\n    'grade': 1,\n    'subject': 'Дождик',\n    'text': 'Дождик, дождик, поливай, будет хлеба каравай!\\n'\n            'Дождик, дождик, припусти, дай гороху подрасти!',\n    'type': 'Стихотворение',\n    'year': 1963}\n\nfor i in sc.get_texts(text_type='Басня', limit=1):\n    pprint(i)\n\n    ('— Соседка, слышала ль ты добрую молву? — вбежавши, крысе мышь сказала:\\n'\n    '— Ведь кошка, говорят, попалась в когти льву. Вот отдохнуть и нам пора '\n    'настала!\\n'\n    '— Не радуйся, мой свет,— ей крыса говорит в ответ,— и не надейся '\n    'по-пустому.\\n'\n    'Коль до когтей у них дойдёт, то, верно, льву не быть живому: сильнее кошки '\n    'зверя нет.')\n```\n\n### Visualization\n\nLibrary allows visualizing text with the help of the following graphs:\n\n*   Zipf's law\n*   Literature Fingerprinting\n*   Word Tree\n\nExample:\n\n```python\nfrom collections import Counter\nfrom nltk.corpus import stopwords\nfrom ruts import WordsExtractor\nfrom ruts.datasets import SovChLit\nfrom ruts.visualizers import zipf\n\nsc = SovChLit()\ntext = '\\n'.join([text for text in sc.get_texts(limit=100)])\nwe = WordsExtractor(use_lexemes=True, stopwords=stopwords.words('russian'), filter_nums=True)\ntokens_with_count = Counter(we.extract(text))\nzipf(tokens_with_count, num_words=100, num_labels=10, log=False, show_theory=True, alpha=1.1)\n```\n\n### Components\n\nLibrary allows creating the following classes of spaCy components:\n\n*   BasicStats\n*   DiversityStats\n*   MorphStats\n*   ReadabilityStats\n\nRussian-language spaCy model can be downloaded by running the command:\n\n```bash\n$ python -m spacy download ru_core_news_sm\n```\n\nExample:\n\n```python\nimport ruts\nimport spacy\nnlp = spacy.load('ru_core_news_sm')\nnlp.add_pipe('basic', last=True)\ndoc = nlp(\"Существуют три вида лжи: ложь, наглая ложь и статистика\")\ndoc._.basic.c_letters\n\n    {1: 1, 3: 2, 4: 3, 6: 1, 10: 2}\n\ndoc._.basic.get_stats()\n\n    {'c_letters': {1: 1, 3: 2, 4: 3, 6: 1, 10: 2},\n    'c_syllables': {1: 5, 2: 1, 3: 1, 4: 2},\n    'n_chars': 55,\n    'n_complex_words': 2,\n    'n_letters': 45,\n    'n_long_words': 3,\n    'n_monosyllable_words': 5,\n    'n_polysyllable_words': 4,\n    'n_punctuations': 2,\n    'n_sents': 1,\n    'n_simple_words': 7,\n    'n_spaces': 8,\n    'n_syllables': 18,\n    'n_unique_words': 8,\n    'n_words': 9}\n```\n\n## Project structure\n\n*   **docs** - project documentation\n*   **ruts**:\n    *   basic_stats.py - basic text statistics\n    *   components.py - spaCy components\n    *   constants.py - main constants\n    *   diversity_stats.py - lexical diversity metrics\n    *   extractors.py - tools for object extraction from a text\n    *   morph_stats.py - morphological statistics \n    *   readability_stats.py - readability metrics\n    *   utils.py - subsidiary tools\n    *   **datasets**:\n        *   dataset.py - basic class for working with datasets\n        *   sov_chrest_lit.py - soviet reading-books for literature classes\n        *   stalin_works.py - the collected works of Stalin\n    *   **visualizers** - tools for text visualization:\n        *   fingerprinting.py - Literature Fingerprinting\n        *   word_tree.py - Word Tree\n        *   zipf.py - Zipf's law\n*   **tests**:\n    *   test_basic_stats.py - tests for basic text statistics\n    *   test_components.py - tests for spaCy components\n    *   test_diversity_stats.py - tests for lexical diversity metrics\n    *   test_extractors.py - tests for object extraction tools\n    *   test_morph_stats - tests for morphological statistics\n    *   test_readability_stats.py - tests for readability metrics\n    *   **datasets** - tests for datasets:\n        *   test_dataset.py - tests for basic class for working with datasets\n        *   test_sov_chrest_lit.py - tests for dataset soviet reading-books for literature classes\n        *   test_stalin_works.py - tests for dataset the collected works of Stalin\n    *   **visualizers** - tests for tools for text visualization:\n        *   test_fingerprinting.py - tests for visualization Literature Fingerprinting\n        *   test_word_tree.py - tests for visualization Word Tree\n        *   test_zipf.py - tests for visualization Zipf's law\n\n## Authors\n\n*   Sergey Shkarin (kouki.sergey@gmail.com)\n*   Ekaterina Smirnova (ekanerina@yandex.ru)\n\n## Attribution\n\nPlease use the following BibTeX entry for citing **ruTS** if you use it in your research or software.\nCitations are helpful for the continued development and maintenance of this library.\n\n```\n@software{ruTS,\n  author = {Sergey Shkarin},\n  title = {{ruTS, a library for statistics extraction from texts in Russian}},\n  year = 2023,\n  publisher = {Moscow},\n  url = {https://github.com/SergeyShk/ruTS}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSergeyShk%2FruTS","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSergeyShk%2FruTS","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSergeyShk%2FruTS/lists"}