{"id":13935765,"url":"https://github.com/natasha/navec","last_synced_at":"2025-04-05T13:05:28.252Z","repository":{"id":40967998,"uuid":"189425886","full_name":"natasha/navec","owner":"natasha","description":"Compact high quality word embeddings for Russian language","archived":false,"fork":false,"pushed_at":"2023-07-24T09:22:26.000Z","size":1946,"stargazers_count":166,"open_issues_count":4,"forks_count":16,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-04-27T23:37:13.261Z","etag":null,"topics":["embeddings","glove","nlp","python","quantization","russian","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/natasha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-30T14:13:18.000Z","updated_at":"2024-04-27T23:37:18.076Z","dependencies_parsed_at":"2024-04-27T23:37:15.454Z","dependency_job_id":"766d1919-97a7-46a8-b2ae-113285ecee04","html_url":"https://github.com/natasha/navec","commit_stats":{"total_commits":137,"total_committers":5,"mean_commits":27.4,"dds":0.03649635036496346,"last_synced_commit":"2e2bf10e7a1d14df611df101c8cdd01f18281158"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natasha%2Fnavec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natasha%2Fnavec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natasha%2Fnavec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natasha%2Fnavec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/natasha","download_url":"https://codeload.github.com/natasha/navec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247339155,"owners_count":20923014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","glove","nlp","python","quantization","russian","word2vec"],"created_at":"2024-08-07T23:02:04.734Z","updated_at":"2025-04-05T13:05:28.236Z","avatar_url":"https://github.com/natasha.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n\u003cimg src=\"https://github.com/natasha/natasha-logos/blob/master/navec.svg\"\u003e\n\n![CI](https://github.com/natasha/navec/actions/workflows/test.yml/badge.svg)\n\nNavec is a library of pretrained word embeddings for Russian language. It shows competitive or better results than \u003ca href=\"http://rusvectores.org\"\u003eRusVectores\u003c/a\u003e, loads ~10 times faster (~1 sec), takes ~10 times less space (~50 MB).\n\n\u003e Navec = large Russian text datasets + vanila GloVe + quantization\n\n## Downloads\n\nHow to read model filename:\n```\nnavec_hudlit_v1_12B_500K_300d_100q.tar\n                 |    |    |    |\n                 |    |    |     ---- 100 dimentions after quantization\n                 |    |     --------- original vectors have 300 dimentions\n                 |     -------------- vocab size is 500 000 words + 2 for \u003cunk\u003e, \u003cpad\u003e\n                  ------------------- dataset of 12 billion tokens was used\n```\n\nCurrently two models are published:\n\u003ctable\u003e\n\n\u003ctr\u003e\n\u003cth\u003eModel\u003c/th\u003e\n\u003cth\u003eSize\u003c/th\u003e\n\u003cth\u003eDescription\u003c/th\u003e\n\u003cth\u003eSources\u003c/th\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd\u003e\n  \u003ca href=\"https://storage.yandexcloud.net/natasha-navec/packs/navec_hudlit_v1_12B_500K_300d_100q.tar\"\u003enavec_hudlit_v1_12B_500K_300d_100q.tar\u003c/a\u003e\n  \u003ca name=\"hudlit\"\u003e\u003c/a\u003e\u003ca href=\"#hudlit\"\u003e\u003ccode\u003e#\u003c/code\u003e\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd\u003e50MB\u003c/td\u003e\n\u003ctd\u003e\n  Should be used by default. Shows best results on \u003ca href=\"#evaluation\"\u003eintrinsic evaluations\u003c/a\u003e. Model was trained on large corpus of \n\t\n\t\n\t\n\tan literature (~150GB).\n\u003c/td\u003e\n\u003ctd\u003e\n  \u003ca href=\"https://github.com/natasha/corus#load_librusec\"\u003e\u003ccode\u003elibrusec\u003c/code\u003e\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003ctr\u003e\n\u003ctd\u003e\n  \u003ca href=\"https://storage.yandexcloud.net/natasha-navec/packs/navec_news_v1_1B_250K_300d_100q.tar\"\u003enavec_news_v1_1B_250K_300d_100q.tar\u003c/a\u003e\n  \u003ca name=\"news\"\u003e\u003c/a\u003e\u003ca href=\"#news\"\u003e\u003ccode\u003e#\u003c/code\u003e\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd\u003e25MB\u003c/td\u003e\n\u003ctd\u003e\n  Try to use this model to news texts. It is two times smaller than `hudlit` but covers same 98% of words in news articles.\n\u003c/td\u003e\n\u003ctd\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_lenta\"\u003e\u003ccode\u003elenta\u003c/code\u003e\u003c/a\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_ria\"\u003e\u003ccode\u003eria\u003c/code\u003e\u003c/a\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_taiga_fontanka\"\u003e\u003ccode\u003etaiga_fontanka\u003c/code\u003e\u003c/a\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_buriy_news\"\u003e\u003ccode\u003eburiy_news\u003c/code\u003e\u003c/a\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_buriy_webhose\"\u003e\u003ccode\u003eburiy_webhose\u003c/code\u003e\u003c/a\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_ods_gazeta\"\u003e\u003ccode\u003eods_gazeta\u003c/code\u003e\u003c/a\u003e\n  \u003ca href=\"//github.com/natasha/corus#load_ods_interfax\"\u003e\u003ccode\u003eods_interfax\u003c/code\u003e\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\n\u003c/table\u003e\n\n## Installation\n\nNavec supports Pyton 3.7+ and PyPy 3.\n\n```bash\n$ pip install navec\n```\n\n## Usage\n\nFirst download `hudlit` emdeddings (see the table above):\n```bash\nwget https://storage.yandexcloud.net/natasha-navec/packs/navec_hudlit_v1_12B_500K_300d_100q.tar\n```\n\nLoad tar-archive with `Navec.load`, it takes ~1s and ~100MB of RAM:\n```python\n\u003e\u003e\u003e from navec import Navec\n\n\u003e\u003e\u003e path = 'hudlit_12B_500K_300d_100q.tar'\n\u003e\u003e\u003e navec = Navec.load(path)\n```\n\nThen `navec` can be used as a dict object:\n```python\n\u003e\u003e\u003e navec['навек']\narray([ 0.3955571 ,  0.11600914,  0.24605067, -0.35206917, -0.08932345,\n        0.3382279 , -0.5457616 ,  0.07472657, -0.4753835 , -0.3330848 ,\n        ...\n\n\u003e\u003e\u003e 'нааавееек' in navec\nFalse\n\n\u003e\u003e\u003e navec.get('нааавееек')\nNone\n```\n\nTo get an index of word, use `navec.vocab`:\n```python\n\u003e\u003e\u003e navec.vocab['навек']\n225823\n\n\u003e\u003e\u003e navec.vocab.get('наааавеeeк', navec.vocab.unk_id)\n500000   # == navec.vocab['\u003cunk\u003e']\n```\n\nThere are two special words in vocab, `\u003cunk\u003e` and `\u003cpad\u003e`:\n```python\n\u003e\u003e\u003e navec['\u003cunk\u003e']\narray([ 3.69125791e-02,  9.32818875e-02,  2.01917738e-02, ...\n\n\u003e\u003e\u003e navec['\u003cpad\u003e']\narray([0., 0., 0., 0., 0., 0., ...\n\n```\n\nTo use Navec in PyTorch model there is a Slovnet module:\n```python\n\u003e\u003e\u003e import torch\n\u003e\u003e\u003e from slovnet.model.emb import NavecEmbedding\n\n\u003e\u003e\u003e emb = NavecEmbedding(navec)\n\u003e\u003e\u003e input = torch.tensor([1, 2, 0])\n\u003e\u003e\u003e output = emb(input)\n\n\u003e\u003e\u003e output.shape\ntorch.Size([3, 300])\n\n\u003e\u003e\u003e output\ntensor([[ 4.2000e-01,  3.6666e-01,  1.7728e-01, -3.8719e-01, -1.0762e-01,\n          1.6954e-01, -4.6063e-01,  5.4519e-01, -2.1212e-01,  2.0965e-01,\n          1.9658e-01,  2.7807e-01, -2.3802e-01,  3.5155e-01,  1.4491e-02,\n\t\t  ...\n```\n\n## Documentation\n\nMaterials are in Russian:\n\n* \u003ca href=\"https://natasha.github.io/navec\"\u003eNavec article on natasha.github.io\u003c/a\u003e \n* \u003ca href=\"https://youtu.be/-7XT_U6hVvk?t=1705\"\u003eSlovnet section of Datafest 2020 talk\u003c/a\u003e\n\n## Evaluation\n\nLet's compore Navec to top 5 RusVectores models (based on \u003ca href=\"https://github.com/natasha/corus#load_simlex\"\u003e`simlex`\u003c/a\u003e and \u003ca href=\"https://github.com/natasha/corus#load_russe_hj\"\u003e`hj`\u003c/a\u003e eval datasets). In each column top 3 results are highlighted.\n\n* `init` — time it takes to load model file to RAM. `tayga_upos_skipgram_300_2_2019` word2vec binary file takes 5 seconds to load with `gensim.KeyedVectors.load_word2vec_format`. `tayga_none_fasttextcbow_300_10_2019` fastText large ~2.7 GB file takes 8 seconds. Navec `hudlit` with vocab 2 times larger than previous two takes 1 second.\n* `get` — time is takes to query embedding vector for a single word. Word2vec models win here, to fetch a vector they basically do `dict.get`. FastText and Navec for every query do extra computation. FastText extracts and sums word ngrams, Navec unpacks vector from quantization table. In practice query to embeddings table is small compared to all other computation in network.\n* `disk` — model file size. It is convenient for deployment and distribution to have small models. Notice that `hudlit` model is 4-20 times smaller with vocab size 2 times bigger.\n* `ram` — space model takes in RAM after loading. It is convenient to have small memory footprint to fit more computation on single server.\n* `vocab` — number of words in vocab, number of embedding vectors. Since Navec vectors table takes less space we can have larger vocab. With 500K vocab `hudlit` model has ~2% OVV rate on average.\n\n\u003c!--- emb1 ---\u003e\n\u003ctable border=\"0\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etype\u003c/th\u003e\n      \u003cth\u003einit, s\u003c/th\u003e\n      \u003cth\u003eget, µs\u003c/th\u003e\n      \u003cth\u003edisk, mb\u003c/th\u003e\n      \u003cth\u003eram, mb\u003c/th\u003e\n      \u003cth\u003evocab\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003ehudlit_12B_500K_300d_100q\u003c/th\u003e\n      \u003ctd\u003enavec\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e1.1\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e21.6\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e50.6\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e95.3\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e500K\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003enews_1B_250K_300d_100q\u003c/th\u003e\n      \u003ctd\u003enavec\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.8\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e20.7\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e25.4\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e47.7\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e250K\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eruscorpora_upos_cbow_300_20_2019\u003c/th\u003e\n      \u003ctd\u003ew2v\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e3.3\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e1.4\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e220.6\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e236.1\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e189K\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eruwikiruscorpora_upos_skipgram_300_2_2019\u003c/th\u003e\n      \u003ctd\u003ew2v\u003c/td\u003e\n      \u003ctd\u003e5.0\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e1.5\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e290.0\u003c/td\u003e\n      \u003ctd\u003e309.4\u003c/td\u003e\n      \u003ctd\u003e248K\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003etayga_upos_skipgram_300_2_2019\u003c/th\u003e\n      \u003ctd\u003ew2v\u003c/td\u003e\n      \u003ctd\u003e5.2\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e1.4\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e290.7\u003c/td\u003e\n      \u003ctd\u003e310.9\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e249K\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003etayga_none_fasttextcbow_300_10_2019\u003c/th\u003e\n      \u003ctd\u003efasttext\u003c/td\u003e\n      \u003ctd\u003e8.0\u003c/td\u003e\n      \u003ctd\u003e13.4\u003c/td\u003e\n      \u003ctd\u003e2741.9\u003c/td\u003e\n      \u003ctd\u003e2746.9\u003c/td\u003e\n      \u003ctd\u003e192K\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003earaneum_none_fasttextcbow_300_5_2018\u003c/th\u003e\n      \u003ctd\u003efasttext\u003c/td\u003e\n      \u003ctd\u003e16.4\u003c/td\u003e\n      \u003ctd\u003e10.6\u003c/td\u003e\n      \u003ctd\u003e2752.1\u003c/td\u003e\n      \u003ctd\u003e2754.7\u003c/td\u003e\n      \u003ctd\u003e195K\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c!--- emb1 ---\u003e\n\nNow let's look at intrinsic evaluation scores. Navec `hudlit` model does not show best results on all datasets but it is usually in top 3. We'll use 6 datasets:\n\n* \u003ca href=\"https://github.com/natasha/corus#load_simlex\"\u003e`simlex965`\u003c/a\u003e, \u003ca href=\"https://github.com/natasha/corus#load_russe_hj\"\u003e`hj`\u003c/a\u003e — two small datasets (965 and 398 tests respectively) used by RusVectores, see the \u003ca href=\"https://arxiv.org/abs/1801.06407\"\u003etheir paper\u003c/a\u003e for more info. Metric is spearman correlation, other datasets use average precision.\n* \u003ca href=\"https://github.com/natasha/corus#load_russe_rt\"\u003e`rt`\u003c/a\u003e, \u003ca href=\"https://github.com/natasha/corus#load_russe_ae\"\u003e`ae`\u003c/a\u003e, \u003ca href=\"https://github.com/natasha/corus#load_russe_ae\"\u003e`ae2`\u003c/a\u003e — large datasets (114066, 22919, 86772 tests) from RUSSE workshop, see \u003ca href=\"https://russe.nlpub.org/downloads/\"\u003eproject description\u003c/a\u003e for more.\n* \u003ca href=\"https://github.com/natasha/corus#load_toloka_lrwc\"\u003e`lrwc`\u003c/a\u003e — relatively new dataset by Yandex.Toloka, see \u003ca href=\"https://research.yandex.com/datasets/toloka\"\u003etheir page\u003c/a\u003e.\n\n\u003c!--- emb2 ---\u003e\n\u003ctable border=\"0\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etype\u003c/th\u003e\n      \u003cth\u003esimlex\u003c/th\u003e\n      \u003cth\u003ehj\u003c/th\u003e\n      \u003cth\u003ert\u003c/th\u003e\n      \u003cth\u003eae\u003c/th\u003e\n      \u003cth\u003eae2\u003c/th\u003e\n      \u003cth\u003elrwc\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003ehudlit_12B_500K_300d_100q\u003c/th\u003e\n      \u003ctd\u003enavec\u003c/td\u003e\n      \u003ctd\u003e0.310\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.707\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.842\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.931\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.923\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.604\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003enews_1B_250K_300d_100q\u003c/th\u003e\n      \u003ctd\u003enavec\u003c/td\u003e\n      \u003ctd\u003e0.230\u003c/td\u003e\n      \u003ctd\u003e0.590\u003c/td\u003e\n      \u003ctd\u003e0.784\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.866\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.861\u003c/td\u003e\n      \u003ctd\u003e0.589\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eruscorpora_upos_cbow_300_20_2019\u003c/th\u003e\n      \u003ctd\u003ew2v\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.359\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.685\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.852\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.758\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.896\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.602\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eruwikiruscorpora_upos_skipgram_300_2_2019\u003c/th\u003e\n      \u003ctd\u003ew2v\u003c/td\u003e\n      \u003ctd\u003e0.321\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.723\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.817\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.801\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.860\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.629\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003etayga_upos_skipgram_300_2_2019\u003c/th\u003e\n      \u003ctd\u003ew2v\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.429\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.749\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.871\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.771\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.899\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.639\u003c/b\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003etayga_none_fasttextcbow_300_10_2019\u003c/th\u003e\n      \u003ctd\u003efasttext\u003c/td\u003e\n      \u003ctd\u003e\u003cb\u003e0.369\u003c/b\u003e\u003c/td\u003e\n      \u003ctd\u003e0.639\u003c/td\u003e\n      \u003ctd\u003e0.793\u003c/td\u003e\n      \u003ctd\u003e0.682\u003c/td\u003e\n      \u003ctd\u003e0.813\u003c/td\u003e\n      \u003ctd\u003e0.536\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003earaneum_none_fasttextcbow_300_5_2018\u003c/th\u003e\n      \u003ctd\u003efasttext\u003c/td\u003e\n      \u003ctd\u003e0.349\u003c/td\u003e\n      \u003ctd\u003e0.671\u003c/td\u003e\n      \u003ctd\u003e0.801\u003c/td\u003e\n      \u003ctd\u003e0.706\u003c/td\u003e\n      \u003ctd\u003e0.793\u003c/td\u003e\n      \u003ctd\u003e0.579\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c!--- emb2 ---\u003e\n\n## Support\n\n- Chat — https://t.me/natural_language_processing\n- Issues — https://github.com/natasha/navec/issues\n- Commercial support — https://lab.alexkuk.ru\n\n## Development\n\nDev env\n\n```bash\npython -m venv ~/.venvs/natasha-navec\nsource ~/.venvs/natasha-navec/bin/activate\n\npip install -r requirements/dev.txt\npip install -e .\n```\n\nTest + lint\n\n```bash\nmake test\n```\n\nRelease\n\n```bash\n# Update setup.py version\n\ngit commit -am 'Up version'\ngit tag v0.10.0\n\ngit push\ngit push --tags\n```\n\nNotice! All commands belows use code from `navec/train`, it is not under CI, it works only with Python 3, it is expected user is familiar with source code. We use Yandex Cloud Compute and Object Storage.\n\nCreate remote worker\n\nTo compute cooc (large HDD, 1Tb for librusec).\n```bash\nyc compute instance create \\\n    --name worker \\\n    --zone ru-central1-a \\\n    --network-interface subnet-name=default,nat-ip-version=ipv4 \\\n    --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804,type=network-hdd,size=1000 \\\n    --memory 8 \\\n    --cores 2 \\\n    --core-fraction 100 \\\n    --ssh-key ~/.ssh/id_rsa.pub \\\n    --folder-name default \\\n    --preemptible  # in case computation takes \u003c24h\n```\n\nTo fit embedings (multiple cores). HDD should be \u003e cooc.bin * 3 (for suffle + tmp)\n```bash\nyc compute instance create \\\n    --name worker \\\n    --zone ru-central1-a \\\n    --network-interface subnet-name=default,nat-ip-version=ipv4 \\\n    --create-boot-disk image-folder-id=standard-images,image-family=ubuntu-1804,type=network-hdd,size=700 \\\n    --memory 16 \\\n    --cores 16 \\\n    --core-fraction 100 \\\n    --ssh-key ~/.ssh/id_rsa.pub  \\\n    --folder-name default \\\n    --preemptible\n```\n\nSetup machine\n```bash\nyc compute instance list --folder-name default\nssh yc-user@123.123.123.123\n\nsudo locale-gen en_US.UTF-8\nsudo timedatectl set-timezone Europe/Moscow\nsudo apt-get update\nsudo DEBIAN_FRONTEND=noninteractive apt-get install -y language-pack-ru python3-pip screen unzip git pv cmake\n\nwget https://nlp.stanford.edu/software/GloVe-1.2.zip\nunzip GloVe-1.2.zip\nrm GloVe-1.2.zip\nmv GloVe-1.2 glove\ncd glove\nmake\ncd ..\n\nexport GLOVE_DIR=~/glove/build\n\ngit clone https://github.com/natasha/navec.git\nsudo -H pip3 install -e navec\nsudo -H pip3 install -r navec/requirements/train.txt\n\nscreen\nctrl a d\n```\n\nRemove instance\n```bash\nyc compute instance list --folder-name default\nyc compute instance delete xxxxxxxxx\n````\n\nEnv, used by `navec-train s3|vocab|cooc|emb`\n```bash\nexport S3_KEY=_XxXXXxxx_XXXxxxxXxxx\nexport S3_SECRET=XXxxx_XXXXXXxxxxxxXXXXxxXXx-XxxXXxxxX\nexport S3_BUCKET=XXXXXXX\nexport GLOVE_DIR=~/path/to/glove/build\n```\n\nShare text data (see corus)\n```bash\nnavec-train s3 upload librusec_fb2.plain.gz sources/librusec.gz\nnavec-train s3 upload taiga/proza_ru.zip sources/taiga_proza.zip\n\nnavec-train s3 upload ruwiki-latest-pages-articles.xml.bz2 sources/wiki.xml.bz2\n\nnavec-train s3 upload lenta-ru-news.csv.gz sources/lenta.csv.gz\nnavec-train s3 upload ria.json.gz sources/ria.json.gz\nnavec-train s3 upload taiga/Fontanka.tar.gz sources/taiga_fontanka.tar.gz\nnavec-train s3 upload buriy/news-articles-2014.tar.bz2 sources/buriy_news1.tar.bz2\nnavec-train s3 upload buriy/news-articles-2015-part1.tar.bz2 sources/buriy_news2.tar.bz2\nnavec-train s3 upload buriy/news-articles-2015-part2.tar.bz2 sources/buriy_news3.tar.bz2\nnavec-train s3 upload buriy/webhose-2016.tar.bz2 sources/buriy_webhose.tar.bz2\nnavec-train s3 upload ods/gazeta_v1.csv.zip sources/ods_gazeta.csv.zip\nnavec-train s3 upload ods/interfax_v1.csv.zip sources/ods_interfax.csv.zip\n\nnavec-train s3 download sources/librusec.gz\nnavec-train s3 download sources/taiga_proza.zip\n\nnavec-train s3 download sources/wiki.xml.bz2\n\nnavec-train s3 download sources/lenta.csv.gz\nnavec-train s3 download sources/ria.json.gz\nnavec-train s3 download sources/taiga_fontanka.tar.gz\nnavec-train s3 download sources/buriy_news1.tar.bz2\nnavec-train s3 download sources/buriy_news2.tar.bz2\nnavec-train s3 download sources/buriy_news3.tar.bz2\nnavec-train s3 download sources/buriy_webhose.tar.bz2\nnavec-train s3 download sources/ods_gazeta.csv.zip\nnavec-train s3 download sources/ods_interfax.csv.zip\n```\n\nText to tokens\n```bash\nnavec-train corpus librusec librusec.gz | pv | navec-train tokenize \u003e tokens.txt  # ~12B words\nnavec-train corpus taiga_proza taiga_proza.zip | pv | navec-train tokenize \u003e tokens.txt  # ~3B\n\nnavec-train corpus wiki wiki.xml.bz2 | pv | navec-train tokenize \u003e tokens.txt  # ~0.5B\n\nnavec-train corpus lenta lenta.csv.gz | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus ria ria.json.gz | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus taiga_fontanka taiga_fontanka.tar.gz | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus buriy_news buriy_news1.tar.bz2 | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus buriy_news buriy_news2.tar.bz2 | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus buriy_news buriy_news3.tar.bz2 | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus buriy_webhose buriy_webhose.tar.bz2 | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus ods_gazeta ods_gazeta.csv.zip | pv | navec-train tokenize \u003e\u003e tokens.txt\nnavec-train corpus ods_interfax ods_interfax.csv.zip | pv | navec-train tokenize \u003e\u003e tokens.txt  # ~1B\n\npv tokens.txt | gzip \u003e tokens.txt.gz\nnavec-train s3 upload tokens.txt.gz librusec_tokens.txt.gz\n\nnavec-train s3 upload tokens.txt taiga_proza_tokens.txt\nnavec-train s3 upload tokens.txt news_tokens.txt\nnavec-train s3 upload tokens.txt wiki_tokens.txt\n```\n\nTokens to vocab\n```bash\npv tokens.txt \\\n\t| navec-train vocab count \\\n\t\u003e full_vocab.txt\n\npv full_vocab.txt \\\n\t| navec-train vocab quantile\n\n# librusec\n# ...\n# 0.970      325 882\n# 0.980      511 542\n# 0.990    1 122 624\n# 1.000   22 129 654\n\n# taiga_proza\n# ...\n# 0.960      229 906\n# 0.970      321 810\n# 0.980      517 647\n# 0.990    1 224 277\n# 1.000   14 302 409\n\n# wiki\n# ...\n# 0.950     380 134\n# 0.960     519 817\n# 0.970     757 561\n# 0.980   1 223 201\n# 0.990   2 422 265\n# 1.000   6 664 630\n\n# news\n# ...\n# 0.970    163 833\n# 0.980    243 903\n# 0.990    462 361\n# 1.000  3 744 070\n\n# threashold at ~0.98\n# librusec 500000\n# taiga_proza 500000\n# wiki 750000\n# news 250000\n\ncat full_vocab.txt \\\n\t| head -500000 \\\n\t| LC_ALL=C sort \\\n\t\u003e vocab.txt\n\nnavec-train s3 upload full_vocab.txt librusec_full_vocab.txt\nnavec-train s3 upload vocab.txt librusec_vocab.txt\n\nnavec-train s3 upload full_vocab.txt taiga_proza_full_vocab.txt\nnavec-train s3 upload vocab.txt taiga_proza_vocab.txt\n\nnavec-train s3 upload full_vocab.txt wiki_full_vocab.txt\nnavec-train s3 upload vocab.txt wiki_vocab.txt\n\nnavec-train s3 upload full_vocab.txt news_full_vocab.txt\nnavec-train s3 upload vocab.txt news_vocab.txt\n```\n\nCompute coocurence pairs\n```bash\n# Default limit on max number of open files is 1024, merge fails if\n# number of chunks is large\n\nulimit -n  # 1024\nsudo nano /etc/security/limits.conf\n\n* soft     nofile         65535\n* hard     nofile         65535\n\n# relogin\nulimit -n  # 65535\n\npv tokens.txt \\\n\t| navec-train cooc count vocab.txt --memory 7 --window 10 \\\n\t\u003e cooc.bin\n\n# Monitor\nls /tmp/cooc_*\ntail -c 16 cooc.bin | navec-train cooc parse\n\nnavec-train s3 upload cooc.bin librusec_cooc.bin\nnavec-train s3 upload cooc.bin taiga_proza_cooc.bin\nnavec-train s3 upload cooc.bin wiki_cooc.bin\nnavec-train s3 upload cooc.bin news_cooc.bin\n```\n\nMerge (did not give much boost compared to plain librusec, so all_vocab.txt, all_cooc.bin not used below)\n```bash\nfor i in librusec taiga_proza wiki news; do\n\tnavec-train s3 download $i_vocab.txt;\n\tnavec-train s3 download $i_cooc.bin;\ndone\n\nnavec-train merge vocabs \\\n\tlibrusec_vocab.txt \\\n\ttaiga_proza_vocab.txt \\\n\twiki_vocab.txt \\\n\tnews_vocab.txt \\\n\t| pv \u003e vocab.txt\n\nnavec-train merge coocs vocab.txt \\\n\tlibrusec_cooc.bin:librusec_vocab.txt \\\n\ttaiga_proza_cooc.bin:taiga_proza_vocab.txt \\\n\twiki_cooc.bin:wiki_vocab.txt \\\n\tnews_cooc.bin:news_vocab.txt \\\n\t| pv \u003e cooc.bin\n\nnavec-train s3 upload vocab.txt all_vocab.txt\nnavec-train s3 upload cooc.bin all_cooc.bin\n```\n\nCompute embedings\n```bash\nnavec-train s3 download librusec_vocab.txt vocab.txt\nnavec-train s3 download librusec_cooc.bin cooc.bin\n\nnavec-train s3 download wiki_vocab.txt vocab.txt\nnavec-train s3 download wiki_cooc.bin cooc.bin\n\nnavec-train s3 download news_vocab.txt vocab.txt\nnavec-train s3 download news_cooc.bin cooc.bin\n\npv cooc.bin \\\n\t| navec-train cooc shuffle --memory 15 \\\n\t\u003e shuf_cooc.bin\n\n# Search dim with best score\nfor i in 100 200 300 400 500 600;\n\tdo navec-train emb shuf_cooc.bin vocab.txt emb_${i}d.txt --dim $i --threads 10 --iterations 2;\ndone\n\n# 300 has ok score. 400, 500 are a bit better, but too heavy\nnavec-train emb shuf_cooc.bin vocab.txt emb.txt --dim 300 --threads 16 --iterations 15\n\nnavec-train s3 upload emb.txt librusec_emb.txt\nnavec-train s3 upload emb.txt wiki_emb.txt\nnavec-train s3 upload emb.txt news_emb.txt\n```\n\nQuantize\n```bash\nnavec-train s3 download librusec_emb.txt emb.txt\nnavec-train s3 download wiki_emb.txt emb.txt\nnavec-train s3 download news_emb.txt emb.txt\n\n# Search for best compression that has still ok score\nfor i in 150 100 75 60 50;\n\tdo pv emb.txt | navec-train pq fit $i --sample 100000 --iterations 15 \u003e pq_${i}q.bin;\ndone\n\n# 100 is \u003c1% worse on eval but much lighter\npv emb.txt | navec-train pq fit 100 --sample 100000 --iterations 20 \u003e pq.bin\n\nnavec-train pq pad \u003c pq.bin \u003e t; mv t pq.bin\n\nnavec-train s3 upload pq.bin librusec_pq.bin\nnavec-train s3 upload pq.bin wiki_pq.bin\nnavec-train s3 upload pq.bin news_pq.bin\n```\n\nPack\n```\nnavec-train s3 download librusec_pq.bin pq.bin\nnavec-train s3 download librusec_vocab.txt vocab.txt\n\nnavec-train s3 download news_pq.bin pq.bin\nnavec-train s3 download news_vocab.txt vocab.txt\n\nnavec-train vocab pack \u003c vocab.txt \u003e vocab.bin\n\nnavec-train pack vocab.bin pq.bin hudlit_v1_12B_500K_300d_100q\nnavec-train s3 upload navec_hudlit_v1_12B_500K_300d_100q.tar packs/navec_hudlit_v1_12B_500K_300d_100q.tar\n\nnavec-train pack vocab.bin pq.bin news_v1_1B_250K_300d_100q\nnavec-train s3 upload navec_news_v1_1B_250K_300d_100q.tar packs/navec_news_v1_1B_250K_300d_100q.tar\n```\n\nPublish\n```\nnavec-train s3 download packs/navec_hudlit_v1_12B_500K_300d_100q.tar\nnavec-train s3 download packs/navec_news_v1_1B_250K_300d_100q.tar\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnatasha%2Fnavec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnatasha%2Fnavec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnatasha%2Fnavec/lists"}