{"id":15014047,"url":"https://github.com/explosion/spacy-vectors-builder","last_synced_at":"2025-07-19T01:37:09.278Z","repository":{"id":57827258,"uuid":"468670482","full_name":"explosion/spacy-vectors-builder","owner":"explosion","description":"🌸 Train floret vectors","archived":false,"fork":false,"pushed_at":"2023-05-04T07:39:52.000Z","size":70,"stargazers_count":18,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-13T05:37:11.638Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-11T08:35:34.000Z","updated_at":"2024-09-16T19:33:00.000Z","dependencies_parsed_at":"2024-09-20T15:32:24.679Z","dependency_job_id":null,"html_url":"https://github.com/explosion/spacy-vectors-builder","commit_stats":{"total_commits":14,"total_committers":1,"mean_commits":14.0,"dds":0.0,"last_synced_commit":"f5247248d76e8358704e7c72c1c139229b51db7b"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/explosion/spacy-vectors-builder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-vectors-builder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-vectors-builder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-vectors-builder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-vectors-builder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/spacy-vectors-builder/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacy-vectors-builder/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265871377,"owners_count":23842026,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T19:45:06.961Z","updated_at":"2025-07-19T01:37:09.247Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"\u003c!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) --\u003e\n\n# 🪐 spaCy Project: Train fastText or floret vectors\n\nThis project downloads, extracts and preprocesses texts from a number of\nsources and trains vectors with [floret](https://github.com/explosion/floret).\n\nBy default, the project trains floret vectors for Korean for use in `md` and\n`lg` spaCy pipelines.\n\nPrerequisites:\n- linux (it may largely work on osx but this is not tested or maintained)\n- a large amount of hard drive space (e.g. ~100GB total for Korean, which has\n  15GB of data in OSCAR 21.09; for English, Russian, Chinese, Spanish, etc.\n  you would need multiple TB with the provided defaults)\n- a workstation with a good CPU, or a lot of patience\n\nAdjust the variables `n_process_tokenize` and `vector_thread` for your CPU.\n\n\u003e For a Python-only cross-platform alternative, try out the simpler\n\u003e [`pipelines/floret_wiki_oscar_vectors`](https://github.com/explosion/projects/tree/v3/pipelines/floret_wiki_oscar_vectors)\n\u003e project using Wikipedia and OSCAR 2019.\n\n## Text Sources\n\n- Wikipedia: https://dumps.wikimedia.org\n- OpenSubtitles: https://opus.nlpl.eu/OpenSubtitles-v2018.php (https://www.opensubtitles.org)\n- WMT Newscrawl: https://data.statmt.org/news-crawl/\n- OSCAR 21.09: https://oscar-corpus.com/post/oscar-v21-09/\n\nOpenSubtitles and WMT Newscrawl only contain texts for a small subset of the\nlanguages included in Wikipedia or OSCAR, so you may need to remove the\nassets and adjust/remove related steps to use a subset of the resources.\n\n### Source Requirements\n\n#### Wikipedia\n\nInstall `Wikiparsec`: https://github.com/rspeer/wikiparsec\n\nChoose a current version available at https://dumps.wikimedia.org for this\nlanguage or switch to `\"latest\"`.\n\n#### OSCAR 21.09\n\nThe dataset [`oscar-corpus/OSCAR-2109`](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)\nrequires you to:\n- create a Hugging Face Hub account\n- agree to the dataset terms to access: https://huggingface.co/datasets/oscar-corpus/OSCAR-2109\n- authenticate with `huggingface-cli login`\n\n#### OSCAR 2019\n\nAs an alternative to OSCAR 21.09, you can stream from\n[`oscar`](https://huggingface.co/datasets/oscar) without authentication.\n\n## floret Parameters\n\n[floret](https://github.com/explosion/floret) has a large number of\nparameters and it's difficult to give advice for all configurations, but the\nparameters described here are the ones that it makes sense to customize for\nany new language and to experiment with initially.\n\nBe aware that if you're using more than one thread, the results of each run\nwith fastText or floret will be slightly different.\n\n### `vector_minn` / `vector_maxn`\n\nThe minimum and maximum character n-gram lengths should be adapted for the\nlanguage and writing system. The n-grams should capture common grammatical\naffixes like English `-ing`, without making the number of n-grams per word\ntoo large. Very short n-grams aren't meaningful and very long n-grams will be\ntoo sparse and won't be useful for cases with misspellings and noise.\n\nA good rule of thumb is that `maxn` should correspond to the length of the\nlongest common affix + `1`, so for many languages with alphabets, `minn\n4`/`maxn 5` can be a good starting point, similar to `minn 5`/`maxn 5`, which\nwas shown to be a reasonable default for the [original fastText\nvectors](https://fasttext.cc/docs/en/crawl-vectors.html).\n\nFor writing systems where one character corresponds to a syllable, shorter\nn-grams are typically more suitable. For Korean, where each (normalized)\ncharacter is a syllable and most grammatical affixes are 1-2 characters,\n`minn 2`/`maxn 3` seems to perform well.\n\n### `vector_bucket_md` / `vector_bucket_lg`\n\nThe bucket size is the number of rows in the floret vector table. For\ntagging and parsing, a bucket size of 50k performs well, but larger sizes may\nstill lead to small improvements. For NER, the performance continues to\nimprove for bucket sizes up to at least 200k.\n\nIn a spaCy pipeline package, 50k 300-dim vectors are ~60MB and 200k 300-dim\nvectors are ~230MB.\n\n### `vector_hash_count`\n\nThe recommended hash count is `2`, especially for smaller bucket sizes.\n\nLarger hash counts are slower to train with floret and slightly slower in\ninference in spaCy, but may lead to slightly improved performance, especially\nwith larger bucket sizes.\n\n### `vector_epoch`\n\nYou may want to reduce the number of epochs for larger training input sizes.\n\n### `vector_min_count`\n\nYou may want to increase the minimum word count for larger training input\nsizes.\n\n### `vector_lr`\n\nYou may need to decrease the learning rate for larger training input sizes to\navoid NaN errors, see:\nhttps://fasttext.cc/docs/en/faqs.html#im-encountering-a-nan-why-could-this-be\n\n### `vector_thread`\n\nAdjust the number of threads for your CPU. With a larger number of threads,\nyou may need more epochs to reach the same performance.\n\n## Notes\n\nThe project does not currently clean up any intermediate files so that it's\npossible to resume from any point in the workflow. The overall disk space\ncould be reduced by cleaning up files after each step, keeping only the final\nfloret input text file. floret does require the input file to be on disk\nduring training.\n\nfloret always writes the full `.bin` and `.vec` files after training. These\nmay be 5GB+ each even though the final `.floret` table is much smaller.\n\n\n## 📋 project.yml\n\nThe [`project.yml`](project.yml) defines the data assets required by the\nproject, as well as the available commands and workflows. For details, see the\n[spaCy projects documentation](https://spacy.io/usage/projects).\n\n### ⏯ Commands\n\nThe following commands are defined by the project. They\ncan be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).\nCommands are only re-run if their inputs have changed.\n\n| Command | Description |\n| --- | --- |\n| `extract-wikipedia` | Convert Wikipedia XML to plain text with Wikiparsec |\n| `tokenize-wikipedia` | Tokenize Wikipedia |\n| `extract-opensubtitles` | Extract OpenSubtitles data |\n| `tokenize-opensubtitles` | Tokenize OpenSubtitles |\n| `extract-newscrawl` | Extract newscrawl data |\n| `tokenize-newscrawl` | Tokenize newscrawl |\n| `tokenize-oscar` | Tokenize and sentencize oscar dataset |\n| `create-input` | Concatenate tokenized input texts |\n| `compile-floret` | Compile floret |\n| `train-floret-vectors-md` | Train floret md vectors |\n| `train-floret-vectors-lg` | Train floret lg vectors |\n| `train-fasttext-vectors` | Train fastText vectors |\n\n### ⏭ Workflows\n\nThe following workflows are defined by the project. They\ncan be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)\nand will run the specified commands in order. Commands are only re-run if their\ninputs have changed.\n\n| Workflow | Steps |\n| --- | --- |\n| `prepare-text` | `extract-wikipedia` \u0026rarr; `tokenize-wikipedia` \u0026rarr; `extract-opensubtitles` \u0026rarr; `tokenize-opensubtitles` \u0026rarr; `extract-newscrawl` \u0026rarr; `tokenize-newscrawl` \u0026rarr; `tokenize-oscar` \u0026rarr; `create-input` |\n| `train-vectors` | `compile-floret` \u0026rarr; `train-floret-vectors-md` \u0026rarr; `train-floret-vectors-lg` |\n\n### 🗂 Assets\n\nThe following assets are defined by the project. They can\nbe fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)\nin the project directory.\n\n| File | Source | Description |\n| --- | --- | --- |\n| `software/floret` | Git |  |\n| `/scratch/vectors/downloaded/wikipedia/kowiki-20220201-pages-articles.xml.bz2` | URL |  |\n| `/scratch/vectors/downloaded/opensubtitles/ko.txt.gz` | URL |  |\n| `/scratch/vectors/downloaded/newscrawl/ko/news.2020.ko.shuffled.deduped.gz` | URL |  |\n\n\u003c!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) --\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-vectors-builder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspacy-vectors-builder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacy-vectors-builder/lists"}