{"id":13452869,"url":"https://github.com/explosion/floret","last_synced_at":"2026-01-10T02:04:29.039Z","repository":{"id":43865889,"uuid":"406274686","full_name":"explosion/floret","owner":"explosion","description":"🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy","archived":false,"fork":true,"pushed_at":"2023-11-03T15:06:38.000Z","size":4619,"stargazers_count":300,"open_issues_count":1,"forks_count":11,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-18T13:41:16.697Z","etag":null,"topics":["fasttext","fasttext-embeddings","spacy","subword-embeddings","word-embeddings","word-vectors"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"facebookresearch/fastText","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-14T07:51:07.000Z","updated_at":"2025-01-16T12:00:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/explosion/floret","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ffloret","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ffloret/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ffloret/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Ffloret/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/floret/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235515424,"owners_count":19002481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fasttext","fasttext-embeddings","spacy","subword-embeddings","word-embeddings","word-vectors"],"created_at":"2024-07-31T08:00:25.408Z","updated_at":"2025-10-06T08:31:00.173Z","avatar_url":"https://github.com/explosion.png","language":"C++","readme":"\u003ca href=\"https://explosion.ai\"\u003e\u003cimg src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\u003c/a\u003e\n\n# floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy\n\nfloret is an extended version of [fastText](https://fasttext.cc) that can\nproduce word representations for any word from a compact vector table. It\ncombines:\n\n- fastText's subwords to provide embeddings for any word\n- Bloom embeddings (\"hashing trick\") for a compact vector table\n\nTo learn more about floret, check out our [blog post on floret vectors](https://explosion.ai/blog/floret-vectors).\n\nFor a hands-on introduction, experiment with English vectors in this example\nnotebook: [`intro_to_floret`][intro_to_floret] [![Open in\nColab][colab]][intro_to_floret_colab]\n\n[colab]: https://gistcdn.githack.com/ines/dcf354aa71a7665ae19871d7fd14a4e0/raw/461fc1f61a7bc5860f943cd4b6bcfabb8c8906e7/colab-badge.svg\n[intro_to_floret]: examples/01_intro_to_floret.ipynb\n[intro_to_floret_colab]: https://colab.research.google.com/github/explosion/floret/blob/main/examples/01_intro_to_floret.ipynb\n\n## Install floret\n\n### Build floret from source\n\n```bash\ngit clone https://github.com/explosion/floret\ncd floret\nmake\n```\n\nThis produces the main binary `floret`.\n\n### Install for python\n\nInstall the python wrapper with `pip`:\n\n```bash\npip install floret\n```\n\nOr install from source in developer mode:\n\n```bash\ngit clone https://github.com/explosion/floret\ncd floret\npip install -r requirements.txt\npip install --no-build-isolation --editable .\n```\n\nSee the [python docs](python/README.md).\n\n## Usage\n\n`floret` adds two additional command line options to `fasttext`:\n\n```\n  -mode               fasttext (default) or floret (word and char ngrams hashed in buckets) [fasttext]\n  -hashCount          floret mode only: number of hashes (1-4) per word/subword [1]\n```\n\nWith `-mode floret`, the word entries are stored in the same table as the\nsubword embeddings (buckets), reducing the size of the saved vector data.\n\nWith `-hashCount 2`, each entry is stored as the sum of 2 rows in the internal\nsubwords hash table. `floret` supports 1-4 hashes per entry in the embeddings\ntable. By storing an entry in the embedding table as the sum of more than one\nrow, it is possible to greatly reduce the number of rows in the table with a\nrelatively small effect on the performance, both in terms of accuracy and\nspeed.\n\nHere's how to train CBOW embeddings with subwords as 4-grams and 5-grams, 2\nhashes per entry, and a compact table of 50K entries rather than the default of\n2M entries.\n\n```bash\nfloret cbow -dim 300 -minn 4 -maxn 5 -mode floret -hashCount 2 -bucket 50000 \\\n-input input.txt -output vectors\n```\n\nWith the `-mode floret` option, floret will save an additional vector table\nwith the file ending `.floret`. The format is very similar to `.vec` with a\nheader line followed by one line per vector. The word tokens are replaced with\nthe index of the row and the header is extended to contain all the relevant\ntraining settings needed to load this table in spaCy.\n\nTo import this vector table in [spaCy](https://spacy.io) v3.2+:\n\n```bash\nspacy init vectors LANG vectors.floret spacy_vectors_dir --mode floret\n```\n\n## How floret works\n\nIn its original implementation, fastText stores words and subwords in two\nseparate tables. The word table contains one entry per word in the vocabulary\n(typically ~1M entries) and the subwords are stored a separate fixed-size table\nby hashing each subword into one row in the table (default 2M entries). A\nrelatively large table is used to reduce the number of collisions between\nsubwords. However, for 1M words + 2M subwords with 300-dimensional vectors of\n32-bit floats, you'd need around 3GB to store the resulting data, which is\nprohibitive for many use cases.\n\nIn addition, many libraries that import vectors only support the word table\n(`.vec`), which limits the coverage to words above a certain frequency in the\ntraining data. For languages with rich morphology, even a large vector table\nmay not provide good coverage for words seen during training and you are still\nlikely to encounter words that were not seen at all during training.\n\nIn order to store word and subword vectors in a more compact format, we turn to\nan algorithm that's been used by [spaCy](https://spacy.io) all along: Bloom\nembeddings. Bloom embeddings (also called the \"hashing trick\", or known as\n[`HashEmbed`](https://thinc.ai/docs/api-layers#hashembed) within spaCy's ML\nlibrary [thinc](https://thinc.ai)) can be used to store distinct\nrepresentations in a compact table by hashing each entry into multiple rows in\nthe table. By representing each entry as the sum of multiple rows, where it's\nunlikely that two entries will collide on multiple hashes, most entries will\nend up with a distinct representation.\n\nWith the settings `-minn 4 -maxn 5 -mode floret -hashCount 2`, the embedding\nfor the word `apple` is stored internally as the sum of 2 hashed rows for each\nof the word, 4-grams and 5-grams. The word is padded with the BOW and EOW\ncharacters `\u003c` and `\u003e`, creating the following word and subword entries:\n\n```\n\u003capple\u003e\n\u003capp\nappl\npple\nple\u003e\n\u003cappl\napple\npple\u003e\n```\n\nFor compatibility with spaCy,\n[MurmurHash](https://github.com/aappleby/smhasher) is used to hash the word and\nchar ngram strings. The final embedding for `apple` is then the sum of two rows\n(`-hashCount 2`) per word and char ngram above.\n\nWith `-mode floret`, `floret` will save an additional vector table with the\nending `.floret` alongside the usual `.bin` and `.vec` files. The format is\nvery similar to `.vec` with a header line followed by one line per entry in the\nvector table with the row index rather than a word token at the beginning of\neach line. The header is extended to contain all the training settings required\nto use this table in another application or library like spaCy.\n\nThe header contains the space-separated settings:\n\n```none\nbucket dim minn maxn hashCount hashSeed BOW EOW\n```\n\nA demo `.floret` table with `-bucket 10 -dim 10 -minn 2 -maxn3 -hashCount 2`:\n\n```none\n10 10 2 3 2 2166136261 \u003c \u003e\n0 -2.2611 3.9302 2.6676 -11.233 0.093715 -10.52 -9.6463 -0.11853 2.101 -0.10145\n1 -3.12 -1.7981 10.7 -6.171 4.4527 10.967 9.073 6.2056 -6.1199 -2.0402\n2 9.5689 5.6721 -8.4832 -1.2249 2.1871 -3.0264 -2.391 -5.3308 -3.2847 -4.0382\n3 3.6268 4.2759 -1.7007 1.5002 5.5266 1.8716 -12.063 0.26314 2.7645 2.4929\n4 -11.683 -7.7068 2.1102 2.214 7.2202 0.69799 3.2173 -5.382 -2.0838 5.0314\n5 -4.3024 8.0241 2.0714 -1.0174 -0.28369 1.7622 7.8797 -1.7795 6.7541 5.6703\n6 8.3574 -5.225 8.6529 8.5605 -8.9465 3.767 -5.4636 -1.4635 -0.98947 -0.58025\n7 -10.01 3.3894 -4.4487 1.1669 -11.904 6.5158 4.3681 0.79913 -6.9131 -8.687\n8 -5.4576 7.1019 -8.8259 1.7189 4.955 -8.9157 -3.8905 -0.60086 -2.1233 5.892\n9 8.0678 -4.4142 3.6236 4.5889 -2.7611 2.4455 0.67096 -4.2822 2.0875 4.6274\n```\n\nThis table can be imported into a spaCy pipeline using `spacy init vectors` in\nspaCy v3.2+ with the option `--mode floret`:\n\n```bash\nspacy init vectors LANG vectors.floret spacy_vectors_dir --mode floret\n```\n\n## Notes\n\nThe fastText and floret binary formats (`.bin`) are not compatible, so it's\nimportant to load a `.bin` file with the same program used to train it.\n\nSee the [fastText documentation](https://fasttext.cc) for details about all\nother commands and options. `floret` supports all existing `fasttext` commands\nand does not modify any `fasttext` defaults.\n\nThe original fastText README is provided below for reference.\n\n---\n\n# fastText README\n\n[fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification.\n\n## Table of contents\n\n- [Resources](#resources)\n  - [Models](#models)\n  - [Supplementary data](#supplementary-data)\n  - [FAQ](#faq)\n  - [Cheatsheet](#cheatsheet)\n- [Requirements](#requirements)\n- [Building fastText](#building-fasttext)\n  - [Getting the source code](#getting-the-source-code)\n  - [Building fastText using make (preferred)](#building-fasttext-using-make-preferred)\n  - [Building fastText using cmake](#building-fasttext-using-cmake)\n  - [Building fastText for Python](#building-fasttext-for-python)\n- [Example use cases](#example-use-cases)\n  - [Word representation learning](#word-representation-learning)\n  - [Obtaining word vectors for out-of-vocabulary words](#obtaining-word-vectors-for-out-of-vocabulary-words)\n  - [Text classification](#text-classification)\n- [Full documentation](#full-documentation)\n- [References](#references)\n  - [Enriching Word Vectors with Subword Information](#enriching-word-vectors-with-subword-information)\n  - [Bag of Tricks for Efficient Text Classification](#bag-of-tricks-for-efficient-text-classification)\n  - [FastText.zip: Compressing text classification models](#fasttextzip-compressing-text-classification-models)\n\n## Resources\n\n### Models\n\n- Recent state-of-the-art [English word vectors](https://fasttext.cc/docs/en/english-vectors.html).\n- Word vectors for [157 languages trained on Wikipedia and Crawl](https://fasttext.cc/docs/en/crawl-vectors.html).\n- Models for [language identification](https://fasttext.cc/docs/en/language-identification.html#content) and [various supervised tasks](https://fasttext.cc/docs/en/supervised-models.html#content).\n\n### Supplementary data\n\n- The preprocessed [YFCC100M data](https://fasttext.cc/docs/en/dataset.html#content) used in [2].\n\n### FAQ\n\nYou can find [answers to frequently asked questions](https://fasttext.cc/docs/en/faqs.html#content) on our [website](https://fasttext.cc/).\n\n### Cheatsheet\n\nWe also provide a [cheatsheet](https://fasttext.cc/docs/en/cheatsheet.html#content) full of useful one-liners.\n\n## Requirements\n\nWe are continuously building and testing our library, CLI and Python bindings under various docker images using [circleci](https://circleci.com/).\n\nGenerally, **fastText** builds on modern Mac OS and Linux distributions.\nSince it uses some C++11 features, it requires a compiler with good C++11 support.\nThese include :\n\n- (g++-4.7.2 or newer) or (clang-3.3 or newer)\n\nCompilation is carried out using a Makefile, so you will need to have a working **make**.\nIf you want to use **cmake** you need at least version 2.8.9.\n\nOne of the oldest distributions we successfully built and tested the CLI under is [Debian jessie](https://www.debian.org/releases/jessie/).\n\nFor the word-similarity evaluation script you will need:\n\n- Python 2.6 or newer\n- NumPy \u0026 SciPy\n\nFor the python bindings (see the subdirectory python) you will need:\n\n- Python version 2.7 or \u003e=3.4\n- NumPy \u0026 SciPy\n- [pybind11](https://github.com/pybind/pybind11)\n\nOne of the oldest distributions we successfully built and tested the Python bindings under is [Debian jessie](https://www.debian.org/releases/jessie/).\n\nIf these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.\n\n## Building fastText\n\nWe discuss building the latest stable version of fastText.\n\n### Getting the source code\n\nYou can find our [latest stable release](https://github.com/facebookresearch/fastText/releases/latest) in the usual place.\n\nThere is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.\n\n### Building fastText using make (preferred)\n\n```\n$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip\n$ unzip v0.9.2.zip\n$ cd fastText-0.9.2\n$ make\n```\n\nThis will produce object files for all the classes as well as the main binary `fasttext`.\nIf you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).\n\n### Building fastText using cmake\n\nFor now this is not part of a release, so you will need to clone the master branch.\n\n```\n$ git clone https://github.com/facebookresearch/fastText.git\n$ cd fastText\n$ mkdir build \u0026\u0026 cd build \u0026\u0026 cmake ..\n$ make \u0026\u0026 make install\n```\n\nThis will create the fasttext binary and also all relevant libraries (shared, static, PIC).\n\n### Building fastText for Python\n\nFor now this is not part of a release, so you will need to clone the master branch.\n\n```\n$ git clone https://github.com/facebookresearch/fastText.git\n$ cd fastText\n$ pip install .\n```\n\nFor further information and introduction see python/README.md\n\n## Example use cases\n\nThis library has two main use cases: word representation learning and text classification.\nThese were described in the two papers [1](#enriching-word-vectors-with-subword-information) and [2](#bag-of-tricks-for-efficient-text-classification).\n\n### Word representation learning\n\nIn order to learn word vectors, as described in [1](#enriching-word-vectors-with-subword-information), do:\n\n```\n$ ./fasttext skipgram -input data.txt -output model\n```\n\nwhere `data.txt` is a training file containing `UTF-8` encoded text.\nBy default the word vectors will take into account character n-grams from 3 to 6 characters.\nAt the end of optimization the program will save two files: `model.bin` and `model.vec`.\n`model.vec` is a text file containing the word vectors, one per line.\n`model.bin` is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.\nThe binary file can be used later to compute word vectors or to restart the optimization.\n\n### Obtaining word vectors for out-of-vocabulary words\n\nThe previously trained model can be used to compute word vectors for out-of-vocabulary words.\nProvided you have a text file `queries.txt` containing words for which you want to compute vectors, use the following command:\n\n```\n$ ./fasttext print-word-vectors model.bin \u003c queries.txt\n```\n\nThis will output word vectors to the standard output, one vector per line.\nThis can also be used with pipes:\n\n```\n$ cat queries.txt | ./fasttext print-word-vectors model.bin\n```\n\nSee the provided scripts for an example. For instance, running:\n\n```\n$ ./word-vector-example.sh\n```\n\nwill compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].\n\n### Text classification\n\nThis library can also be used to train supervised text classifiers, for instance for sentiment analysis.\nIn order to train a text classifier using the method described in [2](#bag-of-tricks-for-efficient-text-classification), use:\n\n```\n$ ./fasttext supervised -input train.txt -output model\n```\n\nwhere `train.txt` is a text file containing a training sentence per line along with the labels.\nBy default, we assume that labels are words that are prefixed by the string `__label__`.\nThis will output two files: `model.bin` and `model.vec`.\nOnce the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:\n\n```\n$ ./fasttext test model.bin test.txt k\n```\n\nThe argument `k` is optional, and is equal to `1` by default.\n\nIn order to obtain the k most likely labels for a piece of text, use:\n\n```\n$ ./fasttext predict model.bin test.txt k\n```\n\nor use `predict-prob` to also get the probability for each label\n\n```\n$ ./fasttext predict-prob model.bin test.txt k\n```\n\nwhere `test.txt` contains a piece of text to classify per line.\nDoing so will print to the standard output the k most likely labels for each line.\nThe argument `k` is optional, and equal to `1` by default.\nSee `classification-example.sh` for an example use case.\nIn order to reproduce results from the paper [2](#bag-of-tricks-for-efficient-text-classification), run `classification-results.sh`, this will download all the datasets and reproduce the results from Table 1.\n\nIf you want to compute vector representations of sentences or paragraphs, please use:\n\n```\n$ ./fasttext print-sentence-vectors model.bin \u003c text.txt\n```\n\nThis assumes that the `text.txt` file contains the paragraphs that you want to get vectors for.\nThe program will output one vector representation per line in the file.\n\nYou can also quantize a supervised model to reduce its memory usage with the following command:\n\n```\n$ ./fasttext quantize -output model\n```\n\nThis will create a `.ftz` file with a smaller memory footprint. All the standard functionality, like `test` or `predict` work the same way on the quantized models:\n\n```\n$ ./fasttext test model.ftz test.txt\n```\n\nThe quantization procedure follows the steps described in [3](#fasttextzip-compressing-text-classification-models). You can\nrun the script `quantization-example.sh` for an example.\n\n## Full documentation\n\nInvoke a command without arguments to list available arguments and their default values:\n\n```\n$ ./fasttext supervised\nEmpty input or output path.\n\nThe following arguments are mandatory:\n  -input              training file path\n  -output             output file path\n\nThe following arguments are optional:\n  -verbose            verbosity level [2]\n\nThe following arguments for the dictionary are optional:\n  -minCount           minimal number of word occurrences [1]\n  -minCountLabel      minimal number of label occurrences [0]\n  -wordNgrams         max length of word ngram [1]\n  -bucket             number of buckets [2000000]\n  -minn               min length of char ngram [0]\n  -maxn               max length of char ngram [0]\n  -t                  sampling threshold [0.0001]\n  -label              labels prefix [__label__]\n\nThe following arguments for training are optional:\n  -lr                 learning rate [0.1]\n  -lrUpdateRate       change the rate of updates for the learning rate [100]\n  -dim                size of word vectors [100]\n  -ws                 size of the context window [5]\n  -epoch              number of epochs [5]\n  -neg                number of negatives sampled [5]\n  -loss               loss function {ns, hs, softmax} [softmax]\n  -thread             number of threads [12]\n  -pretrainedVectors  pretrained word vectors for supervised learning []\n  -saveOutput         whether output params should be saved [0]\n\nThe following arguments for quantization are optional:\n  -cutoff             number of words and ngrams to retain [0]\n  -retrain            finetune embeddings if a cutoff is applied [0]\n  -qnorm              quantizing the norm separately [0]\n  -qout               quantizing the classifier [0]\n  -dsub               size of each sub-vector [2]\n```\n\nDefaults may vary by mode. (Word-representation modes `skipgram` and `cbow` use a default `-minCount` of 5.)\n\n## References\n\nPlease cite [1](#enriching-word-vectors-with-subword-information) if using this code for learning word representations or [2](#bag-of-tricks-for-efficient-text-classification) if using for text classification.\n\n### Enriching Word Vectors with Subword Information\n\n[1] P. Bojanowski\\*, E. Grave\\*, A. Joulin, T. Mikolov, [_Enriching Word Vectors with Subword Information_](https://arxiv.org/abs/1607.04606)\n\n```\n@article{bojanowski2017enriching,\n  title={Enriching Word Vectors with Subword Information},\n  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},\n  journal={Transactions of the Association for Computational Linguistics},\n  volume={5},\n  year={2017},\n  issn={2307-387X},\n  pages={135--146}\n}\n```\n\n### Bag of Tricks for Efficient Text Classification\n\n[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [_Bag of Tricks for Efficient Text Classification_](https://arxiv.org/abs/1607.01759)\n\n```\n@InProceedings{joulin2017bag,\n  title={Bag of Tricks for Efficient Text Classification},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},\n  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},\n  month={April},\n  year={2017},\n  publisher={Association for Computational Linguistics},\n  pages={427--431},\n}\n```\n\n### FastText.zip: Compressing text classification models\n\n[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [_FastText.zip: Compressing text classification models_](https://arxiv.org/abs/1612.03651)\n\n```\n@article{joulin2016fasttext,\n  title={FastText.zip: Compressing text classification models},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\\'e}gou, H{\\'e}rve and Mikolov, Tomas},\n  journal={arXiv preprint arXiv:1612.03651},\n  year={2016}\n}\n```\n\n(\\* These authors contributed equally.)\n","funding_links":[],"categories":["NLP","C++"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Ffloret","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Ffloret","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Ffloret/lists"}