{"id":13577858,"url":"https://github.com/plasticityai/magnitude","last_synced_at":"2025-05-15T11:08:40.805Z","repository":{"id":46148389,"uuid":"122715432","full_name":"plasticityai/magnitude","owner":"plasticityai","description":"A fast, efficient universal vector embedding utility package.","archived":false,"fork":false,"pushed_at":"2023-08-03T00:59:57.000Z","size":74131,"stargazers_count":1644,"open_issues_count":39,"forks_count":119,"subscribers_count":37,"default_branch":"master","last_synced_at":"2025-04-14T16:58:58.253Z","etag":null,"topics":["embeddings","fast","fasttext","gensim","glove","machine-learning","machine-learning-library","memory-efficient","natural-language-processing","nlp","python","vectors","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/plasticityai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-02-24T07:28:16.000Z","updated_at":"2025-03-22T12:11:38.000Z","dependencies_parsed_at":"2022-09-24T14:50:56.446Z","dependency_job_id":"7a8dfb33-b853-45be-8fc5-4792c14402ac","html_url":"https://github.com/plasticityai/magnitude","commit_stats":null,"previous_names":[],"tags_count":139,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plasticityai%2Fmagnitude","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plasticityai%2Fmagnitude/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plasticityai%2Fmagnitude/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plasticityai%2Fmagnitude/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/plasticityai","download_url":"https://codeload.github.com/plasticityai/magnitude/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254328385,"owners_count":22052632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","fast","fasttext","gensim","glove","machine-learning","machine-learning-library","memory-efficient","natural-language-processing","nlp","python","vectors","word-embeddings","word2vec"],"created_at":"2024-08-01T15:01:24.946Z","updated_at":"2025-05-15T11:08:40.767Z","avatar_url":"https://github.com/plasticityai.png","language":"Python","funding_links":[],"categories":["Python","Look-up large Embeddings","Misc","向量相似度搜索（ANN）"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\u003cimg src=\"https://gitlab.com/Plasticity/magnitude/raw/master/images/magnitude.png\" alt=\"magnitude\" height=\"50\"\u003e\u003c/div\u003e\n\n## \u003cdiv align=\"center\"\u003eMagnitude: a fast, simple vector embedding utility library\u003cbr /\u003e\u003cbr /\u003e[![pipeline status](https://gitlab.com/Plasticity/magnitude/badges/master/pipeline.svg)](https://gitlab.com/Plasticity/magnitude/commits/master)\u0026nbsp;\u0026nbsp;\u0026nbsp;[![Build Status](https://travis-ci.org/plasticityai/magnitude.svg?branch=master)](https://travis-ci.org/plasticityai/magnitude)\u0026nbsp;\u0026nbsp;\u0026nbsp;[![Build status](https://ci.appveyor.com/api/projects/status/72lwh2g7a9ddbnt2/branch/master?svg=true)](https://ci.appveyor.com/project/plasticity-admin/magnitude/branch/master)\u003cbr/\u003e[![PyPI version](https://badge.fury.io/py/pymagnitude.svg)](https://pypi.python.org/pypi/pymagnitude/)\u0026nbsp;\u0026nbsp;\u0026nbsp;[![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://gitlab.com/Plasticity/magnitude/blob/master/LICENSE.txt)\u0026nbsp;\u0026nbsp;\u0026nbsp;[![Python version](https://img.shields.io/pypi/pyversions/pymagnitude.svg)](https://pypi.python.org/pypi/pymagnitude/)\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;[![DOI](https://zenodo.org/badge/122715432.svg)](https://zenodo.org/badge/latestdoi/122715432)\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;[![arXiv](https://img.shields.io/badge/arXiv-1810.11190-%23B41A1A.svg)](https://arxiv.org/abs/1810.11190)\u003c/div\u003e\nA feature-packed Python package and vector storage file format for utilizing vector embeddings in machine learning models in a fast, efficient, and simple manner developed by [Plasticity](https://www.plasticity.ai/). It is primarily intended to be a simpler / faster alternative to [Gensim](https://radimrehurek.com/gensim/), but can be used as a generic key-vector store for domains outside NLP. It offers unique features like [out-of-vocabulary lookups](#advanced-out-of-vocabulary-keys) and [streaming of large models over HTTP](#remote-streaming-over-http). Published in our paper at [EMNLP 2018](http://aclweb.org/anthology/D18-2021) and available on [arXiv](https://arxiv.org/abs/1810.11190).\n\n## Table of Contents\n- [Installation](#installation)\n- [Motivation](#motivation)\n- [Benchmarks and Features](#benchmarks-and-features)\n- [Pre-converted Magnitude Formats of Popular Embeddings Models](#pre-converted-magnitude-formats-of-popular-embeddings-models)\n- [Using the Library](#using-the-library)\n    * [Constructing a Magnitude Object](#constructing-a-magnitude-object)\n    * [Querying](#querying)\n    * [Basic Out-of-Vocabulary Keys](#basic-out-of-vocabulary-keys)\n    * [Advanced Out-of-Vocabulary Keys](#advanced-out-of-vocabulary-keys)\n        + [Handling Misspellings and Typos](#handling-misspellings-and-typos)\n    * [Concatenation of Multiple Models](#concatenation-of-multiple-models)\n    * [Additional Featurization (Parts of Speech, etc.)](#additional-featurization-parts-of-speech-etc)\n    * [Using Magnitude with a ML library](#using-magnitude-with-a-ml-library)\n        + [Keras](#keras)\n        + [PyTorch](#pytorch)\n        + [TFLearn](#tflearn)\n    * [Utils](#utils)\n- [Concurrency and Parallelism](#concurrency-and-parallelism)\n- [File Format and Converter](#file-format-and-converter)\n- [Remote Loading](#remote-loading)\n- [Remote Streaming over HTTP](#remote-streaming-over-http)\n- [Other Documentation](#other-documentation)\n- [Other Languages](#other-languages)\n- [Other Programming Languages](#other-programming-languages)\n- [Other Domains](#other-domains)\n- [Contributing](#contributing)\n- [Roadmap](#roadmap)\n- [Other Notable Projects](#other-notable-projects)\n- [Citing this Repository](#citing-this-repository)\n- [LICENSE and Attribution](#license-and-attribution)\n\n## Installation\nYou can install this package with `pip`:\n```python\npip install pymagnitude # Python 2.7\npip3 install pymagnitude # Python 3\n```\n\nGoogle Colaboratory has some dependency issues with installing Magnitude due to conflicting dependencies. You can use the following snippet to install Magnitude on Google Colaboratory:\n```bash\n# Install Magnitude on Google Colab\n! echo \"Installing Magnitude.... (please wait, can take a while)\"\n! (curl https://raw.githubusercontent.com/plasticityai/magnitude/master/install-colab.sh | /bin/bash 1\u003e/dev/null 2\u003e/dev/null)\n! echo \"Done installing Magnitude.\"\n```\n\n## Motivation\nVector space embedding models have become increasingly common in machine learning and traditionally have been popular for natural language processing applications. A fast, lightweight tool to consume these large vector space embedding models efficiently is lacking.\n\nThe Magnitude file format (`.magnitude`) for vector embeddings is intended to be a more efficient universal vector embedding format that allows for lazy-loading for faster cold starts in development, LRU memory caching for performance in production, multiple key queries, direct featurization to the inputs for a neural network, performant similiarity calculations, and other nice to have features for edge cases like handling out-of-vocabulary keys or misspelled keys and concatenating multiple vector models together. It also is intended to work with large vector models that may not fit in memory.\n\nIt uses [SQLite](http://www.sqlite.org), a fast, popular embedded database, as its underlying data store. It uses indexes for fast key lookups as well as uses memory mapping, SIMD instructions, and spatial indexing for fast similarity search in the vector space off-disk with good memory performance even between multiple processes. Moreover, memory maps are cached between runs so even after closing a process, speed improvements are reaped.\n\n## Benchmarks and Features\n\n| **Metric**                                                                                                                                            | **Magnitude Light**   | **Magnitude Medium** | **Magnitude Heavy** | **Magnitude [Stream](#remote-streaming-over-http)**    |\n| ----------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------: | :------------------: | :-----------------: | :----------------------------------------------------: |\n| Initial load time                                                                                                                                     | **0.7210s**           | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | 7.7550s                                                |\n| Cold single key query                                                                                                                                 | **0.0001s**           | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | 1.6437s                                                |\n| Warm single key query \u003cbr /\u003e\u003csup\u003e*(same key as cold query)*\u003c/sup\u003e                                                                                     | **0.00004s**          | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | **0.0004s**                                            |\n| Cold multiple key query \u003cbr /\u003e\u003csup\u003e*(n=25)*\u003c/sup\u003e                                                                                                     | **0.0442s**           | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | 1.7753s                                                |\n| Warm multiple key query \u003cbr /\u003e\u003csup\u003e*(n=25) (same keys as cold query)*\u003c/sup\u003e                                                                           | **0.00004s**          | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | **0.0001s**                                            |\n| First `most_similar` search query \u003cbr /\u003e\u003csup\u003e*(n=10) (worst case)*\u003c/sup\u003e                                                                              | 247.05s               | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | -                                                      |\n| First `most_similar` search query \u003cbr /\u003e\u003csup\u003e*(n=10) (average case) (w/ disk persistent cache)*\u003c/sup\u003e                                                 | **1.8217s**           | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | -                                                      |\n| Subsequent `most_similar` search \u003cbr /\u003e\u003csup\u003e*(n=10) (different key than first query)*\u003c/sup\u003e                                                           | **0.2434s**           | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | -                                                      |\n| Warm subsequent `most_similar` search \u003cbr /\u003e\u003csup\u003e*(n=10) (same key as first query)*\u003c/sup\u003e                                                             | **0.00004s**          | **0.00004s**         | **0.00004s**        | -                                                      |\n| First `most_similar_approx` search query \u003cbr /\u003e\u003csup\u003e*(n=10, effort=1.0) (worst case)*\u003c/sup\u003e                                                           | N/A                   | N/A                  | **29.610s**         | -                                                      |\n| First `most_similar_approx` search query \u003cbr /\u003e\u003csup\u003e*(n=10, effort=1.0) (average case) (w/ disk persistent cache)*\u003c/sup\u003e                              | N/A                   | N/A                  | **0.9155s**         | -                                                      |\n| Subsequent `most_similar_approx` search \u003cbr /\u003e\u003csup\u003e*(n=10, effort=1.0) (different key than first query)*\u003c/sup\u003e                                        | N/A                   | N/A                  | **0.1873s**         | -                                                      |\n| Subsequent `most_similar_approx` search \u003cbr /\u003e\u003csup\u003e*(n=10, effort=0.1) (different key than first query)*\u003c/sup\u003e                                        | N/A                   | N/A                  | **0.0199s**         | -                                                      |\n| Warm subsequent `most_similar_approx` search \u003cbr /\u003e\u003csup\u003e*(n=10, effort=1.0) (same key as first query)*\u003c/sup\u003e                                          | N/A                   | N/A                  | **0.00004s**        | -                                                      |\n| File size                                                                                                                                             | 4.21GB                | 5.29GB               | 10.74GB             | **0.00GB**                                             |\n| Process memory (RAM) utilization                                                                                                                      | **18KB**              | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | 1.71MB                                                 |\n| Process memory (RAM) utilization after 100 key queries                                                                                                | **168KB**             | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e | 1.91MB                                                 |\n| Process memory (RAM) utilization after 100 key queries + similarity search                                                                            | **342KB**\u003csup\u003e2\u003c/sup\u003e | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e  | ━\u0026nbsp;\u003csup\u003e1\u003c/sup\u003e |                                                        |\n| Integrity checks and tests                                                                                                                            | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Universal format between word2vec (`.txt`, `.bin`), GloVe (`.txt`), fastText (`.vec`), and ELMo (`.hdf5`) with converter utility                      | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Simple, Pythonic interface                                                                                                                            | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Few dependencies                                                                                                                                      | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Support for larger than memory models                                                                                                                 | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Lazy loading whenever possible for speed and performance                                                                                              | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Optimized for `threading` and `multiprocessing`                                                                                                       | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Bulk and multiple key lookup with padding, truncation, placeholder, and featurization support                                                         | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Concatenting multiple vector models together                                                                                                          | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Basic out-of-vocabulary key lookup \u003cbr /\u003e\u003csup\u003e(character n-gram feature hashing)\u003c/sup\u003e                                                                | ✅                     | ✅                    | ✅                   | ✅                                                      |\n| Advanced out-of-vocabulary key lookup with support for misspellings \u003cbr /\u003e\u003csup\u003e(character n-gram feature hashing to similar in-vocabulary keys)\u003c/sup\u003e | ❌                     | ✅                    | ✅                   | ✅                                                      |\n| Approximate most similar search with an [annoy](#other-notable-projects) index                                                                        | ❌                     | ❌                    | ✅                   | ✅                                                      |\n| Built-in training for new models                                                                                                                      | ❌                     | ❌                    | ❌                   | ❌                                                      |\n\n\n\n\u003csup\u003e1: *same value as previous column*\u003c/sup\u003e\u003cbr /\u003e\n\u003csup\u003e2: *uses `mmap` to read from disk, so the OS will still allocate pages of memory when memory is available, but it can be shared between processes and isn't managed within each process for extremely large files which is a performance win*\u003c/sup\u003e\u003cbr/\u003e\n\u003csup\u003e\\*: All [benchmarks](https://gitlab.com/Plasticity/magnitude/blob/master/tests/benchmark.py) were performed on the Google News pre-trained word vectors (`GoogleNews-vectors-negative300.bin`) with a MacBook Pro (Retina, 15-inch, Mid 2014) 2.2GHz quad-core Intel Core i7 @ 16GB RAM on SSD over an average of trials where feasible.\u003c/sup\u003e\n\n## Pre-converted Magnitude Formats of Popular Embeddings Models\n\nPopular embedding models have been pre-converted to the `.magnitude` format for immmediate download and usage:\n\n| **Contributor**                                                         | **Data**                                                        | **Light**\u003cbr/\u003e\u003cbr/\u003e\u003csup\u003e(basic support for out-of-vocabulary keys)\u003c/sup\u003e                                                                                                                                                                                                                                                                                                | **Medium**\u003cbr/\u003e\u003ci\u003e(recommended)\u003c/i\u003e\u003cbr/\u003e\u003cbr/\u003e\u003csup\u003e(advanced support for out-of-vocabulary keys)\u003c/sup\u003e                                                                                                                                                                                                                                                                           | **Heavy**\u003cbr/\u003e\u003cbr/\u003e\u003csup\u003e(advanced support for out-of-vocabulary keys and faster `most_similar_approx`)\u003c/sup\u003e                                                                                                                                                                                                                                                                |\n| :---------------------------------------------------------------------: | :-------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:                         | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| Google - [word2vec](https://code.google.com/archive/p/word2vec/)        | Google News 100B                                                | [300D](http://magnitude.plasticity.ai/word2vec/light/GoogleNews-vectors-negative300.magnitude)                                                                                                                                                                                                                                                                          | [300D](http://magnitude.plasticity.ai/word2vec/medium/GoogleNews-vectors-negative300.magnitude)                                                                                                                                                                                                                                                                                 | [300D](http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude)                                                                                                                                                                                                                                                                              |\n| Stanford - [GloVe](https://nlp.stanford.edu/projects/glove/)            | Wikipedia 2014 + Gigaword 5 6B                                  | [50D](http://magnitude.plasticity.ai/glove/light/glove.6B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/light/glove.6B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/light/glove.6B.200d.magnitude),\u0026nbsp;[300D](http://magnitude.plasticity.ai/glove/light/glove.6B.300d.magnitude)                                             | [50D](http://magnitude.plasticity.ai/glove/medium/glove.6B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/medium/glove.6B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/medium/glove.6B.200d.magnitude),\u0026nbsp;[300D](http://magnitude.plasticity.ai/glove/medium/glove.6B.300d.magnitude)                                                 | [50D](http://magnitude.plasticity.ai/glove/heavy/glove.6B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/heavy/glove.6B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/heavy/glove.6B.200d.magnitude),\u0026nbsp;[300D](http://magnitude.plasticity.ai/glove/heavy/glove.6B.300d.magnitude)                                                 |\n| Stanford - [GloVe](https://nlp.stanford.edu/projects/glove/)            | Wikipedia 2014 + Gigaword 5 6B \u003cbr /\u003e(lemmatized by Plasticity) | [50D](http://magnitude.plasticity.ai/glove/light/glove-lemmatized.6B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/light/glove-lemmatized.6B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/light/glove-lemmatized.6B.200d.magnitude),\u0026nbsp;[300D](http://magnitude.plasticity.ai/glove/light/glove-lemmatized.6B.300d.magnitude) | [50D](http://magnitude.plasticity.ai/glove/medium/glove-lemmatized.6B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/medium/glove-lemmatized.6B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/medium/glove-lemmatized.6B.200d.magnitude),\u0026nbsp;[300D](http://magnitude.plasticity.ai/glove/medium/glove-lemmatized.6B.300d.magnitude)     | [50D](http://magnitude.plasticity.ai/glove/heavy/glove-lemmatized.6B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/heavy/glove-lemmatized.6B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/heavy/glove-lemmatized.6B.200d.magnitude),\u0026nbsp;[300D](http://magnitude.plasticity.ai/glove/heavy/glove-lemmatized.6B.300d.magnitude)     |\n| Stanford - [GloVe](https://nlp.stanford.edu/projects/glove/)            | Common Crawl 840B                                               | [300D](http://magnitude.plasticity.ai/glove/light/glove.840B.300d.magnitude)                                                                                                                                                                                                                                                                                            | [300D](http://magnitude.plasticity.ai/glove/medium/glove.840B.300d.magnitude)                                                                                                                                                                                                                                                                                                   | [300D](http://magnitude.plasticity.ai/glove/heavy/glove.840B.300d.magnitude)                                                                                                                                                                                                                                                                                                |\n| Stanford - [GloVe](https://nlp.stanford.edu/projects/glove/)            | Twitter 27B                                                     | [25D](http://magnitude.plasticity.ai/glove/light/glove.twitter.27B.25d.magnitude),\u0026nbsp;[50D](http://magnitude.plasticity.ai/glove/light/glove.twitter.27B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/light/glove.twitter.27B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/light/glove.twitter.27B.200d.magnitude)           | [25D](http://magnitude.plasticity.ai/glove/medium/glove.twitter.27B.25d.magnitude),\u0026nbsp;[50D](http://magnitude.plasticity.ai/glove/medium/glove.twitter.27B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/medium/glove.twitter.27B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/medium/glove.twitter.27B.200d.magnitude)               | [25D](http://magnitude.plasticity.ai/glove/heavy/glove.twitter.27B.25d.magnitude),\u0026nbsp;[50D](http://magnitude.plasticity.ai/glove/heavy/glove.twitter.27B.50d.magnitude),\u0026nbsp;[100D](http://magnitude.plasticity.ai/glove/heavy/glove.twitter.27B.100d.magnitude),\u0026nbsp;[200D](http://magnitude.plasticity.ai/glove/heavy/glove.twitter.27B.200d.magnitude)               |\n| Facebook - [fastText](https://fasttext.cc/docs/en/english-vectors.html) | English Wikipedia 2017 16B                                      | [300D](http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M.magnitude)                                                                                                                                                                                                                                                                                       | [300D](http://magnitude.plasticity.ai/fasttext/medium/wiki-news-300d-1M.magnitude)                                                                                                                                                                                                                                                                                              | [300D](http://magnitude.plasticity.ai/fasttext/heavy/wiki-news-300d-1M.magnitude)                                                                                                                                                                                                                                                                                           |\n| Facebook - [fastText](https://fasttext.cc/docs/en/english-vectors.html) | English Wikipedia 2017 + subword 16B                            | [300D](http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M-subword.magnitude)                                                                                                                                                                                                                                                                               | [300D](http://magnitude.plasticity.ai/fasttext/medium/wiki-news-300d-1M-subword.magnitude)                                                                                                                                                                                                                                                                                      | [300D](http://magnitude.plasticity.ai/fasttext/heavy/wiki-news-300d-1M-subword.magnitude)                                                                                                                                                                                                                                                                                   |\n| Facebook - [fastText](https://fasttext.cc/docs/en/english-vectors.html) | Common Crawl 600B                                               | [300D](http://magnitude.plasticity.ai/fasttext/light/crawl-300d-2M.magnitude)                                                                                                                                                                                                                                                                                           | [300D](http://magnitude.plasticity.ai/fasttext/medium/crawl-300d-2M.magnitude)                                                                                                                                                                                                                                                                                                  | [300D](http://magnitude.plasticity.ai/fasttext/heavy/crawl-300d-2M.magnitude)                                                                                                                                                                                                                                                                                               |\n| AI2 - [AllenNLP ELMo](https://allennlp.org/elmo)                        | [ELMo Models](ELMo.md)                                          | [ELMo Models](ELMo.md)                                                                                                                                                                                                                                                                                                                                                  | [ELMo Models](ELMo.md)                                                                                                                                                                                                                                                                                                                                                          | [ELMo Models](ELMo.md)                                                                                                                                                                                                                                                                                                                                                      |\n| Google - [BERT](https://github.com/google-research/bert)                | [Coming Soon...](#roadmap)                                      | [Coming Soon...](#roadmap)                                                                                                                                                                                                                                                                                                                                              | [Coming Soon...](#roadmap)                                                                                                                                                                                                                                                                                                                                                      | [Coming Soon...](#roadmap)                                                                                                                                                                                                                                                                                                                                                  |\n\n\nThere are instructions [below](#file-format-and-converter) for converting any `.bin`, `.txt`, `.vec`, `.hdf5` file to a `.magnitude` file.\n\n## Using the Library\n\n### Constructing a Magnitude Object\n\nYou can create a Magnitude object like so:\n```python\nfrom pymagnitude import *\nvectors = Magnitude(\"/path/to/vectors.magnitude\")\n```\n\nIf needed, and included for convenience, you can also open a `.bin`, `.txt`, `.vec`, `.hdf5` file directly with Magnitude. This is, however, less efficient and very slow for large models as it will convert the file to a `.magnitude` file on the first run into a temporary directory. The temporary directory is not guaranteed to persist and does not persist when your computer reboots. You should [pre-convert `.bin`, `.txt`, `.vec`, `.hdf5` files with `python -m pymagnitude.converter`](#file-format-and-converter) typically for faster speeds, but this feature is useful for one-off use-cases. A warning will be generated when instantiating a Magnitude object directly with a `.bin`, `.txt`, `.vec`, `.hdf5`. You can supress warnings by setting the  `supress_warnings` argument in the constructor to `True`.\n\n---------------\n\n* \u003csup\u003eBy default, lazy loading is enabled. You can pass in an optional `lazy_loading` argument to the constructor with the value `-1` to disable lazy-loading and pre-load all vectors into memory (a la Gensim), `0` (default) to enable lazy-loading with an unbounded in-memory LRU cache, or an integer greater than zero `X` to enable lazy-loading with an LRU cache that holds the `X` most recently used vectors in memory.\u003c/sup\u003e \n* \u003csup\u003eIf you want the data for the `most_similar` functions to be pre-loaded eagerly on initialization, set `eager` to `True`.\u003c/sup\u003e\n* \u003csup\u003eNote, even when `lazy_loading` is set to `-1` or `eager` is set to `True` data will be pre-loaded into memory in a background thread to prevent the constructor from blocking for a few minutes for large models. If you really want blocking behavior, you can pass `True` to the `blocking` argument.\u003c/sup\u003e\n* \u003csup\u003eBy default, [unit-length normalized](https://en.wikipedia.org/wiki/Unit_vector) vectors are returned unless you are loading an ELMo model. Set the optional argument `normalized` to `False` if you wish to recieve the raw non-normalized vectors instead.\u003c/sup\u003e\n* \u003csup\u003eBy default, NumPy arrays are returned for queries. Set the optional argument `use_numpy` to `False` if you wish to recieve Python lists instead.\u003c/sup\u003e\n* \u003csup\u003eBy default, querying for keys is case-sensitive. Set the optional argument `case_insensitive` to `True` if you wish to perform case-insensitive searches.\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can include the `pad_to_length` argument which will specify the length all examples should be padded to if passing in multple examples. Any examples that are longer than the pad length will be truncated.\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can set the `truncate_left` argument to `True` if you want the beginning of the the list of keys in each example to be truncated instead of the end in case it is longer than `pad_to_length` when specified.\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can set the `pad_left` argument to `True` if you want the padding to appear at the beginning versus the end (which is the default).\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can pass in the `placeholders` argument, which will increase the dimensions of each vector by a `placeholders` amount, zero-padding those extra dimensions. This is useful, if you plan to add other values and information to the vectors and want the space for that pre-allocated in the vectors for efficiency.\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can pass in the `language` argument with an [ISO 639-1 Language Code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), which, if you are using Magnitude for word vectors, will ensure the library respects stemming and other language-specific features for that language. The default is `en` for English. You can also pass in `None` if you are not using Magnitude for word vectors. \u003c/sup\u003e\n* \u003csup\u003eOptionally, you can pass in the `dtype` argument which will let you control the data type of the NumPy arrays returned by Magnitude.\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can pass in the `devices` argument which will let you control the usage of GPUs when the underlying models supports GPU usage. This argument should be a list of integers, where each integer represents the GPU device number (`0`, `1`, etc.).\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can pass in the `temp_dir` argument which will let you control the location of the temporary directory Magnitude will use.\u003c/sup\u003e\n* \u003csup\u003eOptionally, you can pass in the `log` argument which will have Magnitude log progress to standard error when slow operations are taking place.\u003c/sup\u003e\n\n### Querying\n\nYou can query the total number of vectors in the file like so:\n```python\nlen(vectors)\n```\n\n---------------\n\nYou can query the dimensions of the vectors like so: \n```python\nvectors.dim\n```\n\n---------------\n\nYou can check if a key is in the vocabulary like so: \n```python\n\"cat\" in vectors\n```\n\n---------------\n\nYou can iterate through all keys and vectors like so:\n```python\nfor key, vector in vectors:\n  ...\n```\n\n---------------\n\nYou can query for the vector of a key like so: \n```python\nvectors.query(\"cat\")\n```\n\n---------------\n\nYou can index for the n-th key and vector like so:\n```python\nvectors[42]\n```\n\n---------------\n\nYou can query for the vector of multiple keys like so: \n```python\nvectors.query([\"I\", \"read\", \"a\", \"book\"])\n```\nA 2D array (keys by vectors) will be returned.\n\n---------------\n\nYou can query for the vector of multiple examples like so: \n```python\nvectors.query([[\"I\", \"read\", \"a\", \"book\"], [\"I\", \"read\", \"a\", \"magazine\"]])\n```\nA 3D array (examples by keys by vectors) will be returned. If `pad_to_length` is not specified, and the size of each example is uneven, they will be padded to the length of the longest example.\n\n---------------\n\nYou can index for the keys and vectors of multiple indices like so:\n```python\nvectors[:42] # slice notation\nvectors[42, 1337, 2001] # tuple notation\n```\n\n---------------\n\nYou can query the distance of two or multiple keys like so:\n```python\nvectors.distance(\"cat\", \"dog\")\nvectors.distance(\"cat\", [\"dog\", \"tiger\"])\n```\n\n---------------\n\nYou can query the similarity of two or multiple keys like so:\n```python\nvectors.similarity(\"cat\", \"dog\")\nvectors.similarity(\"cat\", [\"dog\", \"tiger\"])\n```\n\n---------------\n\nYou can query for the most similar key out of a list of keys to a given key like so:\n```python\nvectors.most_similar_to_given(\"cat\", [\"dog\", \"television\", \"laptop\"]) # dog\n```\n\n---------------\n\nYou can query for which key doesn't match a list of keys to a given key like so:\n```python\nvectors.doesnt_match([\"breakfast\", \"cereal\", \"dinner\", \"lunch\"]) # cereal\n```\n\n---------------\n\nYou can query for the most similar (nearest neighbors) keys like so: \n```python\nvectors.most_similar(\"cat\", topn = 100) # Most similar by key\nvectors.most_similar(vectors.query(\"cat\"), topn = 100) # Most similar by vector\n```\nOptionally, you can pass a `min_similarity` argument to `most_similar`. Values from [-1.0-1.0] are valid.\n\n---------------\n\nYou can also query for the most similar keys giving positive and negative examples (which, incidentally, solves analogies) like so: \n```python\nvectors.most_similar(positive = [\"woman\", \"king\"], negative = [\"man\"]) # queen\n```\n\n---------------\n\nSimilar to `vectors.most_similar`, a `vectors.most_similar_cosmul` function exists that uses the 3CosMul function from [Levy and Goldberg](http://www.aclweb.org/anthology/W14-1618):\n```python\nvectors.most_similar_cosmul(positive = [\"woman\", \"king\"], negative = [\"man\"]) # queen\n```\n\n---------------\n\nYou can also query for the most similar keys using an approximate nearest neighbors index which is much faster, but doesn't guarantee the exact answer: \n```python\nvectors.most_similar_approx(\"cat\")\nvectors.most_similar_approx(positive = [\"woman\", \"king\"], negative = [\"man\"])\n```\nOptionally, you can pass an `effort` argument with values between [0.0-1.0] to the `most_similar_approx` function which will give you runtime trade-off. The default value for `effort` is 1.0 which will take the longest, but will give the most accurate result.\n\n---------------\n\nYou can query for all keys closer to a key than another key is like so:\n```python\nvectors.closer_than(\"cat\", \"rabbit\") # [\"dog\", ...]\n```\n\n---------------\n\nYou can access all of the underlying vectors in the model in a large `numpy.memmap` array of size (`len(vectors) x vectors.emb_dim`) like so:\n\n```python\nvectors.get_vectors_mmap()\n```\n\n---------------\n\nYou can clean up all associated resources, open files, and database connections like so:\n```python\nvectors.close()\n```\n\n### Basic Out-of-Vocabulary Keys\n\nFor word vector representations, handling out-of-vocabulary keys is important to handling new words not in the trained model, handling mispellings and typos, and making models trained on the word vector representations more robust in general.\n\nOut-of-vocabulary keys are handled by assigning them a random vector value. However, the randomness is deterministic. So if the *same* out-of-vocabulary key is encountered twice, it will be assigned the same random vector value for the sake of being able to train on those out-of-vocabulary keys. Moreover, if two out-of-vocabulary keys share similar character n-grams (\"uberx\", \"uberxl\") they will placed close to each other even if they are both not in the vocabulary:\n\n```python\nvectors = Magnitude(\"/path/to/GoogleNews-vectors-negative300.magnitude\")\n\"uberx\" in vectors # False\n\"uberxl\" in vectors # False\nvectors.query(\"uberx\") # array([ 5.07109939e-02, -7.08248823e-02, -2.74812328e-02, ... ])\nvectors.query(\"uberxl\") # array([ 0.04734962, -0.08237578, -0.0333479, -0.00229564, ... ])\nvectors.similarity(\"uberx\", \"uberxl\") # 0.955000000200815\n```\n\n### Advanced Out-of-Vocabulary Keys\n\nIf using a Magnitude file with advanced out-of-vocabulary support (Medium or Heavy), out-of-vocabulary keys will also be embedded close to similar keys (determined by string similarity) that *are in* the vocabulary:\n```python\nvectors = Magnitude(\"/path/to/GoogleNews-vectors-negative300.magnitude\")\n\"uberx\" in vectors # False\n\"uberification\" in vectors # False\n\"uber\" in vectors # True\nvectors.similarity(\"uberx\", \"uber\") # 0.7383483267618451\nvectors.similarity(\"uberification\", \"uber\") # 0.745452837882727\n```\n\n#### Handling Misspellings and Typos\nThis also makes Magnitude robust to a lot of spelling errors:\n```python\nvectors = Magnitude(\"/path/to/GoogleNews-vectors-negative300.magnitude\")\n\"missispi\" in vectors # False\nvectors.similarity(\"missispi\", \"mississippi\") # 0.35961736624824003\n\"discrimnatory\" in vectors # False\nvectors.similarity(\"discrimnatory\", \"discriminatory\") # 0.8309152561753461\n\"hiiiiiiiiii\" in vectors # False\nvectors.similarity(\"hiiiiiiiiii\", \"hi\") # 0.7069775034853861\n```\n\nCharacter n-grams are used to create this effect for out-of-vocabulary keys. The inspiration for this feature was taken from Facebook AI Research's [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf), but instead of utilizing character n-grams at train time, character n-grams are used at inference so the effect can be somewhat replicated (but not perfectly replicated) in older models that were not trained with character n-grams like word2vec and GloVe.\n\n### Concatenation of Multiple Models\nOptionally, you can combine vectors from multiple models to feed stronger information into a machine learning model like so:\n```python\nfrom pymagnitude import *\nword2vec = Magnitude(\"/path/to/GoogleNews-vectors-negative300.magnitude\")\nglove = Magnitude(\"/path/to/glove.6B.50d.magnitude\")\nvectors = Magnitude(word2vec, glove) # concatenate word2vec with glove\nvectors.query(\"cat\") # returns 350-dimensional NumPy array ('cat' from word2vec concatenated with 'cat' from glove)\nvectors.query((\"cat\", \"cats\")) # returns 350-dimensional NumPy array ('cat' from word2vec concatenated with 'cats' from glove)\n```\n\nYou can concatenate more than two vector models, simply by passing more arguments to constructor.\n\n### Additional Featurization (Parts of Speech, etc.)\nYou can automatically create vectors from additional features you may have such as parts of speech, syntax dependency information, or any other information using the `FeaturizerMagnitude` class:\n\n```python\nfrom pymagnitude import *\npos_vectors = FeaturizerMagnitude(100, namespace = \"PartsOfSpeech\")\npos_vectors.dim # 4 - number of dims automatically determined by Magnitude from 100\npos_vectors.query(\"NN\") # - array([ 0.08040417, -0.71705252,  0.61228951,  0.32322192]) \npos_vectors.query(\"JJ\") # - array([-0.11681135,  0.10259253,  0.8841201 , -0.44063763])\npos_vectors.query(\"NN\") # - array([ 0.08040417, -0.71705252,  0.61228951,  0.32322192]) (deterministic hashing so the same value is returned every time for the same key)\ndependency_vectors = FeaturizerMagnitude(100, namespace = \"SyntaxDependencies\")\ndependency_vectors.dim # 4 - number of dims automatically determined by Magnitude from 100\ndependency_vectors.query(\"nsubj\") # - array([-0.81043793,  0.55401352, -0.10838071,  0.15656626])\ndependency_vectors.query(\"prep\") # - array([-0.30862918, -0.44487267, -0.0054573 , -0.84071788])\n```\n\nMagnitude will use the [feature hashing trick](https://en.wikipedia.org/wiki/Feature_hashing) internally to directly use the hash of the feature value to create a unique vector for that feature value.\n\nThe first argument to `FeaturizerMagnitude` should be an approximate upper-bound on the number of values for the feature. Since there are \u003c 100 [parts of speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and \u003c 100 [syntax dependencies](http://universaldependencies.org/u/dep/all.html), we choose 100 for both in the example above. The value chosen will determine how many dimensions Magnitude will automatically assign to the particular the `FeaturizerMagnitude` object to reduce the chance of a hash collision. The `namespace` argument can be any string that describes your additional feature. It is optional, but highly recommended.\n\nYou can then concatenate these features for use with a standard Magnitude object:\n```python\nfrom pymagnitude import *\nword2vec = Magnitude(\"/path/to/GoogleNews-vectors-negative300.magnitude\")\npos_vectors = FeaturizerMagnitude(100, namespace = \"PartsOfSpeech\")\ndependency_vectors = FeaturizerMagnitude(100, namespace = \"SyntaxDependencies\")\nvectors = Magnitude(word2vec, pos_vectors, dependency_vectors) # concatenate word2vec with pos and dependencies\nvectors.query([\n    (\"I\", \"PRP\", \"nsubj\"), \n    (\"saw\", \"VBD\", \"ROOT\"), \n    (\"a\", \"DT\", \"det\"), \n    (\"cat\", \"NN\", \"dobj\"), \n    (\".\",  \".\", \"punct\")\n  ]) # array of size 5 x (300 + 4 + 4) or 5 x 308\n\n# Or get a unique vector for every 'buffalo' in:\n# \"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo\"\n# (https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo)\nvectors.query([\n    (\"Buffalo\", \"JJ\", \"amod\"), \n    (\"buffalo\", \"NNS\", \"nsubj\"), \n    (\"Buffalo\", \"JJ\", \"amod\"), \n    (\"buffalo\", \"NNS\", \"nsubj\"), \n    (\"buffalo\",  \"VBP\", \"rcmod\"),\n    (\"buffalo\",  \"VB\", \"ROOT\"),\n    (\"Buffalo\",  \"JJ\", \"amod\"),\n    (\"buffalo\",  \"NNS\", \"dobj\")\n  ]) # array of size 8 x (300 + 4 + 4) or 8 x 308\n\n```\n\nA machine learning model, given this output, now has access to parts of speech information and syntax dependency information instead of just word vector information. In this case, this additional information can give neural networks stronger signal for semantic information and reduce the need for training data.\n\n### Using Magnitude with a ML library\nMagnitude makes it very easy to quickly build and iterate on models that need to use vector representations by taking care of a lot of pre-processing code to convert a dataset of text (or keys) into vectors. Moreover, it can make these models more robust to [out-of-vocabulary words](#advanced-out-of-vocabulary-keys) and [misspellings](#handling-misspellings-and-typos).\n\nThere is example code available using Magnitude to build an intent classification model for the [ATIS (Airline Travel Information Systems) dataset](https://catalog.ldc.upenn.edu/docs/LDC93S4B/corpus.html) ([Train](http://magnitude.plasticity.ai/data/atis/atis-intent-train.txt)/[Test](http://magnitude.plasticity.ai/data/atis/atis-intent-test.txt)), used for chatbots or conversational interfaces, in a few popular machine learning libraries below.\n\n#### Keras\nYou can access a guide for using Magnitude with Keras (which supports TensorFlow, Theano, CNTK) at this [Google Colaboratory Python notebook](https://colab.research.google.com/drive/1lOcAhIffLW8XC6QsKzt5T_ZqPP4Y9eS4).\n\n#### PyTorch\n*The PyTorch guide is coming soon.*\n\n#### TFLearn\n*The TFLearn guide is coming soon.*\n\n### Utils\n\nYou can use the `MagnitudeUtils` class for convenient access to functions that may be useful when creating machine learning models.\n\nYou can import MagnitudeUtils like so:\n```python\n  from pymagnitude import MagnitudeUtils\n```\n\nYou can download a Magnitude model from a remote source like so:\n```python\n  vecs = Magnitude(MagnitudeUtils.download_model('word2vec/heavy/GoogleNews-vectors-negative300'))\n```\n\nBy default, `download_model` will download files from `http://magnitude.plasticity.ai` to a `~/.magnitude` folder created automatically. If the file has already been downloaded, it will not be downloaded again. You can change the directory of the local download folder using the optional `download_dir` argument. You can change the domain from which models will be downloaded with the optional `remote_path` argument.\n\nYou can create a batch generator for `X` and `y` data with `batchify`, like so:\n```python\n  X = [.3, .2, .7, .8, .1]\n  y = [0, 0, 1, 1, 0]\n  batch_gen = MagnitudeUtils.batchify(X, y, 2)\n  for X_batch, y_batch in batch_gen:\n    print(X_batch, y_batch)\n  # Returns:\n  # 1st loop: X_batch = [.3, .2], y_batch = [0, 0]\n  # 2nd loop: X_batch = [.7, .8], y_batch = [1, 1]\n  # 3rd loop: X_batch = [.1], y_batch = [0]\n  # next loop: repeats infinitely...\n```\n\nYou can encode class labels to integers and back with `class_encoding`, like so:\n```python\n  add_class, class_to_int, int_to_class = MagnitudeUtils.class_encoding()\n  add_class(\"cat\") # Returns: 0\n  add_class(\"dog\") # Returns: 1\n  add_class(\"cat\") # Returns: 0\n  class_to_int(\"dog\") # Returns: 1\n  class_to_int(\"cat\") # Returns: 0\n  int_to_class(1) # Returns: \"dog\"\n  int_to_class(0) # Returns: \"cat\"\n```\n\nYou can convert categorical data with class integers to one-hot NumPy arrays with `to_categorical`, like so:\n```python\n  y = [1, 5, 2]\n  MagnitudeUtils.to_categorical(y, num_classes = 6) # num_classes is optional\n  # Returns: \n  # array([[0., 1., 0., 0., 0., 0.] \n  #       [0., 0., 0., 0., 0., 1.] \n  #       [0., 0., 1., 0., 0., 0.]])\n```\n\nYou can convert from one-hot NumPy arrays back to a 1D NumPy array of class integers with `from_categorical`, like so:\n```python\n  y_c = [[0., 1., 0., 0., 0., 0.],\n         [0., 0., 0., 0., 0., 1.]]\n  MagnitudeUtils.from_categorical(y_c)\n  # Returns: \n  # array([1., 5.])\n```\n\n## Concurrency and Parallelism\nThe library is thread safe (it uses a different connection to the underlying store per thread), is read-only, and it never writes to the file. Because of the light-memory usage, you can also run it in multiple processes (or use `multiprocessing`) with different address spaces without having to duplicate the data in-memory like with other libraries and without having to create a multi-process shared variable since data is read off-disk and each process keeps its own LRU memory cache. For heavier functions, like `most_similar` a shared memory mapped file is created to share memory between processes.\n\n## File Format and Converter\nThe Magnitude package uses the `.magnitude` file format instead of `.bin`, `.txt`, `.vec`, or `.hdf5` as with other vector models like word2vec, GloVe, fastText, and ELMo. There is an included command-line utility for converting word2vec, GloVe, fastText, and ELMo files to Magnitude files.\n\nYou can convert them like so:\n```bash\npython -m pymagnitude.converter -i \u003cPATH TO FILE TO BE CONVERTED\u003e -o \u003cOUTPUT PATH FOR MAGNITUDE FILE\u003e\n```\n\nThe input format will automatically be determined by the extension / the contents of the input file. You should only need to perform this conversion once for a model. After converting, the Magnitude file format is static and it will not be modified or written to make concurrent read access safe.\n\nThe flags for  `pymagnitude.converter` are specified below:\n* You can pass in the `-h` flag for help and to list all flags.\n* You can use the `-p \u003cPRECISION\u003e` flag to specify the decimal precision to retain (selecting a lower number will create smaller files). The actual underlying values are stored as integers instead of floats so this is essentially [quantization](https://www.tensorflow.org/performance/quantization) for smaller model footprints.\n* You can add an approximate nearest neighbors index to the file (increases size) with the `-a` flag which will enable the use of the `most_similar_approx` function. The `-t \u003cTREES\u003e` flag controls the number of trees in the approximate neigherest neighbors index (higher is more accurate) when used in conjunction with the `-a` flag (if not supplied, the number of trees is automatically determined).\n* You can pass the `-s` flag to disable adding subword information to the file (which will make the file smaller), but disable advanced out-of-vocabulary key support.\n* If converting a model that has no vocabulary like ELMo, you can pass the `-v` flag along with the path to another Magnitude file you would like to take the vocabulary from.\n\nOptionally, you can bulk convert many files by passing an input folder and output folder instead of an input file and output file. All `.txt`, `.bin`, `.vec`, `.hdf5` files in the input folder will be converted to `.magnitude` files in the the output folder. The output folder must exist before a bulk conversion operation.\n\n## Remote Loading\nYou can instruct Magnitude download and open a model from Magnitude's remote repository instead of a local file path. The file will automatically be downloaded locally on the first run to `~/.magnitude/` and subsequently skip the download if the file already exists locally.\n\n```python\n  vecs = Magnitude('http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude') # full url\n  vecs = Magnitude('word2vec/heavy/GoogleNews-vectors-negative300') # or, use the shorthand for the url\n```\n\nFor more control over the remote download domain and local download directory, see how to use [`MagnitudeUtils.download_model`](#utils).\n\n## Remote Streaming over HTTP\n\nMagnitude models are generally large files (multiple GB) that take up a lot of disk space, even though the `.magnitude` format makes it fast to utilize the vectors. Magnitude has an option to stream these large files over HTTP. \nThis is explicitly different from the [remote loading feature](#remote-loading), in that the model doesn't even need to be downloaded at all. You can begin querying models immediately with no disk space used at all. \n\n\n```python\n  vecs = Magnitude('http://magnitude.plasticity.ai/word2vec/heavy/GoogleNews-vectors-negative300.magnitude', stream=True) # full url\n  vecs = Magnitude('word2vec/heavy/GoogleNews-vectors-negative300', stream=True) # or, use the shorthand for the url\n\n  vecs.query(\"king\") # Returns: the vector for \"king\" quickly, even with no local model file downloaded\n```\n\nYou can play around with a demo of this in a [Google Colaboratory Python Notebook](https://colab.research.google.com/drive/1zkPhoNM1NvbTmEk9gr0Jnt8hONrca1Fv).\n\nThis feature is extremely useful if your computing environment is resource constrainted (low RAM and low disk space), you want to experiment quickly with vectors without downloading and setting up large model files, or you are training a small model.\nWhile there is some added network latency since the data is being streamed, Magnitude will still use an in-memory cache as specified by the [`lazy_loading`](#constructing-a-magnitude-object) constructor parameter. Since languages generally have a [Zipf-ian distribution](https://en.wikipedia.org/wiki/Zipf%27s_law), the network latency should largely not be an issue after the cache is warmed after being queried a small number of times.\n\nThey will be queried directly off a static HTTP web server using [HTTP Range Request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests) headers. All Magnitude methods support streaming, however, `most_similar` and `most_similar_approx`\nmay be slow as they are not optimized for streaming [yet](#roadmap). You can see how this streaming mode [performs currently in the benchmarks](#benchmarks-and-features), however, it will get faster as we [optimize it in the future](#roadmap)!\n\n## Other Documentation\nOther documentation is not available at this time. See the source file directly (it is well commented) if you need more information about a method's arguments or want to see all supported features.\n\n## Other Languages\nCurrently, we only provide English word vector models on this page pre-converted to the `.magnitude` format. You can, however, still use Magnitude with word vectors of other languages. Facebook has trained their [fastText vectors for many different languages](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md). You can down the `.vec` file for any language you want and then convert it to `.magnitude` with the [converter](#file-format-and-converter).\n\n## Other Programming Languages\nCurrently, reading Magnitude files is only supported in Python, since it has become the de-facto language for machine learning. This is sufficient for most use cases. Extending the file format to other languages shouldn't be difficult as SQLite has a native C implementation and has bindings in most languages. The file format itself and the protocol for reading and searching is also fairly straightforward upon reading the source code of this repository.\n\n## Other Domains\nCurrently, natural language processing is the most popular domain that uses pre-trained vector embedding models for word vector representations. There are, however, other domains like computer vision that have started using pre-trained vector embedding models like [Deep1B](https://github.com/arbabenko/GNOIMI) for image representation. This library intends to stay agnostic to various domains and instead provides a generic key-vector store and interface that is useful for all domains.\n\n## Contributing\nThe main repository for this project can be found on [GitLab](https://gitlab.com/Plasticity/magnitude). The [GitHub repository](https://github.com/plasticityai/magnitude) is only a mirror. Pull requests for more tests, better error-checking, bug fixes, performance improvements, or documentation or adding additional utilties / functionalities are welcome on [GitLab](https://gitlab.com/Plasticity/magnitude).\n\nYou can contact us at [opensource@plasticity.ai](mailto:opensource@plasticity.ai).\n\n## Roadmap\n\n* Speed optimizations on remote streaming and exposing stream cache configuration options\n* Make `most_similar_approx` optimized for streaming\n* In addition to the \"Light\", \"Medium\", and \"Heavy\" flavors, add a \"Ludicrous\" flavor that will be of an even larger file size but removes the constraint of the initially slow `most_similar` lookups.\n* Add Google BERT support\n* Support fastText `.bin` format\n\n## Other Notable Projects\n* [spotify/annoy](https://github.com/spotify/annoy) - Powers the approximate nearest neighbors algorithm behind `most_similar_approx` in Magnitude using random-projection trees and hierarchical 2-means. Thanks to author [Erik Bernhardsson](https://github.com/erikbern) for helping out with some of the integration details between Magnitude and Annoy.\n\n## Citing this Repository\n\nIf you'd like to [cite our paper at EMNLP 2018](http://aclweb.org/anthology/D18-2021), you can use the following BibTeX citation:\n```latex\n@inproceedings{patel2018magnitude,\n  title={Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package},\n  author={Patel, Ajay and Sands, Alexander and Callison-Burch, Chris and Apidianaki, Marianna},\n  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},\n  pages={120--126},\n  year={2018}\n}\n```\nor follow the [Google Scholar link](https://scholar.google.com/scholar?cluster=5916903042122216495\u0026hl=en\u0026as_sdt=0,5) for other ways to cite the paper.\n\nIf you'd like to cite this repository you can use the following DOI badge: \u0026nbsp;[![DOI](https://zenodo.org/badge/122715432.svg)](https://zenodo.org/badge/latestdoi/122715432)\n\nClicking on the badge will lead to a page that will help you generate proper BibTeX citations, JSON-LD citations, and other citations.\n\n## LICENSE and Attribution\n\nThis repository is licensed under the license found [here](LICENSE.txt).\n\n“[Seismic](https://thenounproject.com/ziman.jan/collection/weather/?i=1518266)” icon by JohnnyZi from the [Noun Project](https://thenounproject.com).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplasticityai%2Fmagnitude","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplasticityai%2Fmagnitude","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplasticityai%2Fmagnitude/lists"}