{"id":13935806,"url":"https://github.com/ThoughtRiver/lmdb-embeddings","last_synced_at":"2025-07-19T21:30:42.461Z","repository":{"id":49148055,"uuid":"147229508","full_name":"ThoughtRiver/lmdb-embeddings","owner":"ThoughtRiver","description":"Fast word vectors with little memory usage in Python","archived":false,"fork":false,"pushed_at":"2021-06-26T13:12:01.000Z","size":69,"stargazers_count":416,"open_issues_count":2,"forks_count":30,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-06-04T02:20:47.905Z","etag":null,"topics":["embeddings","fasttext","gensim","glove","lmdb","magnitude","memory","speed","text","vectors","word","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ThoughtRiver.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-03T16:31:18.000Z","updated_at":"2025-05-31T01:55:00.000Z","dependencies_parsed_at":"2022-08-28T00:02:13.555Z","dependency_job_id":null,"html_url":"https://github.com/ThoughtRiver/lmdb-embeddings","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/ThoughtRiver/lmdb-embeddings","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThoughtRiver%2Flmdb-embeddings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThoughtRiver%2Flmdb-embeddings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThoughtRiver%2Flmdb-embeddings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThoughtRiver%2Flmdb-embeddings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ThoughtRiver","download_url":"https://codeload.github.com/ThoughtRiver/lmdb-embeddings/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThoughtRiver%2Flmdb-embeddings/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266019657,"owners_count":23864916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","fasttext","gensim","glove","lmdb","magnitude","memory","speed","text","vectors","word","word2vec"],"created_at":"2024-08-07T23:02:06.558Z","updated_at":"2025-07-19T21:30:42.454Z","avatar_url":"https://github.com/ThoughtRiver.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"![lmdb-embeddings](https://socialify.git.ci/ThoughtRiver/lmdb-embeddings/image?description=1\u0026font=Raleway\u0026forks=1\u0026language=1\u0026logo=https%3A%2F%2Fuser-images.githubusercontent.com%2F10864294%2F29792093-382146cc-8c37-11e7-9e70-6f71b3d0800b.png\u0026owner=1\u0026pattern=Plus\u0026stargazers=1\u0026theme=Light)\n\n[![Build Status](https://travis-ci.org/ThoughtRiver/lmdb-embeddings.svg?branch=master)](https://travis-ci.org/ThoughtRiver/lmdb-embeddings)\n\n# LMDB Embeddings\nQuery word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by [Lightning Memory-Mapped Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).\n\nInspired by [Delft](https://github.com/kermitt2/delft). As explained in their readme, this approach permits us to have the pre-trained embeddings immediately \"warm\" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.\n\nFor instance, in a traditional approach `glove-840B` takes around 2 minutes to load and 4GB in memory. Managed with LMDB, `glove-840B` can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).\n\n## Installation\n```bash\npip install lmdb-embeddings\n```\n\n## Reading vectors\n```python\nfrom lmdb_embeddings.reader import LmdbEmbeddingsReader\nfrom lmdb_embeddings.exceptions import MissingWordError\n\nembeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')\n\ntry:\n    vector = embeddings.get_word_vector('google')\nexcept MissingWordError:\n    # 'google' is not in the database.\n    pass\n```\n\n## Writing vectors\nAn example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the `iter_embeddings` method below appropriately.\n\nI will be writing a CLI interface to convert standard formats soon.\n\n```python\nfrom gensim.models.keyedvectors import KeyedVectors\nfrom lmdb_embeddings.writer import LmdbEmbeddingsWriter\n\n\nGOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'\nOUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'\n\n\nprint('Loading gensim model...')\ngensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary=True)\n\n\ndef iter_embeddings():\n    for word in gensim_model.vocab.keys():\n        yield word, gensim_model[word]\n\nprint('Writing vectors to a LMDB database...')\n\nwriter = LmdbEmbeddingsWriter(iter_embeddings()).write(OUTPUT_DATABASE_FOLDER)\n\n# These vectors can now be loaded with the LmdbEmbeddingsReader.\n```\n\n## LRU Cache\nA reader with an LRU (Least Recently Used) cache is included. This will save the embeddings for the 50,000 most recently queried words and return the same object instead of querying the database each time. Its interface is the same as the standard reader.\nSee [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache) in the standard library.\n\n```python\nfrom lmdb_embeddings.reader import LruCachedLmdbEmbeddingsReader\nfrom lmdb_embeddings.exceptions import MissingWordError\n\nembeddings = LruCachedLmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')\n\ntry:\n    vector = embeddings.get_word_vector('google')\nexcept MissingWordError:\n    # 'google' is not in the database.\n    pass\n```\n\n## Customisation\nBy default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the `LmdbEmbeddingsWriter` and `LmdbEmbeddingsReader`.\n\nA [msgpack](https://msgpack.org/index.html) serializer is included and can be used in the same way.\n\n```python\nfrom lmdb_embeddings.writer import LmdbEmbeddingsWriter\nfrom lmdb_embeddings.serializers import MsgpackSerializer\n\nwriter = LmdbEmbeddingsWriter(\n    iter_embeddings(),\n    serializer=MsgpackSerializer().serialize\n).write(OUTPUT_DATABASE_FOLDER)\n```\n\n```python\nfrom lmdb_embeddings.reader import LmdbEmbeddingsReader\nfrom lmdb_embeddings.serializers import MsgpackSerializer\n\nreader = LmdbEmbeddingsReader(\n    OUTPUT_DATABASE_FOLDER,\n    unserializer=MsgpackSerializer().unserialize\n)\n```\n\n## Running tests\n```\npytest\n```\n\n## Author\n\n- Github: [DomHudson](https://github.com/DomHudson)\n\n## Contributing\n\nContributions, issues and feature requests are welcome!\n\n## Show your support\n\nGive a ⭐️ if this project helped you!\n\n## License\n\nCopyright © 2019 [ThoughtRiver](https://github.com/thoughtriver). \u003cbr /\u003e\nThis project is [GPL-3.0](https://github.com/ThoughtRiver/lmdb-embeddings/blob/master/LICENSE) licensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FThoughtRiver%2Flmdb-embeddings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FThoughtRiver%2Flmdb-embeddings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FThoughtRiver%2Flmdb-embeddings/lists"}