{"id":13531934,"url":"https://github.com/facebookresearch/distributed-faiss","last_synced_at":"2025-04-01T20:30:51.259Z","repository":{"id":40302723,"uuid":"472736085","full_name":"facebookresearch/distributed-faiss","owner":"facebookresearch","description":"A library for building and serving multi-node distributed faiss indices.","archived":true,"fork":false,"pushed_at":"2023-11-01T15:59:47.000Z","size":899,"stargazers_count":262,"open_issues_count":3,"forks_count":19,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-02-23T00:18:10.449Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-03-22T11:25:33.000Z","updated_at":"2025-02-18T06:58:38.000Z","dependencies_parsed_at":"2024-01-12T04:56:37.543Z","dependency_job_id":"31cd3e2f-6326-428e-a3fe-6f58a02a59f8","html_url":"https://github.com/facebookresearch/distributed-faiss","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fdistributed-faiss","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fdistributed-faiss/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fdistributed-faiss/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2Fdistributed-faiss/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/distributed-faiss/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246709923,"owners_count":20821297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T07:01:07.006Z","updated_at":"2025-04-01T20:30:46.250Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["Python","Awesome Vector Search Engine"],"sub_categories":["Library"],"readme":"# About\nDistributed faiss index service. A lightweight library that lets you work with FAISS indexes which don't fit into a single server memory. It follows a simple concept of a set of index server processes runing in a complete isolation from each other. All the coordination is done at the client side. This siplified many-vs-many client-to-server relationship architecture is flexible and is specifically designed for research projects vs more complicated solutions that aims mostly at production usage and transactionality support.\nThe data is sharded over several indexes on different servers in RAM. The search client aggregates results from different servers during retrieval. The service is model-independent and operates with supplied embeddings and metadatas.\n\n### Features:\n* Multiple clients connect to all servers via RPC.\n* At indexing time: clients balance data across servers. The client sends the next available batch of embeddings to a server that is selected in a round-robin fashion.\n* The index client aggregates results from different servers during retrieval. It queries all the servers and uses a heap to find final results.\n* The API allows to send and store any additional metadata (e.g. raw bpe, language information, etc).\n* Launch servers with submitit.\n* Save/load the index/metadata periodically. Can restore from a stopped index state.\n* Supports several indexes at the same time (e.g. one index per language, or different versions of the same index).\n* The API is trying to optimize for network bandwidth.\n* Flexible index configuration.\n\n\u003cimg src=\"design_schema.png\" width=\"900\"\u003e\n\n### Installation\n`pip install -e .`\n\n\n### Testing\n`python -m unittest discover tests`\n\nor \n```bash\npip install pytest\npytest tests\n```\n\n### Code formatting\n`black --line-length 100 .`\n\n# Usage\n## Starting the index servers\ndistributed-faiss consist of server and client parts which are supposed to be launched as separate services. \nThe set of server processes can be launched either by using its API or the provided lauch tool that uses [`submitit`](https://github.com/facebookincubator/submitit) library that works on clusters with SLURM cluster management and job scheduling system \n\n\n\n## Launching servers with submitit on SLURM managed clusters\nExample:\n\n```bash\npython scripts/server_launcher.py \\\n    --log-dir /logs/distr-faiss/ \\\n    --discovery-config /tmp/discover_config.txt \\\n    --save-dir $HOME/dfaiss_data \\\n    --num-servers 64 \\\n    --num-servers-per-node 32 \\\n    --timeout-min 4320 \\\n    --mem-gb 400 \\\n    --base-port 12033 \\\n    --partition dev \u0026\n```\nClients can now read `/tmp/discover_config.txt` to discover servers.\n\nWill launch a job running 64 servers in the background.\nTo view logs (which are verbose but informative) run something like:\n`watch 'tail /logs/distr-faiss/34785924_0_log.err'`\nwhere the `34785924` will be the slurm job id you are allocated.\n\n\n## Launching servers using API\nYou can run each index server process indepentently using the following API:\n\n```python\nserver = IndexServer(global_rank, index_storage_dir)\nserver.start_blocking(port, load_index=True)\n```\n\nThe rank of the server node is needed for reading/writing its own part of the index from/to files. Index are dumped to files for persistent storage. The filesytem path convetion is that there is a shared folder for the entire logical index with each server node working on its own sub-folder inside it.\nindex_storage_dir is the default parameter to store indexes. Can be overrided for each logic index by specifing this attribute in the index configuration object (see client code examples below)\nWhen you start a server node on a specific machine and port, you need to write the host, port line to a specific file which can later be used to start a client.\n\n\n## Client API\nEach client process is supposed to work with all the server nodes and does all the data balancing among them. Client processes can be run independently of each other and work with the same set of server nodes simulateously.\n\n```python\nindex_client = IndexClient(discovery_config)\n```\ndiscovery_config is the path to the shared FS file which was used to start the set of servers and contains all (host, port) info to connect to all of them.\n\n## Creating an index\nEach client \u0026 server nodes can work with multiple logical indexes (consider them as fully separate tables in an SQL database).\nEach logical index can have its own faiss-related configuration, FS location and other parameters which affect its creation logic.\nExample of creating a simle IVF index:\n\n```python\nindex_client = IndexClient(discovery_config)\nidx_cfg = IndexCfg(\n    index_builder_type='ivf_simple',\n    dim=128,\n    train_num=10000,\n    centroids=64,\n    metric='dot',\n    nprobe=12,\n    index_storage_dir='path/to/your/index',\n)\nindex_id = 'your logic index str id'\nindex_client.create_index(index_id, idx_cfg)\n```\n\n## Index configuration\n\n`IndexCfg` has multiple attributes to set the FAISS index type.\nList of values for `index_builder_type` attribute:\n- `flat`, \n- `ivf_simple`,\n- `knnlm`, corresponds to `IndexIVFPQ`,\n- `hnswsq`, corresponds to `IndexHNSWSQ`,\n- `ivfsq`, corresponds to `IndexIVFScalarQuantizer`,\n- `ivf_gpu` is a gpu version of `IVF`.\n\nAlternatively, if `index_builder_type` is not specified, one can set `faiss_factory` just like in FAISS API factory call `faiss.index_factory(...)`\n\nThe following attributes defined the way the index is created:\n- `train_num` - if specified, sets the number of samples are used for the index training.\n- `train_ratio` - the same as train_num but as a ratio of total data size.\n\nData sent for indexing will be aggregated in memory until `train_num` threshold is exceeded. \nPlease refer to the diagram below about the server and client side interactions and steps.\n\n\u003cimg src=\"detailed_design.png\" width=\"900\"\u003e\n\n## Client side operations\nOnce the index has been created, one can send batches of numpy arrays coupled with arbitrarily metadata (should be piackable)\n\n```python\nindex.add_index_data(index_id, vector_chunk, list_of_metadata)\n```\nThe index training and creation are done asynchronously with the `add()` operation the index processing may take a lot of time after all the data are sent.\nIn order to check if all server nodes have finished index building, it is recommended to use the following snippet:\n\n```python\nwhile index.get_state(self.index_id) != IndexState.TRAINED:\n    time.sleep(some_time)\n```\n\nOnce the index is ready, one can query it:\n```python\nscores, meta = index.search(query, topk=10, index_id, return_embeddings=False)\n```\nquery is a query vector batch as a numpy array. return_embeddings enables to return the search result vectors in addition to metadata. If it is set to true, the result tuple will return vectors as the 3-rd element.\n\n## Loading Data\nThe following two commands load a medium sized mmap into distributed-faiss in about 1 minute:\n\nFirst launch 64 servers in the background\n```bash\npython scripts/server_launcher.py \\\n    --log-dir /logs/distr-faiss/ \\\n    --discovery-config /tmp/discover_config.txt \\\n    --save-dir $HOME/dfaiss_data \\\n    --num-servers 64 \\\n    --num-servers-per-node 32 \\\n    --timeout-min 4320 \\\n    --mem-gb 400 \\\n    --base-port 12033 \\\n    --partition dev \u0026\n```\nOnce you receive your allocation, load in the data with\n\n```bash\npython scripts/load_data.py \\\n    --discover /tmp/discover_config.txt \\\n    --mmap $HOME/dfaiss_data/random_1000000000_768_fp16.mmap \\\n    --mmap-size 1000000000 \\\n    --dimension 768 \\\n    --dstore-fp16 \\\n    --cfg scripts/idx_cfg.json \\\n    --dstore-fp16\n```\n\nmodify `scripts/load_data.py` to load other data formats.\n\n# Reference\nReference to cite when using `distributed-faiss` in a research paper:\n```\n@article{DBLP:journals/corr/abs-2112-09924,\n  author    = {Aleksandra Piktus and\n               Fabio Petroni and\n               Vladimir Karpukhin and\n               Dmytro Okhonko and\n               Samuel Broscheit and\n               Gautier Izacard and\n               Patrick Lewis and\n               Barlas Oguz and\n               Edouard Grave and\n               Wen{-}tau Yih and\n               Sebastian Riedel},\n  title     = {The Web Is Your Oyster - Knowledge-Intensive {NLP} against a Very\n               Large Web Corpus},\n  journal   = {CoRR},\n  volume    = {abs/2112.09924},\n  year      = {2021},\n  url       = {https://arxiv.org/abs/2112.09924},\n  eprinttype = {arXiv},\n  eprint    = {2112.09924},\n  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},\n  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-09924.bib},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\nYou can access the paper [here](https://arxiv.org/abs/2112.09924).\n\n# License\n`distributed-faiss` is released under the CC-BY-NC 4.0 license. See the `LICENSE` file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fdistributed-faiss","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2Fdistributed-faiss","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2Fdistributed-faiss/lists"}