{"id":13631260,"url":"https://github.com/baidu/puck","last_synced_at":"2025-05-16T10:07:09.335Z","repository":{"id":190576099,"uuid":"679064911","full_name":"baidu/puck","owner":"baidu","description":"Puck is a high-performance ANN search engine","archived":false,"fork":false,"pushed_at":"2024-11-21T10:58:21.000Z","size":5730,"stargazers_count":350,"open_issues_count":4,"forks_count":39,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-04-09T04:08:04.020Z","etag":null,"topics":["ann","benchmark","search"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baidu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-16T02:56:07.000Z","updated_at":"2025-04-06T04:03:00.000Z","dependencies_parsed_at":"2023-08-25T10:10:26.217Z","dependency_job_id":"d36c19bf-11cf-44ff-ae8b-842f67b11c1f","html_url":"https://github.com/baidu/puck","commit_stats":null,"previous_names":["baidu/puck"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2Fpuck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2Fpuck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2Fpuck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu%2Fpuck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baidu","download_url":"https://codeload.github.com/baidu/puck/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254509476,"owners_count":22082891,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ann","benchmark","search"],"created_at":"2024-08-01T22:02:18.171Z","updated_at":"2025-05-16T10:07:04.326Z","avatar_url":"https://github.com/baidu.png","language":"Jupyter Notebook","funding_links":[],"categories":["Open Sources","Jupyter Notebook","Multidimensional data / Vectors"],"sub_categories":[],"readme":"## Description\nThis project is a library for approximate nearest neighbor(ANN) search named Puck.\nIn Industrial deployment scenarios, limited memory, expensive computer resources and increasing database size are as important as the recall-vs-latency tradeof for all search applications.\nAlong with the rapid development of retrieval business service, it has the big demand for the highly recall-vs-latency and precious but finite resource, the borning of Puck is precisely for meeting this kind of need.\n\nIt contains two algorithms, Puck and Tinker. \nThis project is written in C++ with wrappers for python3.  \nPuck is an efficient approache for large-scale dataset, which has the best performance of multiple 1B-datasets in [NeurIPS'21 competition track](https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/neurips21/t1_t2/README.md#results-for-t1).\nSince then, performance of Puck has increased by 70%. \nPuck includes a two-layered architectural design for inverted indices and a multi-level quantization on the dataset.\nIf the memory is going to be a bottleneck, Puck could resolve your problems.  \nTinker is an efficient approache for smaller dataset(like 10M, 100M), which has better performance than Nmslib in big-ann-benchmarks. \nThe relationships among similarity points are well thought out, Tinker need more memory to save these. Thinker cost more memory then Puck, but has better performace than Puck. If you want a better searching performance and need not concerned about memory used, Tinker is a better choiese.\n\n## Introduction\n\nThis project supports cosine similarity, L2(Euclidean) and IP(Inner Product, conditioned).\nWhen two vectors are normalized, L2 distance is equal to 2 - 2 * cos.\nIP2COS is a transform method that convert IP distance to cos distance.\nThe distance value in search result is always L2.  \n\nPuck use a compressed vectors(after PQ) instead of the original vectors, the memory cost just over to 1/4 of the original vectors by default.\nWith the increase of datasize, Puck's advantage is more obvious.  \nTinker need save relationships of similarity points, the memory cost is more than the original vectors (less than Nmslib) by default.\nMore performance details in benchmarks. Please see [this readme](./ann-benchmarks/README.md) for more details.\n\n## Linux install\n\n### 1.The prerequisite is mkl, python and cmake.\n**MKL**:  MKL must be installed to compile puck, download the MKL installation package corresponding to the operating system from the official website, and configure the corresponding installation path after the installation is complete.\nsource the MKL component environment script, eg. source ${INSTALL_PATH}/mkl/latest/env/vars.sh. This will maintain many sets of environment variables, like MKLROOT.\n\nhttps://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html\n\n**python**: Version higher than 3.6.0.\n\n**cmake**:  Version higher than 3.21.\n### 2.Clone this project.\n````shell\ngit clone https://github.com/baidu/puck.git\ncd puck\n````\n\n### 3.Use cmake to build this project.\n##### 3.1 Build this project\n````shell\ncmake -DCMAKE_BUILD_TYPE=Release \n    -DMKLROOT=${MKLROOT} \\\n    -DBLA_VENDOR=Intel10_64lp_seq \\\n    -DBLA_STATIC=ON  \\\n    -B build .\n\ncd build \u0026\u0026 make \u0026\u0026 make install\n````\n##### 3.2 Build with GTEST \nUse conditional compilation variable named WITH_TESTING.\n````shell\ncmake -DCMAKE_BUILD_TYPE=Release \n    -DMKLROOT=${MKLROOT} \\\n    -DBLA_VENDOR=Intel10_64lp_seq \\\n    -DBLA_STATIC=ON  \\\n    -DWITH_TESTING=ON \\\n    -B build .\n\ncd build \u0026\u0026 make \u0026\u0026 make install\n````\n\n##### 3.3 Build with Python\n\nRefer to the [Dockerfile](./ann-benchmarks/install/Dockerfile.puck_inmem)\n````shell\npython3 setup.py install \n````\n\nOutput files are saved in build/output subdirectory by default.\n\n## How to use\nOutput files include demos of train, build and search tools.  \nTrain and build tools are in build/output/build_tools subdirectory.  \nSearch demo tools are in build/output/bin subdirectory.\n\n### 1.format vector dataset for train and build\nThe vectors are stored in raw little endian.\nEach vector takes 4+d*4 bytes for .fvecs format, where d is the dimensionality of the vector.\n\n### 2.train \u0026 build\nThe default train configuration file is \"build/output/build_tools/conf/puck_train.conf\".\nThe length of each feature vector must be set in train configuration file (feature_dim).\n\n````shell\ncd output/build_tools\ncp YOUR_FEATURE_FILE puck_index/all_data.feat.bin\nsh script/puck_train_control.sh -t -b\n````\n\nindex files are saved in puck_index subdirectory by default.\n\n### 3.search\nDuring searching, the default value of index files path is './puck_index'.  \nThe format of query file, refer to [demo](./tools/demo/init-feature-example)  \nSearch parameters can be modified using a configuration file, refer to [demo](./demo/conf/puck.conf )\n\n````shell\ncd output/\nln -s build_tools/puck_index .\n./bin/search_client YOUR_QUERY_FEATURE_FILE RECALL_FILE_NAME --flagfile=conf/puck.conf\n````\n\nrecall results are stored in file RECALL_FILE_NAME.\n\n## More Details\n[more details for puck](./docs/README.md)\n\n## Benchmark\nPlease see [this readme](./ann-benchmarks/README.md) for details.\n\nthis ann-benchmark is forked from https://github.com/harsha-simhadri/big-ann-benchmarks of 2021.\n\nHow to run this benchmark is the same with it. We add support of faiss(IVF,IVF-Flat,HNSW) , nmslib（HNSW）,Puck and Tinker of T1 track. And We update algos.yaml of these method using recommended parameters of 4 datasets(bigann-10M, bigann-100M, deep-10M, deep-100M)\n\n## Discussion\nJoin our QQ group if you are interested in this project.\n\n![QQ Group](./docs/PuckQQGroup.jpeg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaidu%2Fpuck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaidu%2Fpuck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaidu%2Fpuck/lists"}