{"id":15641240,"url":"https://github.com/rragundez/chunkdot","last_synced_at":"2025-04-04T21:05:42.122Z","repository":{"id":152394020,"uuid":"610501127","full_name":"rragundez/chunkdot","owner":"rragundez","description":"Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K most similar items for a large number of items by chunking the item matrix representation (embeddings) and using Numba to accelerate the calculations.","archived":false,"fork":false,"pushed_at":"2024-12-28T07:22:10.000Z","size":3315,"stargazers_count":80,"open_issues_count":3,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-28T20:05:42.730Z","etag":null,"topics":["cosine-similarity","embeddings","numba","vectorization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rragundez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-06T22:36:23.000Z","updated_at":"2025-03-23T20:18:00.000Z","dependencies_parsed_at":"2025-02-22T20:11:18.359Z","dependency_job_id":"331c6c50-6b0a-4573-bc02-fe2cc82bc4dd","html_url":"https://github.com/rragundez/chunkdot","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rragundez%2Fchunkdot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rragundez%2Fchunkdot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rragundez%2Fchunkdot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rragundez%2Fchunkdot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rragundez","download_url":"https://codeload.github.com/rragundez/chunkdot/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247249524,"owners_count":20908212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cosine-similarity","embeddings","numba","vectorization"],"created_at":"2024-10-03T11:41:57.127Z","updated_at":"2025-04-04T21:05:42.094Z","avatar_url":"https://github.com/rragundez.png","language":"Python","readme":"# ChunkDot\n\nMulti-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K most similar items for a large number of items by chunking the item matrix representation (embeddings) and using Numba to accelerate the calculations.\n\nUse for:\n\n- [dense embeddings](#dense-embeddings)\n- [sparse embeddings](#sparse-embeddings)\n- [similarity calculation versus other embeddings](#similarity-calculation-versus-other-embeddings)\n- [CosineSimilarityTopK scikit-learn transformer](#cosinesimilaritytopk-scikit-learn-transformer)\n\n## Related blog posts\n\n- [Cosine Similarity for 1 Trillion Pairs of Vectors\n](https://pub.towardsai.net/cosine-similarity-for-1-trillion-pairs-of-vectors-11f6a1ed6458)\n- [Bulk Similarity Calculations for Sparse Embeddings\n](https://pub.towardsai.net/scale-up-bulk-similarity-calculations-for-sparse-embeddings-fb3ecb624727)\n\n## Usage\n\n```bash\npip install -U chunkdot\n```\n\n### Dense embeddings\n\nCalculate the 50 most similar and dissimilar items for 100K items.\n\n```python\nimport numpy as np\nfrom chunkdot import cosine_similarity_top_k\n\nembeddings = np.random.randn(100000, 256)\n# using all you system's memory\ncosine_similarity_top_k(embeddings, top_k=50)\n# most dissimilar items using 20GB\ncosine_similarity_top_k(embeddings, top_k=-50, max_memory=20E9)\n```\n```\n\u003c100000x100000 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n with 5000000 stored elements in Compressed Sparse Row format\u003e\n```\n```python\n# with progress bar\ncosine_similarity_top_k(embeddings, top_k=50, show_progress=True)\n```\n```\n100%|███████████████████████████████████████████████████████████████| 129.0/129 [01:04\u003c00:00,  1.80it/s]\n\u003c100000x100000 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n  with 5000000 stored elements in Compressed Sparse Row format\u003e\n```\n\nExecution time\n```python\nfrom timeit import timeit\nimport numpy as np\nfrom chunkdot import cosine_similarity_top_k\n\nembeddings = np.random.randn(100000, 256)\ntimeit(lambda: cosine_similarity_top_k(embeddings, top_k=50, max_memory=20E9), number=1)\n```\n```\n58.611996899999994\n```\n\n### Sparse embeddings\n\nCalculate the 50 most similar and dissimilar items for 100K items. Items represented by 10K dimensional vectors and an embeddings matrix of 0.005 density.\n\n```python\nfrom scipy import sparse\nfrom chunkdot import cosine_similarity_top_k\n\nembeddings = sparse.rand(100000, 10000, density=0.005)\n# using all you system's memory\ncosine_similarity_top_k(embeddings, top_k=50)\n# most dissimilar items using 20GB\ncosine_similarity_top_k(embeddings, top_k=-50, max_memory=20E9)\n```\n```\n\u003c100000x100000 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n with 5000000 stored elements in Compressed Sparse Row format\u003e\n```\n\nExecution time\n\n```python\nfrom timeit import timeit\nfrom scipy import sparse\nfrom chunkdot import cosine_similarity_top_k\n\nembeddings = sparse.rand(100000, 10000, density=0.005)\ntimeit(lambda: cosine_similarity_top_k(embeddings, top_k=50, max_memory=20E9), number=1)\n```\n```\n51.87472256699999\n```\n### Similarity calculation versus other embeddings\n\nGiven 20K items, for each item, find the 50 most similar items in a collection of other 10K items.\n\n```python\nimport numpy as np\nfrom chunkdot import cosine_similarity_top_k\n\nembeddings = np.random.randn(20000, 256)\nother_embeddings = np.random.randn(10000, 256)\n\ncosine_similarity_top_k(embeddings, embeddings_right=other_embeddings, top_k=10)\n```\n```\n\u003c20000x10000 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n with 200000 stored elements in Compressed Sparse Row format\u003e\n```\n\n ### CosineSimilarityTopK scikit-learn transformer\n\nGiven a pandas DataFrame with 100K rows and\n\n- 2 numerical columns\n- 2 categorical columns with 500 categories each\n\nuse scikit-learn transformers, the standard scaler for the numerical columns and the one-hot encoder for the categorical columns, to form an embeddings matrix of dimensions 100K x 1002 and then calculate the top 50 most similar rows per each row.\n\n```python\nimport numpy as np\nimport pandas as pd\n\nn_rows = 100000\nn_categories = 500\ndf = pd.DataFrame(\n    {\n        \"A_numeric\": np.random.rand(n_rows),\n        \"B_numeric\": np.random.rand(n_rows),\n        \"C_categorical\": np.random.randint(n_categories, size=n_rows),\n        \"D_categorical\": np.random.randint(n_categories, size=n_rows),\n    }\n)\n```\n```python\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\nfrom chunkdot import CosineSimilarityTopK\n\nnumeric_features = [\"A_numeric\", \"B_numeric\"]\nnumeric_transformer = Pipeline(steps=[(\"scaler\", StandardScaler())])\n\ncategorical_features = [\"C_categorical\", \"D_categorical\"]\ncategorical_transformer = Pipeline(steps=[(\"encoder\", OneHotEncoder())])\n\npreprocessor = ColumnTransformer(\n    transformers=[\n        (\"num\", numeric_transformer, numeric_features),\n        (\"cat\", categorical_transformer, categorical_features),\n    ]\n)\n\ncos_sim = CosineSimilarityTopK(top_k=50)\n\npipe = Pipeline(steps=[(\"preprocessor\", preprocessor), (\"cos_sim\", cos_sim)])\npipe.fit_transform(df)\n```\n```\n\u003c100000x100000 sparse matrix of type '\u003cclass 'numpy.float64'\u003e'\n\twith 5000000 stored elements in Compressed Sparse Row format\u003e\n```\n\nExecution time\n```python\nfrom timeit import timeit\n\ntimeit(lambda: pipe.fit_transform(df), number=1)\n```\n```\n24.45172154181637\n```","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frragundez%2Fchunkdot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frragundez%2Fchunkdot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frragundez%2Fchunkdot/lists"}