{"id":20547215,"url":"https://github.com/lfrati/subpair","last_synced_at":"2026-05-09T05:03:47.450Z","repository":{"id":64348153,"uuid":"570292739","full_name":"lfrati/subpair","owner":"lfrati","description":"Fast pairwise cosine distance calculation and numba accelerated evolutionary matrix subset extraction 🍐🚀","archived":false,"fork":false,"pushed_at":"2023-04-10T19:38:02.000Z","size":43,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-01T16:40:22.529Z","etag":null,"topics":["cosine-distance","cuda","numba"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lfrati.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-11-24T20:13:56.000Z","updated_at":"2022-11-25T20:58:41.000Z","dependencies_parsed_at":"2023-02-09T11:15:28.019Z","dependency_job_id":"11ae2b2a-7307-49e9-b38c-850f86962739","html_url":"https://github.com/lfrati/subpair","commit_stats":{"total_commits":21,"total_committers":2,"mean_commits":10.5,"dds":"0.47619047619047616","last_synced_commit":"2f5792f5bae0d749d5485a60ef02edbd43d0a80f"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/lfrati/subpair","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfrati%2Fsubpair","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfrati%2Fsubpair/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfrati%2Fsubpair/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfrati%2Fsubpair/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lfrati","download_url":"https://codeload.github.com/lfrati/subpair/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfrati%2Fsubpair/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32807861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"online","status_checked_at":"2026-05-09T02:00:06.633Z","response_time":123,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cosine-distance","cuda","numba"],"created_at":"2024-11-16T02:06:49.744Z","updated_at":"2026-05-09T05:03:47.428Z","avatar_url":"https://github.com/lfrati.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg width=\"300\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/203899211-fff1c9d8-10cd-4a84-88b5-518a591cd1e5.jpeg\"\u003e\n    \u003cp align=\"center\"\u003e/sʌb.pɛɹ/\u003c/p\u003e\n\u003c/p\u003e\n\n# SubPair  ![CI](https://github.com/lfrati/subpair/actions/workflows/test.yml/badge.svg)\n\n\u003e \"All you need is love and _evolutionary matrix subset extraction_.\" - J. Lennon\n\nPairwise cosine distance is great to easily compare many vectors. However, you can end up with a very sizeable distance matrix. What if you would like to find a small subset of that matrix? Let's search it by evolution.\n\nGiven N elements and their (N,N) pairwise distance matrix we would like to get the subset of S elements such that the sum of elements in the corresponding (S,S) submatrix is minimal. See example below.\n\n```\n  [0  1  2  3  4] indeces \n      i  j     k    \n      │  │     │          i j k   = [1, 2, 4]\n   0  1  6  4  1                   \ni──1  0  3  1  7       i  0 3 7     \nj──6  3  0  2  3  --\u003e  j  3 0 3  --\u003e  7 + 3 + 3 = 13 👎\n   4  1  2  0  1       k  7 3 0\nk──1  7  3  1  0\n\n         i  j  k    \n         │  │  │          i j k  = [2, 3, 4]   \n   0  1  6  4  1                   \n   1  0  3  1  7       i  0 2 3     \ni──6  3  0  2  3  --\u003e  j  2 0 1  --\u003e  2 + 1 + 3 = 6 👍\nj──4  1  2  0  1       k  3 1 0\nk──1  7  3  1  0\n```\n\nAll the possible subsets are ${N}\\choose{S}$ and for N = 1024, S = 20 (like in the tests) we would have to check ${1024}\\choose{20}$ $= 5.479 \\times 10^{41}$ of them. \n\nA few too many. Instead we are going to use an evolutionary approach to search for it.\n\n# Installation\nThrough pip:\n\n```bash\npip install subpair\n```\nor github\n\n```bash\ngit clone https://github.com/lfrati/subpair.git\ncd subpair\npip install -e .\n```\n\n# Example usage\n\nThe usage is quite straight forward since there are only a couple of functions exported `pairwise_cosine` and `extract`.\n\n```python\n\u003e\u003e\u003e import matplotlib.pyplot as plt\n\u003e\u003e\u003e from subpair import pairwise_cosine\n\u003e\u003e\u003e\n\u003e\u003e\u003e X = np.random.rand(N, K).astype(np.float32)\n\u003e\u003e\u003e distances = pairwise_cosine(X) # (N,N)\n\u003e\u003e\u003e ...\n\u003e\u003e\u003e best, stats = extract(distances, P=200, S=S, K=50, M=3, O=2, its=3_000)\n100%|█████████████████████████████████| 3000/3000 [00:03\u003c00:00, 817.42it/s]\n\u003e\u003e\u003e plt.plot(stats[\"fits\"]); plt.show()\n```\n\u003cp align=\"left\"\u003e\n    \u003cimg width=\"500\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/204059389-730df61a-4e87-4023-b7c7-038b329dc6a6.png\"\u003e\n    \u003cp\u003e(We have sprinkled a few negative numbers to see if the algorithm can find them)\u003c/p\u003e\n\u003c/p\u003e\nWhere the options of extract are parameters for the evolutionary algorithm:\n\n``` \ndistances (int, int) : N vectors of length L\n        P (int)      : population size\n        S (int)      : desired subset size \u003c- determines size of output\n        K (int)      : number of parents (P-K children)\n        M (int)      : number of mutations\n        O (int)      : fraction of crossovers e.g. O=2 -\u003e 1/2, O=10 -\u003e 1/10, (bigger=faster)\n```\n\n# Note\n\nThis repo contains both numpy and numba/CUDA versions of the pairwise cosine distance matrix calculation. But numpy is already _blazingly_ fast so the cuda version is provided mostly for inspiration. Our numpy version is very similar to sklearn's [metrics.pairwise.cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html) but slightly faster. Sklearn's one has some extra nicities that our simplified version does not have.\n\n```bash\n\u003e python flops.py # On Macbook pro M1 Max\nN=513 K=2304 GOPs=1\n  sklearn: 0.01s - 109.4 GFLOPS\n    numpy: 0.00s - 162.4 GFLOPS\n\nN=1027 K=2304 GOPs=2\n  sklearn: 0.02s - 135.9 GFLOPS\n    numpy: 0.01s - 192.4 GFLOPS\n\nN=2055 K=2304 GOPs=10\n  sklearn: 0.07s - 142.9 GFLOPS\n    numpy: 0.06s - 166.0 GFLOPS\n\nN=4111 K=2304 GOPs=39\n  sklearn: 0.20s - 195.8 GFLOPS\n    numpy: 0.16s - 248.6 GFLOPS\n\nN=8223 K=2304 GOPs=156\n  sklearn: 0.61s - 255.3 GFLOPS\n    numpy: 0.54s - 289.5 GFLOPS\n\nN=16447 K=2304 GOPs=623\n  sklearn: 2.11s - 295.4 GFLOPS\n    numpy: 1.79s - 347.9 GFLOPS\n```\n\n# Todo\n- [ ] Add type info to minimize.py to allow for AOT compilation.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flfrati%2Fsubpair","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flfrati%2Fsubpair","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flfrati%2Fsubpair/lists"}