{"id":16976170,"url":"https://github.com/ashvardanian/usearch-molecules","last_synced_at":"2025-03-22T14:31:49.745Z","repository":{"id":207995202,"uuid":"651932274","full_name":"ashvardanian/usearch-molecules","owner":"ashvardanian","description":"Searching for structural similarities across billions of molecules in milliseconds","archived":false,"fork":false,"pushed_at":"2024-03-25T05:48:03.000Z","size":3615,"stargazers_count":73,"open_issues_count":2,"forks_count":6,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-15T15:49:59.746Z","etag":null,"topics":["aws-s3","bioinformatics","biology","cheminformatics","chemistry","compchem","drug-discovery","open-data","opendata","vector-search","vector-search-engine"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/usearch-molecules","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-10T14:38:40.000Z","updated_at":"2025-03-14T12:57:13.000Z","dependencies_parsed_at":"2023-12-14T10:00:27.251Z","dependency_job_id":"26054770-d851-49ab-96af-2fbc452e4818","html_url":"https://github.com/ashvardanian/usearch-molecules","commit_stats":{"total_commits":5,"total_committers":1,"mean_commits":5.0,"dds":0.0,"last_synced_commit":"754285a1aef432b81c509f85820920486b2e9d4f"},"previous_names":["ashvardanian/usearch-molecules"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-molecules","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-molecules/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-molecules/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fusearch-molecules/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/usearch-molecules/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244972272,"owners_count":20540952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-s3","bioinformatics","biology","cheminformatics","chemistry","compchem","drug-discovery","open-data","opendata","vector-search","vector-search-engine"],"created_at":"2024-10-14T01:25:11.126Z","updated_at":"2025-03-22T14:31:49.302Z","avatar_url":"https://github.com/ashvardanian.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"#  USearch Molecules\n\n![USearch Molecules 7B datataset thumbnail](/assets/USearchMolecules.jpg)\n\nUSearch Molecules is a large Chem-Informatics dataset of small molecules.\nIt includes __7'131'914'291 molecules__ with up to 50 \"heavy\" (non-hydrogen) atoms gathered from:\n\n- 115'034'339 mecules from the __PubChem__ dataset.\n- 977'468'301 molules from the __GDB13__ dataset.\n- 6'039'411'651 molules from thEnamine __REAL__ dataset.\n\nAll molecules have been encoded using `rdkit` and `cdk` to produce binary fingerprints (structural embeddings) of four kinds:\n\n- __MACCS__: Molecular ACCess System keys with __166__ dimensions.\n- __PubChem__: Structure Fingerprints with __881__ dimensions.\n- __ECFP4__: Extended Connectivity Fingerprint of diameter 4 with __2048__ dimensions.\n- __FCFP4__: Functional Class Fingerprint of diameter 4 with __2048__ dimensions.\n\nThose fingerprints were then indexed using [Unum's USearch](https://github.com/unum-cloud/usearch) to enable real-time search and clustering of molecular structures for drug discovery and broader chemistry.\nThe dataset is included in [AWS Open Data platform](https://registry.opendata.aws/usearch-molecules/) and is publicly available from the `s3://usearch-molecules` bucket, accessible even without AWS credentials, entirely anonymously:\n\n```sh\naws s3 ls --no-sign-request s3://usearch-molecules\n```\n\n## Dataset Structure\n\n```sh\n.\n├── data\n│   ├── pubchem\n│   │   ├── index-maccs.usearch # 18.6 GB\n│   │   ├── index-maccs-ecfp4.usearch # 46.1 GB\n│   │   └── parquet # 30 GB\n│   │       ├── 0000000000-0001000000.parquet # 265 MB\n│   │       ├── 0001000000-0002000000.parquet # 265 MB\n│   │       ├── ... \n│   │       └── 0115000000-0116000000.parquet # 177 MB\n│   ├── gdb13\n│   │   ├── index-maccs.usearch # 157.0 GB\n│   │   ├── index-maccs-ecfp4.usearch # 390.1 GB\n│   │   └── parquet # 189 GB\n│   │       ├── 0000000000-0001000000.parquet # 198 MB\n│   │       ├── 0001000000-0002000000.parquet # 198 MB\n│   │       ├── ... \n│   │       └── 0977000000-0978000000.parquet # 93 MB\n│   └── real\n│       └── parquet # 477 GB\n│           ├── 0000000000-0001000000.parquet # 262 MB\n│           ├── 0001000000-0002000000.parquet # 262 MB\n│           ├── ... \n│           └── 6039000000-6040000000.parquet # 108 MB\n└── README.md\n```\n\nPre-constructed search and clustering indexes for the Enamine REAL dataset are much harder to distribute and deploy.\nThose are not yet available in the bucket but are available per request.\nTo view the dataset structure, one can use Python:\n\n```sh\n  $ pip install pyarrow\n  $ python\n\u003e\u003e\u003e import pyarrow.parquet as pq\n\u003e\u003e\u003e pq.read_table('data/real/parquet/0000000000-0001000000.parquet')\n\npyarrow.Table\nsmiles: string not null\nmaccs: fixed_size_binary[21] not null\npubchem: fixed_size_binary[111] not null\necfp4: fixed_size_binary[256] not null\nfcfp4: fixed_size_binary[256] not null\n```\n\nIn a tabular form that will look like:\n\n|      | `smiles`                                                   |                                      `maccs` |                                                                                                                                                                                                                        `pubchem` |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            `ecfp4` |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            `fcfp4` |\n| :--- | :--------------------------------------------------------- | -------------------------------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |\n| 0    | CNCC(C)NC(=O)C1(C(C)(C)OC)CC1                              | 0x00000200000002002021227C488B9C02100615FFCC | 0x00733000000000000000000000001800000000000000000000000000000000000000001E00100000000E6CC18006020002C004000800011010000000000000000000810800000040160080001400000636008000000000000F80000000000000000000000000000000000000000000 | 0x40000000000000000000800000002400000000000000000000000000000000000000001000000200000000000000000000800000000000000000000000000000000000000002000000000002000000000020000000000100000000000000000000000000010000000040000000000000000000020000000800000000000000000000000048000000000000000000000280200000000000000000020000000000000000000000000000000100000000000000020000000000000000000400000001000000000000000000000000000000010004000000000000000000000800000000000000000000800000000000000400000000000000000010000020000000 | 0xE0001400000000000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000001000401000000000000000000000000400000000000000000000001000000000000000000000100080000004000000000000000000000000000000000000000000000000000000000000000004000800000000000000000000000000001000000200000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000000000004000000000000000001000000000000000000000080020000004000000000000000000000000000000000000000080 |\n| 1    | CN(C(=O)C1=CC2=C(F)C=C(F)C=C2N1)C1CN(C(=O)CC2=CC=CN=C2O)C1 | 0x00900000002000004011172DAC534CE55EF3EB7FFC | 0x007BB1800000000000000000000000005801600000003C400000000000000001F000001F00100800000C28C19E0C3EC4F3C99200A8033577540082802037222008D921BC6CDC0866F2C295B394710864D611C8D987BE99809E00000000000200000000000000040000000000000000 | 0x00000000000001000000800000200100000100000000000000000000000000020000000000000000040000000008002000000000000000808000000000000000000200000000000000000001000000000020000000000014000000001000200100000000014040000000000000104000000000020100400000000000000040100000110040000000880000200000000000100000000000000400000000000000000000000000000104040000080000000000000000080000000100000000000000000000000000042000000000004000020000000000014000004200200000000000000000008000002040000000000400800000000000000000004001000000 | 0xBE800000000000000001000000000000000080000000080000000000000000000000000000000000000200000000000000000000000900000000000000010000000000010000000000020000000000000000000000000000000000200000000000000080080000000000000000000000040000008000000000002000000080000000000000400004000000000000000010000000000000000000000000000000000000400000000000000014000000000008000000000000000000000000000000000800000000000000000000000400080000000000001000400000000100000000000000000040004000000000002404000000000000000002020040003180 |\n\nI've also added a tiny sample dataset under the `data/example` directory, with only 2 shards totaling 2 million entries, with pre-constructed indexes to simplify the entry.\nThose come in handy if you want to test your application without downloading the whole dataset or visualize a few molecules using the StreamLit app.\n\n```sh\n.\n└── data\n    └── example # 1.8 GB\n        ├── index-maccs.usearch # 329 MB\n        ├── index-maccs-ecfp4.usearch # 817 MB\n        ├── parquet # 30 GB\n        │   ├── 0000000000-0001000000.parquet # 265 MB\n        │   └── 0001000000-0002000000.parquet # 265 MB\n        └── smiles # 30 GB\n            ├── 0000000000-0001000000.smi # 58 MB\n            └── 0001000000-0002000000.smi # 58 MB\n```\n\n## Usage\n\n### Exploring Dataset via Command Line Interface\n\nFirst, install NumPy, RDKit, and USearch v2, and download the dataset:\n\n```sh\npip3 install git+https://github.com/ashvardanian/usearch-molecules.git@main\nmkdir -p data/pubchem data/gdb13 data/real data/example\naws s3 sync --no-sign-request s3://usearch-molecules/data/example data/example/\n```\n\nIf you need just one of the subsets:\n\n```sh\naws s3 sync --no-sign-request s3://usearch-molecules/data/pubchem/ data/pubchem/\naws s3 sync --no-sign-request s3://usearch-molecules/data/gdb13/ data/gdb13/\naws s3 sync --no-sign-request s3://usearch-molecules/data/real/ data/real/\n```\n\nYou can immediately check if the indexes are readable:\n\n```sh\n  $ python\n\u003e\u003e\u003e from usearch.index import Index\n\u003e\u003e\u003e Index.metadata(\"data/pubchem/index-maccs.usearch\") # example of reading metadata\n\n{'matrix_included': True,\n 'matrix_uses_64_bit_dimensions': False,\n 'version': '2.8.10',\n 'kind_metric': \u003cMetricKind.Tanimoto: 116\u003e,\n 'kind_scalar': \u003cScalarKind.B1: 1\u003e,\n 'kind_key': \u003cScalarKind.U64: 8\u003e,\n 'kind_compressed_slot': \u003cScalarKind.U32: 9\u003e,\n 'count_present': 115627267,\n 'count_deleted': 0,\n 'dimensions': 192}\n\n\u003e\u003e\u003e Index.restore(\"data/pubchem/index-maccs-ecfp4.usearch\") # example of parsing it\n\nusearch.Index\n- config\n-- data type: ScalarKind.B1\n-- dimensions: 2240\n-- metric: MetricKind.Tanimoto\n-- connectivity: 16\n-- expansion on addition:128 candidates\n-- expansion on search: 64 candidates\n- binary\n-- uses OpenMP: 1\n-- uses SimSIMD: 1\n-- uses hardware acceleration: avx512+popcnt\n- state\n-- size: 115,627,267 vectors\n-- memory usage: 69,631,939,864 bytes\n-- max level: 4\n--- 0. 115,627,267 nodes\n--- 1. 7,148,410 nodes\n--- 2. 461,450 nodes\n--- 3. 37,714 nodes\n--- 4. 5,152 nodes\n```\n\nWith those out of the way, you can now query the downloaded files:\n\n```py\nfrom usearch_molecules.dataset import FingerprintedDataset, shape_mixed\n\ndata = FingerprintedDataset.open(\"data/example\", shape=shape_mixed)\n\n# No inspiration? Pick a random molecule with `data.random_smiles()`\nresults = data.search('CC(O)C(CN)=NNCC(C)(C)C', 100)\n\nresults_keys = [r[0] for r in results]\nresults_smiles = [r[1] for r in results]\nresults_scores = [r[2] for r in results]\n```\n\n## Exploring Dataset via Graphical Interface\n\nThe dataset also comes with Graphical sandbox implemented with StreamLit and 3DMol.js, to help visualize similarities between molecules.\n\n```sh\npip install streamlit stmol ipython_genutils\nstreamlit run streamlit_app.py\n```\n\n![USearch Molecules StreamLit demo preview](/assets/USearchMoleculesStreamLitPreview.gif)\n\n## Methodology\n\n### Dataset Sources\n\nOriginal data came from:\n\n- __PubChem__: [CID-SMILES](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-SMILES.gz).gz\n- __GDB13__: [gdb13](https://zenodo.org/record/5172018/files/gdb13.tgz?download=1).tgz\n- Enamine __REAL__, split by Heavy Atom Counts:\n    - HAC 6-21: [CXSMILES](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_6_21_420M_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 22-23: [CXSMILES](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_22_23_471M_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 24: [CXSMILES](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_24_394M_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 25: [CXSMILES](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_25_557M_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 26:\n      - [CXSMILES Part 1](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_26_833M_Part_1_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n      - [CXSMILES Part 2](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_26_833M_Part_2_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 27:\n      - [CXSMILES Part 1](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_27_1.1B_Part_1_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n      - [CXSMILES Part 2](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_27_1.1B_Part_2_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 28:\n      - [CXSMILES Part 1](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_28_1.2B_Part_1_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n      - [CXSMILES Part 2](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_28_1.2B_Part_2_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n    - HAC 29-38:\n      - [CXSMILES Part 1](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_29_38_988M_Part_1_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n      - [CXSMILES Part 2](https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_29_38_988M_Part_2_CXSMILES.cxsmiles.bz2).cxsmiles.bz2\n\n### Pre-processing Pipeline\n\n1. `prep_schedule.py`: convert and split datasets into standardized Parquet files.\n2. `prep_encode.py`: produce MACCS, PubChem, ECFP4, and FCFP4 fingerprints and index those.\n3. `prep_smiles.py`: export newline-delimited `.smi` files to simplify navigation with [StringZilla][stringzilla].\n\nEvery script is designed to work with bigger-than-memory data.\nIn other words, processing 1 TB of molecules doesn't require 1 TB of RAM.\nEverything happens in a \"gliding-window\" fashion, with computationally intensive parts split between processes and threads.\n\n```sh\npython usearch_molecules/prep_schedule.py # Prepare Parquet files\npython usearch_molecules/prep_encode.py # Build USearch indexes\npython usearch_molecules/prep_smiles.py # Export SMILES new-line delimited files to simplify serving\n```\n\nOnce completed, datasets have been uploaded to S3:\n\n```sh\naws s3 sync data/pubchem/parquet/ s3://usearch-molecules/data/pubchem/parquet/\naws s3 sync data/gdb13/parquet/ s3://usearch-molecules/data/gdb13/parquet/\naws s3 sync data/real/parquet/ s3://usearch-molecules/data/real/parquet/\n```\n\n[stringzilla]: https://github.com/ashvardanian/stringzilla\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fusearch-molecules","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fusearch-molecules","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fusearch-molecules/lists"}