{"id":34094558,"url":"https://github.com/gmattedi/chemicalspace","last_synced_at":"2026-04-08T15:33:39.998Z","repository":{"id":242521435,"uuid":"800582102","full_name":"gmattedi/chemicalspace","owner":"gmattedi","description":"Object-oriented Representation for Chemical Spaces","archived":false,"fork":false,"pushed_at":"2024-06-18T14:40:09.000Z","size":1662,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-05T03:36:03.080Z","etag":null,"topics":["chemical-space","cheminformatics","chemistry","computational-chemistry","machine-learning","rdkit"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gmattedi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-14T15:53:15.000Z","updated_at":"2025-11-14T01:39:29.000Z","dependencies_parsed_at":"2024-06-03T14:47:52.092Z","dependency_job_id":"5045511c-5cde-4449-86ce-2f9fd99f469c","html_url":"https://github.com/gmattedi/chemicalspace","commit_stats":null,"previous_names":["gmattedi/chemicalspace"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/gmattedi/chemicalspace","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gmattedi%2Fchemicalspace","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gmattedi%2Fchemicalspace/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gmattedi%2Fchemicalspace/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gmattedi%2Fchemicalspace/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gmattedi","download_url":"https://codeload.github.com/gmattedi/chemicalspace/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gmattedi%2Fchemicalspace/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31562689,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chemical-space","cheminformatics","chemistry","computational-chemistry","machine-learning","rdkit"],"created_at":"2025-12-14T15:03:59.046Z","updated_at":"2026-04-08T15:33:39.989Z","avatar_url":"https://github.com/gmattedi.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"![](.badges/coverage.svg)\u0026nbsp;\u0026nbsp;\n![](.badges/tests.svg)\u0026nbsp;\u0026nbsp;\n[![](https://readthedocs.org/projects/chemicalspace/badge/?version=latest)](https://chemicalspace.readthedocs.io/en/latest/?badge=latest)\n\n# ChemicalSpace\n\nAn Object-Oriented Representation for Chemical Spaces\n\n`chemicalspace` is a Python package that provides an object-oriented\nrepresentation for chemical spaces. It is designed to be used in conjunction\nwith the `RDKit` package, which provides the underlying cheminformatics functionality.\n\nWhile in the awesome `RDKit`, the main frame of reference is that of single molecules, here the main focus is on operations on chemical spaces.\n\n## Installation\n\nTo install `chemicalspace` you can use `pip`:\n\n```bash\npip install chemicalspace\n```\n\n# Usage\n\nThe main class in `chemicalspace` is `ChemicalSpace`.\nThe class provides a number of methods for working with chemical spaces,\nincluding reading and writing, filtering, clustering and\npicking from chemical spaces.\n\n## Basics\n\n### Initialization\n\nA `ChemicalSpace` can be initialized from SMILES strings or `RDKit` molecules.\nIt optionally takes molecule indices and scores as arguments.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nsmiles = ('CCO', 'CCN', 'CCl')\nindices = (\"mol1\", \"mol2\", \"mol3\")\nscores = (0.1, 0.2, 0.3)\n\nspace = ChemicalSpace(mols=smiles, indices=indices, scores=scores)\n\nprint(space)\n```\n\n```text\n\u003cChemicalSpace: 3 molecules | 3 indices | 3 scores\u003e\n```\n\n### Reading and Writing\n\nA `ChemicalSpace` can be read from and written to SMI and SDF files.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\n# Load from SMI file\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace.to_smi(\"outputs1.smi\")\n\n# Load from SDF file\nspace = ChemicalSpace.from_sdf(\"tests/data/inputs1.sdf\")\nspace.to_sdf(\"outputs1.sdf\")\n\nprint(space)\n```\n\n```text\n\u003cChemicalSpace: 10 molecules | 10 indices | No scores\u003e\n```\n\n### Indexing, Slicing and Masking\n\nIndexing, slicing and masking a `ChemicalSpace` object returns a new `ChemicalSpace` object.\n\n#### Indexing\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nprint(space[0])\n```\n\n```text\n\u003cChemicalSpace: 1 molecules | 1 indices | No scores\u003e\n```\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nidx = [1, 2, 4]\n\nprint(space[idx])\n```\n\n```text\n\u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n```\n\n#### Slicing\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nprint(space[:2])\n```\n\n```text\n\u003cChemicalSpace: 2 molecules | 2 indices | No scores\u003e\n```\n\n#### Masking\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nmask = [True, False, True, False, True, False, True, False, True, False]\n\nprint(space[mask])\n```\n\n```text\n\u003cChemicalSpace: 5 molecules | 5 indices | No scores\u003e\n```\n\n### Deduplicating\n\nDeduplicating a `ChemicalSpace` object removes duplicate molecules.  \nSee [Hashing and Identity](#hashing-and-identity) for details on molecule identity.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace_twice = space + space  # 20 molecules\nspace_deduplicated = space_twice.deduplicate()  # 10 molecules\n\nprint(space_deduplicated)\n```\n\n```text\n\u003cChemicalSpace: 10 molecules | 10 indices | No scores\u003e\n```\n\n### Chunking\n\nA `ChemicalSpace` object can be chunked into smaller `ChemicalSpace` objects.   \nThe `.chunks` method returns a generator of `ChemicalSpace` objects.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nchunks = space.chunks(chunk_size=3)\n\nfor chunk in chunks:\n    print(chunk)\n```\n\n```text\n\u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n\u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n\u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n\u003cChemicalSpace: 1 molecules | 1 indices | No scores\u003e\n```\n\n### Drawing\n\nA `ChemicalSpace` object can be rendered as a grid of molecules.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace.draw()\n```\n![draw](.media/sample.png)\n\n### Featurizing\n\n#### Features\n\nA `ChemicalSpace` object can be featurized as a `numpy` array of features.\nBy default, ECFP4/Morgan2 fingerprints are used.\nThe features are cached for subsequent calls,\nand spaces generated by a `ChemicalSpace` object (e.g. by slicing, masking, chunking)\ninherit the respective features.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace_slice = space[:6:2]\n\n# Custom ECFP4 features\nprint(space.features.shape)\nprint(space_slice.features.shape)\n```\n\n```text\n(10, 1024)\n(3, 1024)\n```\n\n#### Custom featurizer\n\nThis should take in a `rdkit.Chem.Mol` molecule, and the numerical\nreturn value should be castable to NumPy array (see `chemicalspace.utils.MolFeaturizerType`).\n\n```python\nfrom chemicalspace import ChemicalSpace\nfrom chemicalspace.utils import maccs_featurizer\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\", featurizer=maccs_featurizer)\nspace_slice = space[:6:2]\n\n# Custom ECFP4 features\nprint(space.features.shape)\nprint(space_slice.features.shape)\n```\n\n```text\n(10, 167)\n(3, 167)\n```\n\n#### Metrics\n\nA distance metric on the feature space is necessary for clustering, calculating diversity, and\nidentifying neighbors. By default, the `jaccard` (a.k.a Tanimoto) distance is used.\n`ChemicalSpace` takes a `metric` string argument that allows to specify a `sklearn` metric.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\", metric='euclidean')\n```\n\n### Binary Operations\n\n#### Single entries\n\nSingle entries as SMILES strings or `RDKit` molecules\ncan be added to a `ChemicalSpace` object.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace.add(\"CCO\", \"mol11\")\n\nprint(space)\n```\n\n```text\n\u003cChemicalSpace: 11 molecules | 11 indices | No scores\u003e\n```\n\n#### Chemical spaces\n\nTwo `ChemicalSpace` objects can be added together.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi\")\n\nspace = space1 + space2\n\nprint(space)\n```\n\n```text\n\u003cChemicalSpace: 25 molecules | 25 indices | No scores\u003e\n```\n\nAnd subtracted from each other to return only molecules in `space1`\nthat are not in `space2`.   \nSee [Hashing and Identity](#hashing-and-identity) for more details.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi\")\n\nspace = space1 - space2\n\nprint(space)\n```\n\n```text\n\u003cChemicalSpace: 5 molecules | 5 indices | No scores\u003e\n```\n\n### Hashing and Identity\n\nIndividual molecules in a chemical space are hashed by their InChI Keys only (by default), or by InChI Keys and index.\nScores **do not** affect the hashing process.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nsmiles = ('CCO', 'CCN', 'CCl')\nindices = (\"mol1\", \"mol2\", \"mol3\")\n\n# Two spaces with the same molecules, and indices\n# But one space includes the indices in the hashing process\nspace_indices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=True)\nspace_no_indices = ChemicalSpace(mols=smiles, indices=indices, hash_indices=False)\n\nprint(space_indices == space_indices)\nprint(space_indices == space_no_indices)\nprint(space_no_indices == space_no_indices)\n```\n\n```text\nTrue\nFalse\nTrue\n```\n\n`ChemicalSpace` objects are hashed by their molecular hashes, in an **order-independent** manner.\n\n```python\nfrom rdkit import Chem\nfrom rdkit.Chem import inchi\nfrom chemicalspace import ChemicalSpace\n\nmol = Chem.MolFromSmiles(\"c1ccccc1\")\ninchi_key = inchi.MolToInchiKey(mol)\n\nspace = ChemicalSpace(mols=(mol,))\n\nassert hash(space) == hash(frozenset((inchi_key,)))\n```\n\nThe identity of a `ChemicalSpace` is evaluated on its hashed representation.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace1_again = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi.gz\")\n\nprint(space1 == space1)\nprint(space1 == space1_again)\nprint(space1 == space2)\n```\n\n```text\nTrue\nTrue\nFalse\n```\n\n### Copy\n\n`ChemicalSpace` supports copy and deepcopy operations.\nDeepcopy allows to fully unlink the copied object from the original one, including the `RDKit` molecules.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\n# Shallow copy\nspace_copy = space.copy()\nassert id(space.mols[0]) == id(space_copy.mols[0])\n\n# Deep copy\nspace_deepcopy = space.copy(deep=True)\nassert id(space.mols[0]) != id(space_deepcopy.mols[0])\n```\n\n## Clustering\n\n### Labels\n\nA `ChemicalSpace` can be clustered using by its molecular features.\n`kmedoids`, `agglomerative-clustering`, `sphere-exclusion` and `scaffold`\nare the available clustering methods.\nRefer to the respective methods in [`chemicalspace.layers.clustering`](chemicalspace/layers/clustering.py)\nfor more details.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\ncluster_labels = space.cluster(n_clusters=3)\n\nprint(cluster_labels)\n```\n\n```text\n[0 1 2 1 1 0 0 0 0 0]\n```\n\n### Clusters\n\n`ChemicalSpace.yield_clusters` can be used to iterate clusters as `ChemicalSpace` objects.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nclusters = space.yield_clusters(n_clusters=3)\n\nfor cluster in clusters:\n    print(cluster)\n```\n\n```text\n\u003cChemicalSpace: 6 molecules | 6 indices | No scores\u003e\n\u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n\u003cChemicalSpace: 1 molecules | 1 indices | No scores\u003e\n```\n\n### KFold Clustering\n\n`ChemicalSpace.splits` can be used to iterate train/test cluster splits for ML training.\nAt each iteration, one cluster is used as the test set and the rest as the training set.\nNote that there is no guarantee on the size of the clusters.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nfor train, test in space.split(n_splits=3):\n    print(train, test)\n```\n\n```text\n\u003cChemicalSpace: 4 molecules | 4 indices | No scores\u003e \u003cChemicalSpace: 6 molecules | 6 indices | No scores\u003e\n\u003cChemicalSpace: 7 molecules | 7 indices | No scores\u003e \u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n\u003cChemicalSpace: 9 molecules | 9 indices | No scores\u003e \u003cChemicalSpace: 1 molecules | 1 indices | No scores\u003e\n```\n\n## Overlap\n\n`ChemicalSpace` implements methods for calculating the overlap with another space.\n\n### Overlap\n\nThe molecules of a `ChemicalSpace` that are similar to another space can be flagged.\nThe similarity between two molecules is calculated by the Tanimoto similarity of their\nECFP4/Morgan2 fingerprints.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi.gz\")\n\n# Indices of `space1` that are similar to `space2`\noverlap = space1.find_overlap(space2, radius=0.6)\n\nprint(overlap)\n```\n\n```text\n[0 1 2 3 4]\n```\n\n### Carving\n\nThe overlap between two `ChemicalSpace` objects can be carved out from one of the objects,\nso to ensure that the two spaces are disjoint for a given similarity radius.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace1 = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace2 = ChemicalSpace.from_smi(\"tests/data/inputs2.smi.gz\")\n\n# Carve out the overlap from `space1`\nspace1_carved = space1.carve(space2, radius=0.6)\n\nprint(space1_carved)\n```\n\n```text\n\u003cChemicalSpace: 5 molecules | 5 indices | No scores\u003e\n```\n\n## Dimensionality Reduction\n\n`ChemicalSpace` implements methods for dimensionality reduction by\n`pca`, `tsne` or `umap` projection of its features.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nproj = space.project(method='pca')\n\nprint(proj.shape)\n```\n\n```text\n(10, 2)\n```\n\n## Picking\n\nA subset of a `ChemicalSpace` can be picked by a number of acquisition strategies.  \nSee [`chemicalspace.layers.acquisition`](chemicalspace/layers/acquisition.py) for details.\n\n```python\nfrom chemicalspace import ChemicalSpace\nimport numpy as np\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\nspace_pick_random = space.pick(n=3, strategy='random')\nprint(space_pick_random)\n\nspace_pick_diverse = space.pick(n=3, strategy='maxmin')\nprint(space_pick_diverse)\n\nspace.scores = np.array(range(len(space)))  # Assign dummy scores\nspace_pick_greedy = space.pick(n=3, strategy='greedy')\nprint(space_pick_greedy)\n```\n\n```text\n\u003cChemicalSpace: 3 molecules | 3 indices | No scores\u003e\n\u003cChemicalSpace: 3 molecules | 3 indices | 3 scores\u003e\n```\n\n## Uniqueness and Diversity\n\n### Uniqueness\n\nThe uniqueness of a `ChemicalSpace` object can be calculated by the number of unique molecules.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\nspace_twice = space + space  # 20 molecules\nuniqueness = space_twice.uniqueness()\n\nprint(uniqueness)\n```\n\n```text\n0.5\n```\n\n### Diversity\n\nThe diversity of a `ChemicalSpace` object can be calculated as:\n\n- The average of the pairwise distance matrix\n- The normalized [Vendi score](https://arxiv.org/abs/2210.02410) of the same distance matrix.\n\nThe Vendi score can be interpreted as the effective number of molecules in the space,\nand here it is normalized by the number of molecules in the space taking values in the range `[0, 1]`.\n\n```python\nfrom chemicalspace import ChemicalSpace\n\nspace = ChemicalSpace.from_smi(\"tests/data/inputs1.smi\")\n\ndiversity_int = space.diversity(method='internal-distance')\ndiversity_vendi = space.diversity(method='vendi')\nprint(diversity_int)\nprint(diversity_vendi)\n\n# Dummy space with the same molecule len(space) times\nspace_redundant = ChemicalSpace(mols=tuple([space.mols[0]] * len(space)))\n\ndiversity_int_redundant = space_redundant.diversity(method='internal-distance')\ndiversity_vendi_redundant = space_redundant.diversity(method='vendi')\n\nprint(diversity_int_redundant)\nprint(diversity_vendi_redundant)\n```\n\n```text\n0.7730273985449335\n0.12200482273434754\n0.0\n0.1\n```\n\n# Advanced\n\n## Layers\n\n`ChemicalSpace` is implemented as a series of *layers* that provide the functionality of the class. As can be seen in the [source code](chemicalspace/space.py), the class simply combines the layers.\n\nIf only a subset of the functionality of `ChemicalSpace` is necessary, and lean objects are a priority, one can combine only the required layers:\n\n```python\nfrom chemicalspace.layers.clustering import ChemicalSpaceClusteringLayer\nfrom chemicalspace.layers.neighbors import ChemicalSpaceNeighborsLayer\n\n\nclass MyCustomSpace(ChemicalSpaceClusteringLayer, ChemicalSpaceNeighborsLayer):\n    pass\n\n\nspace = MyCustomSpace(mols=[\"c1ccccc1\"])\nspace\n```\n```text\n\u003cMyCustomSpace: 1 molecules | No indices | No scores\u003e\n```\n\n# Development\n\n## Installation\n\nInstall the development dependencies with `pip`:\n\n```bash\npip install -e .[dev]\n```\n\n## Hooks\n\nThe project uses `pre-commit` for code formatting, linting and testing.\nInstall the hooks with:\n\n```bash\npre-commit install\n```\n\n## Documentation\n\nThe documentation can be built by running:\n```bash\ncd docs\n./rebuild.sh\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgmattedi%2Fchemicalspace","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgmattedi%2Fchemicalspace","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgmattedi%2Fchemicalspace/lists"}