{"id":13684476,"url":"https://github.com/VlachosGroup/AIMSim","last_synced_at":"2025-04-30T21:30:50.054Z","repository":{"id":39912762,"uuid":"239363892","full_name":"VlachosGroup/AIMSim","owner":"VlachosGroup","description":"A Python toolbox to work with molecular similarity","archived":false,"fork":false,"pushed_at":"2024-08-13T19:43:14.000Z","size":11904,"stargazers_count":39,"open_issues_count":3,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-12T04:14:11.526Z","etag":null,"topics":["clustering","machine-learning"],"latest_commit_sha":null,"homepage":"https://vlachosgroup.github.io/AIMSim/README.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VlachosGroup.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-02-09T19:47:09.000Z","updated_at":"2025-03-11T05:43:25.000Z","dependencies_parsed_at":"2023-09-24T07:16:10.054Z","dependency_job_id":"b7da1b0f-50fb-40de-a8c0-3cafa78cfe9a","html_url":"https://github.com/VlachosGroup/AIMSim","commit_stats":{"total_commits":1136,"total_committers":15,"mean_commits":75.73333333333333,"dds":0.6390845070422535,"last_synced_commit":"45a978eee54b29343b917b3b141b22473061e139"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VlachosGroup%2FAIMSim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VlachosGroup%2FAIMSim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VlachosGroup%2FAIMSim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VlachosGroup%2FAIMSim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VlachosGroup","download_url":"https://codeload.github.com/VlachosGroup/AIMSim/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251785365,"owners_count":21643464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","machine-learning"],"created_at":"2024-08-02T14:00:33.969Z","updated_at":"2025-04-30T21:30:48.404Z","avatar_url":"https://github.com/VlachosGroup.png","language":"Python","funding_links":[],"categories":["Cheminformatics"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eAIMSim README\u003c/h1\u003e \n\u003ch3 align=\"center\"\u003eVisualizing Diversity in your Molecular Dataset\u003c/h3\u003e\n\n![AIMSim Logo](interfaces/UI/AIMSim-logo.png)\n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"GitHub Repo Stars\" src=\"https://img.shields.io/github/stars/VlachosGroup/AIMSim?style=social\"\u003e\n  \u003cimg alt=\"commits since\" src=\"https://img.shields.io/github/commits-since/VlachosGroup/AIMSim/latest.svg\"\u003e\n  \u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/aimsim\"\u003e\n  \u003cimg alt=\"PyPI - License\" src=\"https://img.shields.io/github/license/VlachosGroup/AIMSim\"\u003e\n  \u003cimg alt=\"Test Status\" src=\"https://github.com/VlachosGroup/AIMSim/actions/workflows/ci.yml/badge.svg?event=schedule\"\u003e\n\u003c/p\u003e\n\nRepository Status: [![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](https://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/#inactive)\n\nAIMSim has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.\n\nDownloads Stats:\n - `aimsim`: [![Downloads](https://static.pepy.tech/badge/aimsim)](https://static.pepy.tech/personalized-badge/aimsim?period=total\u0026units=none\u0026left_color=grey\u0026right_color=blue\u0026left_text=Lifetime%20Downloads)\n - `aimsim_core`: [![Downloads](https://static.pepy.tech/badge/aimsim_core)](https://pepy.tech/project/aimsim_core?period=total\u0026units=none\u0026left_color=grey\u0026right_color=blue\u0026left_text=Lifetime%20Downloads)\n\n## Documentation and Tutorial\n[View our Online Documentation](https://vlachosgroup.github.io/AIMSim/) or try the _AIMSim_ comprehensive tutorial in your browser:\n\u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/VlachosGroup/AIMSim/blob/master/AIMSim-demo.ipynb\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\n\u003c/a\u003e\n\n## Purpose\n\n__Why Do We Need To Visualize Molecular Similarity / Diversity?__\n\nThere are several contexts where it is helpful to visualize the diversity of a molecular dataset:\n\n_Exploratory Experimental Synthesis_\n\nFor a chemist, synthesizing new molecules with targeted properties is often a laborious and time consuming task.\nIn such a case, it becomes useful to check the similarity of a newly proposed (un-synthesized) molecule to the ones already synthesized.\nIf the proposed molecule is too similar to the existing repertoire of molecules, it will probably not yield not enough new information /\nproperty and thus need not be synthesized. Thus, a chemist can avoid spending\ntime and effort synthesizing molecules not useful for the project.\n\n_Lead Optimization and Virtual Screening_\n\nThis application is the converse of exploratory synthesis where the interest is to find molecules in a database which are structurally similar to an \"active\" molecule. In this context, \"active\" might refer to pharmocological activity (drug discover campaigns) or desirable chemical properties (for example, to discover alternative chemicals and solvents for an application). In such a case, AIMSim helps to run virtual screenings over a molecular database and visualize the results.\n\n_Machine Learning Molecular Properties_\n\nIn the context of machine learning, visualizing the diversity of the training set gives a good idea about its information quality.\nA more diverse training data-set yields a more robust model, which generalizes well to unseen data. Additionally, such a visualization can \nidentify \"clusters of similarity\" indicating the need for separately trained models for each cluster.\n\n_Substrate Scope Robustness Verification_\n\nWhen proposing a novel reaction it is essential for the practicing chemist to evaluate the transformation's tolerance of diverse functional groups and substrates (Glorius, 2013). Using `AIMSim`, one can evaluate the structural and chemical similarity across an entire susbtrate scope to ensure that it avoids redundant species. Below is an example similarity heatmap generated to visualize the diversity of a three-component sulfonamide coupling reaction with a substantial number of substrates (Chen, 2018).\n![Image of sulfonamide substrate scope](tests/sulfonamide-substrate-scope.png)\n\nMany of the substrates appear similar to one another and thereby redundant, but in reality the core sulfone moiety and the use of the same coupling partner when evaluating functional group tolerance accounts for this apparent shortcoming. Also of note is the region of high similarity along the diagonal where the substrates often differ by a single halide heteratom or substitution pattern.\n\n## Installing AIMSim\nIt is recommended to install `AIMSim` in a virtual environment with [`conda`](https://docs.conda.io/en/latest/) or Python's [`venv`](https://docs.python.org/3/library/venv.html).\n### `pip`\n`AIMSim` can be installed with a single command using Python's package manager `pip`:\n`pip install aimsim`\nThis command also installs the required dependencies.\n\n\u003e [!NOTE]\n\u003e Looking to use AIMSim for descriptor calculation or extend its functionality? `AIMSim`'s core modules for creating molecules, calculating descriptors, and comparing the results are available without support for plotting or visualization in the PyPI package `aimsim_core`.\n\n### `conda`\n`AIMSim` is also available with the `conda` package manager via:\n`conda install -c conda-forge aimsim`\nThis will install all dependencies from `conda-forge`.\n\n### Note for mordred-descriptor\nAIMSim v1 provided direct support for the descriptors provided in the `mordred` package but unfortunately the original `mordred` is now abandonware.\nThe **unofficial** [`mordredcommunity`](https://github.com/JacksonBurns/mordred-community) is now used in version 2.1 and newer to deliver the same features but with support for modern Python.\n\n## Running AIMSim\n`AIMSim` is compatible with Python 3.8 to 3.12.\nStart `AIMSim` with a graphical user interface:\n\n`aimsim`\n\nStart `AIMSim` with a prepared configuration YAML file (`config.yaml`):\n\n`aimsim config.yaml`\n\n### Currently Implemented Fingerprints\n\n1. Morgan Fingerprint (Equivalent to the ECFP fingerprints)\n2. RDKit Topological Fingerprint\n3. RDKit Daylight Fingerprint\n\n_The following are available via command line use (config.yaml) only:_\n\n4. MinHash Fingerprint (see [MHFP](https://github.com/reymond-group/mhfp))\n5. All fingerprints available from the [ccbmlib](https://github.com/vogt-m/ccbmlib) package (_specify 'ccbmlib:descriptorname' for command line input_).\n6. All descriptors and fingerprints available from [PaDELPy](https://github.com/ecrl/padelpy), an interface to PaDEL-Descriptor. (_specify 'padelpy:desciptorname' for command line input._).\n7. All descriptors available through the [Mordred](https://github.com/mordred-descriptor/mordred) library (_specify 'mordred:desciptorname' for command line input._). To enable this option, you must install with `pip install 'aimsim[mordred]'` (see disclaimer in the Installation section above).\n\n### Currently Implemented Similarity Scores\n\n44 commonly used similarity scores are implemented in AIMSim.\nAdditional L0, L1 and L2 norm based similarities are also implemented. [View our Online Documentation](https://vlachosgroup.github.io/AIMSim/implemented_metrics.html) for a complete list of implemented similarity scores.\n\n\n### Currently Implemented Functionalities\n\n1. Measure Search: Automate the search of fingerprint and similarity metric (called a \"measure\") using the following algorithm:\n  Step 1: Select an arbitrary featurization scheme.\n  Step 2: Featurize the molecule set using the selected scheme.\n  Step 3: Choose an arbitrary similarity measure.\n  Step 4: Select each molecule’s nearest and furthest neighbors in the set using the similarity measure.\n  Step 5: Measure the correlation between a molecule’s QoI and its nearest neighbor’s QoI.\n  Step 6: Measure the correlation between a molecule’s QoI and its further neighbor’s QoI.\n  Step 7: Define a score which maximizes the value in Step 5 and minimizes the value in Step 6.\n  Step 8: Iterate Steps 1 – 7 to select the featurization scheme and similarity measure to maximize the result of Step 7. \n2. See Property Variation with Similarity: Visualize the correlation in the QoI between nearest neighbor molecules (most similar pairs in the molecule set) and between the furthest neighbor molecules (most dissimilar pairs in the molecule set). This is used to verify that the chosen measure is appropriate for the task.\n\n3. Visualize Dataset: Visualize the diversity of the molecule set in the form of a pairwise similarity density and a similarity heatmap of the molecule set. Embed the molecule set in 2D space using using principal component analysis (PCA)[3], multi-dimensional scaling[4], t-SNE[5], Spectral Embedding[6], or Isomap[7].\n\n4. Compare Target Molecule to Molecule Set: Run a similarity search of a molecule against a database of molecules (molecule set). This task can be used to identify the most similar (useful in virtual screening operations) or most dissimilar (useful in application that require high diversity such as training set design for machine learning models) molecules.\n\n5. Cluster Data: Cluster the molecule set. The following algorithms are implemented: \n\nFor arbitrary molecular features or similarity metrics with defined Euclidean distances: K-Medoids[3] and Ward[8] (hierarchical clustering).\n\nFor binary fingerprints: Complete, single and average linkage hierarchical clustering[8].\n\nThe clustered data is plotted in two dimensions using principal component analysis (PCA)[3], multi-dimensional scaling[4], or TSNE[5].\n\n6. Outlier Detection: Using an isolation forest, check for which molecules are potentially novel or are outliers according to the selected descriptor. Output can be directly to the command line by specifiying `output` to be `terminal` or to a text file by instead providing a filename.\n\n## Contributors\n\nDeveloper: Himaghna Bhattacharjee, Vlachos Research Lab. ([LinkedIn](www.linkedin.com/in/himaghna-bhattacharjee))\n\nDeveloper: Jackson Burns, Don Watson Lab. ([Personal Site](https://www.jacksonwarnerburns.com/))\n\n## `AIMSim` in the Literature\n - [Applications of Artificial Intelligence and Machine Learning Algorithms to Crystallization](https://doi.org/10.1021/acs.chemrev.2c00141)\n - [Recent Advances in Machine-Learning-Based Chemoinformatics: A Comprehensive Review](https://doi.org/10.3390/ijms241411488)\n\n## Developer Notes\nIssues and Pull Requests are welcomed! To propose an addition to `AIMSim` open an issue and the developers will tag it as an _enhancement_ and start discussion.\n\n`AIMSim` includes an automated testing apparatus operated by Python's _unittest_ built-in package. To execute tests related to the core functionality of `AIMSim`, run this command:\n\n`python -m unittest discover`\n\nFull multiprocessing speedup and efficiency tests take more than 10 hours to run due to the number of replicates required. To run these tests, create a file called `.speedup-test` in the `AIMSim` directory and execute the above command as shown.\n\nTo manually build the docs, execute the following with `sphinx` and `m2r` installed and from the `/docs` directory:\n\n`m2r ../README.md | mv ../README.rst . | sphinx-apidoc -f -o . .. | make html | cp _build/html/* .`\n\nDocumentation should manually build on push to master branch via an automated GitHub action.\n\nFor packaging on PyPI:\n\n`python -m build; twine upload dist/*`\n\nBe sure to bump the version in `__init__.py`.\n\n## Citation\nIf you use this code for scientific publications, please cite the following paper.\n\nHimaghna Bhattacharjee, Jackson Burns, Dionisios G. Vlachos, AIMSim: An accessible cheminformatics platform for similarity operations on chemicals datasets, Computer Physics Communications, Volume 283, 2023, 108579, ISSN 0010-4655, https://doi.org/10.1016/j.cpc.2022.108579.\n\n## License\nThis code is made available under the terms of the _MIT Open License_:\n\nCopyright (c) 2020-2027 Himaghna Bhattacharjee \u0026 Jackson Burns\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n\n\n## Works Cited\n[1] Collins, K. and Glorius, F., A robustness screen for the rapid assessment of chemical reactions. Nature Chem 5, 597–601 (2013). https://doi.org/10.1038/nchem.1669\n\n[2] Chen, Y., Murray, P.R.D., Davies, A.T., and Willis M.C., J. Am. Chem. Soc. 140 (28), 8781-8787 (2018). https://doi.org/10.1021/jacs.8b04532\n\n[3] Hastie, T., Tibshirani R. and Friedman J., The Elements of statistical Learning: Data Mining, Inference, and Prediction, 2nd Ed., Springer Series in Statistics (2009).\n\n[4] Borg, I. and Groenen, P.J.F., Modern Multidimensional Scaling: Theory and Applications, Springer Series in Statistics (2005).\n\n[5] van der Maaten, L.J.P. and Hinton, G.E., Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605 (2008).\n\n[6] Ng, A.Y., Jordan, M.I. and Weiss, Y., On Spectral Clustering: Analysis and an algorithm. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, MIT Press (2001).\n\n[7] Tenenbaum, J.B., De Silva, V. and Langford, J.C, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500), 2319-23 (2000). https://doi.org/10.1126/science.290.5500.2319.\n\n[8] Murtagh, F. and Contreras, P., Algorithms for hierarchical clustering: an overview. WIREs Data Mining Knowl Discov (2011). https://doi.org/10.1002/widm.53\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVlachosGroup%2FAIMSim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVlachosGroup%2FAIMSim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVlachosGroup%2FAIMSim/lists"}