{"id":19706127,"url":"https://github.com/llnl/smallmoleval","last_synced_at":"2026-05-26T16:32:05.399Z","repository":{"id":66083041,"uuid":"151645084","full_name":"LLNL/SmallMolEval","owner":"LLNL","description":"Using machine learning to score potential drug candidates may offer an advantage over traditional imprecise scoring functions because the parameters and model structure can be learned from the data. However, models may lack interpretability, are often overfit to the data, and are not generalizable to drug targets and chemotypes not in the training data. Benchmark datasets are prone to artificial enrichment and analogue bias due to the overrepresentation of certain scaffolds in experimentally determined active sets. Datasets can be evaluated using spatial statistics to quantify the dataset topology and better understand potential biases. Dataset clumping comprises a combination of self-similarity of actives and separation from decoys in chemical space and is associated with overoptimistic virtual screening results. This code explores methods of quantifying potential biases and examines some common benchmark datasets.","archived":false,"fork":false,"pushed_at":"2019-06-19T18:44:13.000Z","size":25,"stargazers_count":3,"open_issues_count":1,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-12T05:03:43.471Z","etag":null,"topics":["machine-learning","python","statistics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LLNL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-04T22:46:28.000Z","updated_at":"2022-03-02T16:13:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"e229e78a-0bee-4baa-8b87-592e306369bd","html_url":"https://github.com/LLNL/SmallMolEval","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LLNL/SmallMolEval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FSmallMolEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FSmallMolEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FSmallMolEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FSmallMolEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LLNL","download_url":"https://codeload.github.com/LLNL/SmallMolEval/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FSmallMolEval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285817006,"owners_count":27236561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-22T02:00:05.934Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","python","statistics"],"created_at":"2024-11-11T21:34:17.283Z","updated_at":"2025-11-22T16:01:27.738Z","avatar_url":"https://github.com/LLNL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"SmallMolEval\n----------------\nUsing machine learning to score potential drug candidates may offer an advantage over traditional imprecise scoring functions because the parameters and model structure can be learned from the data. However, models may lack interpretability, are often overfit to the data, and are not generalizable to drug targets and chemotypes not in the training data. Benchmark datasets are prone to artificial enrichment and analogue bias due to the overrepresentation of certain scaffolds in experimentally determined active sets. Datasets can be evaluated using spatial statistics to quantify the dataset topology and better understand potential biases. Dataset clumping comprises a combination of self-similarity of actives and separation from decoys in chemical space and is associated with overoptimistic virtual screening results. This code explores methods of quantifying potential biases and examines some common benchmark datasets.\n\nDocumentation\n----------------\nFile: \nremove_AVE_bias2.py\nslight modification on atomwise script to split data\n\nrun_remove_AVE_bias.py\nexample of running remove_AVE_bias2.py on DUDE dataset\n\nmain.py,main_activeonly.py,main.old.py\nscripts that run the MUV spatial statistics\n\nDescriptorSets.py\nmostly contains functions used by MUV statistics and called in main files\n\ngf.plot\nplots MUV statistics\n\nmakegraphs.py\nuses gf.plots to make whole dataset plots\n\nanalyze_AVE_bias.py\nno revisions from atomwise, computes the bias score and AUC of ligand based models\n\naveanalyze.py\nruns analyze_AVE_bias.py for directory of multiple directories containing splits on different receptors\n\nAuthors\n----------------\n\nSmallMolEval was written by Dr. Sally Ellingson.\n\nRelease\n----------------\n\nSmallMolEval is released under an MIT license.  For more details see the\nNOTICE and LICENSE files.\n\n``LLNL-CODE-759342``","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllnl%2Fsmallmoleval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fllnl%2Fsmallmoleval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllnl%2Fsmallmoleval/lists"}