Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/llnl/smallmoleval

Using machine learning to score potential drug candidates may offer an advantage over traditional imprecise scoring functions because the parameters and model structure can be learned from the data. However, models may lack interpretability, are often overfit to the data, and are not generalizable to drug targets and chemotypes not in the training data. Benchmark datasets are prone to artificial enrichment and analogue bias due to the overrepresentation of certain scaffolds in experimentally determined active sets. Datasets can be evaluated using spatial statistics to quantify the dataset topology and better understand potential biases. Dataset clumping comprises a combination of self-similarity of actives and separation from decoys in chemical space and is associated with overoptimistic virtual screening results. This code explores methods of quantifying potential biases and examines some common benchmark datasets.
https://github.com/llnl/smallmoleval

machine-learning python statistics

Last synced: 24 days ago
JSON representation

Host: GitHub
URL: https://github.com/llnl/smallmoleval
Owner: LLNL
License: mit
Created: 2018-10-04T22:46:28.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-06-19T18:44:13.000Z (over 5 years ago)
Last Synced: 2024-11-11T21:38:50.670Z (3 months ago)
Topics: machine-learning, python, statistics
Language: Python
Homepage:
Size: 24.4 KB
Stars: 3
Watchers: 7
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

SmallMolEval
----------------
Using machine learning to score potential drug candidates may offer an advantage over traditional imprecise scoring functions because the parameters and model structure can be learned from the data. However, models may lack interpretability, are often overfit to the data, and are not generalizable to drug targets and chemotypes not in the training data. Benchmark datasets are prone to artificial enrichment and analogue bias due to the overrepresentation of certain scaffolds in experimentally determined active sets. Datasets can be evaluated using spatial statistics to quantify the dataset topology and better understand potential biases. Dataset clumping comprises a combination of self-similarity of actives and separation from decoys in chemical space and is associated with overoptimistic virtual screening results. This code explores methods of quantifying potential biases and examines some common benchmark datasets.

Documentation
----------------
File:
remove_AVE_bias2.py
slight modification on atomwise script to split data

run_remove_AVE_bias.py
example of running remove_AVE_bias2.py on DUDE dataset

main.py,main_activeonly.py,main.old.py
scripts that run the MUV spatial statistics

DescriptorSets.py
mostly contains functions used by MUV statistics and called in main files

gf.plot
plots MUV statistics

makegraphs.py
uses gf.plots to make whole dataset plots

analyze_AVE_bias.py
no revisions from atomwise, computes the bias score and AUC of ligand based models

aveanalyze.py
runs analyze_AVE_bias.py for directory of multiple directories containing splits on different receptors

Authors
----------------

SmallMolEval was written by Dr. Sally Ellingson.

Release
----------------

SmallMolEval is released under an MIT license. For more details see the
NOTICE and LICENSE files.

``LLNL-CODE-759342``