https://github.com/gitter-lab/active-learning-drug-discovery
Iterative chemical screening pipeline for drug discovery
https://github.com/gitter-lab/active-learning-drug-discovery
active-learning drug-discovery virtual-screening
Last synced: 6 months ago
JSON representation
Iterative chemical screening pipeline for drug discovery
- Host: GitHub
- URL: https://github.com/gitter-lab/active-learning-drug-discovery
- Owner: gitter-lab
- License: mit
- Created: 2018-10-24T18:02:22.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2025-06-02T17:44:40.000Z (11 months ago)
- Last Synced: 2025-06-03T07:03:51.580Z (11 months ago)
- Topics: active-learning, drug-discovery, virtual-screening
- Language: Jupyter Notebook
- Homepage:
- Size: 102 MB
- Stars: 5
- Watchers: 4
- Forks: 0
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Active Learning in Drug Discovery
[](https://github.com/gitter-lab/active-learning-drug-discovery/actions/workflows/test.yml)
## Installation
We recommend creating a [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) to manage the dependencies.
Assumes [Anaconda installation](https://www.anaconda.com/download/).
Clone this repository:
```
git clone https://github.com/gitter-lab/active-learning-drug-discovery.git
cd active-learning-drug-discovery
```
Setup the `active_learning_dd` conda environment using the `conda_cpu_env.yml` file:
```
conda env create -f conda_cpu_env.yml
conda activate active_learning_dd
```
If you do want GPU support, you can replace `conda_cpu_env.yml` with `conda_env.yml`.
Finally, install `active_learning_dd` with `pip`:
```
pip install -e .
```
Now check the installation is working correctly by running the sample data test:
```
cd chtc_runners
python sample_data_runner.py \
--pipeline_params_json_file=../param_configs/sample_data_config.json \
--hyperparams_json_file=../param_configs/experiment_PstP_hyperparams/sampled_hyparams/ClusterBasedWCSelector_609.json \
--iter_max=5 \
--no-precompute_dissimilarity_matrix \
--initial_dataset_file=../datasets/sample_data/training_data/iter_0.csv.gz
```
You should see the following last prompt:
```
Finished testing sample dataset. Verified that hashed selection matches stored hash.
```
## datasets
The datasets used in this study are: PriA-SSB target, 107 PubChem BioAssay targets, and PstP target.
The datasets will be uploaded to Zenodo in the near future.
The repository also contains a small dataset for testing: `datasets/sample_data/`.
## active_learning_dd
The `active_learning_dd` subdirectory contains the main codebase for the iterative batched screening components.
Consult the README in that subdirectory for details.
## param_configs
This subdirectory contains json config files for strategies and experiments used in the [thesis document](https://www.biostat.wisc.edu/~gitter/pubs/AlnammiThesis.pdf).
Consult the README in that subdirectory for details.
## analysis_notebooks
This subdirectory contains Jupyter notebooks that preprocess the datasets, debug methods, analyze the results, and produce result images.
## runner scripts
`chtc_runners/` contains runner scripts for the experiments in the [thesis document](https://www.biostat.wisc.edu/~gitter/pubs/AlnammiThesis.pdf).
`chtc_runners/simulation_runner.py` can be used as a starting template for your own runner script.
`chtc_runners/simulation_utils.py` contains helper functions for pre- and post-processing iteration selections for retrospective experiments.
Consult the README in that subdirectory for details.
## Implemented Iterative Strategies
The following are the currently implemented strategies in `active_learning_dd/next_batch_selector/` (see [thesis document](https://www.biostat.wisc.edu/~gitter/pubs/AlnammiThesis.pdf) and hyperapameter examples in `param_configs/`):
1. **ClusterBasedWeightSelector (CBWS)**: assigns exploitation-exploration weights to every cluster, splits the budget between exploit-explore, then select compounds from most exploitable clusters, followed by selecting most explorable clusters.
2. **ClusterBasedRandom**: randomly samples a cluster, then randomly samples compounds from within clusters.
3. **InstanceBasedRandom**: randomly samples compounds from the pool.
4. **ClusterBasedDissimilar**: samples clusters dissimilarly according to a dissimilarity measure which is by default fingerprint based.
5. **InstanceBasedDissimilar**: samples compounds dissimilarly from the pool.
6. **MABSelector**: Upper-Confidence-Bound (UCB) style solution from Multi-Armed Bandits (MAB).
Assigns every cluster an upper-bound estimate of the reward that is a combination of a exploitation term and an exploration term.
Samples clusters with the highest rewards.