Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tencent-ailab/drugood
OOD Dataset Curator and Benchmark for AI-aided Drug Discovery
https://github.com/tencent-ailab/drugood
Last synced: 2 days ago
JSON representation
OOD Dataset Curator and Benchmark for AI-aided Drug Discovery
- Host: GitHub
- URL: https://github.com/tencent-ailab/drugood
- Owner: tencent-ailab
- License: other
- Created: 2022-03-14T02:57:53.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-10-23T19:07:12.000Z (about 2 years ago)
- Last Synced: 2024-11-16T04:33:05.501Z (4 days ago)
- Language: Python
- Homepage:
- Size: 491 KB
- Stars: 151
- Watchers: 6
- Forks: 22
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# :fire:DrugOOD:fire:: OOD Dataset Curator and Benchmark for AI Aided Drug Discovery
This is the official implementation of the DrugOOD project, this is the project page:
## Environment Installation
You can install the conda environment using the drugood.yaml file provided:
```shell
!git clone https://github.com/tencent-ailab/DrugOOD.git
!cd DrugOOD
!conda env create --name drugood --file=drugood.yaml
!conda activate drugood
```
Then you can go to the demo at `demo/demo.ipynb` which gives a quick practice on how to use DrugOOD.## Demo
For a quick practice on using DrugOOD for dataset curation and OOD benchmarking, one can refer to the `demo/demo.ipynb`.
## Dataset Curator
First, you need to generate the required DrugOOD dataset with our code. The dataset curator currently focusing on generating datasets from CHEMBL. It supports the following two tasks:
- Ligand Based Affinity Prediction (LBAP).
- Structure Based Affinity Prediction (SBAP).For OOD domain annotations, it supports the following 5 choices.
- Assay.
- Scaffold.
- Size.
- Protein. (only for SBAP task)
- Protein Family. (only for SBAP task)For noise annotations, it supports the following three noise levels. Datasets with different
noises are implemented by filters with different levels of strictness.- Core.
- Refined.
- General.At the same time, due to the inconvenient conversion between different measurement type (E.g. IC50, EC50, Ki, Potency), one needs to specify the measurement type when generating the dataset.
### How to Run and Reproduce the 96 Datasets?
Firstly, specifiy the path of CHEMBL database and the directory to save the data in the configuration
file: `configs/_base_/curators/lbap_defaults.py` for LBAP task or `configs/_base_/curators/sbap_defaults.py` for SBAP task.
The `source_root="YOUR_PATH/chembl_29_sqlite/chembl_29.db"` means the path to the
chembl29 sqllite file. The `target_root="data/"` specifies the folder to save the generated data.Note that you can download the original chembl29 database with sqllite format from `http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_29/chembl_29_sqlite.tar.gz`.
The built-in configuration files are located in:
`configs/curators/`. Here we provide the 96 config files to __reproduce__ the 96 datasets in our paper. Meanwhile,
you can also customize your own datasets by changing the config files.Run `tools/curate.py` to generate dataset. Here are some examples:
Generate datasets for the LBAP task, with `assay` as domain, `core` as noise
level, `IC50` as measurement type, `LBAP` as task type.:```shell
python tools/curate.py --cfg configs/curators/lbap_core_ic50_assay.py
```Generate datasets for the SBAP task, with `protein` as domain, `refined` as noise level, `EC50` as
measurement type, `SBAP` as task type.:```shell
python tools/curate.py --cfg configs/curator/sbap_refined_ec50_protein.py
```## Benchmarking SOTA OOD Algorithms
Currently we support 6 different baseline algorithms:
- ERM
- IRM
- GroupDro
- Coral
- MixUp
- DANNMeanwhile, we support various GNN backbones:
- GIN
- GCN
- Weave
- ShcNet
- GAT
- MGCN
- NF
- ATi-FPGNN
- GTransformerAnd different backbones for protein sequence modeling:
- Bert
- ProteinBert### How to Run?
Firstly, run the following command to install.
```shell
python setup.py develop
```Run the LBAP task with ERM algorithm:
```shell
python tools/train.py configs/algorithms/erm/lbap_core_ec50_assay_erm.py
```If you would like to run ERM on other datasets, change the corresponding options inside the above
config file. For example, `ann_file = 'data/lbap_core_ec50_assay.json'` specifies the input data.Similarly, run the SBAP task with ERM algorithm:
```shell
python tools/train.py configs/algorithms/erm/sbap_core_ec50_assay_erm.py
```## Reference
:smile:If you find this repo is useful, please consider to cite our paper:
```
@ARTICLE{2022arXiv220109637J,
author = {{Ji}, Yuanfeng and {Zhang}, Lu and {Wu}, Jiaxiang and {Wu}, Bingzhe and {Huang}, Long-Kai and {Xu}, Tingyang and {Rong}, Yu and {Li}, Lanqing and {Ren}, Jie and {Xue}, Ding and {Lai}, Houtim and {Xu}, Shaoyong and {Feng}, Jing and {Liu}, Wei and {Luo}, Ping and {Zhou}, Shuigeng and {Huang}, Junzhou and {Zhao}, Peilin and {Bian}, Yatao},
title = "{DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations}",
journal = {arXiv e-prints},
keywords = {Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Quantitative Biology - Quantitative Methods},
year = 2022,
month = jan,
eid = {arXiv:2201.09637},
pages = {arXiv:2201.09637},
archivePrefix = {arXiv},
eprint = {2201.09637},
primaryClass = {cs.LG}
}
```## Disclaimer
This is not an officially supported Tencent product.