https://github.com/chembl/chembl_multitask_model
Target prediction multitask neural network, with examples running it in Python, C++, Julia and JS
https://github.com/chembl/chembl_multitask_model
cheminformatics chemistry machine-learning
Last synced: 5 months ago
JSON representation
Target prediction multitask neural network, with examples running it in Python, C++, Julia and JS
- Host: GitHub
- URL: https://github.com/chembl/chembl_multitask_model
- Owner: chembl
- License: mit
- Created: 2021-03-11T14:32:20.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2025-09-01T13:56:05.000Z (10 months ago)
- Last Synced: 2025-09-01T15:24:51.154Z (10 months ago)
- Topics: cheminformatics, chemistry, machine-learning
- Language: Python
- Homepage:
- Size: 88.9 MB
- Stars: 18
- Watchers: 1
- Forks: 10
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ChEMBL Multitask Neural Network model
Small and fast target prediction model trained on a panel of targets using ChEMBL data. The model can be used in off-target prediction scenarios with large collections of compounds.
- Based on the blogpost: http://chembl.blogspot.com/2019/05/multi-task-neural-network-on-chembl.html
- Model available in KNIME thanks to Greg Landrum: https://www.knime.com/blog/interactive-bioactivity-prediction-with-multitask-neural-networks
The model is exported to the ONNX format so it can be used in any programming language able to generate fingerprints with RDKit
# Try the model online!
Using both RDKit Javascript MinimalLib and ONNX.js. Hosted in github pages: https://chembl.github.io/chembl_multitask_model
# Data Extraction
```bash
python extract_format_dataset.py --chembl_version 36 --output_dir ./chembl_36/
```
Activities in ChEMBL with the following requirements are extracted
- activities.standard_units = 'nM'
- activities.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency')
- activities.data_validity_comment IS NULL
- activities.standard_relation IN ('=', '<')
- activities.potential_duplicate = 0 AND assays.confidence_score >= 8
- target_dictionary.target_type = 'SINGLE PROTEIN'
Keeping targets
- with at least 100 active and 100 inactive compounds
- mentioned in at least 2 publications
Using [IDG protein family activity thresholds](https://druggablegenome.net/IDGProteinFamilies)
- Kinases: <= 30nM
- GPCRs: <= 100nM
- Nuclear Receptors: <= 100nM
- Ion Channels: <= 10μM
- Non-IDG Family Targets: <= 1μM
When multiple measurements for a target-pair are found, the one with the lowest concentration is selected. This intentionally biases the model toward sensitivity.
# Model training
```bash
python train_chembl_multitask.py --chembl_version 36 --data_file ./chembl_36/mt_data_36_all.h5 --output_dir ./chembl_36/
```
# Extract Kinase data and train a Kinase specific model
```bash
python extract_format_dataset.py --chembl_version 36 --protein_family kinase --output_dir ./kinase/ && python train_chembl_multitask.py --chembl_version 36 --data_file ./kinase/mt_data_36_kinase.h5 --output_dir ./kinase/
```
# Example to predict in Python using the ONNX Runtime
```Python
import onnxruntime
import numpy as np
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
FP_SIZE = 1024
RADIUS = 2
def calc_morgan_fp(smiles):
mol = Chem.MolFromSmiles(smiles)
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
mol, RADIUS, nBits=FP_SIZE)
a = np.zeros((0,), dtype=np.float32)
Chem.DataStructs.ConvertToNumpyArray(fp, a)
return a
def format_preds(preds, targets):
preds = np.concatenate(preds).ravel()
np_preds = [(tar, pre) for tar, pre in zip(targets, preds)]
dt = [('chembl_id','|U20'), ('pred', ' 1024, "radius" => 2)
mfp = get_morgan_fp(mol, fp_details)
# convert the bitstring to a 1024×1 Matrix{Float32}
mfp = map(x->parse(Float32,string(x)),collect(mfp))
mfp = reshape(mfp, (length(mfp), 1))
# test a molecule
pred = play!(mt_chembl, mfp)
pred = collect(Iterators.flatten(pred))
res = tuple.(targets, pred)
res = sort(res, by=res->res[2], rev=true)
```
# C++ REST microservice
https://github.com/eloyfelix/pistache_predictor