https://github.com/chembl/chembl_multitask_model

Target prediction multitask neural network, with examples running it in Python, C++, Julia and JS
https://github.com/chembl/chembl_multitask_model

cheminformatics chemistry machine-learning

Last synced: 5 months ago
JSON representation

Target prediction multitask neural network, with examples running it in Python, C++, Julia and JS

Host: GitHub
URL: https://github.com/chembl/chembl_multitask_model
Owner: chembl
License: mit
Created: 2021-03-11T14:32:20.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2025-09-01T13:56:05.000Z (11 months ago)
Last Synced: 2025-09-01T15:24:51.154Z (11 months ago)
Topics: cheminformatics, chemistry, machine-learning
Language: Python
Homepage:
Size: 88.9 MB
Stars: 18
Watchers: 1
Forks: 10
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # ChEMBL Multitask Neural Network model

Small and fast target prediction model trained on a panel of targets using ChEMBL data. The model can be used in off-target prediction scenarios with large collections of compounds. 

- Based on the blogpost: http://chembl.blogspot.com/2019/05/multi-task-neural-network-on-chembl.html

- Model available in KNIME thanks to Greg Landrum: https://www.knime.com/blog/interactive-bioactivity-prediction-with-multitask-neural-networks

The model is exported to the ONNX format so it can be used in any programming language able to generate fingerprints with RDKit

# Try the model online!

Using both RDKit Javascript MinimalLib and ONNX.js. Hosted in github pages: https://chembl.github.io/chembl_multitask_model

# Data Extraction

```bash

python extract_format_dataset.py --chembl_version 36 --output_dir ./chembl_36/

```

Activities in ChEMBL with the following requirements are extracted

- activities.standard_units = 'nM'

- activities.standard_type IN ('EC50', 'IC50', 'Ki', 'Kd', 'XC50', 'AC50', 'Potency')

- activities.data_validity_comment IS NULL

- activities.standard_relation IN ('=', '<')

- activities.potential_duplicate = 0 AND assays.confidence_score >= 8

- target_dictionary.target_type = 'SINGLE PROTEIN'

Keeping targets

- with at least 100 active and 100 inactive compounds

- mentioned in at least 2 publications

Using [IDG protein family activity thresholds](https://druggablegenome.net/IDGProteinFamilies)

- Kinases: <= 30nM

- GPCRs: <= 100nM

- Nuclear Receptors: <= 100nM

- Ion Channels: <= 10μM

- Non-IDG Family Targets: <= 1μM

When multiple measurements for a target-pair are found, the one with the lowest concentration is selected. This intentionally biases the model toward sensitivity.

# Model training

```bash

python train_chembl_multitask.py --chembl_version 36 --data_file ./chembl_36/mt_data_36_all.h5 --output_dir ./chembl_36/

```

# Extract Kinase data and train a Kinase specific model

```bash

python extract_format_dataset.py --chembl_version 36 --protein_family kinase --output_dir ./kinase/ && python train_chembl_multitask.py --chembl_version 36 --data_file ./kinase/mt_data_36_kinase.h5 --output_dir ./kinase/

```

# Example to predict in Python using the ONNX Runtime

```Python

import onnxruntime

import numpy as np

from rdkit import Chem

from rdkit.Chem import rdMolDescriptors

FP_SIZE = 1024

RADIUS = 2

def calc_morgan_fp(smiles):

    mol = Chem.MolFromSmiles(smiles)

    fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(

        mol, RADIUS, nBits=FP_SIZE)

    a = np.zeros((0,), dtype=np.float32)

    Chem.DataStructs.ConvertToNumpyArray(fp, a)

    return a

def format_preds(preds, targets):

    preds = np.concatenate(preds).ravel()

    np_preds = [(tar, pre) for tar, pre in zip(targets, preds)]

    dt = [('chembl_id','|U20'), ('pred', ' 1024, "radius" => 2)

mfp = get_morgan_fp(mol, fp_details)

# convert the bitstring to a 1024×1 Matrix{Float32}

mfp = map(x->parse(Float32,string(x)),collect(mfp))

mfp = reshape(mfp, (length(mfp), 1))

# test a molecule

pred = play!(mt_chembl, mfp)

pred = collect(Iterators.flatten(pred))

res = tuple.(targets, pred)

res = sort(res, by=res->res[2], rev=true)

```

# C++ REST microservice

https://github.com/eloyfelix/pistache_predictor

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chembl/chembl_multitask_model

Awesome Lists containing this project

README