Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/adam-maz/fingerprint-based_tool_for_small_drug_screening

Here I provide small drug screening toolkit based on RandomForrestClassifier
https://github.com/adam-maz/fingerprint-based_tool_for_small_drug_screening

binary-classification chembl fingerprints ipython-notebook ligand-based-drug-design machine-learning medicinal-chemistry molecular-screening random-forest-classifier

Last synced: about 1 month ago
JSON representation

Here I provide small drug screening toolkit based on RandomForrestClassifier

Host: GitHub
URL: https://github.com/adam-maz/fingerprint-based_tool_for_small_drug_screening
Owner: Adam-maz
Created: 2024-12-14T10:52:24.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-12-15T12:11:26.000Z (about 1 month ago)
Last Synced: 2024-12-15T12:23:16.754Z (about 1 month ago)
Topics: binary-classification, chembl, fingerprints, ipython-notebook, ligand-based-drug-design, machine-learning, medicinal-chemistry, molecular-screening, random-forest-classifier
Language: Jupyter Notebook
Homepage:
Size: 1.07 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Description

This toolkit introduces a machine learning pipeline designed for small drug screening. It includes:

1. A Jupyter Notebook for data parsing and preprocessing, model evaluation, optimization, and saving.
2. An Object-Oriented Programming (OOP) script (`ml_launcher.py`) that uses the trained estimator to make predictions locally on the user’s computer. Users can also modify the notebook to run predictions within a Google Colab environment if desired.

The model is a binary classifier that predicts bioactivity, returning:
- **1** for active compounds (Ki ≤ 50 nM).
- **0** for inactive compounds (Ki > 50 nM).

The example dataset focuses on the **5-HT7 receptor**, which demonstrates the workflow but can be adapted to predict bioactivity for any biological target. Notably, the 5-HT7 dataset includes a relatively small number of molecules (177 after preprocessing) and exhibits some class imbalance (active vs. inactive). Consequently, predictions may be suboptimal. Thus, this toolkit serves as an example of a fingerprints-based binary classifier. For other biological targets, different models might perform better. Therefore, it’s crucial to evaluate and select the most suitable estimator for each target.

---

# Instructions

To run the `ml_launcher.py` script, ensure the following files are in the same directory:
- `ml_launcher.py` (script for making predictions).
- `best_rfc_model.joblib` (the trained estimator).
- `example.csv` (the file containing molecules to be analyzed).

## Requirements Installation

If necessary, install the required Python packages using the following command:

```python
pip install pandas numpy rdkit joblib
```

Once the required packages are installed, you can run the script in your local environment.

---

# List of Files

1. **5ht7_IC50.csv** - raw dataset from ChEMBL.
2. **5ht7_Ki.csv** - raw dataset from ChEMBL.
3. **ml_notebook.ipynb** - Jupyter Notebook containing the model definition and training pipeline.
4. **best_rfc_model.joblib** - trained and saved estimator.
5. **example.csv** - example file with molecules for prediction.
6. **ml_launcher.py** - python script for running predictions.
7. **fps_esti_schema.png** - scheme.