Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adam-maz/fingerprint-based_tool_for_small_drug_screening
Here I provide small drug screening toolkit based on RandomForrestClassifier
https://github.com/adam-maz/fingerprint-based_tool_for_small_drug_screening
binary-classification chembl fingerprints ipython-notebook ligand-based-drug-design machine-learning medicinal-chemistry molecular-screening random-forest-classifier
Last synced: about 1 month ago
JSON representation
Here I provide small drug screening toolkit based on RandomForrestClassifier
- Host: GitHub
- URL: https://github.com/adam-maz/fingerprint-based_tool_for_small_drug_screening
- Owner: Adam-maz
- Created: 2024-12-14T10:52:24.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-15T12:11:26.000Z (about 1 month ago)
- Last Synced: 2024-12-15T12:23:16.754Z (about 1 month ago)
- Topics: binary-classification, chembl, fingerprints, ipython-notebook, ligand-based-drug-design, machine-learning, medicinal-chemistry, molecular-screening, random-forest-classifier
- Language: Jupyter Notebook
- Homepage:
- Size: 1.07 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Description
This toolkit introduces a machine learning pipeline designed for small drug screening. It includes:
1. A Jupyter Notebook for data parsing and preprocessing, model evaluation, optimization, and saving.
2. An Object-Oriented Programming (OOP) script (`ml_launcher.py`) that uses the trained estimator to make predictions locally on the user’s computer. Users can also modify the notebook to run predictions within a Google Colab environment if desired.The model is a binary classifier that predicts bioactivity, returning:
- **1** for active compounds (Ki ≤ 50 nM).
- **0** for inactive compounds (Ki > 50 nM).The example dataset focuses on the **5-HT7 receptor**, which demonstrates the workflow but can be adapted to predict bioactivity for any biological target. Notably, the 5-HT7 dataset includes a relatively small number of molecules (177 after preprocessing) and exhibits some class imbalance (active vs. inactive). Consequently, predictions may be suboptimal. Thus, this toolkit serves as an example of a fingerprints-based binary classifier. For other biological targets, different models might perform better. Therefore, it’s crucial to evaluate and select the most suitable estimator for each target.
---
# Instructions
To run the `ml_launcher.py` script, ensure the following files are in the same directory:
- `ml_launcher.py` (script for making predictions).
- `best_rfc_model.joblib` (the trained estimator).
- `example.csv` (the file containing molecules to be analyzed).## Requirements Installation
If necessary, install the required Python packages using the following command:
```python
pip install pandas numpy rdkit joblib
```Once the required packages are installed, you can run the script in your local environment.
---
# List of Files
1. **5ht7_IC50.csv** - raw dataset from ChEMBL.
2. **5ht7_Ki.csv** - raw dataset from ChEMBL.
3. **ml_notebook.ipynb** - Jupyter Notebook containing the model definition and training pipeline.
4. **best_rfc_model.joblib** - trained and saved estimator.
5. **example.csv** - example file with molecules for prediction.
6. **ml_launcher.py** - python script for running predictions.
7. **fps_esti_schema.png** - scheme.