Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/PatWalters/rd_filters
A script to run structural alerts using the RDKit and ChEMBL
https://github.com/PatWalters/rd_filters
Last synced: 3 months ago
JSON representation
A script to run structural alerts using the RDKit and ChEMBL
- Host: GitHub
- URL: https://github.com/PatWalters/rd_filters
- Owner: PatWalters
- License: mit
- Created: 2018-08-08T01:29:21.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-08-01T19:18:03.000Z (over 1 year ago)
- Last Synced: 2024-05-06T13:32:40.807Z (6 months ago)
- Language: Python
- Size: 549 KB
- Stars: 119
- Watchers: 15
- Forks: 35
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-small-molecule-ml - rd_filters
README
# rd_filters
This script provides a simple means of applying the
functional group filters from the ChEMBL database, as well as a number of
property filters from the RDKit, to a set
of compounds. As of ChEMBL 23, the database table **structural_alerts**
contains 8 sets of alerts. ChEMBL doesn't apper to have much in the way of
documentation on the different alert sets.|Rule Set |Number of Alerts|
|---|---:|
|BMS| 180|
|Dundee| 105|
|Glaxo| 55|
|Inpharmatica| 91|
|LINT| 57|
|MLSMR| 116|
|[PAINS](https://pubs.acs.org/doi/abs/10.1021/jm901137j)| 479|
|SureChEMBL| 166|The SMARTS patterns in a number of these alerts were not compatible with the RDKit
so I edited them. A complete list of the changes I made is in the file **Notes.txt**.## Prerequisite
* At least Python 3.6
* The RDKit, you can find installation instructions [here](https://www.rdkit.org/docs/Install.html). I'd recommend the conda route.## Installation
### Directly install from github
`pip install git+https://github.com/PatWalters/rd_filters.git`
### Local install
``` shell
git clone https://github.com/PatWalters/rd_filters
cd rd_filters
pip install .
```## Usage
The script needs 2 files to operate.
* alert_collection.csv - the set of structural alerts
* rules.json - the configuration fileThe script uses the following logic to find alert_collection.csv and rules.json.
1. Use locations specified by the "--alert" (for alerts.csv) and "--rules" (for rules.json) command line arguments.
2. Look in the current directory.
3. Look in the directory pointed to by the FILTER_RULES_DATA environment variable.I'll provide some examples below to illustrate.
That's it, at this point you should be good to go.
### Configuration files
The file **alert_collection.csv** contains alerts. You shouldn't have to mess with this unless you
want to add your own structural alerts. I think the format is pretty obvious.The file **rules.json** controls which filters and alerts are used. You can use the command
below to generate a **rules.json** with the default settings.`rd_filters template --out rules.json`
The **rules.json** file looks like this. The values for the properties are the maximum and minimum
allowed (inclusive). To set which structural alerts are used, set **true** and **false**. You can
use multiple alert sets.
Just edit the file with your [favorite text editor](https://www.gnu.org/software/emacs/).```javascript
{
"HBA": [
0,
10
],
"HBD": [
0,
5
],
"LogP": [
-5,
5
],
"MW": [
0,
500
],
"Rule_BMS": false,
"Rule_Dundee": false,
"Rule_Glaxo": false,
"Rule_Inpharmatica": true,
"Rule_LINT": false,
"Rule_MLSMR": false,
"Rule_PAINS": false,
"Rule_SureChEMBL": false,
"TPSA": [
0,
200
]
}
```#### Examples
First off, you're going to want to copy **alert_collection.csv** and
**rules.json** to a directory and set the FILTER_RULES_DATA environment
variable to point to that directory. If you are using a bash-ish shell
and the files are in /home/elvis/data that would be:`export FILTER_RULES_DATA=/home/elvis/data`
If you type
`rd_filters -h`
you'll see this:
``` shell
Usage:
rd_filters filter --in INPUT_FILE --prefix PREFIX [--rules RULES_FILE_NAME] [--alerts ALERT_FILE_NAME][--np NUM_CORES]
rd_filters template --out TEMPLATE_FILE [--rules RULES_FILE_NAME]Options:
--in INPUT_FILE input file name
--prefix PREFIX prefix for output file names
--rules RULES_FILE_NAME name of the rules JSON file
--alerts ALERTS_FILE_NAME name of the structural alerts file
--np NUM_CORES the number of cpu cores to use (default is all)
--out TEMPLATE_FILE parameter template file name
"""
```The basic operation is pretty simple. If I want to filter a file called test.smi and
I want my output files to start with "out", I could do something like this:`rd_filters filter --in test.smi --prefix out`
This will create 2 files
* **out.smi** - contains the SMILES strings and molecule names for all of the compounds
passing the filters
* **out.csv** - contains calculated property values and a listing alerts triggered
by a moleculeBy default, this script runs in parallel and uses all available processors. To
change this value, use the --np flag.`rd_filters filter --in test.smi --prefix out --np 4`
As mentioned above, alternate rules files or alerts files can be specified on
the command line.``` shell
rd_filters filter --in test.smi --prefix out --rules myrules.json
rd_filters filter --in test.smi --prefix out --alerts myalerts.csv
rd_filters filter --in test.smi --prefix out --rules myrules.json --alerts myalerts.csv
```A new default rules template file can be generated using the **template** option.
`rdfilters.py template --out myrules.json`
As always please let me know if you have questions, comments, etc.
Pat Walters, August 2018