https://github.com/brsynth/biorgroup
Systematic expansion of R-group for ChEBI molecules from the RHEA database
https://github.com/brsynth/biorgroup
chebi r-group rhea
Last synced: 5 months ago
JSON representation
Systematic expansion of R-group for ChEBI molecules from the RHEA database
- Host: GitHub
- URL: https://github.com/brsynth/biorgroup
- Owner: brsynth
- License: mit
- Created: 2025-06-13T08:40:47.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-09-03T06:58:59.000Z (9 months ago)
- Last Synced: 2025-09-09T20:14:05.898Z (9 months ago)
- Topics: chebi, r-group, rhea
- Language: Jupyter Notebook
- Homepage:
- Size: 421 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BioRGroup dataset

[https://doi.org/10.57745/V3URYA](https://doi.org/10.57745/V3URYA)
## Installation
```sh
conda env create --file recipes/worklow.yaml --name biorgroup
pip install --no-deps -e .
```
## Build dataset
### 1 - Download PubChem
```sh
python -m biorgroup.pubchem.download \
--output-pubchem-dir \
--output-pubchem-db
```
### 2 - Download Rhea
```sh
python -m biorgroup.rhea.download \
--output-rhea-dir \
--parameter-release-int
```
### 3 - R-group search
```sh
snakemake \
-p \
-j 48 \
-c 48 \
--workflow-profile template/biorgroup \
-s ./src/biorgroup/rgroup/Snakefile \
--use-conda \
--latency-wait 5 \
--rerun-incomplete \
--config input_depot_str=./src/biorgroup/rgroup input_chebi_csv=rhea-chebi-smiles.csv input_pubchem_db=pubchem.db output_dir_str=chebi parameter_search_timeout_int=10
```
## Dataset overview
The Snakemake workflow produces a `csv.gz` file containing:
| column name | type |
| --- | --- |
| smiles_rhea | `str`|
| chebi | `List[str]` |
| num_heavy_atoms | `int` |
| exact_mol_wt | `float` |
| core_superstructure_smiles | `List[str]` |
| core_superstructure_pubchem_cid | `List[List[str]]` |
| rgroup_extended_smiles | `List[str]` |
| rgroup_extended_pubchem_cid | `List[List[str]]` |