An open API service indexing awesome lists of open source software.

https://github.com/brsynth/biorgroup

Systematic expansion of R-group for ChEBI molecules from the RHEA database
https://github.com/brsynth/biorgroup

chebi r-group rhea

Last synced: 5 months ago
JSON representation

Systematic expansion of R-group for ChEBI molecules from the RHEA database

Awesome Lists containing this project

README

          

# BioRGroup dataset

![BioRGroup Logo](.github/docs/logo.jpg)
[https://doi.org/10.57745/V3URYA](https://doi.org/10.57745/V3URYA)

## Installation

```sh
conda env create --file recipes/worklow.yaml --name biorgroup
pip install --no-deps -e .
```

## Build dataset

### 1 - Download PubChem
```sh
python -m biorgroup.pubchem.download \
--output-pubchem-dir \
--output-pubchem-db
```

### 2 - Download Rhea
```sh
python -m biorgroup.rhea.download \
--output-rhea-dir \
--parameter-release-int
```

### 3 - R-group search
```sh
snakemake \
-p \
-j 48 \
-c 48 \
--workflow-profile template/biorgroup \
-s ./src/biorgroup/rgroup/Snakefile \
--use-conda \
--latency-wait 5 \
--rerun-incomplete \
--config input_depot_str=./src/biorgroup/rgroup input_chebi_csv=rhea-chebi-smiles.csv input_pubchem_db=pubchem.db output_dir_str=chebi parameter_search_timeout_int=10
```

## Dataset overview
The Snakemake workflow produces a `csv.gz` file containing:

| column name | type |
| --- | --- |
| smiles_rhea | `str`|
| chebi | `List[str]` |
| num_heavy_atoms | `int` |
| exact_mol_wt | `float` |
| core_superstructure_smiles | `List[str]` |
| core_superstructure_pubchem_cid | `List[List[str]]` |
| rgroup_extended_smiles | `List[str]` |
| rgroup_extended_pubchem_cid | `List[List[str]]` |