https://github.com/merck/bgc-pipeline
https://github.com/merck/bgc-pipeline
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/merck/bgc-pipeline
- Owner: Merck
- License: mit
- Created: 2020-12-04T21:42:58.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2024-12-10T13:41:11.000Z (over 1 year ago)
- Last Synced: 2025-04-22T17:04:28.943Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 23.8 MB
- Stars: 11
- Watchers: 6
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### Note!
**This repository provides data and examples that were used for development of DeepBGC and its evaluation with ClusterFinder and antiSMASH.**
**See https://github.com/Merck/deepbgc for the DeepBGC tool.**
### Note!
# DeepBGC development & evaluation code
## Reproducing data
Reproduction and storage of data files is managed using [DVC](https://github.com/iterative/dvc) (development version `0.22.0`).
Each data file has a `.dvc` history file that contains the command that was used to generate the output along with md5 hashes of its dependencies.
## Installation
- Install python 3, ideally using conda
- Run `pip install -r requirements.txt` to download DVC and other requirements
## Downloading a file
- Run the AWS config script to generate temporary AWS credentials in ~/.aws/credentials:
- `generate-aws-config --account lab --insecure`
- Run `dvc pull data/path/to/file.dvc` to download required file.
## High-level overview
### Main folders
- [bgc_detection/](bgc_detection/) all the code
- [data/](notebooks/) all the data
- [bacteria/](data/bacteria) 3k reference bacteria
- [candidates/](data/bacteria/candidates) novel detected BGC candidates
- [clusterfinder/](data/clusterfinder) ClusterFinder (Cimermancic et al.) datasets
- [evaluation/](data/evaluation) Cross-validation, Leave-Class-Out and Bootstrap evaluation
- [features/](data/features) Pfam2vec and other protein domain features
- [figures/](data/figures) Paper figures
- [mibig/](data/mibig) MIBiG BGC database samples
- [models/](data/models) Model configurations and trained models
- [pfam/](data/pfam) Pfam repository files
- [training/](data/training) Negative and positive training data and t-SNE visualizations
- [notebooks/](notebooks/) Jupyter (iPython) notebooks
### Training a model
- Define a JSON config file, see [data/models/config](data/models/config) for reference.
- Run [bgc_detection/run_training.py](bgc_detection/run_training.py) with given config and path to training data. See DVC files in [data/models/trained](data/models/trained) for reference.
- Trained model will be presented as Python pickle file.
### Predicting using trained model
- Prepare a protein FASTA file, e.g. using [Prodigal](https://github.com/hyattpd/Prodigal)
(see [data/bacteria/proteins.dvc](data/bacteria/proteins.dvc) for reference) or extract it from an annotated GenBank file using [bgc_detection/preprocessing/proteins2fasta.py](bgc_detection/preprocessing/proteins2fasta.py).
- Detect protein domains using Hmmscan (see [data/bacteria/domtbl.dvc](data/bacteria/domtbl.dvc) for reference)
- Convert the Hmmscan domtbl file into a Domain CSV file using [bgc_detection/preprocessing/domtbl2csv.py](bgc_detection/preprocessing/domtbl2csv.py)
(see [data/bacteria/domains.dvc](data/bacteria/domains.dvc) for reference)
- Predict BGC domain-level probability using [bgc_detection/run_prediction.py](bgc_detection/run_prediction.py)
(see [data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc](data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc) for reference)
- Threshold and merge domain-level predictions into a BGC candidate CSV file using
[bgc_detection/candidates/threshold_candidates.py](bgc_detection/candidates/threshold_candidates.py)
(see [data/bacteria/candidates/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k-fpr2/candidates.csv.dvc] for reference)
### Bootstrap validation on 9 Fully-annotated genomes
See [notebooks/LabelledContigBootstrap.ipynb](notebooks/LabelledContigBootstrap.ipynb).
### Leave Class Out validation and Cross validation
See [data/evaluation/lco-neg-10k](data/evaluation/lco-neg-10k) (TODO).
See [data/evaluation/cv-10fold-neg-10k](data/evaluation/cv-10fold-neg-10k) (TODO).
### Random Forest classification
See [notebooks/CandidateClassification.ipynb](notebooks/CandidateClassification.ipynb) and [notebooks/CandidateActivityClassification.ipynb](notebooks/CandidateActivityClassification.ipynb)
### Novel BGC candidates generation
See [notebooks/NovelCandidates.ipynb](notebooks/NovelCandidates.ipynb).