{"id":20690303,"url":"https://github.com/merck/bgc-pipeline","last_synced_at":"2026-03-01T02:02:12.376Z","repository":{"id":48787507,"uuid":"318641688","full_name":"Merck/bgc-pipeline","owner":"Merck","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-10T13:41:11.000Z","size":24988,"stargazers_count":11,"open_issues_count":2,"forks_count":1,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-22T17:04:28.943Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Merck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-12-04T21:42:58.000Z","updated_at":"2025-01-15T09:55:17.000Z","dependencies_parsed_at":"2025-04-22T16:56:14.272Z","dependency_job_id":"1a4b62eb-2a07-4cc4-9729-a8d452612ccf","html_url":"https://github.com/Merck/bgc-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Merck/bgc-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fbgc-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fbgc-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fbgc-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fbgc-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Merck","download_url":"https://codeload.github.com/Merck/bgc-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fbgc-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29958395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T01:47:18.291Z","status":"online","status_checked_at":"2026-03-01T02:00:07.437Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T23:12:35.056Z","updated_at":"2026-03-01T02:02:12.363Z","avatar_url":"https://github.com/Merck.png","language":"Jupyter Notebook","readme":"### Note!\n\n**This repository provides data and examples that were used for development of DeepBGC and its evaluation with ClusterFinder and antiSMASH.**\n\n**See https://github.com/Merck/deepbgc for the DeepBGC tool.**\n\n### Note!\n\n# DeepBGC development \u0026 evaluation code\n\n## Reproducing data\n\nReproduction and storage of data files is managed using [DVC](https://github.com/iterative/dvc) (development version `0.22.0`). \nEach data file has a `.dvc` history file that contains the command that was used to generate the output along with md5 hashes of its dependencies.\n\n## Installation\n\n- Install python 3, ideally using conda\n- Run `pip install -r requirements.txt` to download DVC and other requirements\n\n## Downloading a file\n\n- Run the AWS config script to generate temporary AWS credentials in ~/.aws/credentials:\n  - `generate-aws-config --account lab --insecure`\n- Run `dvc pull data/path/to/file.dvc` to download required file.\n\n## High-level overview\n\n### Main folders\n\n- [bgc_detection/](bgc_detection/) all the code\n- [data/](notebooks/) all the data\n    - [bacteria/](data/bacteria) 3k reference bacteria\n        - [candidates/](data/bacteria/candidates) novel detected BGC candidates\n    - [clusterfinder/](data/clusterfinder) ClusterFinder (Cimermancic et al.) datasets\n    - [evaluation/](data/evaluation) Cross-validation, Leave-Class-Out and Bootstrap evaluation\n    - [features/](data/features) Pfam2vec and other protein domain features\n    - [figures/](data/figures) Paper figures\n    - [mibig/](data/mibig) MIBiG BGC database samples\n    - [models/](data/models) Model configurations and trained models\n    - [pfam/](data/pfam) Pfam repository files\n    - [training/](data/training) Negative and positive training data and t-SNE visualizations\n- [notebooks/](notebooks/) Jupyter (iPython) notebooks\n\n\n### Training a model\n\n- Define a JSON config file, see [data/models/config](data/models/config) for reference.\n- Run [bgc_detection/run_training.py](bgc_detection/run_training.py) with given config and path to training data. See DVC files in [data/models/trained](data/models/trained) for reference.\n- Trained model will be presented as Python pickle file. \n\n### Predicting using trained model\n\n- Prepare a protein FASTA file, e.g. using [Prodigal](https://github.com/hyattpd/Prodigal) \n(see [data/bacteria/proteins.dvc](data/bacteria/proteins.dvc) for reference) or extract it from an annotated GenBank file using [bgc_detection/preprocessing/proteins2fasta.py](bgc_detection/preprocessing/proteins2fasta.py).\n- Detect protein domains using Hmmscan (see [data/bacteria/domtbl.dvc](data/bacteria/domtbl.dvc) for reference)\n- Convert the Hmmscan domtbl file into a Domain CSV file using [bgc_detection/preprocessing/domtbl2csv.py](bgc_detection/preprocessing/domtbl2csv.py) \n(see [data/bacteria/domains.dvc](data/bacteria/domains.dvc) for reference)\n- Predict BGC domain-level probability using [bgc_detection/run_prediction.py](bgc_detection/run_prediction.py) \n(see [data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc](data/bacteria/prediction/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k.dvc) for reference)\n- Threshold and merge domain-level predictions into a BGC candidate CSV file using \n[bgc_detection/candidates/threshold_candidates.py](bgc_detection/candidates/threshold_candidates.py) \n(see [data/bacteria/candidates/128lstm-100pfamdim-8pfamiter-posweighted-neg-10k-fpr2/candidates.csv.dvc] for reference)\n\n### Bootstrap validation on 9 Fully-annotated genomes\n\nSee [notebooks/LabelledContigBootstrap.ipynb](notebooks/LabelledContigBootstrap.ipynb).\n\n### Leave Class Out validation and Cross validation\n\nSee [data/evaluation/lco-neg-10k](data/evaluation/lco-neg-10k) (TODO).\n\nSee [data/evaluation/cv-10fold-neg-10k](data/evaluation/cv-10fold-neg-10k) (TODO).\n\n### Random Forest classification\n\nSee [notebooks/CandidateClassification.ipynb](notebooks/CandidateClassification.ipynb) and [notebooks/CandidateActivityClassification.ipynb](notebooks/CandidateActivityClassification.ipynb)\n\n### Novel BGC candidates generation\n\nSee [notebooks/NovelCandidates.ipynb](notebooks/NovelCandidates.ipynb).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fbgc-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmerck%2Fbgc-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fbgc-pipeline/lists"}