{"id":20690228,"url":"https://github.com/merck/deepbgc","last_synced_at":"2025-04-07T05:09:26.032Z","repository":{"id":34297015,"uuid":"162141645","full_name":"Merck/deepbgc","owner":"Merck","description":"BGC Detection and Classification Using Deep Learning","archived":false,"fork":false,"pushed_at":"2023-11-11T12:48:56.000Z","size":26579,"stargazers_count":140,"open_issues_count":34,"forks_count":27,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-31T04:07:46.450Z","etag":null,"topics":["bidirectional-lstm","biosynthetic-gene-clusters","deep-learning","deepbgc","natural-products","pfam2vec","python","synthetic-biology"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1093/nar/gkz654","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Merck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-17T14:19:43.000Z","updated_at":"2025-03-28T09:03:26.000Z","dependencies_parsed_at":"2023-01-15T06:15:37.563Z","dependency_job_id":"eaeb71f7-b88a-4902-982b-0ce156ce0da7","html_url":"https://github.com/Merck/deepbgc","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fdeepbgc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fdeepbgc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fdeepbgc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Merck%2Fdeepbgc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Merck","download_url":"https://codeload.github.com/Merck/deepbgc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247595334,"owners_count":20963943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bidirectional-lstm","biosynthetic-gene-clusters","deep-learning","deepbgc","natural-products","pfam2vec","python","synthetic-biology"],"created_at":"2024-11-16T23:12:19.571Z","updated_at":"2025-04-07T05:09:26.014Z","avatar_url":"https://github.com/Merck.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DeepBGC: Biosynthetic Gene Cluster detection and classification\n\nDeepBGC detects BGCs in bacterial and fungal genomes using deep learning. \nDeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network \nand a word2vec-like vector embedding of Pfam protein domains. \nProduct class and activity of detected BGCs is predicted using a Random Forest classifier.\n\n[![BioConda Install](https://img.shields.io/conda/dn/bioconda/deepbgc.svg?style=flag\u0026label=BioConda%20install\u0026color=green)](https://anaconda.org/bioconda/deepbgc) \n![PyPI - Downloads](https://img.shields.io/pypi/dm/deepbgc.svg?color=green\u0026label=PyPI%20downloads)\n[![PyPI license](https://img.shields.io/pypi/l/deepbgc.svg)](https://pypi.python.org/pypi/deepbgc/)\n[![PyPI version](https://badge.fury.io/py/deepbgc.svg)](https://badge.fury.io/py/deepbgc)\n[![CI](https://api.travis-ci.org/Merck/deepbgc.svg?branch=master)](https://travis-ci.org/Merck/deepbgc)\n\n![DeepBGC architecture](images/deepbgc.architecture.png?raw=true \"DeepBGC architecture\")\n\n## 📌 News 📌\n\n- **DeepBGC 0.1.23**: Predicted BGCs can now be uploaded for visualization in **antiSMASH** using a JSON output file\n  - Install and run DeepBGC as usual based on instructions below\n  - Upload `antismash.json` from the DeepBGC output folder using \"Upload extra annotations\" on the [antiSMASH](https://antismash.secondarymetabolites.org/) page\n  - Predicted BGC regions and their prediction scores will be displayed alongside antiSMASH BGCs\n \n## Publications\n\nA deep learning genome-mining strategy for biosynthetic gene cluster prediction \u003cbr\u003e\nGeoffrey D Hannigan,  David Prihoda et al., Nucleic Acids Research, gkz654, https://doi.org/10.1093/nar/gkz654\n\n## Install using conda (recommended)\n\nYou can install DeepBGC using [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html) \nor one of the alternatives ([Miniconda](https://docs.conda.io/en/latest/miniconda.html), \n[Miniforge](https://github.com/conda-forge/miniforge)).\n\nSet up Bioconda and Conda-Forge channels:\n\n```bash\nconda config --add channels bioconda\nconda config --add channels conda-forge\n```\n\nInstall DeepBGC using:\n\n```bash\n# Create a separate DeepBGC environment and install dependencies\nconda create -n deepbgc python=3.7 hmmer prodigal\n\n# Install DeepBGC into the environment using pip\nconda activate deepbgc\npip install deepbgc\n\n# Alternatively, install everything using conda (currently unstable due to conda conflicts)\nconda install deepbgc\n```\n\n\n## Install dependencies manually (if conda is not available)\n\nIf you don't mind installing the HMMER and Prodigal dependencies manually, you can also install DeepBGC using pip:\n\n- Install Python version 3.6 or 3.7 (Note: **Python 3.8 is not supported** due to Tensorflow \u003c 2.0 dependency)\n- Install Prodigal and put the `prodigal` binary it on your PATH: https://github.com/hyattpd/Prodigal/releases\n- Install HMMER and put the `hmmscan` and `hmmpress` binaries on your PATH: http://hmmer.org/download.html\n- Run `pip install deepbgc` to install DeepBGC   \n\n## Use DeepBGC\n\n### Download models and Pfam database\n\nBefore you can use DeepBGC, download trained models and Pfam database:\n\n```bash\ndeepbgc download\n```\n\nYou can display downloaded dependencies and models using:\n\n```bash\ndeepbgc info\n```\n\n### Detection and classification\n\n![DeepBGC pipeline](images/deepbgc.pipeline.png?raw=true \"DeepBGC pipeline\")\n\nDetect and classify BGCs in a genomic sequence. \nProteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed)\n\n```bash\n# Show command help docs\ndeepbgc pipeline --help\n\n# Detect and classify BGCs in mySequence.fa using DeepBGC detector.\ndeepbgc pipeline mySequence.fa\n\n# Detect and classify BGCs in mySequence.fa using custom DeepBGC detector trained on your own data.\ndeepbgc pipeline --detector path/to/myDetector.pkl mySequence.fa\n```\n\nThis will produce a `mySequence` directory with multiple files and a README.txt with file descriptions.\n\nSee [Train DeepBGC on your own data](#train-deepbgc-on-your-own-data) section below for more information about training a custom detector or classifier.\n\n#### Example output\n\nSee the [DeepBGC Example Result Notebook](https://nbviewer.jupyter.org/urls/github.com/Merck/deepbgc/releases/download/v0.1.0/DeepBGC_Example_Result.ipynb).\nData can be downloaded on the [releases page](https://github.com/Merck/deepbgc/releases)\n\n![Detected BGC Regions](images/deepbgc.bgc.png?raw=true \"Detected BGC regions\")\n\n## Train DeepBGC on your own data\n\nYou can train your own BGC detection and classification models, see `deepbgc train --help` for documentation and examples.\n\nTraining and validation data can be found in [release 0.1.0](https://github.com/Merck/deepbgc/releases/tag/v0.1.0) and [release 0.1.5](https://github.com/Merck/deepbgc/releases/tag/v0.1.5). You will need:\n- Positive (BGC) training data - In most cases, this is your own BGC training set, see \"Preparing training data\" section below\n- Negative (Non-BGC) training data - Needed for BGC detection. You can use `GeneSwap_Negatives.pfam.tsv` from release https://github.com/Merck/deepbgc/releases/tag/v0.1.0\n- Validation data - Needed for BGC detection. Contigs with annotated BGC and non-BGC regions. A working example can be downloaded from https://github.com/Merck/deepbgc/releases/tag/v0.1.5\n- Trained Pfam2vec vectors - \"Vocabulary\" converting Pfam IDs to meaningful numeric vectors, you can reuse previously trained `pfam2vec.csv` results from https://github.com/Merck/deepbgc/releases/tag/v0.1.0\n- JSON configuration files - See JSON section below\n\nIf you have any questions about using or training DeepBGC, feel free to submit an issue.\n\n### Preparing training data\n\nThe training examples need to be prepared in Pfam TSV format, which can be prepared from your sequence\nusing `deepbgc prepare`. \n\nFirst, you will need to manually add an `in_cluster` column that will contain 0 for pfams outside a BGC \nand 1 for pfams inside a BGC. We recommend preparing a separate negative TSV and positive TSV file, \nwhere the column will be equal to all 0 or 1 respectively. \n\nFinally, you will need to manually add a `sequence_id` column ,\nwhich will identify a continuous sequence of Pfams from a single sample (BGC or negative sequence).\nThe samples are shuffled during training to present the model with a random order of positive and negative samples.\nPfams with the same `sequence_id` value will be kept together. For example, if your training set contains multiple BGCs, the `sequence_id` column should contain the BGC ID.\n\n**! New in version 0.1.17 !** You can now prepare *protein* FASTA sequences into a Pfam TSV file using `deepbgc prepare --protein`.\n\n\n### JSON model training template files\n\nDeepBGC is using JSON template files to define model architecture and training parameters. All templates can be downloaded in [release 0.1.0](https://github.com/Merck/deepbgc/releases/tag/v0.1.0).\n\nJSON template for DeepBGC LSTM **detector** with pfam2vec is structured as follows:\n```\n{\n  \"type\": \"KerasRNN\", - Model architecture (KerasRNN/DiscreteHMM/GeneBorderHMM)\n  \"build_params\": { - Parameters for model architecture\n    \"batch_size\": 16, - Number of splits of training data that is trained in parallel \n    \"hidden_size\": 128, - Size of vector storing the LSTM inner state\n    \"stateful\": true - Remember previous sequence when training next batch\n  },\n  \"fit_params\": {\n    \"timesteps\": 256, - Number of pfam2vec vectors trained in one batch\n    \"validation_size\": 0, - Fraction of training data to use for validation (if validation data is not provided explicitly). Use 0.2 for 20% data used for testing.\n    \"verbose\": 1, - Verbosity during training\n    \"num_epochs\": 1000, - Number of passes over your training set during training. You probably want to use a lower number if not using early stopping on validation data.\n    \"early_stopping\" : { - Stop model training when at certain validation performance\n      \"monitor\": \"val_auc_roc\", - Use validation AUC ROC to observe performance\n      \"min_delta\": 0.0001, - Stop training when the improvement in the last epochs did not improve more than 0.0001\n      \"patience\": 20, - How many of the last epochs to check for improvement\n      \"mode\": \"max\" - Stop training when given metric stops increasing (use \"min\" for decreasing metrics like loss)\n    },\n    \"shuffle\": true, - Shuffle samples in each epoch. Will use \"sequence_id\" field to group pfam vectors belonging to the same sample and shuffle them together \n    \"optimizer\": \"adam\", - Optimizer algorithm\n    \"learning_rate\": 0.0001, - Learning rate\n    \"weighted\": true - Increase weight of less-represented class. Will give more weight to BGC training samples if the non-BGC set is larger.\n  },\n  \"input_params\": {\n    \"features\": [ - Array of features to use in model, see deepbgc/features.py\n      {\n        \"type\": \"ProteinBorderTransformer\" - Add two binary flags for pfam domains found at beginning or at end of protein\n      },\n      {\n        \"type\": \"Pfam2VecTransformer\", - Convert pfam_id field to pfam2vec vector using provided pfam2vec table\n        \"vector_path\": \"#{PFAM2VEC}\" - PFAM2VEC variable is filled in using command line argument --config\n      }\n    ]\n  }\n}\n```\n\nJSON template for Random Forest **classifier** is structured as follows:\n```\n{\n  \"type\": \"RandomForestClassifier\", - Type of classifier (RandomForestClassifier)\n  \"build_params\": {\n    \"n_estimators\": 100, - Number of trees in random forest\n    \"random_state\": 0 - Random seed used to get same result each time\n  },\n  \"input_params\": {\n    \"sequence_as_vector\": true, - Convert each sample into a single vector\n    \"features\": [\n      {\n        \"type\": \"OneHotEncodingTransformer\" - Convert each sequence of Pfams into a single binary vector (Pfam set)\n      }\n    ]\n  }\n}\n```\n\n### Using your trained model\n\nSince version `0.1.10` you can provide a direct path to the detector or classifier model like so:\n```bash\ndeepbgc pipeline \\\n    mySequence.fa \\\n    --detector path/to/myDetector.pkl \\\n    --classifier path/to/myClassifier.pkl \n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fdeepbgc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmerck%2Fdeepbgc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmerck%2Fdeepbgc/lists"}