{"id":26873623,"url":"https://github.com/systems-genomics-lab/deeptaxa","last_synced_at":"2025-08-16T17:04:32.543Z","repository":{"id":285297779,"uuid":"957650138","full_name":"systems-genomics-lab/deeptaxa","owner":"systems-genomics-lab","description":"A deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences","archived":false,"fork":false,"pushed_at":"2025-05-16T15:15:17.000Z","size":152,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-20T12:11:25.385Z","etag":null,"topics":["16s-rrna","bert","cnn","convolutional-neural-networks","deep-learning","machine-learning","microbiome","python","taxonomic-classification","torch","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/systems-genomics-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-30T21:38:01.000Z","updated_at":"2025-05-16T15:15:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"e19fddaa-e8e9-481b-bfa4-0e9a4d15619f","html_url":"https://github.com/systems-genomics-lab/deeptaxa","commit_stats":null,"previous_names":["systems-genomics-lab/deeptaxa"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/systems-genomics-lab/deeptaxa","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systems-genomics-lab%2Fdeeptaxa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systems-genomics-lab%2Fdeeptaxa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systems-genomics-lab%2Fdeeptaxa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systems-genomics-lab%2Fdeeptaxa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/systems-genomics-lab","download_url":"https://codeload.github.com/systems-genomics-lab/deeptaxa/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/systems-genomics-lab%2Fdeeptaxa/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270742043,"owners_count":24637504,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["16s-rrna","bert","cnn","convolutional-neural-networks","deep-learning","machine-learning","microbiome","python","taxonomic-classification","torch","transformers"],"created_at":"2025-03-31T09:19:41.040Z","updated_at":"2025-08-16T17:04:32.511Z","avatar_url":"https://github.com/systems-genomics-lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DeepTaxa\n\n[![License](https://img.shields.io/github/license/systems-genomics-lab/deeptaxa)](LICENSE)\n[![Last Commit](https://img.shields.io/github/last-commit/systems-genomics-lab/deeptaxa)](https://github.com/systems-genomics-lab/deeptaxa/commits/main)\n[![Issues](https://img.shields.io/github/issues/systems-genomics-lab/deeptaxa)](https://github.com/systems-genomics-lab/deeptaxa/issues)\n[![GitHub Stars](https://img.shields.io/github/stars/systems-genomics-lab/deeptaxa?style=social)](https://github.com/systems-genomics-lab/deeptaxa/stargazers)\n[![GitHub Forks](https://img.shields.io/github/forks/systems-genomics-lab/deeptaxa?style=social)](https://github.com/systems-genomics-lab/deeptaxa/network/members)\n\n**DeepTaxa** is a deep learning framework designed for hierarchical taxonomy classification of 16S rRNA gene sequences.\n\n---\n\n## Table of Contents \n\n1. [Overview](#overview)\n2. [Key Features](#key-features)\n3. [Installation](#installation)\n   - [Dependencies](#dependencies)\n   - [Installation Steps](#installation-steps)\n4. [Data and Pre-Trained Models](#data-and-pre-trained-models)\n5. [Usage](#usage)\n   - [Quick Start](#quick-start)\n   - [Training a Model](#training-a-model)\n   - [Inspecting a Checkpoint](#inspecting-a-checkpoint)\n   - [Making Predictions](#making-predictions)\n6. [Troubleshooting](#troubleshooting)\n7. [License](#license)\n8. [Citation](#citation)\n9. [Contact](#contact)\n10. [Acknowledgements](#acknowledgements)\n\n---\n\n## Overview\n\nDeepTaxa is a deep learning framework for classifying 16S rRNA gene sequences into taxonomic hierarchies, from domain to species. DeepTaxa provides a straightforward command-line interface and flexible model options, including a hybrid CNN-BERT approach, to facilitate efficient analysis of 16S rRNA sequence datasets. Hosted on Hugging Face, it provides pre-trained models and datasets to assist with taxonomy classification tasks, training, and prediction.\n\n### Supported Architectures\n- **CNNClassifier**: A convolutional neural network optimized for extracting local sequence features.\n- **BERTClassifier**: A BERT-based model that captures global contextual relationships within sequences.\n- **HybridCNNBERTClassifier**: A hybrid approach combining CNN and BERT for superior accuracy and robustness.\n\n---\n\n## Key Features\n\n- **Hierarchical Taxonomy Prediction**: Classifies sequences across seven taxonomic levels in a single pass.\n- **Multiple Model Options**: Choose from CNN, BERT, or hybrid CNN-BERT architectures based on your needs.\n- **Customizable Training**: Fine-tune hyperparameters (e.g., learning rate, batch size, epochs) via the CLI.\n- **GPU Acceleration**: Seamlessly integrates with CUDA-enabled GPUs for faster training and inference.\n\n---\n\n## Installation\n\nDeepTaxa requires **Python 3.10 or later** and is recommended to be installed within a Conda environment for dependency management.\n\n### Dependencies\nDependencies are specified in [`pyproject.toml`](pyproject.toml) and will be installed automatically during setup. Key requirements include:\n- torch\n- transformers\n- pandas\n- numpy\n- tqdm\n- scikit-learn\n- biopython\n- h5py\n- optuna\n\n### Installation Steps\n1. **Clone the Repository**:\n   ```bash\n   git clone https://github.com/systems-genomics-lab/deeptaxa.git\n   cd deeptaxa\n   ```\n2. **Set Up a Conda Environment**:\n   ```bash\n   conda create --name deeptaxa_env python=3.10 -y\n   conda activate deeptaxa_env\n   ```\n3. **Install DeepTaxa and Dependencies**:\n   ```bash\n   pip install .  # Installs DeepTaxa along with dependencies from pyproject.toml\n   ```\n4. **Verify Installation**:\n   ```bash\n   deeptaxa --version  # Displays the installed DeepTaxa version\n   ```\n\n\u003e **Note**: For GPU support, ensure PyTorch is installed with CUDA compatibility. Refer to the [PyTorch website](https://pytorch.org/get-started/locally/) for details. You may need to install a specific PyTorch version compatible with your CUDA setup before running `pip install .`.\n\n---\n\n## Data and Pre-Trained Models\n\nTo maintain a lightweight repository, datasets and pre-trained models are hosted externally and should be stored in a separate directory (e.g., `deeptaxa-data`) outside the codebase. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated `deeptaxa-outputs` directory.\n\n### Directory Structure Example\nHere’s how the folders should be organized relative to each other:\n\n```\nworking_directory/\n├── deeptaxa/                  # Cloned repository folder (codebase)\n│   ├── LICENSE                # LICENSE\n│   ├── README.md              # This file\n│   ├── pyproject.toml         # Configuration file\n│   ├── deeptaxa/              # Source code subdirectory\n│   └── scripts/               # Supplementary scripts\n├── deeptaxa-data/             # External folder for datasets and models\n│   ├── greengenes/            # Subdirectory for Greengenes dataset\n│   │   ├── gg_2024_09_training.fna.gz\n│   │   ├── gg_2024_09_training.tsv.gz\n│   │   ├── gg_2024_09_testing.fna.gz\n│   │   ├── gg_2024_09_testing.tsv.gz\n│   └── models/                # Subdirectory for pre-trained models\n│       └── deeptaxa_april_2025.pt\n└── deeptaxa-outputs/          # External folder for generated outputs\n    ├── model_checkpoint.pt    # Trained model checkpoint\n    ├── predictions/           # Subdirectory for prediction outputs\n    │   ├── predictions.json\n    │   └── predictions.tsv\n    └── metrics/               # Subdirectory for exported metrics\n        └── model_description.json\n```\n\nCommands in the [Usage](#usage) section are run from within `deeptaxa/`, using `../` to access `deeptaxa-data/` and `deeptaxa-outputs/`.\n\n### Datasets\nDeepTaxa uses the [**Greengenes2**](https://ftp.microbio.me/greengenes_release/2024.09/). Modified, reformatted, and made available on [Hugging Face 🤗](https://huggingface.co/datasets/systems-genomics-lab/greengenes).\n\n- **Available Files**:\n\n| File Name                        | Type                            | Number of Sequences | Size         |\n|----------------------------------|---------------------------------|---------------------|--------------|\n| `gg_2024_09_training.fna.gz`     | Training FASTA (sequences)      | 277,336             | ~96.4 MB     |\n| `gg_2024_09_training.tsv.gz`     | Training TSV (taxonomy labels)  | 277,336             | ~2.6 MB      |\n| `gg_2024_09_testing.fna.gz`      | Testing FASTA (sequences)       | 69,335              | ~24.1 MB     |\n| `gg_2024_09_testing.tsv.gz`      | Testing TSV (taxonomy labels)   | 69,335              | ~0.8 MB      |\n\n\n#### Download Instructions\n```bash\nmkdir -p deeptaxa-data/greengenes\ncd deeptaxa-data/greengenes\nwget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.fna.gz\nwget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.tsv.gz\nwget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.fna.gz\nwget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.tsv.gz\n```\n\n### Pre-Trained Models\nPre-trained models are available for immediate use and hosted on [Hugging Face 🤗](https://huggingface.co/systems-genomics-lab/deeptaxa).\n\n- **Hybrid CNN-BERT Model**: [`deeptaxa_april_2025.pt`](https://huggingface.co/systems-genomics-lab/deeptaxa)\n  - A hybrid CNN-BERT model trained on the Greengenes dataset, providing high-accuracy predictions across all taxonomic levels.\n  - Includes a `config.json` file with model metadata.\n- **License**: MIT\n\n#### Download Instructions\n```bash\nmkdir -p deeptaxa-data/models\ncd deeptaxa-data/models\nwget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt\nwget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json\n```\n\n\u003e **Note**: The `deeptaxa_april_2025.pt` file uses PyTorch’s default serialization with `pickle`. This may trigger a security warning on Hugging Face due to potential risks when loading untrusted files. Ensure you download it directly from the official repository and use it in a secure environment.\n\n---\n\n## Usage\n\nDeepTaxa offers a versatile command-line interface (`deeptaxa.cli`) for training, checkpoint inspection, and prediction tasks. All commands should be run from the `deeptaxa/` directory after installation. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated `deeptaxa-outputs` directory outside the codebase. Replace file paths in the examples below with your local data and output locations.\n\n### Quick Start\nTo train and predict with DeepTaxa:\n1. Install DeepTaxa from source (see [Installation](#installation)).\n2. Download the pre-trained hybrid model:\n   ```bash\n   mkdir -p deeptaxa-data/models\n   cd deeptaxa-data/models\n   wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt\n   ```\n   \u003e **Note**: If you only want to perform predictions with the pre-trained model, you do not need to download the Greengenes dataset files. The dataset is required only for training a new model.\n3. Create the outputs directory:\n   ```bash\n   mkdir -p ../deeptaxa-outputs  # Creates the external outputs folder if it doesn’t exist\n   ```\n4. Train a hybrid CNN-BERT model (optional, if you want to train your own; requires the Greengenes dataset):\n   - Download the Greengenes dataset (see [Data](#data-and-pre-trained-models)).\n   - Run the training command:\n     ```bash\n     deeptaxa train \\                         # Runs the training command\n       --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \\  # Path to training sequences\n       --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \\  # Path to training labels\n       --model-type hybrid \\                  # Specifies the hybrid CNN-BERT architecture\n       --output-dir ../deeptaxa-outputs/      # Directory to save the trained model checkpoint\n     ```\n5. Predict on your data using the pre-trained model:\n   ```bash\n   deeptaxa predict \\                         # Runs the prediction command\n     --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \\  # Path to test sequences\n     --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \\  # Path to the pre-trained model\n     --output-dir ../deeptaxa-outputs/predictions  # Directory to save prediction results\n   ```\n\n### Training a Model\nTrain a new model using the Greengenes dataset:\n```bash\ndeeptaxa train \\                           # Initiates model training\n  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \\  # Input FASTA file with sequences\n  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \\  # Taxonomy labels for training\n  --model-type hybrid \\                    # Model architecture: cnn, bert, or hybrid\n  --output-dir ../deeptaxa-outputs/ \\      # Where to save the trained model\n  --epochs 10 \\                            # Number of training epochs\n  --batch-size 16 \\                        # Batch size for training\n  --learning-rate 1e-4 \\                   # Learning rate for optimization\n  --device cuda                            # Use GPU (cuda) or CPU (cpu)\n```\n\n### Inspecting a Checkpoint\nExamine a pre-trained model’s metadata and performance metrics:\n```bash\ndeeptaxa describe \\                        # Describes a model checkpoint\n  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \\  # Path to the pre-trained model\n  --export-metrics ../deeptaxa-outputs/metrics/model_description.json  # Where to save metadata and metrics\n```\n\n### Making Predictions\nClassify sequences from a FASTA file using a trained model:\n```bash\ndeeptaxa predict \\                         # Generates taxonomic predictions\n  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \\  # Input FASTA file for prediction\n  --checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \\  # Path to the trained model checkpoint\n  --output-dir ../deeptaxa-outputs/predictions \\  # Directory to save prediction outputs\n  --top-k 3 \\                              # Number of top predictions per level\n  --tabular                                # Exports results in TSV format\n```\n\n#### Output Files\n- `../deeptaxa-outputs/predictions/predictions.json`: Detailed predictions with confidence scores and uncertainty metrics.\n- `../deeptaxa-outputs/predictions/predictions.tsv`: Tabular format for downstream analysis.\n\n\u003e **Tip**: Use `--help` with any CLI command (e.g., `python -m deeptaxa.cli train --help`) for a full list of options. Ensure the `deeptaxa-outputs` directory exists (e.g., `mkdir -p ../deeptaxa-outputs`) before running commands.\n\n\n### Demo\n\nFor a full demo of working with DeepTaxa, see the following notebooks:\n\n- [`deeptaxa_prediction.ipynb`](notebooks/deeptaxa_prediction.ipynb): making a taxonomic classification using the pre-trained model `deeptaxa_april_2025.pt`\n- [`deeptaxa_workflow.ipynb`](notebooks/deeptaxa_workflow.ipynb): training a fresh model, resuming training on an existing model, and making predictions\n\n---\n\n## License\n\n- **Code \u0026 Models**: [MIT License](LICENSE)\n- **Greengenes Dataset**: Modified BSD License\n\nThe Greengenes dataset used in DeepTaxa is a modified version of the Greengenes2 dataset, distributed under the terms of the Modified BSD License. For full license details, see the [dataset repository on Hugging Face](https://huggingface.co/datasets/systems-genomics-lab/greengenes).\n\n---\n\n## Citation\n\nIf DeepTaxa contributes to your research, please cite:\n```bibtex\n@software{DeepTaxa,\n  author = {{Systems Genomics Lab}},\n  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},\n  year = {2025},\n  publisher = {GitHub},\n  url = {https://github.com/systems-genomics-lab/deeptaxa},\n}\n```\n\nFor the Greengenes dataset, cite:  \nDeSantis TZ, et al. (2006). *Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB*. Applied and Environmental Microbiology. [DOI:10.1128/AEM.03006-05](https://pubmed.ncbi.nlm.nih.gov/16820507/).\n\n---\n\n## Contact\n\nTo report bugs, suggest features, or submit code, please open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues).\n\n---\n\n## Acknowledgements\n\n- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.\n- **[Ahmed A. El Hosseiny](https://github.com/ahmedelhosseiny)** and the High-Performance Computing Team of the [School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at the [American University in Cairo (AUC)](https://www.aucegypt.edu/) for their support and for granting access to GPU resources that enabled this work.\n- **[Hugging Face](https://huggingface.co/)** to provide a platform to host datasets and models.\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsystems-genomics-lab%2Fdeeptaxa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsystems-genomics-lab%2Fdeeptaxa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsystems-genomics-lab%2Fdeeptaxa/lists"}