https://github.com/systems-genomics-lab/deeptaxa
A deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences
https://github.com/systems-genomics-lab/deeptaxa
16s-rrna bert cnn convolutional-neural-networks deep-learning machine-learning microbiome python taxonomic-classification torch transformers
Last synced: about 2 months ago
JSON representation
A deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences
- Host: GitHub
- URL: https://github.com/systems-genomics-lab/deeptaxa
- Owner: systems-genomics-lab
- License: mit
- Created: 2025-03-30T21:38:01.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-05-16T15:15:17.000Z (5 months ago)
- Last Synced: 2025-06-20T12:11:25.385Z (4 months ago)
- Topics: 16s-rrna, bert, cnn, convolutional-neural-networks, deep-learning, machine-learning, microbiome, python, taxonomic-classification, torch, transformers
- Language: Python
- Homepage:
- Size: 148 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DeepTaxa
[](LICENSE)
[](https://github.com/systems-genomics-lab/deeptaxa/commits/main)
[](https://github.com/systems-genomics-lab/deeptaxa/issues)
[](https://github.com/systems-genomics-lab/deeptaxa/stargazers)
[](https://github.com/systems-genomics-lab/deeptaxa/network/members)**DeepTaxa** is a deep learning framework designed for hierarchical taxonomy classification of 16S rRNA gene sequences.
---
## Table of Contents
1. [Overview](#overview)
2. [Key Features](#key-features)
3. [Installation](#installation)
- [Dependencies](#dependencies)
- [Installation Steps](#installation-steps)
4. [Data and Pre-Trained Models](#data-and-pre-trained-models)
5. [Usage](#usage)
- [Quick Start](#quick-start)
- [Training a Model](#training-a-model)
- [Inspecting a Checkpoint](#inspecting-a-checkpoint)
- [Making Predictions](#making-predictions)
6. [Troubleshooting](#troubleshooting)
7. [License](#license)
8. [Citation](#citation)
9. [Contact](#contact)
10. [Acknowledgements](#acknowledgements)---
## Overview
DeepTaxa is a deep learning framework for classifying 16S rRNA gene sequences into taxonomic hierarchies, from domain to species. DeepTaxa provides a straightforward command-line interface and flexible model options, including a hybrid CNN-BERT approach, to facilitate efficient analysis of 16S rRNA sequence datasets. Hosted on Hugging Face, it provides pre-trained models and datasets to assist with taxonomy classification tasks, training, and prediction.
### Supported Architectures
- **CNNClassifier**: A convolutional neural network optimized for extracting local sequence features.
- **BERTClassifier**: A BERT-based model that captures global contextual relationships within sequences.
- **HybridCNNBERTClassifier**: A hybrid approach combining CNN and BERT for superior accuracy and robustness.---
## Key Features
- **Hierarchical Taxonomy Prediction**: Classifies sequences across seven taxonomic levels in a single pass.
- **Multiple Model Options**: Choose from CNN, BERT, or hybrid CNN-BERT architectures based on your needs.
- **Customizable Training**: Fine-tune hyperparameters (e.g., learning rate, batch size, epochs) via the CLI.
- **GPU Acceleration**: Seamlessly integrates with CUDA-enabled GPUs for faster training and inference.---
## Installation
DeepTaxa requires **Python 3.10 or later** and is recommended to be installed within a Conda environment for dependency management.
### Dependencies
Dependencies are specified in [`pyproject.toml`](pyproject.toml) and will be installed automatically during setup. Key requirements include:
- torch
- transformers
- pandas
- numpy
- tqdm
- scikit-learn
- biopython
- h5py
- optuna### Installation Steps
1. **Clone the Repository**:
```bash
git clone https://github.com/systems-genomics-lab/deeptaxa.git
cd deeptaxa
```
2. **Set Up a Conda Environment**:
```bash
conda create --name deeptaxa_env python=3.10 -y
conda activate deeptaxa_env
```
3. **Install DeepTaxa and Dependencies**:
```bash
pip install . # Installs DeepTaxa along with dependencies from pyproject.toml
```
4. **Verify Installation**:
```bash
deeptaxa --version # Displays the installed DeepTaxa version
```> **Note**: For GPU support, ensure PyTorch is installed with CUDA compatibility. Refer to the [PyTorch website](https://pytorch.org/get-started/locally/) for details. You may need to install a specific PyTorch version compatible with your CUDA setup before running `pip install .`.
---
## Data and Pre-Trained Models
To maintain a lightweight repository, datasets and pre-trained models are hosted externally and should be stored in a separate directory (e.g., `deeptaxa-data`) outside the codebase. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated `deeptaxa-outputs` directory.
### Directory Structure Example
Here’s how the folders should be organized relative to each other:```
working_directory/
├── deeptaxa/ # Cloned repository folder (codebase)
│ ├── LICENSE # LICENSE
│ ├── README.md # This file
│ ├── pyproject.toml # Configuration file
│ ├── deeptaxa/ # Source code subdirectory
│ └── scripts/ # Supplementary scripts
├── deeptaxa-data/ # External folder for datasets and models
│ ├── greengenes/ # Subdirectory for Greengenes dataset
│ │ ├── gg_2024_09_training.fna.gz
│ │ ├── gg_2024_09_training.tsv.gz
│ │ ├── gg_2024_09_testing.fna.gz
│ │ ├── gg_2024_09_testing.tsv.gz
│ └── models/ # Subdirectory for pre-trained models
│ └── deeptaxa_april_2025.pt
└── deeptaxa-outputs/ # External folder for generated outputs
├── model_checkpoint.pt # Trained model checkpoint
├── predictions/ # Subdirectory for prediction outputs
│ ├── predictions.json
│ └── predictions.tsv
└── metrics/ # Subdirectory for exported metrics
└── model_description.json
```Commands in the [Usage](#usage) section are run from within `deeptaxa/`, using `../` to access `deeptaxa-data/` and `deeptaxa-outputs/`.
### Datasets
DeepTaxa uses the [**Greengenes2**](https://ftp.microbio.me/greengenes_release/2024.09/). Modified, reformatted, and made available on [Hugging Face 🤗](https://huggingface.co/datasets/systems-genomics-lab/greengenes).- **Available Files**:
| File Name | Type | Number of Sequences | Size |
|----------------------------------|---------------------------------|---------------------|--------------|
| `gg_2024_09_training.fna.gz` | Training FASTA (sequences) | 277,336 | ~96.4 MB |
| `gg_2024_09_training.tsv.gz` | Training TSV (taxonomy labels) | 277,336 | ~2.6 MB |
| `gg_2024_09_testing.fna.gz` | Testing FASTA (sequences) | 69,335 | ~24.1 MB |
| `gg_2024_09_testing.tsv.gz` | Testing TSV (taxonomy labels) | 69,335 | ~0.8 MB |#### Download Instructions
```bash
mkdir -p deeptaxa-data/greengenes
cd deeptaxa-data/greengenes
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.tsv.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.tsv.gz
```### Pre-Trained Models
Pre-trained models are available for immediate use and hosted on [Hugging Face 🤗](https://huggingface.co/systems-genomics-lab/deeptaxa).- **Hybrid CNN-BERT Model**: [`deeptaxa_april_2025.pt`](https://huggingface.co/systems-genomics-lab/deeptaxa)
- A hybrid CNN-BERT model trained on the Greengenes dataset, providing high-accuracy predictions across all taxonomic levels.
- Includes a `config.json` file with model metadata.
- **License**: MIT#### Download Instructions
```bash
mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json
```> **Note**: The `deeptaxa_april_2025.pt` file uses PyTorch’s default serialization with `pickle`. This may trigger a security warning on Hugging Face due to potential risks when loading untrusted files. Ensure you download it directly from the official repository and use it in a secure environment.
---
## Usage
DeepTaxa offers a versatile command-line interface (`deeptaxa.cli`) for training, checkpoint inspection, and prediction tasks. All commands should be run from the `deeptaxa/` directory after installation. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated `deeptaxa-outputs` directory outside the codebase. Replace file paths in the examples below with your local data and output locations.
### Quick Start
To train and predict with DeepTaxa:
1. Install DeepTaxa from source (see [Installation](#installation)).
2. Download the pre-trained hybrid model:
```bash
mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
```
> **Note**: If you only want to perform predictions with the pre-trained model, you do not need to download the Greengenes dataset files. The dataset is required only for training a new model.
3. Create the outputs directory:
```bash
mkdir -p ../deeptaxa-outputs # Creates the external outputs folder if it doesn’t exist
```
4. Train a hybrid CNN-BERT model (optional, if you want to train your own; requires the Greengenes dataset):
- Download the Greengenes dataset (see [Data](#data-and-pre-trained-models)).
- Run the training command:
```bash
deeptaxa train \ # Runs the training command
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \ # Path to training sequences
--taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \ # Path to training labels
--model-type hybrid \ # Specifies the hybrid CNN-BERT architecture
--output-dir ../deeptaxa-outputs/ # Directory to save the trained model checkpoint
```
5. Predict on your data using the pre-trained model:
```bash
deeptaxa predict \ # Runs the prediction command
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \ # Path to test sequences
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the pre-trained model
--output-dir ../deeptaxa-outputs/predictions # Directory to save prediction results
```### Training a Model
Train a new model using the Greengenes dataset:
```bash
deeptaxa train \ # Initiates model training
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \ # Input FASTA file with sequences
--taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \ # Taxonomy labels for training
--model-type hybrid \ # Model architecture: cnn, bert, or hybrid
--output-dir ../deeptaxa-outputs/ \ # Where to save the trained model
--epochs 10 \ # Number of training epochs
--batch-size 16 \ # Batch size for training
--learning-rate 1e-4 \ # Learning rate for optimization
--device cuda # Use GPU (cuda) or CPU (cpu)
```### Inspecting a Checkpoint
Examine a pre-trained model’s metadata and performance metrics:
```bash
deeptaxa describe \ # Describes a model checkpoint
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the pre-trained model
--export-metrics ../deeptaxa-outputs/metrics/model_description.json # Where to save metadata and metrics
```### Making Predictions
Classify sequences from a FASTA file using a trained model:
```bash
deeptaxa predict \ # Generates taxonomic predictions
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \ # Input FASTA file for prediction
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the trained model checkpoint
--output-dir ../deeptaxa-outputs/predictions \ # Directory to save prediction outputs
--top-k 3 \ # Number of top predictions per level
--tabular # Exports results in TSV format
```#### Output Files
- `../deeptaxa-outputs/predictions/predictions.json`: Detailed predictions with confidence scores and uncertainty metrics.
- `../deeptaxa-outputs/predictions/predictions.tsv`: Tabular format for downstream analysis.> **Tip**: Use `--help` with any CLI command (e.g., `python -m deeptaxa.cli train --help`) for a full list of options. Ensure the `deeptaxa-outputs` directory exists (e.g., `mkdir -p ../deeptaxa-outputs`) before running commands.
### Demo
For a full demo of working with DeepTaxa, see the following notebooks:
- [`deeptaxa_prediction.ipynb`](notebooks/deeptaxa_prediction.ipynb): making a taxonomic classification using the pre-trained model `deeptaxa_april_2025.pt`
- [`deeptaxa_workflow.ipynb`](notebooks/deeptaxa_workflow.ipynb): training a fresh model, resuming training on an existing model, and making predictions---
## License
- **Code & Models**: [MIT License](LICENSE)
- **Greengenes Dataset**: Modified BSD LicenseThe Greengenes dataset used in DeepTaxa is a modified version of the Greengenes2 dataset, distributed under the terms of the Modified BSD License. For full license details, see the [dataset repository on Hugging Face](https://huggingface.co/datasets/systems-genomics-lab/greengenes).
---
## Citation
If DeepTaxa contributes to your research, please cite:
```bibtex
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
```For the Greengenes dataset, cite:
DeSantis TZ, et al. (2006). *Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB*. Applied and Environmental Microbiology. [DOI:10.1128/AEM.03006-05](https://pubmed.ncbi.nlm.nih.gov/16820507/).---
## Contact
To report bugs, suggest features, or submit code, please open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues).
---
## Acknowledgements
- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
- **[Ahmed A. El Hosseiny](https://github.com/ahmedelhosseiny)** and the High-Performance Computing Team of the [School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at the [American University in Cairo (AUC)](https://www.aucegypt.edu/) for their support and for granting access to GPU resources that enabled this work.
- **[Hugging Face](https://huggingface.co/)** to provide a platform to host datasets and models.
---