An open API service indexing awesome lists of open source software.

https://github.com/systems-genomics-lab/deeptaxa

A deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences
https://github.com/systems-genomics-lab/deeptaxa

16s-rrna bert cnn convolutional-neural-networks deep-learning machine-learning microbiome python taxonomic-classification torch transformers

Last synced: about 2 months ago
JSON representation

A deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences

Awesome Lists containing this project

README

          

# DeepTaxa

[![License](https://img.shields.io/github/license/systems-genomics-lab/deeptaxa)](LICENSE)
[![Last Commit](https://img.shields.io/github/last-commit/systems-genomics-lab/deeptaxa)](https://github.com/systems-genomics-lab/deeptaxa/commits/main)
[![Issues](https://img.shields.io/github/issues/systems-genomics-lab/deeptaxa)](https://github.com/systems-genomics-lab/deeptaxa/issues)
[![GitHub Stars](https://img.shields.io/github/stars/systems-genomics-lab/deeptaxa?style=social)](https://github.com/systems-genomics-lab/deeptaxa/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/systems-genomics-lab/deeptaxa?style=social)](https://github.com/systems-genomics-lab/deeptaxa/network/members)

**DeepTaxa** is a deep learning framework designed for hierarchical taxonomy classification of 16S rRNA gene sequences.

---

## Table of Contents

1. [Overview](#overview)
2. [Key Features](#key-features)
3. [Installation](#installation)
- [Dependencies](#dependencies)
- [Installation Steps](#installation-steps)
4. [Data and Pre-Trained Models](#data-and-pre-trained-models)
5. [Usage](#usage)
- [Quick Start](#quick-start)
- [Training a Model](#training-a-model)
- [Inspecting a Checkpoint](#inspecting-a-checkpoint)
- [Making Predictions](#making-predictions)
6. [Troubleshooting](#troubleshooting)
7. [License](#license)
8. [Citation](#citation)
9. [Contact](#contact)
10. [Acknowledgements](#acknowledgements)

---

## Overview

DeepTaxa is a deep learning framework for classifying 16S rRNA gene sequences into taxonomic hierarchies, from domain to species. DeepTaxa provides a straightforward command-line interface and flexible model options, including a hybrid CNN-BERT approach, to facilitate efficient analysis of 16S rRNA sequence datasets. Hosted on Hugging Face, it provides pre-trained models and datasets to assist with taxonomy classification tasks, training, and prediction.

### Supported Architectures
- **CNNClassifier**: A convolutional neural network optimized for extracting local sequence features.
- **BERTClassifier**: A BERT-based model that captures global contextual relationships within sequences.
- **HybridCNNBERTClassifier**: A hybrid approach combining CNN and BERT for superior accuracy and robustness.

---

## Key Features

- **Hierarchical Taxonomy Prediction**: Classifies sequences across seven taxonomic levels in a single pass.
- **Multiple Model Options**: Choose from CNN, BERT, or hybrid CNN-BERT architectures based on your needs.
- **Customizable Training**: Fine-tune hyperparameters (e.g., learning rate, batch size, epochs) via the CLI.
- **GPU Acceleration**: Seamlessly integrates with CUDA-enabled GPUs for faster training and inference.

---

## Installation

DeepTaxa requires **Python 3.10 or later** and is recommended to be installed within a Conda environment for dependency management.

### Dependencies
Dependencies are specified in [`pyproject.toml`](pyproject.toml) and will be installed automatically during setup. Key requirements include:
- torch
- transformers
- pandas
- numpy
- tqdm
- scikit-learn
- biopython
- h5py
- optuna

### Installation Steps
1. **Clone the Repository**:
```bash
git clone https://github.com/systems-genomics-lab/deeptaxa.git
cd deeptaxa
```
2. **Set Up a Conda Environment**:
```bash
conda create --name deeptaxa_env python=3.10 -y
conda activate deeptaxa_env
```
3. **Install DeepTaxa and Dependencies**:
```bash
pip install . # Installs DeepTaxa along with dependencies from pyproject.toml
```
4. **Verify Installation**:
```bash
deeptaxa --version # Displays the installed DeepTaxa version
```

> **Note**: For GPU support, ensure PyTorch is installed with CUDA compatibility. Refer to the [PyTorch website](https://pytorch.org/get-started/locally/) for details. You may need to install a specific PyTorch version compatible with your CUDA setup before running `pip install .`.

---

## Data and Pre-Trained Models

To maintain a lightweight repository, datasets and pre-trained models are hosted externally and should be stored in a separate directory (e.g., `deeptaxa-data`) outside the codebase. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated `deeptaxa-outputs` directory.

### Directory Structure Example
Here’s how the folders should be organized relative to each other:

```
working_directory/
├── deeptaxa/ # Cloned repository folder (codebase)
│ ├── LICENSE # LICENSE
│ ├── README.md # This file
│ ├── pyproject.toml # Configuration file
│ ├── deeptaxa/ # Source code subdirectory
│ └── scripts/ # Supplementary scripts
├── deeptaxa-data/ # External folder for datasets and models
│ ├── greengenes/ # Subdirectory for Greengenes dataset
│ │ ├── gg_2024_09_training.fna.gz
│ │ ├── gg_2024_09_training.tsv.gz
│ │ ├── gg_2024_09_testing.fna.gz
│ │ ├── gg_2024_09_testing.tsv.gz
│ └── models/ # Subdirectory for pre-trained models
│ └── deeptaxa_april_2025.pt
└── deeptaxa-outputs/ # External folder for generated outputs
├── model_checkpoint.pt # Trained model checkpoint
├── predictions/ # Subdirectory for prediction outputs
│ ├── predictions.json
│ └── predictions.tsv
└── metrics/ # Subdirectory for exported metrics
└── model_description.json
```

Commands in the [Usage](#usage) section are run from within `deeptaxa/`, using `../` to access `deeptaxa-data/` and `deeptaxa-outputs/`.

### Datasets
DeepTaxa uses the [**Greengenes2**](https://ftp.microbio.me/greengenes_release/2024.09/). Modified, reformatted, and made available on [Hugging Face 🤗](https://huggingface.co/datasets/systems-genomics-lab/greengenes).

- **Available Files**:

| File Name | Type | Number of Sequences | Size |
|----------------------------------|---------------------------------|---------------------|--------------|
| `gg_2024_09_training.fna.gz` | Training FASTA (sequences) | 277,336 | ~96.4 MB |
| `gg_2024_09_training.tsv.gz` | Training TSV (taxonomy labels) | 277,336 | ~2.6 MB |
| `gg_2024_09_testing.fna.gz` | Testing FASTA (sequences) | 69,335 | ~24.1 MB |
| `gg_2024_09_testing.tsv.gz` | Testing TSV (taxonomy labels) | 69,335 | ~0.8 MB |

#### Download Instructions
```bash
mkdir -p deeptaxa-data/greengenes
cd deeptaxa-data/greengenes
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_training.tsv.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.fna.gz
wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/gg_2024_09_testing.tsv.gz
```

### Pre-Trained Models
Pre-trained models are available for immediate use and hosted on [Hugging Face 🤗](https://huggingface.co/systems-genomics-lab/deeptaxa).

- **Hybrid CNN-BERT Model**: [`deeptaxa_april_2025.pt`](https://huggingface.co/systems-genomics-lab/deeptaxa)
- A hybrid CNN-BERT model trained on the Greengenes dataset, providing high-accuracy predictions across all taxonomic levels.
- Includes a `config.json` file with model metadata.
- **License**: MIT

#### Download Instructions
```bash
mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json
```

> **Note**: The `deeptaxa_april_2025.pt` file uses PyTorch’s default serialization with `pickle`. This may trigger a security warning on Hugging Face due to potential risks when loading untrusted files. Ensure you download it directly from the official repository and use it in a secure environment.

---

## Usage

DeepTaxa offers a versatile command-line interface (`deeptaxa.cli`) for training, checkpoint inspection, and prediction tasks. All commands should be run from the `deeptaxa/` directory after installation. Outputs such as model checkpoints, predictions, and metrics should be stored in a dedicated `deeptaxa-outputs` directory outside the codebase. Replace file paths in the examples below with your local data and output locations.

### Quick Start
To train and predict with DeepTaxa:
1. Install DeepTaxa from source (see [Installation](#installation)).
2. Download the pre-trained hybrid model:
```bash
mkdir -p deeptaxa-data/models
cd deeptaxa-data/models
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
```
> **Note**: If you only want to perform predictions with the pre-trained model, you do not need to download the Greengenes dataset files. The dataset is required only for training a new model.
3. Create the outputs directory:
```bash
mkdir -p ../deeptaxa-outputs # Creates the external outputs folder if it doesn’t exist
```
4. Train a hybrid CNN-BERT model (optional, if you want to train your own; requires the Greengenes dataset):
- Download the Greengenes dataset (see [Data](#data-and-pre-trained-models)).
- Run the training command:
```bash
deeptaxa train \ # Runs the training command
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \ # Path to training sequences
--taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \ # Path to training labels
--model-type hybrid \ # Specifies the hybrid CNN-BERT architecture
--output-dir ../deeptaxa-outputs/ # Directory to save the trained model checkpoint
```
5. Predict on your data using the pre-trained model:
```bash
deeptaxa predict \ # Runs the prediction command
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \ # Path to test sequences
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the pre-trained model
--output-dir ../deeptaxa-outputs/predictions # Directory to save prediction results
```

### Training a Model
Train a new model using the Greengenes dataset:
```bash
deeptaxa train \ # Initiates model training
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \ # Input FASTA file with sequences
--taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \ # Taxonomy labels for training
--model-type hybrid \ # Model architecture: cnn, bert, or hybrid
--output-dir ../deeptaxa-outputs/ \ # Where to save the trained model
--epochs 10 \ # Number of training epochs
--batch-size 16 \ # Batch size for training
--learning-rate 1e-4 \ # Learning rate for optimization
--device cuda # Use GPU (cuda) or CPU (cpu)
```

### Inspecting a Checkpoint
Examine a pre-trained model’s metadata and performance metrics:
```bash
deeptaxa describe \ # Describes a model checkpoint
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the pre-trained model
--export-metrics ../deeptaxa-outputs/metrics/model_description.json # Where to save metadata and metrics
```

### Making Predictions
Classify sequences from a FASTA file using a trained model:
```bash
deeptaxa predict \ # Generates taxonomic predictions
--fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \ # Input FASTA file for prediction
--checkpoint ../deeptaxa-data/models/deeptaxa_april_2025.pt \ # Path to the trained model checkpoint
--output-dir ../deeptaxa-outputs/predictions \ # Directory to save prediction outputs
--top-k 3 \ # Number of top predictions per level
--tabular # Exports results in TSV format
```

#### Output Files
- `../deeptaxa-outputs/predictions/predictions.json`: Detailed predictions with confidence scores and uncertainty metrics.
- `../deeptaxa-outputs/predictions/predictions.tsv`: Tabular format for downstream analysis.

> **Tip**: Use `--help` with any CLI command (e.g., `python -m deeptaxa.cli train --help`) for a full list of options. Ensure the `deeptaxa-outputs` directory exists (e.g., `mkdir -p ../deeptaxa-outputs`) before running commands.

### Demo

For a full demo of working with DeepTaxa, see the following notebooks:

- [`deeptaxa_prediction.ipynb`](notebooks/deeptaxa_prediction.ipynb): making a taxonomic classification using the pre-trained model `deeptaxa_april_2025.pt`
- [`deeptaxa_workflow.ipynb`](notebooks/deeptaxa_workflow.ipynb): training a fresh model, resuming training on an existing model, and making predictions

---

## License

- **Code & Models**: [MIT License](LICENSE)
- **Greengenes Dataset**: Modified BSD License

The Greengenes dataset used in DeepTaxa is a modified version of the Greengenes2 dataset, distributed under the terms of the Modified BSD License. For full license details, see the [dataset repository on Hugging Face](https://huggingface.co/datasets/systems-genomics-lab/greengenes).

---

## Citation

If DeepTaxa contributes to your research, please cite:
```bibtex
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
```

For the Greengenes dataset, cite:
DeSantis TZ, et al. (2006). *Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB*. Applied and Environmental Microbiology. [DOI:10.1128/AEM.03006-05](https://pubmed.ncbi.nlm.nih.gov/16820507/).

---

## Contact

To report bugs, suggest features, or submit code, please open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues).

---

## Acknowledgements

- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
- **[Ahmed A. El Hosseiny](https://github.com/ahmedelhosseiny)** and the High-Performance Computing Team of the [School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at the [American University in Cairo (AUC)](https://www.aucegypt.edu/) for their support and for granting access to GPU resources that enabled this work.
- **[Hugging Face](https://huggingface.co/)** to provide a platform to host datasets and models.
---