An open API service indexing awesome lists of open source software.

https://github.com/instadeepai/nucleotide-transformer

๐Ÿงฌ Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
https://github.com/instadeepai/nucleotide-transformer

deep-learning dna genomics language-model nucleotide transformer

Last synced: 8 months ago
JSON representation

๐Ÿงฌ Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Awesome Lists containing this project

README

          


InstaDeep AI for Genomics Logo

AI Foundation Models for Genomics


A hub for InstaDeep's cutting-edge deep learning models and research for genomics, originating from the Nucleotide Transformer and its evolutions.


License: CC BY-NC-SA 4.0


Python 3.8

Jax 0.3.25+

Hugging Face Models

---

## ๐ŸŽฏ Our Focus: Advancing Genomics with AI

Welcome to the InstaDeep AI for Genomics repository! This is where we feature our collection of transformer-based genomic language models and innovative downstream applications. Our work in the genomics space began with **The Nucleotide Transformer**, developed in collaboration with Nvidia and TUM and trained on Cambridge-1, and has expanded to include projects like the **Agro Nucleotide Transformer** (in collaboration with Google, trained on TPU-v4 accelerators), **SegmentNT**, and **ChatNT**.

Our mission is to provide the scientific community with powerful, reproducible, and accessible tools to unlock new insights from biological sequences. This repository serves as the central place for sharing our models, inference code, pre-trained weights, and research contributions in the genomics domain, with explorations into future areas like single-cell transcriptomics.

We are thrilled to open-source these works and provide the community with access to the code and pre-trained weights for our diverse set of genomics language models and segmentation models.

## โœจ Featured Models & Research Evolutions

This section highlights the key models and research directions from our team. Each entry provides a brief overview and links to detailed documentation, publications, and resources. *(Detailed code examples, setup for specific models, and in-depth figures are now located in their respective documentation pages within the `./docs` folder.)*

---

### ๐Ÿงฌ The Nucleotide Transformer (NT)

Our foundational language models leverage DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide range of species. These models provide extremely accurate molecular phenotype prediction compared to existing methods. *This family includes multiple variants (e.g., 500M_human_ref, 2B5_1000G, NT-v2 series) which are detailed further in the specific documentation.*

* **Keywords:** Foundational Model, Genomics, DNA/RNA, Pre-trained, Sequence Embeddings, Phenotype Prediction
* โžก๏ธ **[Model Details, Variants & Usage](./docs/nucleotide_transformer.md)**
* ๐Ÿ“œ **[Read the Paper (Nature Methods 2025)](https://www.nature.com/articles/s41592-024-02523-z)**
* ๐Ÿค— **[Hugging Face Collection](https://huggingface.co/collections/InstaDeepAI/nucleotide-transformer-65099cdde13ff96230f2e592)**
* ๐Ÿš€ **Fine-tuning Notebooks (HF): ([LoRA](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) and [regular](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb))**

---

### ๐ŸŒพ Agro Nucleotide Transformer (AgroNT)

A novel foundational large language model trained on reference genomes from 48 plant species, with a predominant focus on crop species. AgroNT demonstrates state-of-the-art performance across several prediction tasks ranging from regulatory features, RNA processing, and gene expression in plants.

* **Keywords:** Plant Genomics, Foundational Model, Crop Science, Gene Expression, Agriculture AI
* โžก๏ธ **[Model Details & Usage](./docs/agro_nucleotide_transformer.md)**
* ๐Ÿ“œ **[Read the Paper (Communications Biology 2024)](https://www.nature.com/articles/s42003-024-06465-2)**
* ๐Ÿค— **[Hugging Face Collection](https://huggingface.co/collections/InstaDeepAI/agro-nucleotide-transformer-65b25c077cd0069ad6f6d344)**

---

### ๐Ÿงฉ SegmentNT (& family: SegmentEnformer, SegmentBorzoi)

Segmentation models using transformer backbones (Nucleotide Transformers, Enformer, Borzoi) for predicting genomic elements at single-nucleotide resolution. SegmentNT, for instance, predicts 14 different classes of human genomic elements in sequences up to 30kb (generalizing to 50kbp) and demonstrates superior performance.

* **Keywords:** Genome Segmentation, Single-Nucleotide Resolution, Genomic Elements, U-Net, Enformer, Borzoi
* โžก๏ธ **[Model Details & Usage](./docs/segment_nt.md)** (Covers SegmentNT, SegmentEnformer, SegmentBorzoi)
* ๐Ÿ“œ **[Read the Paper (bioRxiv preprint)](https://www.biorxiv.org/content/10.1101/2024.03.14.584712v1)**
* ๐Ÿค— **[Hugging Face Collection](https://huggingface.co/collections/InstaDeepAI/segmentnt-65eb4941c57808b4a3fe1319)**
* ๐Ÿš€ **[SegmentNT Inference Notebook (HF)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/InstaDeepAI/segment_nt/blob/main/inference_segment_nt.ipynb)**

---

### ๐Ÿ’ฌ ChatNT

A multimodal conversational agent designed with a deep understanding of DNA biological sequences, enabling interactive exploration and analysis of genomic data through natural language.

* **Keywords:** Conversational AI, Multimodal, DNA Analysis, Genomics Chatbot, Interactive Biology
* โžก๏ธ **[Model Details & Usage](./docs/chat_nt.md)**
* ๐Ÿ“œ **[Read the Paper (Nature Machine Intelligence 2025)](https://www.nature.com/articles/s42256-025-01047-1)**
* ๐Ÿค— **[ChatNT on Hugging Face](https://huggingface.co/InstaDeepAI/ChatNT)**
* ๐Ÿš€ **[ChatNT Inference Notebook (Jax)](./notebooks/chat_nt/inference.ipynb)**

---

### 3๏ธโƒฃ Codon-NT (Exploring 3-mer Tokenization)

A Nucleotide Transformer model variant trained on 3-mers (codons). This work investigates alternative tokenization strategies for genomic language models and their impact on downstream performance and interpretability.

* **Keywords:** Genomics, Language Model, Codon, Tokenization, 3-mers, Nucleotide Transformer Variant
* โžก๏ธ **[Model Details & Usage](./docs/codon_nt.md)**
* ๐Ÿ“œ **[Read the Paper (Bioinformatics 2024)](https://academic.oup.com/bioinformatics/article/40/9/btae529/7745814)**
* ๐Ÿค— **[Hugging Face Link](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-3mer-multi-species)**

---

### ๐Ÿงฌ Isoformer

A model designed for learning isoform-aware embeddings directly from RNA-seq data, enabling a deeper understanding of transcript-specific expression and regulation.

* **Keywords:** RNA-seq, Transcriptomics, Isoforms, Gene Expression, Embeddings
* โžก๏ธ **[Model Details & Usage](./docs/isoformer.md)**
* ๐Ÿ“œ **[Read the Paper (NeurIPS 2024)](https://papers.nips.cc/paper_files/paper/2024/file/8f6b3692297e49e5d5c91ba00281379c-Paper-Conference.pdf)**
* ๐Ÿค— **[Hugging Face Link](https://huggingface.co/InstaDeepAI/isoformer)**
* ๐Ÿš€ **[Isoformer Inference Notebook (HF)](./notebooks/isoformer/inference.ipynb)**

---

### ๐Ÿ”ฌ sCT (single-Cell Transformer)

Our foundational transformer model for single-cell and spatial transcriptomics data. sCT aims to learn rich representations from complex, high-dimensional single-cell datasets to improve various downstream analytical tasks.

* **Keywords:** Single-cell RNA-seq, Spatial Transcriptomics, Foundational Model, Transformer, Gene Expression
* โžก๏ธ **[Model Details & Usage](./docs/sct.md)**
* ๐Ÿ“œ **[Read the Paper (OpenReview preprint)](https://openreview.net/forum?id=VdX9tL3VXH)**
* ๐Ÿค— **[Hugging Face Link](https://huggingface.co/InstaDeepAI/sCellTransformer)**
* ๐Ÿš€ **[sCT Inference Notebook (HF)](./notebooks/sct/inference_sCT_pytorch_example.ipynb)**

---

### ๐Ÿงช BulkRNABert

BulkRNABert is a transformer-based, encoder-only foundation model designed for bulk RNA-seq data. It learns biologically meaningful representations from large-scale transcriptomic profiles.

* **Keywords:** Bulk RNA-seq, Foundational Model, Transformer, Cancer prognosis
* โžก๏ธ **[Model Details & Usage](./docs/bulk_rna_bert.md)**
* ๐Ÿ“œ **[Read the Paper (Machine Learning for Health 2024)](https://proceedings.mlr.press/v259/gelard25a.html)**
* ๐Ÿค— **[Hugging Face Link](https://huggingface.co/InstaDeepAI/BulkRNABert)**
* ๐Ÿš€ **[BulkRNABert Inference Notebook (HF)](notebooks/bulk_rna_bert/inference_bulkrnabert_pytorch_example.ipynb)**

---

### ๐Ÿ”— MOJO (Multi-Omics JOint representation)

MOJO is a multimodal model designed learn embeddings of multi-omics data. It integrates bulk RNA-seq and DNA methylation data to generate powerful joint representations tailored for cancer-type classification and survival analysis.

* **Keywords:** Bulk RNA-seq, DNA Methylation, Foundational Model, Transformer, Multimodal, Cancer prognosis
* โžก๏ธ **[Model Details & Usage](./docs/mojo.md)**
* ๐Ÿ“œ **[Read the Paper (ICML Workshop on Generative AI and Biology 2025)](https://www.biorxiv.org/content/10.1101/2025.06.25.661237v1)**
* ๐Ÿค— **[Hugging Face Link](https://huggingface.co/InstaDeepAI/MOJO)**
* ๐Ÿš€ **[MOJO Inference Notebook (HF)](./notebooks/mojo/inference_mojo_pytorch_example.ipynb)**

---

## ๐Ÿ’ก Why Choose InstaDeep's Genomic Models?

* **Built on Strong Foundations:** Leveraging large-scale pre-training and diverse genomic datasets.
* **Cutting-Edge Research:** Incorporating the latest advancements in deep learning for biological sequence analysis.
* **High Performance:** Designed and validated to achieve state-of-the-art results on challenging genomic tasks.
* **Open and Accessible:** We provide pre-trained weights, usage examples, and aim for easy integration into research workflows.
* **Collaborative Spirit:** Developed with leading academic and industry partners.
* **Focused Expertise:** Created by a dedicated team specializing in AI for genomics at InstaDeep.

## ๐Ÿš€ Getting Started

To begin using models from this repository:

1. **Clone the repository:**
```bash
git clone https://github.com/instadeepai/nucleotide-transformer.git
cd nucleotide-transformer
```
2. **Set up your environment (virtual environment recommended):**
```bash
python -m venv .venv
source .venv/bin/activate # On Windows use `source .venv\Scripts\activate`
```
3. **Install the package and dependencies:**
```bash
pip install . # Installs the local package
# Or, for a general requirements file if you have one:
# pip install -r requirements.txt
```

For detailed instructions on individual models, including specific dependencies, downloading pre-trained weights, and Python usage examples, please refer to their dedicated documentation pages linked in the "Featured Models & Research Evolutions" section above (e.g., `./docs/nucleotide_transformer.md`).

## ๐Ÿค Community & Support

* **Questions & Bug Reports:** Please use the [GitHub Issues](https://github.com/instadeepai/nucleotide-transformer/issues) page.
* **Discussions:** For broader discussions or questions, please use the [GitHub Discussions](https://github.com/instadeepai/nucleotide-transformer/discussions) tab (if enabled).
* **Stay Updated:** Follow InstaDeep's official channels for announcements on new model releases and research updates.