https://github.com/manchesterbioinference/mrna_llm
Using LLMs to understand mRNA features
https://github.com/manchesterbioinference/mrna_llm
Last synced: 14 days ago
JSON representation
Using LLMs to understand mRNA features
- Host: GitHub
- URL: https://github.com/manchesterbioinference/mrna_llm
- Owner: ManchesterBioinference
- Created: 2025-05-19T09:08:11.000Z (about 1 year ago)
- Default Branch: translationEfficiency
- Last Pushed: 2026-05-29T13:59:52.000Z (15 days ago)
- Last Synced: 2026-05-29T15:15:57.078Z (14 days ago)
- Language: HTML
- Size: 103 MB
- Stars: 0
- Watchers: 15
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# mRNA Ribosome Density Prediction
This project predicts mRNA ribosome density (a.k.a. translation efficiency or TE) and decay rates in *Drosophila melanogaster* using a fine-tuned RNA Language Model (LLM). It integrates sequence information from 3' UTRs with additional features like codon usage and RNA secondary structure stability.
## Publication
- **Preprint**: A preprint describing this work is available on bioRxiv: [10.64898/2025.12.04.692303v1](https://www.biorxiv.org/content/10.64898/2025.12.04.692303v1)
## Project Overview
- **Goal**: Predict ribosome density and mRNA decay from sequence data.
- **Model**: Extended the pretraining of GenaLM Fly (BERT-based) on 5' & 3' UTR pairs and fine-tuned the model with a regression head.
- **Features**:
- 5' & 3' UTR sequences
- Codon usage metrics
- Minimum Free Energy (MFE) from RNA folding (LinearFold)
- GC content and sequence length
- **Pipeline**: Managed by DVC for reproducibility, covering data download, preprocessing, feature extraction, and model training.
- **Decay analysis**: Detailed decay-rate analyses are kept on the `decay` branch of this repository (see the `decay` branch for notebooks and results).
## Installation Requirements
Apptainer, conda, and DVC must be installed on your system and in your path.
- [Apptainer installation guide](https://apptainer.org/docs/user/latest/quick_start.html#installation)
- [Conda installation guide](https://www.anaconda.com/docs/getting-started/miniconda/install)
- [DVC installation guide](https://dvc.org/doc/install)
## Usage
This DVC pipeline will build the necessary conda environment using the provided `environment.yaml`.
To reproduce the pipeline run the following command:
``` {bash}
dvc repro
```
## Repository Structure
- `dvc.yaml`: Pipeline definition.
- `params.yaml`: Configuration parameters.
- `scripts/`: Source code for data processing and training.
- `notebooks/`: Exploratory analysis and visualization.