https://github.com/man4ish/bioinfo-lora-finetuning
Fine-tune a lightweight LLM (TinyLlama-1.1B) for bioinformatics instruction-response tasks using LoRA. Compare model performance before and after fine-tuning on domain-specific prompts. Fully compatible with Apple M1/M4 (Metal GPU) and demonstrates end-to-end fine-tuning, inference, and adapter merging.
https://github.com/man4ish/bioinfo-lora-finetuning
bioinformatics deep-learning fine-tuning instruction-tuning llm lora machine-learning natural-language-processing python tinyllama transformers
Last synced: 2 months ago
JSON representation
Fine-tune a lightweight LLM (TinyLlama-1.1B) for bioinformatics instruction-response tasks using LoRA. Compare model performance before and after fine-tuning on domain-specific prompts. Fully compatible with Apple M1/M4 (Metal GPU) and demonstrates end-to-end fine-tuning, inference, and adapter merging.
- Host: GitHub
- URL: https://github.com/man4ish/bioinfo-lora-finetuning
- Owner: man4ish
- Created: 2025-11-01T02:24:57.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-11-01T03:36:11.000Z (8 months ago)
- Last Synced: 2025-11-01T04:18:44.449Z (8 months ago)
- Topics: bioinformatics, deep-learning, fine-tuning, instruction-tuning, llm, lora, machine-learning, natural-language-processing, python, tinyllama, transformers
- Language: Python
- Homepage:
- Size: 11.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Bioinformatics LLM Fine-Tuning with LoRA
This repository demonstrates fine-tuning a lightweight LLM (TinyLlama-1.1B) on bioinformatics instruction-response prompts using Low-Rank Adaptation (LoRA). The goal is to adapt a general instruction-tuned model to answer domain-specific questions and showcase the improvements before and after fine-tuning.
---
## Objective
- Fine-tune TinyLlama for bioinformatics instructions
- Run locally on Apple Silicon (M4 Max, MPS backend)
- Compare outputs before and after LoRA fine-tuning
---
## Repository Structure
```
bioinfo-lora-finetuning-demo/
├── data/
│ └── bioinfo_train.jsonl # Instruction-response dataset
├── src/
│ ├── lora_train.py # LoRA fine-tuning script
│ ├── lora_infer_before.py # Baseline inference
│ ├── lora_infer_after.py # Inference using LoRA adapter
│ └── merge_lora.py # Merge adapter into base model
├── results/
│ ├── sample_outputs_before.txt
│ ├── sample_outputs_after.txt
│ └── merged-model/
├── requirements.txt
└── README.md
````
---
## Environment Setup
1. Clone the repository and create a virtual environment:
```bash
git clone
cd bioinfo-lora-finetuning-demo
python -m venv venv
source venv/bin/activate
````
2. Install dependencies:
```bash
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
```
Note: On macOS M4, if you encounter `sentencepiece` build errors, use:
```bash
pip install sentencepiece --prefer-binary
```
---
## Dataset
* `data/bioinfo_train.jsonl` contains bioinformatics instruction-response pairs in JSON Lines format:
```json
{"instruction": "Explain what a FASTQ file is.", "output": "A FASTQ file stores sequencing reads with quality scores..."}
{"instruction": "What is SNP annotation?", "output": "SNP annotation links single-nucleotide polymorphisms to genes and predicts functional impacts."}
```
* You can expand this dataset to hundreds of examples for improved results.
---
## Fine-Tuning
```bash
python src/lora_train.py \
--dataset_path data/bioinfo_train.jsonl \
--epochs 3 \
--batch_size 2 \
--gradient_accumulation 4 \
--lr 2e-4 \
--output_dir results/lora-adapter
```
* Saves LoRA adapter to `./results/lora-adapter`
* Uses MPS GPU if available
* Training logs show decreasing loss over steps
---
## Inference
### Baseline (Before Fine-Tuning)
```bash
python src/lora_infer_before.py
```
### Fine-Tuned Model (After LoRA)
```bash
python src/lora_infer_after.py
```
Example comparison:
| Prompt | Baseline | LoRA Fine-Tuned |
| -------------- | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| Explain FASTQ | “A text file containing sequences.” | “A FASTQ file stores sequencing reads with quality scores, used in genome sequencing and bioinformatics pipelines.” |
| SNP annotation | “A process in genomics.” | “SNP annotation links single-nucleotide polymorphisms to genes and predicts functional impact.” |
---
## Merge LoRA Adapter
To create a standalone fine-tuned model:
```bash
python src/merge_lora.py
```
* Output: `./results/merged-model/`
* Load without PEFT adapters:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./results/merged-model").to("mps")
tokenizer = AutoTokenizer.from_pretrained("./results/merged-model")
```
---
## Results
* Train time: ~1 min per epoch on M4 Max
* Loss decrease: ~10 → 0.3
* Output improvement: Domain-specific answers with bioinformatics terms