An open API service indexing awesome lists of open source software.

https://github.com/varunramagiri/llm-finetuning-toolkit

LoRA/QLoRA fine-tuning for GPT, BERT, T5, LLaMA and Mistral โ€” 8GB VRAM, MLflow tracking, one-command SageMaker and Vertex AI deployment
https://github.com/varunramagiri/llm-finetuning-toolkit

bert huggingface llama llm-finetuning lora machine-learning mlflow nlp peft python pytorch qlora transformers

Last synced: 4 days ago
JSON representation

LoRA/QLoRA fine-tuning for GPT, BERT, T5, LLaMA and Mistral โ€” 8GB VRAM, MLflow tracking, one-command SageMaker and Vertex AI deployment

Awesome Lists containing this project

README

          

# ๐Ÿงฌ LLM Fine-Tuning Toolkit

> **Parameter-efficient fine-tuning (PEFT) for GPT, BERT, T5, LLaMA & Mistral** โ€” LoRA ยท QLoRA ยท MLflow ยท SageMaker ยท Vertex AI ยท One-command deployment

[![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=flat-square&logo=python)](https://python.org)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.2+-EE4C2C?style=flat-square&logo=pytorch)](https://pytorch.org)
[![HuggingFace](https://img.shields.io/badge/๐Ÿค—%20Transformers-4.40+-FFD21E?style=flat-square)](https://huggingface.co/transformers)
[![MLflow](https://img.shields.io/badge/MLflow-2.12+-0194E2?style=flat-square)](https://mlflow.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow?style=flat-square)](LICENSE)

---

## ๐ŸŽฏ Overview

Production-ready toolkit for fine-tuning large language models on domain-specific enterprise datasets using **LoRA (Low-Rank Adaptation)** and **QLoRA (4-bit quantized LoRA)** โ€” dramatically reducing GPU memory requirements without sacrificing model quality.

Built from fine-tuning work across **insurance policy summarization** (Nationwide), **clinical NLP** (CVS Health), and **document classification** tasks. Includes training, evaluation, experiment tracking, and cloud deployment โ€” all scriptable via CLI or Python API.

**Supported tasks:** Summarization ยท Classification ยท QA ยท NER ยท Instruction following ยท Chat fine-tuning

**Supported models:** GPT-2 ยท BERT ยท RoBERTa ยท T5 ยท FLAN-T5 ยท LLaMA 2/3 ยท Mistral 7B ยท Falcon

---

## ๐Ÿ“ Folder Structure

```
llm-finetuning-toolkit/
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ training/
โ”‚ โ”‚ โ”œโ”€โ”€ trainer.py # Core PEFT trainer class
โ”‚ โ”‚ โ”œโ”€โ”€ lora_config.py # LoRA: rank, alpha, target modules
โ”‚ โ”‚ โ”œโ”€โ”€ qlora_config.py # QLoRA: 4-bit NF4, double quant
โ”‚ โ”‚ โ””โ”€โ”€ callbacks.py # Early stopping, checkpoint saving
โ”‚ โ”œโ”€โ”€ data/
โ”‚ โ”‚ โ”œโ”€โ”€ dataset_loader.py # HuggingFace Hub + custom JSONL
โ”‚ โ”‚ โ”œโ”€โ”€ preprocessor.py # Tokenization, padding, truncation
โ”‚ โ”‚ โ”œโ”€โ”€ augmentation.py # Back-translation, synonym swap
โ”‚ โ”‚ โ””โ”€โ”€ quality_filter.py # Dedupe, length filter, toxicity
โ”‚ โ”œโ”€โ”€ evaluation/
โ”‚ โ”‚ โ”œโ”€โ”€ metrics.py # BLEU, ROUGE, F1, BERTScore
โ”‚ โ”‚ โ”œโ”€โ”€ bias_detector.py # Fairness metrics across demographics
โ”‚ โ”‚ โ”œโ”€โ”€ safety_eval.py # ToxiGen, BBQ safety benchmarks
โ”‚ โ”‚ โ””โ”€โ”€ benchmark.py # Throughput, latency, memory profiling
โ”‚ โ”œโ”€โ”€ inference/
โ”‚ โ”‚ โ”œโ”€โ”€ predictor.py # Single inference with adapter loading
โ”‚ โ”‚ โ”œโ”€โ”€ batch_predictor.py # Async batch inference (vLLM)
โ”‚ โ”‚ โ””โ”€โ”€ quantizer.py # Post-training quant (ONNX / TensorRT)
โ”‚ โ””โ”€โ”€ deployment/
โ”‚ โ”œโ”€โ”€ sagemaker_deploy.py # AWS SageMaker real-time endpoint
โ”‚ โ”œโ”€โ”€ vertex_deploy.py # GCP Vertex AI endpoint
โ”‚ โ”œโ”€โ”€ azure_deploy.py # Azure ML managed endpoint
โ”‚ โ””โ”€โ”€ vllm_server.py # Local vLLM OpenAI-compatible server
โ”œโ”€โ”€ configs/
โ”‚ โ”œโ”€โ”€ lora_bert_classification.yaml
โ”‚ โ”œโ”€โ”€ qlora_llama2_7b_instruct.yaml
โ”‚ โ”œโ”€โ”€ lora_t5_summarization.yaml
โ”‚ โ””โ”€โ”€ qlora_mistral_7b_chat.yaml
โ”œโ”€โ”€ scripts/
โ”‚ โ”œโ”€โ”€ train.py # CLI: python scripts/train.py --config ...
โ”‚ โ”œโ”€โ”€ evaluate.py # CLI: python scripts/evaluate.py ...
โ”‚ โ”œโ”€โ”€ merge_adapters.py # Merge LoRA weights โ†’ base model
โ”‚ โ”œโ”€โ”€ push_to_hub.py # Push fine-tuned model to HF Hub
โ”‚ โ””โ”€โ”€ export_onnx.py # Export to ONNX for deployment
โ”œโ”€โ”€ notebooks/
โ”‚ โ”œโ”€โ”€ 01_lora_finetuning_walkthrough.ipynb
โ”‚ โ”œโ”€โ”€ 02_qlora_llama2_on_single_gpu.ipynb
โ”‚ โ”œโ”€โ”€ 03_evaluation_and_bias_analysis.ipynb
โ”‚ โ””โ”€โ”€ 04_inference_optimization.ipynb
โ”œโ”€โ”€ tests/
โ”‚ โ”œโ”€โ”€ test_trainer.py
โ”‚ โ”œโ”€โ”€ test_data_pipeline.py
โ”‚ โ””โ”€โ”€ test_inference.py
โ”œโ”€โ”€ .github/
โ”‚ โ””โ”€โ”€ workflows/
โ”‚ โ”œโ”€โ”€ ci.yml
โ”‚ โ””โ”€โ”€ model_eval.yml # Automated eval on new checkpoints
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md
```

---

## โšก Quick Start

### Python API โ€” QLoRA fine-tuning on LLaMA 2 (single GPU, 8GB VRAM)

```python
from src.training.trainer import PEFTTrainer
from src.training.qlora_config import QLoRAConfig

config = QLoRAConfig(
base_model="meta-llama/Llama-2-7b-hf",
dataset_path="data/insurance_policies.jsonl",
task="text-generation",
lora_r=16,
lora_alpha=32,
lora_dropout=0.05,
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
max_seq_length=2048,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
output_dir="outputs/llama2-insurance-finetuned",
mlflow_experiment="llama2-qlora-insurance-v1",
)

trainer = PEFTTrainer(config)
trainer.train()
trainer.evaluate()
trainer.save_adapter("outputs/llama2-insurance-adapter")
```

### CLI

```bash
# Train
python scripts/train.py --config configs/qlora_llama2_7b_instruct.yaml

# Evaluate on test set
python scripts/evaluate.py \
--model-path outputs/llama2-insurance-finetuned \
--dataset data/test.jsonl \
--metrics rouge,bleu,bertscore

# Merge LoRA adapters into base model (for deployment)
python scripts/merge_adapters.py \
--base meta-llama/Llama-2-7b-hf \
--adapter outputs/llama2-insurance-adapter \
--output outputs/llama2-insurance-merged

# Deploy to SageMaker
python src/deployment/sagemaker_deploy.py \
--model-path outputs/llama2-insurance-merged \
--instance-type ml.g5.xlarge \
--endpoint-name llama2-insurance-prod
```

---

## ๐Ÿ“Š GPU Memory โ€” LoRA vs QLoRA vs Full Fine-Tuning

| Technique | Model | GPU VRAM | Trainable Params |
|---|---|---|---|
| Full fine-tuning | LLaMA-7B | ~56 GB | 7B (100%) |
| LoRA (r=16) | LLaMA-7B | ~18 GB | ~4M (0.06%) |
| QLoRA 4-bit (r=16) | LLaMA-7B | **~8 GB** โœ… | ~4M (0.06%) |
| QLoRA 4-bit (r=16) | LLaMA-13B | **~12 GB** โœ… | ~6M (0.05%) |
| QLoRA 4-bit (r=64) | Mistral-7B | **~10 GB** โœ… | ~16M (0.23%) |

---

## ๐Ÿ”ฌ Experiment Tracking (MLflow)

All runs auto-logged: `train/loss`, `eval/loss`, `eval/rouge1`, `eval/rouge2`, `eval/rougeL`, `eval/bleu`, `eval/bertscore`, `train/learning_rate`, GPU utilization, peak memory.

```bash
mlflow ui --port 5000
```

---

## ๐Ÿš€ Deployment Options

| Target | Latency | Throughput | Cost |
|---|---|---|---|
| vLLM local server | ~50ms | High | GPU instance |
| AWS SageMaker RT | ~120ms | Auto-scaled | Pay-per-use |
| GCP Vertex AI | ~130ms | Auto-scaled | Pay-per-use |
| ONNX + TensorRT | ~20ms | Very high | GPU instance |

---

## ๐Ÿ“„ License

MIT โ€” see [LICENSE](LICENSE)