https://github.com/varunramagiri/llm-finetuning-toolkit
LoRA/QLoRA fine-tuning for GPT, BERT, T5, LLaMA and Mistral โ 8GB VRAM, MLflow tracking, one-command SageMaker and Vertex AI deployment
https://github.com/varunramagiri/llm-finetuning-toolkit
bert huggingface llama llm-finetuning lora machine-learning mlflow nlp peft python pytorch qlora transformers
Last synced: 4 days ago
JSON representation
LoRA/QLoRA fine-tuning for GPT, BERT, T5, LLaMA and Mistral โ 8GB VRAM, MLflow tracking, one-command SageMaker and Vertex AI deployment
- Host: GitHub
- URL: https://github.com/varunramagiri/llm-finetuning-toolkit
- Owner: varunramagiri
- License: mit
- Created: 2026-05-23T14:39:59.000Z (25 days ago)
- Default Branch: main
- Last Pushed: 2026-05-23T14:42:48.000Z (25 days ago)
- Last Synced: 2026-05-23T16:25:05.283Z (25 days ago)
- Topics: bert, huggingface, llama, llm-finetuning, lora, machine-learning, mlflow, nlp, peft, python, pytorch, qlora, transformers
- Size: 9.77 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐งฌ LLM Fine-Tuning Toolkit
> **Parameter-efficient fine-tuning (PEFT) for GPT, BERT, T5, LLaMA & Mistral** โ LoRA ยท QLoRA ยท MLflow ยท SageMaker ยท Vertex AI ยท One-command deployment
[](https://python.org)
[](https://pytorch.org)
[](https://huggingface.co/transformers)
[](https://mlflow.org)
[](LICENSE)
---
## ๐ฏ Overview
Production-ready toolkit for fine-tuning large language models on domain-specific enterprise datasets using **LoRA (Low-Rank Adaptation)** and **QLoRA (4-bit quantized LoRA)** โ dramatically reducing GPU memory requirements without sacrificing model quality.
Built from fine-tuning work across **insurance policy summarization** (Nationwide), **clinical NLP** (CVS Health), and **document classification** tasks. Includes training, evaluation, experiment tracking, and cloud deployment โ all scriptable via CLI or Python API.
**Supported tasks:** Summarization ยท Classification ยท QA ยท NER ยท Instruction following ยท Chat fine-tuning
**Supported models:** GPT-2 ยท BERT ยท RoBERTa ยท T5 ยท FLAN-T5 ยท LLaMA 2/3 ยท Mistral 7B ยท Falcon
---
## ๐ Folder Structure
```
llm-finetuning-toolkit/
โโโ src/
โ โโโ training/
โ โ โโโ trainer.py # Core PEFT trainer class
โ โ โโโ lora_config.py # LoRA: rank, alpha, target modules
โ โ โโโ qlora_config.py # QLoRA: 4-bit NF4, double quant
โ โ โโโ callbacks.py # Early stopping, checkpoint saving
โ โโโ data/
โ โ โโโ dataset_loader.py # HuggingFace Hub + custom JSONL
โ โ โโโ preprocessor.py # Tokenization, padding, truncation
โ โ โโโ augmentation.py # Back-translation, synonym swap
โ โ โโโ quality_filter.py # Dedupe, length filter, toxicity
โ โโโ evaluation/
โ โ โโโ metrics.py # BLEU, ROUGE, F1, BERTScore
โ โ โโโ bias_detector.py # Fairness metrics across demographics
โ โ โโโ safety_eval.py # ToxiGen, BBQ safety benchmarks
โ โ โโโ benchmark.py # Throughput, latency, memory profiling
โ โโโ inference/
โ โ โโโ predictor.py # Single inference with adapter loading
โ โ โโโ batch_predictor.py # Async batch inference (vLLM)
โ โ โโโ quantizer.py # Post-training quant (ONNX / TensorRT)
โ โโโ deployment/
โ โโโ sagemaker_deploy.py # AWS SageMaker real-time endpoint
โ โโโ vertex_deploy.py # GCP Vertex AI endpoint
โ โโโ azure_deploy.py # Azure ML managed endpoint
โ โโโ vllm_server.py # Local vLLM OpenAI-compatible server
โโโ configs/
โ โโโ lora_bert_classification.yaml
โ โโโ qlora_llama2_7b_instruct.yaml
โ โโโ lora_t5_summarization.yaml
โ โโโ qlora_mistral_7b_chat.yaml
โโโ scripts/
โ โโโ train.py # CLI: python scripts/train.py --config ...
โ โโโ evaluate.py # CLI: python scripts/evaluate.py ...
โ โโโ merge_adapters.py # Merge LoRA weights โ base model
โ โโโ push_to_hub.py # Push fine-tuned model to HF Hub
โ โโโ export_onnx.py # Export to ONNX for deployment
โโโ notebooks/
โ โโโ 01_lora_finetuning_walkthrough.ipynb
โ โโโ 02_qlora_llama2_on_single_gpu.ipynb
โ โโโ 03_evaluation_and_bias_analysis.ipynb
โ โโโ 04_inference_optimization.ipynb
โโโ tests/
โ โโโ test_trainer.py
โ โโโ test_data_pipeline.py
โ โโโ test_inference.py
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml
โ โโโ model_eval.yml # Automated eval on new checkpoints
โโโ requirements.txt
โโโ pyproject.toml
โโโ README.md
```
---
## โก Quick Start
### Python API โ QLoRA fine-tuning on LLaMA 2 (single GPU, 8GB VRAM)
```python
from src.training.trainer import PEFTTrainer
from src.training.qlora_config import QLoRAConfig
config = QLoRAConfig(
base_model="meta-llama/Llama-2-7b-hf",
dataset_path="data/insurance_policies.jsonl",
task="text-generation",
lora_r=16,
lora_alpha=32,
lora_dropout=0.05,
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
max_seq_length=2048,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
output_dir="outputs/llama2-insurance-finetuned",
mlflow_experiment="llama2-qlora-insurance-v1",
)
trainer = PEFTTrainer(config)
trainer.train()
trainer.evaluate()
trainer.save_adapter("outputs/llama2-insurance-adapter")
```
### CLI
```bash
# Train
python scripts/train.py --config configs/qlora_llama2_7b_instruct.yaml
# Evaluate on test set
python scripts/evaluate.py \
--model-path outputs/llama2-insurance-finetuned \
--dataset data/test.jsonl \
--metrics rouge,bleu,bertscore
# Merge LoRA adapters into base model (for deployment)
python scripts/merge_adapters.py \
--base meta-llama/Llama-2-7b-hf \
--adapter outputs/llama2-insurance-adapter \
--output outputs/llama2-insurance-merged
# Deploy to SageMaker
python src/deployment/sagemaker_deploy.py \
--model-path outputs/llama2-insurance-merged \
--instance-type ml.g5.xlarge \
--endpoint-name llama2-insurance-prod
```
---
## ๐ GPU Memory โ LoRA vs QLoRA vs Full Fine-Tuning
| Technique | Model | GPU VRAM | Trainable Params |
|---|---|---|---|
| Full fine-tuning | LLaMA-7B | ~56 GB | 7B (100%) |
| LoRA (r=16) | LLaMA-7B | ~18 GB | ~4M (0.06%) |
| QLoRA 4-bit (r=16) | LLaMA-7B | **~8 GB** โ
| ~4M (0.06%) |
| QLoRA 4-bit (r=16) | LLaMA-13B | **~12 GB** โ
| ~6M (0.05%) |
| QLoRA 4-bit (r=64) | Mistral-7B | **~10 GB** โ
| ~16M (0.23%) |
---
## ๐ฌ Experiment Tracking (MLflow)
All runs auto-logged: `train/loss`, `eval/loss`, `eval/rouge1`, `eval/rouge2`, `eval/rougeL`, `eval/bleu`, `eval/bertscore`, `train/learning_rate`, GPU utilization, peak memory.
```bash
mlflow ui --port 5000
```
---
## ๐ Deployment Options
| Target | Latency | Throughput | Cost |
|---|---|---|---|
| vLLM local server | ~50ms | High | GPU instance |
| AWS SageMaker RT | ~120ms | Auto-scaled | Pay-per-use |
| GCP Vertex AI | ~130ms | Auto-scaled | Pay-per-use |
| ONNX + TensorRT | ~20ms | Very high | GPU instance |
---
## ๐ License
MIT โ see [LICENSE](LICENSE)