https://github.com/ntphuc149/viag
ViAG: A Novel Framework for Fine-tuning Answer Generation models ultilizing Encoder-Decoder and Decoder-only Transformers's architecture
https://github.com/ntphuc149/viag
answer-generation bart bartpho bertscore bleu-score decoder-only encoder-decoder fine-tuning instruction-tuning llama llm meteor plms qlora question-answering qwen rouge t5 vit5
Last synced: 4 months ago
JSON representation
ViAG: A Novel Framework for Fine-tuning Answer Generation models ultilizing Encoder-Decoder and Decoder-only Transformers's architecture
- Host: GitHub
- URL: https://github.com/ntphuc149/viag
- Owner: ntphuc149
- License: mit
- Created: 2025-05-03T14:15:27.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-05-26T03:51:35.000Z (5 months ago)
- Last Synced: 2025-06-06T02:01:43.219Z (4 months ago)
- Topics: answer-generation, bart, bartpho, bertscore, bleu-score, decoder-only, encoder-decoder, fine-tuning, instruction-tuning, llama, llm, meteor, plms, qlora, question-answering, qwen, rouge, t5, vit5
- Language: Python
- Homepage:
- Size: 110 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ViAG - Vietnamese Answer Generation
ViAG (Vietnamese Answer Generation) is a project that fine-tunes encoder-decoder models on Vietnamese question-answering tasks. This project provides tools for training, evaluating, and deploying models that can generate answers to questions in Vietnamese.
## Features
- Fine-tune pre-trained encoder-decoder models (like ViT5) for answer generation
- Support for local CSV datasets
- Comprehensive evaluation metrics (ROUGE, BLEU, METEOR, BERTScore)
- Command-line interface for training and evaluation
- Weights & Biases integration for experiment tracking
- Modular and extensible codebase## Project Structure
```markdown
ViAG/
├── configs/ # Configuration files
├── datasets/ # Data files
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
├── src/ # Source code
│ ├── data/ # Data loading and preprocessing
│ ├── models/ # Model configuration and training
│ ├── evaluation/ # Evaluation metrics and utilities
│ └── utils/ # Helper functions and constants
├── scripts/ # Training and evaluation scripts
├── models/ # Directory for saved models
├── notebooks/ # Jupyter notebooks for exploration
├── outputs/ # Training outputs and logs
├── requirements.txt # Project dependencies
└── README.md # Project documentation
```## Installation
1. Clone the repository:
```bash
git clone https://github.com/ntphuc149/ViAG.git
cd ViAG
```2. Install dependencies:
```bash
pip install -r requirements.txt
```3. Install the Vietnamese SpaCy model:
```
pip install https://gitlab.com/trungtv/vi_spacy/-/raw/master/packages/vi_core_news_lg-3.6.0/dist/vi_core_news_lg-3.6.0.tar.gz
```4. Create a `.env` file with your API keys (optional):
```bash
HF_TOKEN=your_huggingface_token
WANDB_API_KEY=your_wandb_api_key
```## Data Format
The expected data format is a CSV file with the following columns:
- `context`: The context passage
- `question`: The question to be answered
- `answer`: The target generative answer## Usage
### Training
Train a model using the command-line interface:
```bash
python scripts/train.py \
--train_data datasets/train.csv \
--val_data datasets/val.csv \
--test_data datasets/test.csv \
--model_name VietAI/vit5-base \
--output_dir outputs/experiment1 \
--num_epochs 5 \
--batch_size 2 \
--learning_rate 3e-5 \
--use_wandb
```For more options, run:
```bash
python scripts/train.py --help
```### Evaluation
Evaluate a trained model:
```bash
python scripts/run_evaluate.py \
--test_data datasets/test.csv \
--model_path outputs/experiment1 \
--output_dir outputs/evaluation1 \
--batch_size 1
```For more options, run:
```bash
python scripts/run_evaluate.py --help
```## Configuration
You can customize the training process using a JSON configuration file:
```json
{
"model": {
"name": "vinai/bartpho-syllable-base",
"max_input_length": 1024,
"max_target_length": 256
},
"training": {
"num_epochs": 5,
"learning_rate": 3e-5,
"batch_size": 2,
"gradient_accumulation_steps": 16
},
"data": {
"train_path": "datasets/train.csv",
"val_path": "datasets/val.csv",
"test_path": "datasets/test.csv"
}
}
```Then use it with:
```bash
python scripts/train.py --config configs/my_config.json
```## Metrics
The project uses the following metrics to evaluate answer quality:
- `ROUGE-1`, `ROUGE-2`, `ROUGE-L`, `ROUGE-L-SUM`: Measures n-gram overlap between generated and reference answers
- `BLEU-1`, `BLEU-2`, `BLEU-3`, `BLEU-4`: Measures precision of n-grams in generated answers
- `METEOR`: Measures unigram alignment between generated and reference answers
- `BERTScore`: Measures semantic similarity using BERT embeddings## Models
The project currently supports the following models:
- `VietAI/vit5-base`
- `VietAI/vit5-large`
- `vinai/bartpho-syllable`
- `vinai/bartpho-syllable-base`
- Other encoder-decoder models compatible with the Hugging Face Transformers library## LLM Instruction Fine-tuning (New Feature)
ViAG now supports instruction fine-tuning for Large Language Models (LLMs) using QLoRA technique. This allows you to fine-tune models like Qwen, Llama, and Mistral on Vietnamese QA tasks with limited GPU memory.
### Features
- **QLoRA Integration**: 4-bit quantization with LoRA for memory-efficient training
- **Multiple Instruction Formats**: Support for ChatML, Alpaca, Vicuna, Llama, and custom templates
- **Flexible Workflow**: Separate training, inference, and evaluation phases for long-running jobs
- **Automatic Data Splitting**: Split single dataset into train/val/test with customizable ratios### Quick Start
#### 1. Full Pipeline (Train + Infer + Eval)
```bash
python scripts/train_llm.py \
--do_train --do_infer --do_eval \
--data_path Truong-Phuc/ViBidLQA \
--model_name Qwen/Qwen2-0.5B \
--instruction_template chatml \
--output_dir outputs/qwen2-vibidlqa
```#### 2. Separate Phases (for Kaggle/Colab sessions)
**Phase 1: Training (~11 hours)**
```bash
python scripts/train_llm.py \
--do_train \
--data_path data/train.csv \
--model_name Qwen/Qwen2-0.5B \
--num_epochs 10 \
--output_dir outputs/qwen2-checkpoint
```**Phase 2: Inference (~1 hour)**
```bash
python scripts/train_llm.py \
--do_infer \
--test_data data/test.csv \
--checkpoint_path outputs/qwen2-checkpoint \
--predictions_file outputs/predictions.csv
```**Phase 3: Evaluation (~10 minutes)**
```bash
python scripts/train_llm.py \
--do_eval \
--predictions_file outputs/predictions.csv \
--metrics_file outputs/metrics.json
```### Instruction Templates
The framework supports multiple instruction formats:
#### ChatML (default)
```
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{context}
{question}<|im_end|>
<|im_start|>assistant
{answer}<|im_end|>
```#### Alpaca
```
Below is an instruction that describes a task...### Instruction:
{question}### Input:
{context}### Response:
{answer}
```#### Custom Template
```bash
python scripts/train_llm.py \
--instruction_template custom \
--custom_template "Context: {context}\nQuestion: {question}\nAnswer: {answer}"
```### Configuration
You can use a JSON configuration file:
```bash
python scripts/train_llm.py --config configs/llm_config.json
```Example configuration:
```json
{
"model": {
"name": "Qwen/Qwen2-0.5B",
"instruction_template": "chatml"
},
"lora": {
"r": 16,
"alpha": 32,
"dropout": 0.05
},
"training": {
"num_epochs": 5,
"batch_size": 1,
"gradient_accumulation_steps": 16
}
}
```### Supported Models
- Qwen/Qwen2 series (0.5B, 1.5B, 7B)
- SeaLLMs/SeaLLMs-v3 series (1.5B, 7B)
- meta-llama/Llama-2 series
- mistralai/Mistral series
- Any other causal LM compatible with Transformers### Advanced Options
```bash
python scripts/train_llm.py --help
```Key parameters:
- `--train_ratio`, `--val_ratio`, `--test_ratio`: Data split ratios (default: 8:1:1)
- `--lora_r`: LoRA rank (default: 16)
- `--learning_rate`: Learning rate (default: 3e-5)
- `--max_new_tokens`: Max tokens to generate (default: 512)
- `--use_wandb`: Enable W&B logging## License
This project is licensed under the MIT License - see the LICENSE file for details.