{"id":28378125,"url":"https://github.com/ntphuc149/viag","last_synced_at":"2025-06-26T21:31:39.929Z","repository":{"id":291273719,"uuid":"977140449","full_name":"ntphuc149/ViAG","owner":"ntphuc149","description":"ViAG: A Novel Framework for Fine-tuning Answer Generation models ultilizing Encoder-Decoder and Decoder-only Transformers's architecture","archived":false,"fork":false,"pushed_at":"2025-05-26T03:51:35.000Z","size":113,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-06T02:01:43.219Z","etag":null,"topics":["answer-generation","bart","bartpho","bertscore","bleu-score","decoder-only","encoder-decoder","fine-tuning","instruction-tuning","llama","llm","meteor","plms","qlora","question-answering","qwen","rouge","t5","vit5"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ntphuc149.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-03T14:15:27.000Z","updated_at":"2025-05-26T03:51:38.000Z","dependencies_parsed_at":"2025-05-03T15:28:31.968Z","dependency_job_id":"d18b963b-727f-4ff8-8b7f-d683adf2e77f","html_url":"https://github.com/ntphuc149/ViAG","commit_stats":null,"previous_names":["ntphuc149/viag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ntphuc149/ViAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ntphuc149","download_url":"https://codeload.github.com/ntphuc149/ViAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViAG/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262145149,"owners_count":23265877,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["answer-generation","bart","bartpho","bertscore","bleu-score","decoder-only","encoder-decoder","fine-tuning","instruction-tuning","llama","llm","meteor","plms","qlora","question-answering","qwen","rouge","t5","vit5"],"created_at":"2025-05-30T02:00:27.318Z","updated_at":"2025-06-26T21:31:39.923Z","avatar_url":"https://github.com/ntphuc149.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ViAG - Vietnamese Answer Generation\r\n\r\nViAG (Vietnamese Answer Generation) is a project that fine-tunes encoder-decoder models on Vietnamese question-answering tasks. This project provides tools for training, evaluating, and deploying models that can generate answers to questions in Vietnamese.\r\n\r\n## Features\r\n\r\n- Fine-tune pre-trained encoder-decoder models (like ViT5) for answer generation\r\n- Support for local CSV datasets\r\n- Comprehensive evaluation metrics (ROUGE, BLEU, METEOR, BERTScore)\r\n- Command-line interface for training and evaluation\r\n- Weights \u0026 Biases integration for experiment tracking\r\n- Modular and extensible codebase\r\n\r\n## Project Structure\r\n\r\n```markdown\r\nViAG/\r\n├── configs/              # Configuration files\r\n├── datasets/             # Data files\r\n│   ├── train.csv\r\n│   ├── val.csv\r\n│   └── test.csv\r\n├── src/                  # Source code\r\n│   ├── data/             # Data loading and preprocessing\r\n│   ├── models/           # Model configuration and training\r\n│   ├── evaluation/       # Evaluation metrics and utilities\r\n│   └── utils/            # Helper functions and constants\r\n├── scripts/              # Training and evaluation scripts\r\n├── models/               # Directory for saved models\r\n├── notebooks/            # Jupyter notebooks for exploration\r\n├── outputs/              # Training outputs and logs\r\n├── requirements.txt      # Project dependencies\r\n└── README.md             # Project documentation\r\n```\r\n\r\n## Installation\r\n\r\n1. Clone the repository:\r\n\r\n```bash\r\ngit clone https://github.com/ntphuc149/ViAG.git\r\ncd ViAG\r\n```\r\n\r\n2. Install dependencies:\r\n\r\n```bash\r\npip install -r requirements.txt\r\n```\r\n\r\n3. Install the Vietnamese SpaCy model:\r\n\r\n```\r\npip install https://gitlab.com/trungtv/vi_spacy/-/raw/master/packages/vi_core_news_lg-3.6.0/dist/vi_core_news_lg-3.6.0.tar.gz\r\n```\r\n\r\n4. Create a `.env` file with your API keys (optional):\r\n\r\n```bash\r\nHF_TOKEN=your_huggingface_token\r\nWANDB_API_KEY=your_wandb_api_key\r\n```\r\n\r\n## Data Format\r\n\r\nThe expected data format is a CSV file with the following columns:\r\n\r\n- `context`: The context passage\r\n- `question`: The question to be answered\r\n- `answer`: The target generative answer\r\n\r\n## Usage\r\n\r\n### Training\r\n\r\nTrain a model using the command-line interface:\r\n\r\n```bash\r\npython scripts/train.py \\\r\n    --train_data datasets/train.csv \\\r\n    --val_data datasets/val.csv \\\r\n    --test_data datasets/test.csv \\\r\n    --model_name VietAI/vit5-base \\\r\n    --output_dir outputs/experiment1 \\\r\n    --num_epochs 5 \\\r\n    --batch_size 2 \\\r\n    --learning_rate 3e-5 \\\r\n    --use_wandb\r\n```\r\n\r\nFor more options, run:\r\n\r\n```bash\r\npython scripts/train.py --help\r\n```\r\n\r\n### Evaluation\r\n\r\nEvaluate a trained model:\r\n\r\n```bash\r\npython scripts/run_evaluate.py \\\r\n    --test_data datasets/test.csv \\\r\n    --model_path outputs/experiment1 \\\r\n    --output_dir outputs/evaluation1 \\\r\n    --batch_size 1\r\n```\r\n\r\nFor more options, run:\r\n\r\n```bash\r\npython scripts/run_evaluate.py --help\r\n```\r\n\r\n## Configuration\r\n\r\nYou can customize the training process using a JSON configuration file:\r\n\r\n```json\r\n{\r\n  \"model\": {\r\n    \"name\": \"vinai/bartpho-syllable-base\",\r\n    \"max_input_length\": 1024,\r\n    \"max_target_length\": 256\r\n  },\r\n  \"training\": {\r\n    \"num_epochs\": 5,\r\n    \"learning_rate\": 3e-5,\r\n    \"batch_size\": 2,\r\n    \"gradient_accumulation_steps\": 16\r\n  },\r\n  \"data\": {\r\n    \"train_path\": \"datasets/train.csv\",\r\n    \"val_path\": \"datasets/val.csv\",\r\n    \"test_path\": \"datasets/test.csv\"\r\n  }\r\n}\r\n```\r\n\r\nThen use it with:\r\n\r\n```bash\r\npython scripts/train.py --config configs/my_config.json\r\n```\r\n\r\n## Metrics\r\n\r\nThe project uses the following metrics to evaluate answer quality:\r\n\r\n- `ROUGE-1`, `ROUGE-2`, `ROUGE-L`, `ROUGE-L-SUM`: Measures n-gram overlap between generated and reference answers\r\n- `BLEU-1`, `BLEU-2`, `BLEU-3`, `BLEU-4`: Measures precision of n-grams in generated answers\r\n- `METEOR`: Measures unigram alignment between generated and reference answers\r\n- `BERTScore`: Measures semantic similarity using BERT embeddings\r\n\r\n## Models\r\n\r\nThe project currently supports the following models:\r\n\r\n- `VietAI/vit5-base`\r\n- `VietAI/vit5-large`\r\n- `vinai/bartpho-syllable`\r\n- `vinai/bartpho-syllable-base`\r\n- Other encoder-decoder models compatible with the Hugging Face Transformers library\r\n\r\n## LLM Instruction Fine-tuning (New Feature)\r\n\r\nViAG now supports instruction fine-tuning for Large Language Models (LLMs) using QLoRA technique. This allows you to fine-tune models like Qwen, Llama, and Mistral on Vietnamese QA tasks with limited GPU memory.\r\n\r\n### Features\r\n\r\n- **QLoRA Integration**: 4-bit quantization with LoRA for memory-efficient training\r\n- **Multiple Instruction Formats**: Support for ChatML, Alpaca, Vicuna, Llama, and custom templates\r\n- **Flexible Workflow**: Separate training, inference, and evaluation phases for long-running jobs\r\n- **Automatic Data Splitting**: Split single dataset into train/val/test with customizable ratios\r\n\r\n### Quick Start\r\n\r\n#### 1. Full Pipeline (Train + Infer + Eval)\r\n\r\n```bash\r\npython scripts/train_llm.py \\\r\n    --do_train --do_infer --do_eval \\\r\n    --data_path Truong-Phuc/ViBidLQA \\\r\n    --model_name Qwen/Qwen2-0.5B \\\r\n    --instruction_template chatml \\\r\n    --output_dir outputs/qwen2-vibidlqa\r\n```\r\n\r\n#### 2. Separate Phases (for Kaggle/Colab sessions)\r\n\r\n**Phase 1: Training (~11 hours)**\r\n```bash\r\npython scripts/train_llm.py \\\r\n    --do_train \\\r\n    --data_path data/train.csv \\\r\n    --model_name Qwen/Qwen2-0.5B \\\r\n    --num_epochs 10 \\\r\n    --output_dir outputs/qwen2-checkpoint\r\n```\r\n\r\n**Phase 2: Inference (~1 hour)**\r\n```bash\r\npython scripts/train_llm.py \\\r\n    --do_infer \\\r\n    --test_data data/test.csv \\\r\n    --checkpoint_path outputs/qwen2-checkpoint \\\r\n    --predictions_file outputs/predictions.csv\r\n```\r\n\r\n**Phase 3: Evaluation (~10 minutes)**\r\n```bash\r\npython scripts/train_llm.py \\\r\n    --do_eval \\\r\n    --predictions_file outputs/predictions.csv \\\r\n    --metrics_file outputs/metrics.json\r\n```\r\n\r\n### Instruction Templates\r\n\r\nThe framework supports multiple instruction formats:\r\n\r\n#### ChatML (default)\r\n```\r\n\u003c|im_start|\u003esystem\r\n{system_prompt}\u003c|im_end|\u003e\r\n\u003c|im_start|\u003euser\r\n{context}\r\n{question}\u003c|im_end|\u003e\r\n\u003c|im_start|\u003eassistant\r\n{answer}\u003c|im_end|\u003e\r\n```\r\n\r\n#### Alpaca\r\n```\r\nBelow is an instruction that describes a task...\r\n\r\n### Instruction:\r\n{question}\r\n\r\n### Input:\r\n{context}\r\n\r\n### Response:\r\n{answer}\r\n```\r\n\r\n#### Custom Template\r\n```bash\r\npython scripts/train_llm.py \\\r\n    --instruction_template custom \\\r\n    --custom_template \"Context: {context}\\nQuestion: {question}\\nAnswer: {answer}\"\r\n```\r\n\r\n### Configuration\r\n\r\nYou can use a JSON configuration file:\r\n\r\n```bash\r\npython scripts/train_llm.py --config configs/llm_config.json\r\n```\r\n\r\nExample configuration:\r\n```json\r\n{\r\n  \"model\": {\r\n    \"name\": \"Qwen/Qwen2-0.5B\",\r\n    \"instruction_template\": \"chatml\"\r\n  },\r\n  \"lora\": {\r\n    \"r\": 16,\r\n    \"alpha\": 32,\r\n    \"dropout\": 0.05\r\n  },\r\n  \"training\": {\r\n    \"num_epochs\": 5,\r\n    \"batch_size\": 1,\r\n    \"gradient_accumulation_steps\": 16\r\n  }\r\n}\r\n```\r\n\r\n### Supported Models\r\n\r\n- Qwen/Qwen2 series (0.5B, 1.5B, 7B)\r\n- SeaLLMs/SeaLLMs-v3 series (1.5B, 7B)\r\n- meta-llama/Llama-2 series\r\n- mistralai/Mistral series\r\n- Any other causal LM compatible with Transformers\r\n\r\n### Advanced Options\r\n\r\n```bash\r\npython scripts/train_llm.py --help\r\n```\r\n\r\nKey parameters:\r\n- `--train_ratio`, `--val_ratio`, `--test_ratio`: Data split ratios (default: 8:1:1)\r\n- `--lora_r`: LoRA rank (default: 16)\r\n- `--learning_rate`: Learning rate (default: 3e-5)\r\n- `--max_new_tokens`: Max tokens to generate (default: 512)\r\n- `--use_wandb`: Enable W\u0026B logging\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the LICENSE file for details.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fntphuc149%2Fviag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fntphuc149%2Fviag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fntphuc149%2Fviag/lists"}