{"id":28532515,"url":"https://github.com/ntphuc149/viir","last_synced_at":"2026-03-09T02:32:15.182Z","repository":{"id":293581159,"uuid":"980722701","full_name":"ntphuc149/ViIR","owner":"ntphuc149","description":"ViIR: The Unified Framework for Fine-tuning Vietnamese Information Retrieval Models with Various Tuning Statergies.","archived":false,"fork":false,"pushed_at":"2025-05-25T13:24:14.000Z","size":90,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-30T20:45:57.723Z","etag":null,"topics":["baseline","bi-encoder","fine-tuning","hard-negative-mining","information-retrieval","plms","positive-pair","vietnamese"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ntphuc149.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-09T15:47:50.000Z","updated_at":"2025-05-25T14:01:37.000Z","dependencies_parsed_at":"2025-05-16T04:19:49.569Z","dependency_job_id":"5fbc32fa-21e8-4a41-acdf-8b278678ceee","html_url":"https://github.com/ntphuc149/ViIR","commit_stats":null,"previous_names":["ntphuc149/viir"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ntphuc149/ViIR","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViIR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViIR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViIR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViIR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ntphuc149","download_url":"https://codeload.github.com/ntphuc149/ViIR/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ntphuc149%2FViIR/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30280854,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T02:23:26.802Z","status":"ssl_error","status_checked_at":"2026-03-09T02:22:46.175Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["baseline","bi-encoder","fine-tuning","hard-negative-mining","information-retrieval","plms","positive-pair","vietnamese"],"created_at":"2025-06-09T15:38:26.583Z","updated_at":"2026-03-09T02:32:15.159Z","avatar_url":"https://github.com/ntphuc149.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ViIR: Information Retrieval Fine-tuning Framework for Vietnamese Documents\n\nFine-tuning sentence transformer models for Vietnamese information retrieval using any custom dataset.\n\n## Overview\n\nViIR is a flexible framework for fine-tuning Bi-Encoder and other transformer models for Vietnamese information retrieval tasks. The framework supports three main fine-tuning strategies:\n\n1. **Baseline**: Using pre-trained models without fine-tuning for benchmarking\n2. **Positive-pair Tuning**: Fine-tuning with query-document positive pairs\n3. **Hard Negative Tuning**: Advanced fine-tuning with hard negatives for improved discrimination\n\nThis framework is designed to work with any Vietnamese dataset that contains query-document pairs, making it adaptable for legal documents, news articles, medical information, and more.\n\n## Installation\n\n```bash\n# Clone repository\ngit clone https://github.com/ntphuc149/ViIR.git\ncd ViIR\n\n# Install package and dependencies\npip install -e .\n```\n\n## Requirements\n\n- Python 3.8+\n- PyTorch 1.10+\n- Transformers 4.16+\n- Sentence-transformers 2.2.0+\n- Scikit-learn 1.0.0+\n- Pandas \u0026 NumPy\n- Tqdm\n- PyYAML\n\n## Usage\n\n### 1. Data Preprocessing\n\nThe framework expects your dataset to have at least the following columns:\n- `question` or `query`: The search query text\n- `context` or `document`: The document text\n- (Optional) `abstractive_answer`: The ground truth answer\n\n```bash\npython scripts/preprocess.py --input /path/to/your_dataset.csv --output data/processed/\n```\n\n### 2. Model Training\n\nThe framework supports three training strategies that can be easily selected through configuration files:\n\n```bash\n# Baseline (no fine-tuning)\npython scripts/train.py --config viir/config/baseline.yaml\n\n# Positive-pair Tuning\npython scripts/train.py --config viir/config/positive_pair.yaml\n\n# Hard Negative Tuning\npython scripts/train.py --config viir/config/hard_negative.yaml\n```\n\nYou can directly specify model, batch size, learning rate and other parameters via command line:\n\n```bash\n# Using PhoBERT model with custom hyperparameters\npython scripts/train.py --config viir/config/hard_negative.yaml \\\n                        --model_name vinai/phobert-base \\\n                        --batch_size 32 \\\n                        --learning_rate 2e-5 \\\n                        --epochs 5\n```\n\n### 3. Model Evaluation\n\nEvaluate your trained model with standard IR metrics including NDCG, MRR, precision, and recall:\n\n```bash\npython scripts/evaluate.py --model_path output/model/ --data_dir data/processed/ --split test\n```\n\nFor comprehensive evaluation across all splits:\n\n```bash\npython scripts/evaluate.py --model_path output/model/ --data_dir data/processed/ --split all\n```\n\n### 4. Running the Complete Pipeline\n\nFor convenience, you can run the entire pipeline in one command:\n\n```bash\n# Using the run.py script\npython run.py --input /path/to/your_dataset.csv --strategy hard_negative\n```\n\nSwitch between strategies and models:\n\n```bash\n# Using baseline strategy with default XLM-RoBERTa\npython run.py --input /path/to/your_dataset.csv --strategy baseline\n\n# Using positive-pair strategy with PhoBERT\npython run.py --input /path/to/your_dataset.csv \\\n              --strategy positive_pair \\\n              --model_name vinai/phobert-base\n\n# Using hard negative strategy with custom hyperparameters\npython run.py --input /path/to/your_dataset.csv \\\n              --strategy hard_negative \\\n              --model_name vinai/phobert-base-v2 \\\n              --batch_size 16 \\\n              --learning_rate 3e-5 \\\n              --epochs 3\n```\n\n## Project Structure\n\n```\nViIR/\n├── viir/                  # Main package directory\n│   ├── __init__.py        # Package initialization\n│   ├── config/            # Configuration files\n│   ├── data/              # Data processing modules\n│   ├── trainers/          # Training strategy implementations\n│   ├── utils/             # Utility functions\n│   ├── evaluation/        # Evaluation tools\n│   └── main.py            # Main module\n├── scripts/               # CLI scripts\n│   ├── preprocess.py      # Data preprocessing\n│   ├── train.py           # Model training\n│   └── evaluate.py        # Model evaluation\n├── run.py                 # Convenience script for running the pipeline\n├── setup.py               # Package setup\n└── README.md              # This file\n```\n\n## Customization\n\n### Using Different Models\n\nYou can use any model from the Hugging Face hub either by specifying it in the command line or by changing the `model.name` parameter in the configuration files:\n\n#### Command line method:\n```bash\npython run.py --input your_data.csv --strategy hard_negative --model_name vinai/phobert-base\n```\n\n#### Configuration file method:\n```yaml\nmodel:\n  name: \"vinai/phobert-base\"  # Or any other Vietnamese language model\n  max_seq_length: 512\n  trust_remote_code: true\n```\n\n### Supported Vietnamese Models\n\nThe framework has been tested with the following Vietnamese models:\n- `FacebookAI/xlm-roberta-base` (default)\n- `FacebookAI/xlm-roberta-large` \n- `vinai/phobert-base-v2`\n- `vinai/phobert-large`\n- And other models compatible with Sentence Transformers\n\n### Custom Dataset Format\n\nIf your dataset has a different format, you can modify the `viir/data/processor.py` file to handle your specific data structure.\n\n## Citation\n\nIf you use this framework in your research or applications, please cite:\n\n```\n@misc{viir,\n  author = {Truong-Phuc Nguyen},\n  title = {ViIR: The Unified Framework for Fine-tuning Vietnamese Information Retrieval Models with Various Tuning Strategies},\n  year = {2025},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/ntphuc149/ViIR}}\n}\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fntphuc149%2Fviir","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fntphuc149%2Fviir","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fntphuc149%2Fviir/lists"}