https://github.com/hitz-zentroa/latxa-instruct
Latxa-Instruct: Basque Instruction-Tuned Models and Evaluation Arena
https://github.com/hitz-zentroa/latxa-instruct
Last synced: 2 months ago
JSON representation
Latxa-Instruct: Basque Instruction-Tuned Models and Evaluation Arena
- Host: GitHub
- URL: https://github.com/hitz-zentroa/latxa-instruct
- Owner: hitz-zentroa
- License: mit
- Created: 2025-05-23T10:27:09.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-06-11T13:06:25.000Z (4 months ago)
- Last Synced: 2025-06-11T14:37:12.389Z (4 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.82 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![]()
Latxa-Instruct:
Basque Instruct Models and Evaluation Arena**Latxa-Instruct** is an open-source project for reproducible training, evaluation, and benchmarking of Basque instruction-tuned language models. It provides:
- **Open-source Basque LLMs**: Instruction-tuned models based on Llama 3.1 (8B and 70B parameters)
- **Synthetic instruction datasets**: Large-scale generated instruction-response pairs in both Basque and English for reproducible training
- **Human preference dataset**: The first preference dataset in Basque with real user prompts, model responses, and 12,890 preference annotations from 1,285 participants
- **Complete training pipeline**: End-to-end scripts for data preprocessing, model training, and instruction-tuning on the CINECA Leonardo supercomputer
- **Benchmark evaluation framework**: Benchmarking scripts and results across 29 datasets in Basque, English, and Spanish
- **Human evaluation framework**: Full frontend and backend implementation for community-driven human evaluation of language modelsAll models, datasets, and evaluation tools are released under open licenses.
- 📒 Blog Post: TBA
- 📖 Paper: [Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque](https://arxiv.org/abs/2506.07597)
- 🤗 Models and Data: [HiTZ/Latxa-Instruct](https://huggingface.co/collections/HiTZ/latxa-instruct-682f356091452b0028380804)---
# Datasets
> 🚧 **Note**: The codebase is currently work in progress. Meanwhile, all datasets are available on Hugging Face.
>
- **Corpus:** Basque corpus of 4.3M documents
- https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1
- **Instructions:** [Magpie](https://github.com/magpie-align/magpie)-generated Basque/English instructions
- https://huggingface.co/datasets/HiTZ/Magpie-Llama-3.1-8B-Instruct-Filtered-1M
- https://huggingface.co/datasets/HiTZ/Magpie-Llama-3.1-8B-Instruct-Filtered-translated-1M
- **Preferences:** Human preferences on model outputs from 21 different LLMs, in response to real Basque prompts
- https://huggingface.co/datasets/HiTZ/ebaluatoia# Model training
> 🚧 **Note**: The codebase is currently work in progress and will require adaptation to your specific environment and setup.
Training is performed on the CINECA Leonardo cluster using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) (a HuggingFace-based framework).
See [model_training/README.md](model_training/README.md) for full instructions, environment setup, and troubleshooting.### Quicksteps
1. **Prepare the data**
Preprocess instruction datasets and Basque corpora using scripts in [model_training/](model_training/). For example:
```bash
sbatch model_training/train_scripts/prepare_data.sh
```
This will tokenize and cache datasets as specified in the YAML configs.
2. **Train the models**
Launch training with the provided SLURM scripts:
```bash
sbatch model_training/train_scripts/Latxa-Llama-3.1-70B-Instruct-exp_2_101.sh
```
3. **Merge and transfer checkpoints**
After training, merge distributed checkpoints and transfer to the target server using `merge_weights.sh` and `rsync_weights.sh`.### Training Configs
- All configs are in [model_training/train_configs/](model_training/train_configs/).
- Datasets are in JSONL format, with user/assistant conversations.
- Example config: [exp_1_010_fixed.yaml](model_training/train_configs/exp_1_010_fixed.yaml)# Evaluation
Our evaluation framework combines two complementary approaches to provide comprehensive model assessment: benchmark-based evaluation and human preferences.
See [evaluation/README.md](evaluation/README.md) for full instructions, environment setup, and troubleshooting.### Static Benchmarks
- **29 benchmarks** across Basque, English, and Spanish
- **6 categories**: reading comprehension, common sense, linguistic proficiency, knowledge, math & reasoning, and bias
- Automated evaluation using [LM Evaluation Harness](https://github.com/juletx/lm-evaluation-harness/tree/basque-bench)
- Full configuration files and results included### Human Evaluation: _Ebaluatoia_ arena
- **Community-driven evaluation** with 1,285 Basque speakers
- **12,890 preference annotations** collected over 14 days
- **Multi-dimensional assessment**: content quality, linguistic quality, and overall preference
- **21 models evaluated** including our variants, GPT-4o, and Claude 3.5 Sonnet
- Complete arena platform implementation provided---
# A note on model variants and their nomenclature
We use different naming conventions in the paper versus the codebase for clarity and simplicity respectively.
### Paper Nomenclature
In the paper, we use descriptive names like:
- $INST_{EN} C_{EU} I_{EN}$= Llama 3.1 Instruct + Basque Corpus + English Instructions
- $BASE_{EU} C_{EU} I_{EN+EU}$ = Latxa Base + Basque Corpus + English & Basque Instructions### Codebase Nomenclature
In the actual code, models, and evaluation results, we use a simplified format
`exp_{BACKBONE}_{CORPUS_EU}{INST_EU}{INST_EN}`
where:
- **BACKBONE:**
- `0` = $BASE_{EN}$ (Llama 3.1 8B)
- `1` = $BASE_{EU}$ (Latxa 3.1 8B)
- `2` = $INST_{EN}$ (Llama 3.1 8B Instruct)- **CORPUS_EU / INST_EU / INST_EN:**
- `0` = not used
- `1` = used**Examples:**
- `exp_1_101` = $BASE_{EU}$ + Basque Corpus + English Instructions
- `exp_2_111` = $INST_{EN}$ + Basque Corpus + Basque Instructions + English Instructions### Model Repositories
All 8B model variants are available as branches in the HuggingFace repository:
- Main model: https://huggingface.co/HiTZ/Latxa-Llama-3.1-8B-Instruct
- Example variant: https://huggingface.co/HiTZ/Latxa-Llama-3.1-8B-Instruct/tree/exp_0_010---
# Citation
If you use Latxa-Instruct, please cite:
```bibtex
@misc{sainz2025instructing,
title = {Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque},
author = {Oscar Sainz and Naiara Perez and Julen Etxaniz and Joseba Fernandez de Landa and Itziar Aldabe and Iker GarcÃa-Ferrero and Aimar Zabala and Ekhi Azurmendi and German Rigau and Eneko Agirre and Mikel Artetxe and Aitor Soroa},
year = {2025},
eprint = {TBP},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
```---
# Acknowledgements
This work has been partially supported by the Basque Government (IKER-GAITU project), the Ministerio para la Transformación Digital y de la Función Pública (EU – NextGenerationEU,
2022/TL22/00215335), and trained on the Leonardo supercomputer at CINECA under EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.