https://github.com/epfml/fineweb2-hq

Code for the paper "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection"
https://github.com/epfml/fineweb2-hq

Last synced: about 2 months ago
JSON representation

Code for the paper "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection"

Host: GitHub
URL: https://github.com/epfml/fineweb2-hq
Owner: epfml
License: apache-2.0
Created: 2025-02-13T21:04:09.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-16T08:35:08.000Z (12 months ago)
Last Synced: 2025-07-21T11:15:32.736Z (9 months ago)
Language: Python
Homepage:
Size: 42 KB
Stars: 3
Watchers: 4
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

This is the codebase for our paper [*Enhancing Multilingual LLM Pretraining with Model-Based Data Selection*](https://arxiv.org/abs/2502.10361).

**Abstract:**
> Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

**Figure**: Pretraining benchmark performance (average accuracy) measured on Chinese (CMMLU), German (MMLU), and French (MMLU), while training for 119B tokens, comparing the baseline FineWeb-2 dataset against data filtered using our FastText (*FT*) and Transformer Multi-Layer Perceptron (*MLP*) embedding-based filtering methods trained on our data mixture *MKC⁺*. When using our approaches, the data retention rates are set to 10%.

We release the dataset resulting from our best approach (*MLP MKC⁺* with a 10% retention rate) for 20 languages as [FineWeb2-HQ](https://huggingface.co/datasets/epfml/FineWeb2-HQ) on HuggingFace.

In addition, we release the FineWeb2 dataset with XLM-RoBERTa embeddings, which can be used for multilingual research, as [FineWeb2-embedded](https://huggingface.co/datasets/epfml/FineWeb2-embedded) on HuggingFace.

# Quickstart

The codebase relies on the [`datatrove`](https://github.com/huggingface/datatrove) library.

We provide an example of the *MLP MKC⁺* dataset creation with a 10% retention rate for the French language (`fra_Latn`).

Create a conda environment and install the package:

```bash
conda create -n env python=3.10
conda activate env
pip install -e .
```

Create the *MKC⁺* dataset:

```bash
cd data
python generate_dataset.py --output-dir ./datasets/ --language-mapping ../assets/language_mapping.csv --fineweb2-path /path/to/fineweb2/data/
```

Compute the embeddings for the generated dataset:
```bash
python compute_embeddings.py --reader-type jsonl --input-dir ./datasets/fra_Latn/train_80.jsonl --output-dir ./datasets-embedded/fra_Latn/train_80
python compute_embeddings.py --reader-type jsonl --input-dir ./datasets/fra_Latn/valid_10.jsonl --output-dir ./datasets-embedded/fra_Latn/valid_10
python compute_embeddings.py --reader-type jsonl --input-dir ./datasets/fra_Latn/test_10.jsonl --output-dir ./datasets-embedded/fra_Latn/test_10
```

Train the *MLP* model on *MKC⁺* dataset:
```bash
python train_mlp.py --dataset-dir ./datasets-embedded/fra_Latn/ --output-path ./models/fra_Latn.pt
```

Compute the embeddings for the FineWeb2 dataset (or use the [FineWeb2-embedded](https://huggingface.co/datasets/epfml/FineWeb2-embedded) dataset):
```bash
python compute_embeddings.py --input-dir /path/to/fineweb2/data/fra_Latn/train --output-dir ./fineweb2-embedded/fra_Latn
```

Run the filtering:
```bash
python filter_mlp.py --input-dir ./fineweb2-embedded/fra_Latn --classifier-path ./models/fra_Latn.pt --output-dir ./fineweb2-hq/fra_Latn --retention-rate 0.1
```

The resulting dataset will be saved in the `fineweb2-hq` folder.

In order to train and evaluate an LLM using the data, we provide the configs for [`nanotron`](https://github.com/huggingface/nanotron) and [`lighteval`](https://github.com/huggingface/lighteval) in `training` and `evaluation` folders.

# Citation information

```
@article{messmer2025multilingdatacomp,
title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection},
author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi},
journal={arXiv},
year={2025},
url={https://arxiv.org/abs/2502.10361},
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/epfml/fineweb2-hq

Awesome Lists containing this project

README