{"id":29630929,"url":"https://github.com/epfml/fineweb2-hq","last_synced_at":"2026-03-07T18:31:37.527Z","repository":{"id":279414298,"uuid":"932396330","full_name":"epfml/fineweb2-hq","owner":"epfml","description":"Code for the paper \"Enhancing Multilingual LLM Pretraining with Model-Based Data Selection\"","archived":false,"fork":false,"pushed_at":"2025-05-16T08:35:08.000Z","size":43,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-07-21T11:15:32.736Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-13T21:04:09.000Z","updated_at":"2025-07-15T11:36:52.000Z","dependencies_parsed_at":"2025-02-25T13:24:06.580Z","dependency_job_id":"4d134862-42d9-4d15-ae30-0d3bc3cdc9d0","html_url":"https://github.com/epfml/fineweb2-hq","commit_stats":null,"previous_names":["epfml/fineweb2-hq"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/epfml/fineweb2-hq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Ffineweb2-hq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Ffineweb2-hq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Ffineweb2-hq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Ffineweb2-hq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/fineweb2-hq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Ffineweb2-hq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30226246,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-07T18:12:09.766Z","status":"ssl_error","status_checked_at":"2026-03-07T18:11:58.786Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-21T11:07:13.910Z","updated_at":"2026-03-07T18:31:37.505Z","avatar_url":"https://github.com/epfml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Enhancing Multilingual LLM Pretraining with Model-Based Data Selection\n\nThis is the codebase for our paper [*Enhancing Multilingual LLM Pretraining with Model-Based Data Selection*](https://arxiv.org/abs/2502.10361).\n\n**Abstract:**\n\u003e Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks and mitigating the curse of multilinguality. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.\n\n\u003cimg src=\"assets/agg_score_plot.svg\" width=\"600\"\u003e\n\n**Figure**: Pretraining benchmark performance (average accuracy) measured on Chinese (CMMLU), German (MMLU), and French (MMLU), while training for 119B tokens, comparing the baseline FineWeb-2 dataset against data filtered using our FastText (*FT*) and Transformer Multi-Layer Perceptron (*MLP*) embedding-based filtering methods trained on our data mixture *MKC\u003csup\u003e+\u003c/sup\u003e*. When using our approaches, the data retention rates are set to 10%.\n\nWe release the dataset resulting from our best approach (*MLP MKC\u003csup\u003e+\u003c/sup\u003e* with a 10% retention rate) for 20 languages as [FineWeb2-HQ](https://huggingface.co/datasets/epfml/FineWeb2-HQ) on HuggingFace.\n\nIn addition, we release the FineWeb2 dataset with XLM-RoBERTa embeddings, which can be used for multilingual research, as [FineWeb2-embedded](https://huggingface.co/datasets/epfml/FineWeb2-embedded) on HuggingFace.\n\n# Quickstart\n\nThe codebase relies on the [`datatrove`](https://github.com/huggingface/datatrove) library.\n\nWe provide an example of the *MLP MKC\u003csup\u003e+\u003c/sup\u003e* dataset creation with a 10% retention rate for the French language (`fra_Latn`).\n\nCreate a conda environment and install the package: \n\n```bash\nconda create -n env python=3.10\nconda activate env\npip install -e .\n```\n\nCreate the *MKC\u003csup\u003e+\u003c/sup\u003e* dataset:\n\n```bash\ncd data\npython generate_dataset.py --output-dir ./datasets/ --language-mapping ../assets/language_mapping.csv --fineweb2-path /path/to/fineweb2/data/\n```\n\nCompute the embeddings for the generated dataset:\n```bash\npython compute_embeddings.py --reader-type jsonl --input-dir ./datasets/fra_Latn/train_80.jsonl --output-dir ./datasets-embedded/fra_Latn/train_80\npython compute_embeddings.py --reader-type jsonl --input-dir ./datasets/fra_Latn/valid_10.jsonl --output-dir ./datasets-embedded/fra_Latn/valid_10\npython compute_embeddings.py --reader-type jsonl --input-dir ./datasets/fra_Latn/test_10.jsonl --output-dir ./datasets-embedded/fra_Latn/test_10\n```\n\nTrain the *MLP* model on *MKC\u003csup\u003e+\u003c/sup\u003e* dataset:\n```bash\npython train_mlp.py --dataset-dir ./datasets-embedded/fra_Latn/ --output-path ./models/fra_Latn.pt\n```\n\nCompute the embeddings for the FineWeb2 dataset (or use the [FineWeb2-embedded](https://huggingface.co/datasets/epfml/FineWeb2-embedded) dataset):\n```bash\npython compute_embeddings.py --input-dir /path/to/fineweb2/data/fra_Latn/train --output-dir ./fineweb2-embedded/fra_Latn\n```\n\nRun the filtering:\n```bash\npython filter_mlp.py  --input-dir ./fineweb2-embedded/fra_Latn --classifier-path ./models/fra_Latn.pt --output-dir ./fineweb2-hq/fra_Latn --retention-rate 0.1\n```\n\nThe resulting dataset will be saved in the `fineweb2-hq` folder.\n\nIn order to train and evaluate an LLM using the data, we provide the configs for [`nanotron`](https://github.com/huggingface/nanotron) and [`lighteval`](https://github.com/huggingface/lighteval) in `training` and `evaluation` folders.\n\n# Citation information\n\n```\n@article{messmer2025multilingdatacomp,\n  title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection},\n  author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi},\n  journal={arXiv},\n  year={2025},\n  url={https://arxiv.org/abs/2502.10361},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Ffineweb2-hq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Ffineweb2-hq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Ffineweb2-hq/lists"}