{"id":17728135,"url":"https://github.com/gair-nlp/prox","last_synced_at":"2025-04-05T00:08:05.001Z","repository":{"id":258301298,"uuid":"854580116","full_name":"GAIR-NLP/ProX","owner":"GAIR-NLP","description":"Offical Repo for \"Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale\"","archived":false,"fork":false,"pushed_at":"2024-10-16T14:32:29.000Z","size":16518,"stargazers_count":169,"open_issues_count":2,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-10-18T11:05:29.675Z","etag":null,"topics":["continual","continual-pre-training","data-centric-ai","data-quality","llama","llm","mistral","neural-symbolic","pre-training"],"latest_commit_sha":null,"homepage":"https://gair-nlp.github.io/ProX/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GAIR-NLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-09T12:33:23.000Z","updated_at":"2024-10-18T08:54:01.000Z","dependencies_parsed_at":"2024-10-25T23:32:04.247Z","dependency_job_id":null,"html_url":"https://github.com/GAIR-NLP/ProX","commit_stats":null,"previous_names":["gair-nlp/prox"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FProX","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FProX/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FProX/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GAIR-NLP%2FProX/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GAIR-NLP","download_url":"https://codeload.github.com/GAIR-NLP/ProX/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247266564,"owners_count":20910836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["continual","continual-pre-training","data-centric-ai","data-quality","llama","llm","mistral","neural-symbolic","pre-training"],"created_at":"2024-10-25T19:05:37.326Z","updated_at":"2025-04-05T00:08:04.979Z","avatar_url":"https://github.com/GAIR-NLP.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./static/images/prox-logo.png\"\u003e\n\u003c/p\u003e\n\u003ca href=\"https://huggingface.co/gair-prox\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Models\" src=\"https://img.shields.io/badge/🤗-HuggingFace Repo-blue\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://arxiv.org/abs/2409.17115\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Paper\" src=\"https://img.shields.io/badge/📑-Paper-blue\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://gair-nlp.github.io/ProX/\" target=\"_blank\"\u003e\n\u003cimg alt=\"Project Page\" src=\"https://img.shields.io/badge/🧪-Project Page-blue\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://opensource.org/license/apache-2-0\" target=\"_blank\"\u003e\n    \u003cimg alt=\"License: apache-2-0\" src=\"https://img.shields.io/github/license/saltstack/salt\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/GAIR-NLP/ProX\" target=\"_blank\"\u003e\n    \u003cimg alt=\"GitHub Stars\" src=\"https://img.shields.io/github/stars/GAIR-NLP/ProX?style=social\" /\u003e\n\u003c/a\u003e\n\u003ca href=\"https://github.com/GAIR-NLP/ProX/issues\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Open Issues\" src=\"https://img.shields.io/github/issues-raw/GAIR-NLP/ProX\" /\u003e\n\u003c/a\u003e\n\n## 🔥 News\n\n- **[17 February, 2025]:** 🎉 We release **[DCLM-pro](https://huggingface.co/datasets/gair-prox/DCLM-pro)**, a further cleaned verion of [DCLM-baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) containing \u003e 500B tokens ready for pre-training. Preliminary experiments show that models trained on DCLM-pro can achieve **\u003e1.5%** performance gain on average within 50B tokens.\n- **[10 October, 2024]:** 🎉 We release the codebase for large scale data refining, together with the refining models on 🤗Huggingface: [Prox-Refining-LMs](https://huggingface.co/collections/gair-prox/prox-refining-models-6707cf820a16d830fbf434dd).\n- **[19 September, 2024]:** 🎉 We open-sourced [pre-training corpus](https://huggingface.co/collections/gair-prox/prox-dataset-66e81c9d560911b836bb3704) curated by our ProX framework, containing \u003e 100B high quality general domain corpus and ~5B high quality math corpus, together with models([ProX](https://huggingface.co/collections/gair-prox/prox-general-models-65f1674f0607712c4d6eec76) and [ProXMath](https://huggingface.co/collections/gair-prox/prox-math-models-66e92c3e5d54b27612286eb9)) trained using these data.\n\n## Table of Contents\n\n- [Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale](#programming-every-example-lifting-pre-training-data-quality-like-experts-at-scale)\n  - [🔥 News](#-news)\n  - [Table of Contents](#table-of-contents)\n  - [🚀 Introduction](#-introduction)\n  - [Setup](#setup)\n  - [Training on ProX curated data](#training-on-prox-curated-data)\n  - [Evaluation](#evaluation)\n    - [General Evaluation](#general-evaluation)\n    - [Math Evaluation](#math-evaluation)\n  - [Projects Using ProX](#projects-using-prox)\n  - [Citation](#citation)\n  - [Acknowledgements](#acknowledgements)\n\n## 🚀 Introduction\n\n🫐 **ProX** is a lm-based data refinement framework to improve the quality of data used in pre-training large language models. Instead of relying on human experts to create rules, ProX treats data refinement like a programming task. This allows models to automatically clean and improve each data example at a large scale.\n\n![alt text](static/images/prox-intro.png)\n\nCurrently, 🫐 ProX curated data have gone through 2 levels of programming + executing: doc-level and chunk-level:\n![alt text](static/images/prox-framework.png)\n**Key Features**:\n\n- Better Performance: Models trained with ProX-refined data perform over 2% better than those trained with raw or rule-based data.\n- Domain Flexibility: 🫐 ProX works well across different domains, boosting accuracy by up to 20% in tasks like math, without needing special manual adjustments.\n- Efficient and Scalable: Even small models (as little as 0.3B parameters) can refine data effectively, similar to human experts, saving resources compared to LLM-based data synthesis.\n- Cost-Effective: In general, 🫐 ProX could significantly save on training computing while maintaining strong results.\n\n## Setup\n\nFirst, we have to install all the libraries listed in requirements.txt\n\n```bash\ngit clone https://github.com/GAIR-NLP/ProX.git prox\ncd prox\nconda create -n prox python=3.10\nconda activate prox\npip install -r requirements.txt\n```\n\nFor acceleration, we need to install flash-attention with some fused kernels:\n\n\u003cdetails\u003e\n\u003csummary\u003eClick me\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\npip install flash-attn --no-build-isolation\n# this part is quite similar to TinyLlama repo\n# you can also refer to its detailed guide at: https://github.com/jzhang38/TinyLlama/blob/main/PRETRAIN.md\ngit clone https://github.com/Dao-AILab/flash-attention.git\ncd flash-attention\ncd csrc/rotary \u0026\u0026 pip install .\ncd ../layer_norm \u0026\u0026 pip install .\ncd ../xentropy \u0026\u0026 pip install .\ncd ../.. \u0026\u0026 rm -rf flash-attention\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\nThen, we can install lighteval \u0026 math-eval for evaluation\n\n\u003cdetails\u003e\n\u003csummary\u003e\n\u003cb\u003elighteval\u003c/b\u003e\n\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\nconda create -n lmeval python=3.10\ngit clone https://github.com/huggingface/lighteval.git\ncd lighteval\npip install -e .\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\n\u003cb\u003emath-eval\u003c/b\u003e\n\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\ngit clone https://github.com/GAIR-NLP/math-evaluation-harness.git\ncd math-evaluation-harness\nconda create -n math_eval python=3.10\nconda activate math_eval\npip install -r requirements.txt\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n## Large Scale Data Refining\n\nIf you want to refine your own data with ProX, please make sure you setup a new environment. \n\n```bash\n# create a new conda env\nconda create -n refining python=3.10\nconda activate refining\n# install requirements\npip install -r refining_requirements.txt\n```\n\nWe released 2 families of refining models:\n\n- WebRefining-LM: for general web domain, including [web-doc-refining-lm](https://huggingface.co/gair-prox/web-doc-refining-lm) and [web-chunk-refining-lm](https://huggingface.co/gair-prox/web-chunk-refining-lm)\n- MathRefining-LM: for math domain, including [math-doc-refining-lm](https://huggingface.co/gair-prox/math-doc-refining-lm) and [math-chunk-refining-lm](https://huggingface.co/gair-prox/math-chunk-refining-lm)\n\nYou can refer to the following example slurm scripts to refine large scale pre-training data.\n\n```bash\n# 1. doc-level refining\nsbatch scripts/data_gen/example_doc_refining.sh\n\n# 2. chunk-level refining\nsbatch scripts/data_gen/example_chunk_refining.sh\n```\n\n## Training on ProX curated data\n\nWe provide over 100B high quality general domain corpus and ~5B high quality math corpus. You can directly train your own model using these data.\n\nHere we provide an example to download, tokenize, train a model using 🫐 ProX data with litgpt, finally with thorough evaluation.\nFeel free to modify the script to fit your own needs.\n\nFirst step is to setup your environment variables:\n\n```bash\n# 1. using setup_personal_env and setup_common_env\nsource setup_personal_env.sh\nsource setup_common_env.sh\n```\n\nThen you can download the data, and tokenize the data\n\n```bash\n# 2. download the data, e.g., RedPajama-pro\npython scripts/data_download/hf_download.py \\\n    --dataset_name gair-prox/RedPajama-pro\n\n# 3. tokenize the data\nexport PYTHONPATH=$PYTHONPATH:$TINYLM_WORK_DIR/train\npython -m train.data_tokenize.prepare_web \\\n    --source_path $RAW_DATA_DIR/gair-prox/RedPajama-pro \\\n    --tokenizer_path $TINYLM_WORK_DIR/vocab_files/llama_hf \\\n    --destination_path $TOKENIZE_DATA_DIR/RedPajama-pro/llama \\\n    --split train \\\n    --percentage 1.0\n```\n\nYou should see many \".bin\" files in the destination path. Then you can train a model using the tokenized data.\n\nWe run the training script using slurm:\n\n```bash\n# 4. train / convert / evaluate using slurm + multiple nodes\nsbatch scripts/train/tlm/pt_tlm_xs_redpj_prox.sh\n```\n\nYou can also run the training script in one local node 👇\n\n\u003cdetails\u003e\n\u003csummary\u003eclick me\u003c/summary\u003e\n\u003cp\u003e\n\n```bash\n# 4.1 train locally\ncd train\nexport PYTHONPATH=$PYTHONPATH:$TINYLM_WORK_DIR/train\npython -m pretrain.tinyllama \\\n    --config_path $TINYLM_WORK_DIR/configs/general/\u003cyour_config\u003e.yaml\n\n# 4.2 convert to HF model\npython -m scripts.weight_conversion.batch_model_conversion \\\n    --litgpt_model_dir pt_llama_0_3b_redpj_25B_prox \\ # the model dir you want to convert under ${$PT_MODEL_OUTPUT_DIR}\n    --hf_model_dir pt_llama_0_3b_redpj_25B_prox \\ # the model dir you want to save under ${HF_MODEL_OUTPUT_DIR}\n    --save_token_interval 1 \\ # the interval to save checkpoints, e.g., you can assume 1024 * 2048 * 500 approx. 1B token\n    --arch_name tiny_LLaMA_0_3b # the model architecture name in train/lit_gpt/config.py\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n## Evaluation\n\n### General Evaluation\n\nWe evaluate the model using lighteval across 10 standard tasks:\n\n- ARC (ARC-Easy, ARC-Challenge)\n- CommonsenseQA\n- Hellaswag\n- MMLU\n- OpenbookQA\n- PIQA\n- SocialIQA\n- WinoGrande\n- SciQ\n\nActually, in sbatch script, we have already included the evaluation part. You can also run the evaluation script if you are not using slurm:\n\n```bash\n# 5. evaluate the model\n# we provide scripts for general evaluation\n# e.g., you only want to eval last checkpoint named as `25B`\n# you can simply remove `--model_step_list 25` to evaluate all checkpoints\npython -m scripts.eval.base_evaluation \\\n    --hf_model_dir pt_llama_0_3b_redpj_25B_prox \\\n    --task_impl lighteval \\\n    --task_set fineweb \\\n    --model_step_list 25\n```\n\n### Math Evaluation\n\nFor math evaluation, you can refer to the following script, after you have installed math-eval and converted the model to HF format:\n\n```bash\n# alter the work dir and activate the conda env\ncd math-evaluation-harness\nconda activate math_eval\n\n# eval on all benchmarks\nbash auto_dir_run.sh ${your_model_folder_path}\n\n# summarize all results of all intermediate ckpts in your_model_folder_path\npython gather_results.py --do_all_ckpts --dir_path outputs/${your_model_folder_path}\n```\n\n## Development\n\nCurrently, we release the following code and data:\n\n- [✅] [Data](https://huggingface.co/collections/gair-prox/prox-dataset-66e81c9d560911b836bb3704)\n- [✅] [Training Code](./train)\n- [✅] [Evaluation Scripts](./scripts/eval)\n- [✅] [Large Scale Data Refining](./scripts/data_gen)\n- [✅] [Refining Model Weights](https://huggingface.co/collections/gair-prox/prox-refining-models-6707cf820a16d830fbf434dd)\n- [🚧] ...\n\n## Projects Using ProX\n\n- [Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs](https://sea-sailor.github.io/blog/sailor2/)\n- [YuLan-Mini: An Open Data-efficient Language Model](https://arxiv.org/abs/2412.17743)\n- [Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach](https://arxiv.org/abs/2502.05171)\n\n\n## Citation\n\nPlease cite 🫐 ProX paper if you find our work helpful:\n\n```bibtex\n@article{zhou2024programming,\n  title={Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale},\n  author={Zhou, Fan and Wang, Zengzhi and Liu, Qian and Li, Junlong and Liu, Pengfei},\n  journal={arXiv preprint arXiv:2409.17115},\n  year={2024}\n}\n```\n\n## Acknowledgements\n\nWe thank the following projects that provide great help for this work:\n\n- [🧠 TinyLlama](https://github.com/jzhang38/TinyLlama)\n- [🔥 FlashAttention](https://github.com/Dao-AILab/flash-attention)\n- [🧩 DataTrove](https://github.com/huggingface/datatrove)\n- [🔍 LightEval](https://github.com/huggingface/lighteval)\n- [🧮 MathEval](https://github.com/GAIR-NLP/math-evaluation-harness)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgair-nlp%2Fprox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgair-nlp%2Fprox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgair-nlp%2Fprox/lists"}