Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mlfoundations/dclm
DataComp for Language Models
https://github.com/mlfoundations/dclm
Last synced: about 1 month ago
JSON representation
DataComp for Language Models
- Host: GitHub
- URL: https://github.com/mlfoundations/dclm
- Owner: mlfoundations
- License: mit
- Created: 2024-05-27T17:24:17.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-10-23T23:56:15.000Z (about 2 months ago)
- Last Synced: 2024-10-24T13:31:46.471Z (about 2 months ago)
- Language: HTML
- Size: 48.8 MB
- Stars: 1,140
- Watchers: 38
- Forks: 103
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- Contributing: contributing.md
- License: LICENSE
Awesome Lists containing this project
- ai-game-devtools - DCLM
- StarryDivineSky - mlfoundations/dclm - LM (DCLM) 是一个综合框架,旨在构建和训练具有不同数据集的大型语言模型 (LLMs)。它提供了来自 CommonCrawl 的 300 多个未经过滤的令牌的标准化语料库、基于 open_lm 框架的有效预训练配方,以及一套包含 50 多个评估的广泛套件。此存储库提供了用于处理原始数据、标记化、洗牌、训练模型以及评估其性能的工具和指南。DCLM 使研究人员能够在不同的计算规模(从 411M 到 7B 参数模型)上试验各种数据集构建策略。我们的基线实验表明,通过优化数据集设计,模型性能有了显著提高。DCLM 已经能够创建多个高质量的数据集,这些数据集在各个尺度上都表现良好,并且优于所有开放数据集。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# DataComp-LM (DCLM)
## Table of Contents
- [Introduction](#introduction)
- [Leaderboard](#leaderboard)
- [Getting Started](#getting-started)
- [Selecting Raw Sources](#selecting-raw-sources)
- [Processing the Data](#processing-the-data)
- [Deduplication](#deduplication)
- [Tokenize and Shuffle](#tokenize-and-shuffle)
- [Model Training](#model-training)
- [Evaluation](#evaluation)
- [Submission](#submission)
- [Contributing](#contributing)
- [Downloading Artifacts](#downloading-artifacts)
- [Datasets](#datasets)
- [Pretrained Models](#pretrained-models)
- [How to Cite Us](#how-to-cite-us)
- [License](#license)## Introduction
[DataComp-LM (DCLM)](https://datacomp.ai/dclm/) is a comprehensive framework designed for building and training large language models (LLMs) with diverse datasets. It offers a standardized corpus of over 300T unfiltered tokens from CommonCrawl, effective pretraining recipes based on the open_lm framework, and an extensive suite of over 50 evaluations. This repository provides tools and guidelines for processing raw data, tokenizing, shuffling, training models, and evaluating their performance.
DCLM enables researchers to experiment with various dataset construction strategies across different compute scales, from 411M to 7B parameter models. Our baseline experiments show significant improvements in model performance through optimized dataset design.
Already, DCLM has enabled the creation of several high quality datasets that perform well across scales and outperform all open datasets.
![Accuracy vs compute tradeoff](assets/acc_vs_flops-1.png)
Developing datasets for better models that are cheaper to train. Using DataComp-LM, we develop a high-quality dataset, DCLM-BASELINE, which we use to train models with strong compute performance tradeoffs. We compare on both a Core set of tasks (left) and on MMLU 5-shot (right). DCLM-BASELINE (orange) shows favorable performance relative to both close-source models (crosses) and other open-source datasets and models (circles).**Submission workflow**:
* **(A)** A participant chooses a scale, where larger scales reflect more target training tokens and/or model parameters.
The smallest scale is 400m-1x, a 400m parameter model trained compute optimally (1x), and the largest scale is 7B-2x, a 7B parameter model trained with twice the tokens required for compute optimallity.* **(B)** A participant filters a pool of data (filtering track) or mixes data of their own (bring your own data track) to create a dataset.
* **(C)** Using the curated dataset, a participant trains a language model, with standardized training code and scale-specific hyperparameters, which is then
* **(D)** evaluated on 53 downstream tasks to judge dataset quality.
![Workflow](assets/workflow_dclm.png)For more details, please refer to our [paper](https://arxiv.org/abs/2406.11794).
## Leaderboard
The DCLM [leaderboard](https://datacomp.ai/dclm/leaderboard) showcases the performance of models trained on various scales and datasets. The leaderboard is updated regularly with the latest submissions from the community.
Below are comparisions of our model with others in the 7B regime.
| Model | Params | Tokens | Open dataset? | CORE | MMLU | EXTENDED |
|---------------|--------|--------|---------------|----------|----------|----------|
| **Open weights, closed datasets** | | | | | | |
| Llama2 | 7B | 2T | ✗ | 49.2 | 45.8 | 34.1 |
| DeepSeek | 7B | 2T | ✗ | 50.7 | 48.5 | 35.3 |
| Mistral-0.3 | 7B | ? | ✗ | 57.0 | 62.7 | 45.1 |
| QWEN-2 | 7B | ? | ✗ | 57.5 | **71.9** | 50.5 |
| Llama3 | 8B | 15T | ✗ | 57.6 | 66.2 | 46.3 |
| Gemma | 8B | 6T | ✗ | 57.8 | 64.3 | 44.6 |
| Phi-3 | 7B | ? | ✗ | **61.0** | 69.9 | **57.9** |
| **Open weights, open datasets** | | | | | | |
| Falcon | 7B | 1T | ✓ | 44.1 | 27.4 | 25.1 |
| OLMo-1.7 | 7B | 2.1T | ✓ | 47.0 | 54.0 | 34.2 |
| MAP-Neo | 7B | 4.5T | ✓ | **50.2** | **57.1** | **40.4** |
| **Models we trained** | | | | | | |
| FineWeb edu | 7B | 0.14T | ✓ | 38.7 | 26.3 | 22.1 |
| FineWeb edu | 7B | 0.28T | ✓ | 41.9 | 37.3 | 24.5 |
| **DCLM-BASELINE** | 7B | 0.14T | ✓ | 44.1 | 38.3 | 25.0 |
| **DCLM-BASELINE** | 7B | 0.28T | ✓ | 48.9 | 50.8 | 31.8 |
| **DCLM-BASELINE** | 7B | 2.6T | ✓ | **57.1** | **63.7** | **45.4** |## Getting Started
To get started with DCLM, follow these steps:1. **Clone the repository**:
```bash
git clone https://github.com/mlfoundations/DCLM.git
cd DCLM
```2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
Before installing the dependencies, make sure cmake, build-essential, and g++ are installed, e.g., by installing:
```bash
apt install cmake build-essential
apt install g++-9
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
```
To download additional models and data needed for baseline reproduction, run:
```bash
python setup.py install
```3. **Set up your environment**:
DCLM uses AWS for storage and possible as a compute backend, and ray for distributed processing.
Ensure you have the necessary environment variables and configurations for AWS and Ray clusters.
We recommend the use of Python 3.10 with DCLM.## Selecting Raw Sources
If you are creating a new source:- Ensure your data is stored in JSONL format (ideally compressed with zstandard).
- Key names should be consistent with those in [here](baselines/core/constants.py).
- Create a reference JSON in [exp_data/datasets/raw_sources](exp_data/datasets/raw_sources).If you are selecting a raw source for downstream processing:
- Identify the raw source you intend to use, which corresponds to a dataset reference (i.e., a JSON in [raw_sources](exp_data/datasets/raw_sources).
- The reference JSON contains the URL to the actual data and other metadata used as input for downstream processing.## Processing the Data
To process raw data, follow these steps:1. **Define a set of processing steps**:
Create a pipeline config YAML file specifying the operations.
See our [reproduction of C4 for example](baselines/baselines_configs/c4.yaml).
Further details on defining a pipeline can be found [here](baselines/README.md).2. **Set up a Ray cluster**:
The data processing script relies on Ray for distributed processing of data. This cluster can be either launched on a single node (for small scale data processing) or using AWS EC2 instances.To launch a local cluster, use the following command:
```bash
ray start --head --port 6379
```To launch a cluster using AWS EC2 instances, use the following:
```bash
ray up
```
where `````` is a cluster configuration script that depends on your specific use case. We invite the reader to go over the [Ray documentation](https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-cli.html) for details on how to create this config file.**Important**: When using EC2 instances, make sure to tear down your cluster after your job finishes, so as to not incur unnecessary costs!
A sample config file can be seen here (make sure to adapt to your needs):
```yaml
cluster_name: test-processing
max_workers: 2
upscaling_speed: 1.0
available_node_types:
ray.head.default:
resources: {}
node_config:
ImageId: ami-0c5cce1d70efb41f5
InstanceType: i4i.4xlarge
IamInstanceProfile:
# Replace 000000000000 with your IAM account 12-digit ID
Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler-v1
ray.worker.default:
min_workers: 2
max_workers: 2
node_config:
ImageId: ami-0c5cce1d70efb41f5
InstanceType: i4i.4xlarge
IamInstanceProfile:
# Replace 000000000000 with your IAM account 12-digit ID
Arn: arn:aws:iam::000000000000:instance-profile/ray-autoscaler-v1# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
cache_stopped_nodes: Falsesetup_commands:
- sudo mkfs -t xfs /dev/nvme1n1
- sudo mount /dev/nvme1n1 /tmp
- sudo chown -R $USER /tmp
- sudo chmod -R 777 /tmp
- wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -O miniconda.sh
- bash ~/miniconda.sh -f -b -p /tmp/miniconda3/
- echo 'export PATH="/tmp/miniconda3/bin/:$PATH"' >> ~/.bashrc
# Include your AWS CREDS here
- echo 'export AWS_ACCESS_KEY_ID=' >> ~/.bashrc
- echo 'export AWS_SECRET_ACCESS_KEY=' >> ~/.bashrc
- pip install --upgrade pip setuptools wheel
- pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
- pip install boto3==1.26.90
- pip install s3fs==2022.11.0
- pip install psutil
- pip install pysimdjson
- pip install pyarrow
- git clone https://github.com/mlfoundations/dclm.git
- pip install -r dclm/requirements.txt
- cd dclm && python3 setup.py install
```3. **Run the processing script**:
To run the processing script, in the case of a local cluster, simply run the following command:
```bash
python3 ray_processing/process.py --source_ref_paths --readable_name --output_dir --config_path --source_name
```When using EC2 instances, you need to connect to the cluster and then launch the command
```bash
# In your local terminal
ray attach# Inside the cluster EC2 instance
cd dcnlp
export PYTHONPATH=$(pwd)
python3 ray_processing/process.py --source_ref_paths --readable_name --output_dir --config_path --source_name
```4. **Monitor and tear down**:
You can track the progress of data processing via the `global_stats.jsonl` file in the output directory. After the job finishes, you can tear down your cluster via `ray stop` (in the local cluster case) or `ray down ` (in the AWS EC2 case). **THIS IS VERY IMPORTANT TO NOT INCUR ADDITIONAL COSTS WHEN USING EC2!**## Deduplication
To deduplicate the raw text as we have done in DCLM-Baseline, use the tools provided in the [dedup](dedup/) subdirectory. Here we include several rust tools for deduplication, but we recommend using BFF, located in [dedup/bff](dedup/bff). Specific instructions to run deduplication are contained in the readme in each of the directories containing the rust tools.We note that the code in [dedup](dedup/) specifically refers to inter-document fuzzy deduplication, i.e., identifying near-duplicates across documents in the corpus. Tooling built in Ray to identify exact content and URL duplicates is contained in [ray_processing/dedup_jsonl.py](ray_processing/dedup_jsonl.py) (but we do not use this form of dedup in DCLM-Baseline).
## Tokenize and Shuffle
After processing the raw text, you should convert it into tokenized datasets and perform shuffling for training:1. **Set up a Ray cluster**:
Set up a Ray cluster in the same way as the processing step.2. **Run the tokenize and shuffle script**:
```bash
python ray_processing/tokenize_shuffle.py --source_ref_paths --readable_name --output --content_key text --do_sample --default_dataset_yaml
```3. **Tear down**:
Tear down the Ray cluster as in the processing step.The `tokenize_shuffle.py` script creates a dataset in `webdataset` format, along with a `manifest.jsonl` file. This file is required by the training script, and it contains information on the number of sequences inside each shard of the dataset. If needed, this manifest file can also be created manually, via the following command:
```bash
python -m open_lm.utils.make_wds_manifest --data-dir
```## Model Training
To train a model using the tokenized dataset:1. **Run the training script**:
```bash
torchrun --nproc-per-node 8 -m training.train --scale --logs [--remote-sync ] [--chinchilla-multiplier ] [--clean-exp] [--report-to-wandb]
```
You can expect the following training times per track:| Scale | Model parameters | Train tokens | Train FLOPs | Train H100 hours | Pool size |
|--------|------------------|--------------|-------------|------------------|-----------|
| 400M-1x| 412M | 8.2B | 2.0e19 | 26 | 137B |
| 1B-1x | 1.4B | 28B | 2.4e20 | 240 | 1.64T |
| 1B-5x | 1.4B | 138B | 1.2e21 | 1200 | 8.20T |
| 7B-1x | 6.9B | 138B | 5.7e21 | 3700 | 7.85T |
| 7B-2x | 6.9B | 276B | 1.1e22 | 7300 | 15.7T |2. **Monitor and manage your training jobs**:
Use slurm sbatch scripts or Sagemaker for running experiments on various compute infrastructures.## Evaluation
Evaluate trained models using the following methods:1. **Preferred Method**:
```bash
python tools/eval_expdb.py --start_idx 0 --end_idx 3 --filters name= --prefix_replacement --num_gpus 8 --output_dir --eval_yaml
```2. **Direct Evaluation**:
```bash
torchrun --nproc_per_node eval/eval_openlm_ckpt.py --checkpoint --eval-yaml --config --model --output-file
```## Submission
When you finished training and evaluating your model, a model eval json file has been generated and is at [exp_data/evals](exp_data/evals).
You can now open a pull request to the main repository to share your results with the team and submit it to the leaderboard.## Contributing
We welcome contributions to improve the DCLM framework. Please follow our [contributing guide](contributing.md) for submitting pull requests and reporting issues.## Downloading Artifacts
### Datasets
We provide multiple datasets, both as starting points for each of the competition scales, as well as the results of our processing pipeline.
- The dataset pools for the competition stages are available at HuggingFace, with different repositories for the [400m-1x](https://huggingface.co/datasets/mlfoundations/dclm-pool-400m-1x), [1b-1x](https://huggingface.co/datasets/mlfoundations/dclm-pool-1b-1x), [1b-5x](https://huggingface.co/datasets/mlfoundations/dclm-pool-1b-5x), [7b-1x](https://huggingface.co/datasets/mlfoundations/dclm-pool-7b-1x) and [7b-2x](https://huggingface.co/datasets/mlfoundations/dclm-pool-7b-2x) scales. All these pools contain raw data and can be processed with the steps outlined above. All of these are subsets of out entire raw pool, [DCLM-pool](https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/index.html), which is available via the CommonCrawl S3 bucket.
- Our final processed dataset, DCLM-Baseline, is available on Huggingface in both [zstd compressed jsonl](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) and [parquet](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) formats. The former version is also available on the CommonCrawl S3 bucket, accessed via the instructions [here](https://data.commoncrawl.org/contrib/datacomp/DCLM-baseline/index.html).
- We also provide a version of our dataset that performs all the steps of our preprocessing except the final one (namely, the fasttext filtering). This version, called DCLM-RefinedWeb, is also available on the CommonCrawl S3 bucket, with instructions available [here](https://data.commoncrawl.org/contrib/datacomp/DCLM-refinedweb/index.html)
### Pretrained Models
We provide links to models pretrained using our dataset via the DCLM collection on Huggingface, found [here](https://huggingface.co/collections/mlfoundations/dclm-669938432ef5162d0d0bc14b). These models can be downloaded and evaluated using the OpenLM library.
## How to Cite Us
If you use our dataset or models in your research, please cite us as follows:
```bibtex
@article{li2024datacomplm,
title={DataComp-LM: In search of the next generation of training sets for language models},
author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and Khyathi Chandu and Thao Nguyen and Igor Vasiljevic and Sham Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldaini and Pang Wei Koh and Jenia Jitsev and Thomas Kollar and Alexandros G. Dimakis and Yair Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar},
year={2024},
journal={arXiv preprint arXiv:2406.11794}
}
```
When using DCLM evaluation suite, please make sure to cite all the original evaluation papers. [evaluation_bibtex](bib/evalutaion.bib).When using DCLM for training, please make sure to cite the main training framework dependencies as well. [training_bibtex](bib/training.bib).
## License
This project is licensed under the MIT License. See the [license](LICENSE.txt) file for details.