https://github.com/locuslab/scaling_laws_data_filtering

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/locuslab/scaling_laws_data_filtering
Owner: locuslab
Created: 2024-04-09T17:43:44.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-09T20:18:16.000Z (about 1 year ago)
Last Synced: 2025-04-02T20:11:24.093Z (3 months ago)
Language: Python
Size: 12.7 KB
Stars: 64
Watchers: 3
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Scaling Laws for Data Filtering

### Registering data buckets

The buckets should be registered in the following file: `all_paths_128.py`
This file contains the following information:
- `path`: The path to the data file that has the evaluation results for a model trained on that dataset.
- `samples_per_epoch_dict`: The number of samples per epoch for the corresponding dataset.
- `match_with_dict`: This tells us if the evaluation is done at a fixed epoch interval, or a fixed sample interval.
- `subsample_every_dict`: In case you want to take the average of every `k` evaluations. This is usually only useful when the evaluation is done at a fixed sample interval.

### Estimating data bucket parameters

This step involves estimating the scaling parameters for each bucket of interest.

### Grid search to find the bucket scaling parameters

Grid search is performed to find the best scaling parameters for each bucket. The grid search is performed using the following file: `grid_search.py`. The objective minimized in the grid search is defined in `objective.py`. We chose grid search because the of instabilities observed in scipy based optimization methods.

### Objective Functions

This file implements scaling laws based on FADU, and also those inspired from work on Scaling Data Constrained Language Models.

- `func_effective_utility`: This is the function that uses the effective utility formulation as proposed in our work.
- `func_effective_data`: This is the function that uses the formulation of effective data from Scaling Data Constrained Language Models.

```
python process_128_grid.py --a_upper 0.02 --objective effective_utility --d 0.1
```
Here `a_upper` is used to give an upper limit to the grid search for `a`, and `d` is the irreducibile loss. Refer to `ablations/finding_a.py` if you want to jointly minimize `a` across the pools.
Copy the obtained scaling parameters to the `results/parameter_values.py` file, and give an appropriate key name.

### Finding best bucket combination
```
python estimate_best_pool.py --key given_key_name
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/locuslab/scaling_laws_data_filtering

Awesome Lists containing this project

README