Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/locuslab/scaling_laws_data_filtering
https://github.com/locuslab/scaling_laws_data_filtering
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/locuslab/scaling_laws_data_filtering
- Owner: locuslab
- Created: 2024-04-09T17:43:44.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-04-09T20:18:16.000Z (9 months ago)
- Last Synced: 2024-04-18T01:57:50.534Z (9 months ago)
- Language: Python
- Size: 12.7 KB
- Stars: 36
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Scaling Laws for Data Filtering
### Registering data buckets
The buckets should be registered in the following file: `all_paths_128.py`
This file contains the following information:
- `path`: The path to the data file that has the evaluation results for a model trained on that dataset.
- `samples_per_epoch_dict`: The number of samples per epoch for the corresponding dataset.
- `match_with_dict`: This tells us if the evaluation is done at a fixed epoch interval, or a fixed sample interval.
- `subsample_every_dict`: In case you want to take the average of every `k` evaluations. This is usually only useful when the evaluation is done at a fixed sample interval.### Estimating data bucket parameters
This step involves estimating the scaling parameters for each bucket of interest.
### Grid search to find the bucket scaling parameters
Grid search is performed to find the best scaling parameters for each bucket. The grid search is performed using the following file: `grid_search.py`. The objective minimized in the grid search is defined in `objective.py`. We chose grid search because the of instabilities observed in scipy based optimization methods.
### Objective Functions
This file implements scaling laws based on FADU, and also those inspired from work on Scaling Data Constrained Language Models.
- `func_effective_utility`: This is the function that uses the effective utility formulation as proposed in our work.
- `func_effective_data`: This is the function that uses the formulation of effective data from Scaling Data Constrained Language Models.```
python process_128_grid.py --a_upper 0.02 --objective effective_utility --d 0.1
```
Here `a_upper` is used to give an upper limit to the grid search for `a`, and `d` is the irreducibile loss. Refer to `ablations/finding_a.py` if you want to jointly minimize `a` across the pools.
Copy the obtained scaling parameters to the `results/parameter_values.py` file, and give an appropriate key name.### Finding best bucket combination
```
python estimate_best_pool.py --key given_key_name
```