Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/WashimNeupane/datacooker
Curates a mix of datasets for llm training.
https://github.com/WashimNeupane/datacooker
Last synced: 3 days ago
JSON representation
Curates a mix of datasets for llm training.
- Host: GitHub
- URL: https://github.com/WashimNeupane/datacooker
- Owner: WashimNeupane
- License: mit
- Created: 2024-03-22T11:31:52.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-04-12T10:39:30.000Z (9 months ago)
- Last Synced: 2024-04-12T17:17:54.669Z (9 months ago)
- Language: Python
- Size: 41 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Datacooker - Curates a mix of datasets for llm training. (Building / Datasets)
- awesome_ai_agents - Datacooker - Curates a mix of datasets for llm training. (Building / Datasets)
README
# datacooker
Curates a mix of datasets for llm training. Based on dolma library (See https://arxiv.org/abs/2402.00159)# Dataset Subset Downloader
This script allows you to download a subset of a Hugging Face dataset based on a specified language (e.g., English). It utilizes the Dolma library for processing Wikipedia data and filtering the dataset during the loading process.
## Requirements
- [wikiextractor](https://github.com/santhoshtr/wikiextractor.git)
- requests
- smart_open
- tqdm
- dolma
- datasetsYou can install these dependencies using pip:
```bash
pip install -r requirements.txt
```## Dolma Scripts to Run
First, you need to extract Wikipedia data and create a Wikipedia mix using Dolma:
```bash
python make_wikipedia.py \
--output data/wikipedia \
--date {latestDate in YYYYMMDD format} \
--lang simple \
--processes 16 \
--overwrite
````Then, use the Dolma script to mix the Wikipedia data:
```bash
dolma -c config/wikipeida-mix.yaml mix --processes 16
```You can subsample or oversample with the probability 0-1 and >1 respectively. For oversample, the oversampled_p= 1/p. Run the code as follows
```bash
python sampling.py -s 'data/dir/location1/*.gz' 'data/dir/location2/*.gz' -p 1.65 -d data/mixed -n 16
```## Notes
For private datasets, you may need to log in via the Hugging Face CLI before downloading.