https://github.com/eleutherai/pile-cc-filtering
The code used to filter CC data for The Pile
https://github.com/eleutherai/pile-cc-filtering
Last synced: about 1 year ago
JSON representation
The code used to filter CC data for The Pile
- Host: GitHub
- URL: https://github.com/eleutherai/pile-cc-filtering
- Owner: EleutherAI
- License: mit
- Created: 2020-10-18T18:19:23.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-10-22T16:26:49.000Z (over 5 years ago)
- Last Synced: 2025-03-28T05:25:01.821Z (about 1 year ago)
- Language: Python
- Size: 4.88 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pile-cc-filtering
This repository is used for filtering CC for Pile v1.
This procedure is largely based on the filtering detailed in the GPT3 paper; we train a fasttext classifier to classify between Pile data and CC data.
## Usage
1. Get fasttext training data from [The Pile](https://github.com/EleutherAI/The-Pile) (see instructions there)
2. Get fasttext training data from CC (downloaded using [cc_downloader](https://github.com/leogao2/commoncrawl_downloader)) by running:
```
python make_cc_fasttext.py path/to/cc_data
```
3. Concatenate the training data files from both into a file called `fasttext_training.txt`
4. Run:
```
python train_fasttext.py
```