An open API service indexing awesome lists of open source software.

https://github.com/eleutherai/pile-cc-filtering

The code used to filter CC data for The Pile
https://github.com/eleutherai/pile-cc-filtering

Last synced: about 1 year ago
JSON representation

The code used to filter CC data for The Pile

Awesome Lists containing this project

README

          

# pile-cc-filtering

This repository is used for filtering CC for Pile v1.

This procedure is largely based on the filtering detailed in the GPT3 paper; we train a fasttext classifier to classify between Pile data and CC data.

## Usage

1. Get fasttext training data from [The Pile](https://github.com/EleutherAI/The-Pile) (see instructions there)
2. Get fasttext training data from CC (downloaded using [cc_downloader](https://github.com/leogao2/commoncrawl_downloader)) by running:
```
python make_cc_fasttext.py path/to/cc_data
```
3. Concatenate the training data files from both into a file called `fasttext_training.txt`
4. Run:
```
python train_fasttext.py
```