https://github.com/eleutherai/pile_dedupe
Pile Deduplication Code
https://github.com/eleutherai/pile_dedupe
Last synced: about 1 year ago
JSON representation
Pile Deduplication Code
- Host: GitHub
- URL: https://github.com/eleutherai/pile_dedupe
- Owner: EleutherAI
- License: mit
- Created: 2023-05-15T06:35:13.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-05-15T06:41:26.000Z (about 3 years ago)
- Last Synced: 2025-03-28T05:31:30.498Z (about 1 year ago)
- Language: Python
- Size: 16.6 KB
- Stars: 17
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Pile Dedupe
## Prerequisites
Download the [Pile distribution](https://the-eye.eu/public/AI/pile/). Relevant files are in train.
## Install
```
git clone https://github.com/EleutherAI/pile_dedupe.git
pip install -r requirements.txt
ln -s PILE_LOCATION pile
```
## Usage
| Step | Overview | Details |
| ---- | ------- | ---------|
| 1 | Prerequisites | Download the pile. |
| 2 | Install | Clone the repo, install requirements, symlink to the location of the downloaded train directory |
| 3 | Generate Minhashes | `python generate_minhashes.py --process_count PROCESS_COUNT` Recommend one process per logical core. |
| 4 | Verify Minhashes (Optional) | `python working_with_minhashes.py` |
| 5 | Dedupe Pile | `python dedupe.py --lsh_threshold LSH_THRESHOLD` It's fairly safe to leave lsh_threshold default (0.5) if you don't mind a bit of extra dedupe. |
| 6 | Inspect duplicates | `python working_with_duplicates.py --inspect_duplicates` |
## I'm Done - Give Me A Generator
```python
from yield_deduped_pile import yield_deduped_pile
pile_directory = "pile"
duplicates_directory = "pile_duplicates"
yield_deduped_pile(pile_directory, duplicates_directory)
```
## Further Documentation
Each file is described at the top.