https://github.com/ggcr/codecurator
A Rust tool for curating and processing GitHub repos as datasets 🦀
https://github.com/ggcr/codecurator
datasets github rust
Last synced: about 1 month ago
JSON representation
A Rust tool for curating and processing GitHub repos as datasets 🦀
- Host: GitHub
- URL: https://github.com/ggcr/codecurator
- Owner: ggcr
- Created: 2025-05-01T16:24:57.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-01T10:20:38.000Z (12 months ago)
- Last Synced: 2025-07-01T11:31:17.490Z (12 months ago)
- Topics: datasets, github, rust
- Language: Rust
- Homepage:
- Size: 240 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CodeCurator
An end-to-end tool for curating GitHub repositories into structured code datasets.
- **Fast parallel processing** - Download and extract with configurable workers
- **Smart filtering** - Only processes programming files using GitHub Linguist
- **GPT-2 tokenization** - Ready-to-use token counts for ML workflows
- **Efficient caching** - Uses ETags to avoid re-downloading unchanged repos
Perfect for curating training data, running code analysis, or creating repository archives.
### Installation
```bash
cargo install --path .
```
### Usage
Create an input file with one GitHub repository per line:
```jsonl
"microsoft/vscode"
"vercel/next.js"
"tensorflow/tensorflow"
"bitcoin/bitcoin"
"rust-lang/rust"
"kubernetes/kubernetes"
"facebook/react"
"docker/compose"
"ansible/ansible"
"elastic/elasticsearch"
```
**Download repositories:**
```bash
codecurator download ./configs/repos.jsonl
```
This creates ZIP files in `./zip/repos/`. Downloads from `main` branch first, falls back to `master` if needed.
**Extract and process:**
```bash
codecurator extract ./configs/repos.jsonl --languages Python Rust Verilog
```
Processes all programming files from `./zip/repos/`, tokenizes content, and outputs structured data to `./jsonl/repos/`.
**Deduplication:**
```bash
codecurator dedupe ./configs/repos.jsonl
```
Hashes the contents of all files and deduplicates them. Stores the final data to `/dedup/` by default.
**Statistics:**
```bash
$ bash stats/count_records.sh ./jsonl/
Total records: 110645
$ bash stats/count_tokens.sh ./dedup/
Total tokens: 346574283
```
