https://github.com/pjox/cc-downloader
A polite and user-friendly downloader for Common Crawl data
https://github.com/pjox/cc-downloader
commoncrawl downloader rust
Last synced: 2 months ago
JSON representation
A polite and user-friendly downloader for Common Crawl data
- Host: GitHub
- URL: https://github.com/pjox/cc-downloader
- Owner: commoncrawl
- License: apache-2.0
- Created: 2024-06-10T15:40:01.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-03-13T22:34:35.000Z (2 months ago)
- Last Synced: 2025-03-16T16:02:06.073Z (2 months ago)
- Topics: commoncrawl, downloader, rust
- Language: Rust
- Homepage:
- Size: 125 KB
- Stars: 34
- Watchers: 7
- Forks: 1
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-APACHE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# CC-Downloader
This is an experimental polite downloader for Common Crawl data writter in `rust`. This tool is intended for use outside of AWS.
## Todo
- [ ] Add Python bindings
- [ ] Add more tests
- [ ] Handle unrecoverable errors## Installation
For now, the only supported way to install the tool is to use `cargo`. For this you need to have `rust` installed. You can install `rust` by following the instructions on the [official website](https://www.rust-lang.org/tools/install).
After installing `rust`, ``cc-downloader`` can be installed with the following command:
```bash
cargo install cc-downloader
```## Usage
```text
➜ cc-downloader -h
A polite and user-friendly downloader for Common Crawl data.Usage: cc-downloader [COMMAND]
Commands:
download-paths Download paths for a given crawl
download Download files from a crawl
help Print this message or the help of the given subcommand(s)Options:
-h, --help Print help
-V, --version Print version------
➜ cc-downloader download-paths -h
Download paths for a given crawlUsage: cc-downloader download-paths
Arguments:
Crawl reference, e.g. CC-MAIN-2021-04
Data type [possible values: segment, warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table]
Destination folderOptions:
-h, --help Print help
------➜ cc-downloader download -h
Download files from a crawlUsage: cc-downloader download [OPTIONS]
Arguments:
Path file
Destination folderOptions:
-f, --files-only Download files without the folder structure. This only works for WARC/WET/WAT files
-n, --numbered Enumerate output files for compatibility with Ungoliant Pipeline. This only works for WET files
-t, --threads Number of threads to use [default: 10]
-r, --retries Maximum number of retries per file [default: 1000]
-p, --progress Print progress
-h, --help Print help
```## Number of threads
The number of threads can be set using the `-t` flag. The default value is 10. It is advised to use the default value to avoid being blocked by the server. If you make too many requests in a short period of time, you will satrt receiving `403` errors which are unrecoverable and cannot be retried by the downloader.