{"id":28626096,"url":"https://github.com/commoncrawl/cc-downloader","last_synced_at":"2025-06-12T08:40:54.273Z","repository":{"id":244571244,"uuid":"813166962","full_name":"commoncrawl/cc-downloader","owner":"commoncrawl","description":"A polite and user-friendly downloader for Common Crawl data","archived":false,"fork":false,"pushed_at":"2025-05-07T14:48:36.000Z","size":137,"stargazers_count":47,"open_issues_count":1,"forks_count":2,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-06-08T22:13:51.215Z","etag":null,"topics":["commoncrawl","downloader","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-06-10T15:40:01.000Z","updated_at":"2025-06-07T20:30:01.000Z","dependencies_parsed_at":"2024-06-17T18:48:10.310Z","dependency_job_id":"c4f411a1-dc21-45eb-b6b9-05e919edd540","html_url":"https://github.com/commoncrawl/cc-downloader","commit_stats":null,"previous_names":["pjox/cc-downloader","commoncrawl/cc-downloader"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-downloader","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-downloader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-downloader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-downloader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-downloader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-downloader/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-downloader/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259430800,"owners_count":22856354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["commoncrawl","downloader","rust"],"created_at":"2025-06-12T08:40:53.662Z","updated_at":"2025-06-12T08:40:54.266Z","avatar_url":"https://github.com/commoncrawl.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"# CC-Downloader\n\nThis is an experimental polite downloader for Common Crawl data written in `rust`. This tool is intended for use outside of AWS.\n\n## Todo\n\n- [ ] Add Python bindings\n- [ ] Add more tests\n- [ ] Handle unrecoverable errors\n\n## Installation\n\nYou can install `cc-downloader` via our pre-built binaries, or by compiling it from source.\n\n### Pre-built binaries\n\nYou can find our pre-built binaries on our [GitHub releases page](https://github.com/commoncrawl/cc-downloader/releases). They are available for `Linux`, `macOS`, and `Windows`, in `x86_64` and `aarch64` architectures (Windows is only supported in `x86_64`). In order to use them please select and download the correct binary for your system.\n\n```bash\nwget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].[COMPRESSION-FORMAT]\n```\n\nAfter downloading it, please verify the checksum of the binary. You can find the checksum file in the same location as the binary. The checksum is generated using `sha512sum`. You can verify it by running the following command:\n\n```bash\nwget https://github.com/commoncrawl/cc-downloader/releases/download/[VERSION]/cc-downloader-[VERSION]-[ARCH]-[OS].sha512\nsha512sum -c cc-downloader-[VERSION]-[ARCH]-[OS].sha512\n```\n\nIf the checksum is valid, which will be indicated by and `OK` message, you can proceed to extract the binary. For `tar.gz` files you can use the following command:\n\n```bash\ntar -xzf cc-downloader-[VERSION]-[ARCH]-[OS].tar.gz\n```\n\nFor `zip` files you can use the following command:\n\n```bash\nunzip cc-downloader-[VERSION]-[ARCH]-[OS].zip\n```\n\nThis will extract the binary, the licenses and the readme file **in the current folder**. After extracting the binary, you can run it by executing the following command:\n\n```bash\n./cc-downloader\n```\n\nIf you want to use the binary from anywhere, you can move it to a folder in your `PATH`. For more information on how to do this, please refer to the documentation of your operating system. For example, on `Linux` and `macOS` you can move it to `~/.bin`:\n\n```bash\nmv cc-downloader ~/.bin\n```\n\nAnd then add the following line to your `~/.bashrc` or `~/.zshrc` file:\n\n```bash\nexport PATH=$PATH:~/.bin\n```\n\nthen run the following command to apply the changes:\n\n```bash\nsource ~/.bashrc\n```\n\nor \n\n```bash\nsource ~/.zshrc\n```\n\nThen, you can run the binary from anywhere. If you want to update the binary, you can repeat the process and download the new version. Make sure to replace the binary that is stored in the folder that you added to your `PATH`. If you want to remove the binary, you can simply delete from this folder.\n\n### Compiling from source\n\nFor this you need to have `rust` installed. You can install `rust` by following the instructions on the [official website](https://www.rust-lang.org/tools/install).\n\nOr by running the following command:\n\n```bash\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\n```\n\nEven if you have `rust` a system-wide installation, we recommend the linked installation method. A system-wide installation and a user installation can co-exist without any problems.\n\nWhen compiling from source, please make sure you have the latest version of `rust` installed by running the following command:\n\n```bash\nrustup update\n```\n\nNow you can install the `cc-downloader` tool by running the following command:\n\n```bash\ncargo install cc-downloader\n```\n\n## Usage\n\n```text\n➜ cc-downloader -h\nA polite and user-friendly downloader for Common Crawl data.\n\nUsage: cc-downloader [COMMAND]\n\nCommands:\n  download-paths  Download paths for a given crawl\n  download        Download files from a crawl\n  help            Print this message or the help of the given subcommand(s)\n\nOptions:\n  -h, --help     Print help\n  -V, --version  Print version\n\n------\n\n➜ cc-downloader download-paths -h\nDownload paths for a given crawl\n\nUsage: cc-downloader download-paths \u003cCRAWL\u003e \u003cSUBSET\u003e \u003cDESTINATION\u003e\n\nArguments:\n  \u003cCRAWL\u003e        Crawl reference, e.g. CC-MAIN-2021-04 or CC-NEWS-2025-01\n  \u003cSUBSET\u003e       Data type [possible values: segment, warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table]\n  \u003cDESTINATION\u003e  Destination folder\n\nOptions:\n  -h, --help  Print help\n------\n\n➜ cc-downloader download -h\nDownload files from a crawl\n\nUsage: cc-downloader download [OPTIONS] \u003cPATHS\u003e \u003cDESTINATION\u003e\n\nArguments:\n  \u003cPATHS\u003e        Path file\n  \u003cDESTINATION\u003e  Destination folder\n\nOptions:\n  -f, --files-only                      Download files without the folder structure. This only works for WARC/WET/WAT files\n  -n, --numbered                        Enumerate output files for compatibility with Ungoliant Pipeline. This only works for WET files\n  -t, --threads \u003cNUMBER OF THREADS\u003e     Number of threads to use [default: 10]\n  -r, --retries \u003cMAX RETRIES PER FILE\u003e  Maximum number of retries per file [default: 1000]\n  -p, --progress                        Print progress\n  -h, --help                            Print help\n```\n\n## Number of threads\n\nThe number of threads can be set using the `-t` flag. The default value is 10. It is advised to use the default value to avoid being blocked by the server. If you make too many requests in a short period of time, you will start receiving `403` errors which are unrecoverable and cannot be retried by the downloader.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-downloader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-downloader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-downloader/lists"}