https://github.com/thunderpoot/cc-getpage
Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
https://github.com/thunderpoot/cc-getpage
common-crawl common-crawl-data common-crawl-python common-crawl-with-python commoncrawl
Last synced: 4 months ago
JSON representation
Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
- Host: GitHub
- URL: https://github.com/thunderpoot/cc-getpage
- Owner: thunderpoot
- License: mit
- Created: 2025-03-02T12:21:48.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-02T12:30:12.000Z (over 1 year ago)
- Last Synced: 2025-10-26T16:35:36.493Z (8 months ago)
- Topics: common-crawl, common-crawl-data, common-crawl-python, common-crawl-with-python, commoncrawl
- Language: Python
- Homepage: https://commoncrawl.org/
- Size: 1.72 MB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# cc-getpage
`cc-getpage` is a lightweight Python utility for retrieving individual pages from the [Common Crawl](https://commoncrawl.org) archive. It provides a simple way to fetch specific web pages using Common Crawl's index and downloads the corresponding WARC file segment.
For **bulk downloads** or **entire snapshots**, please use the official [`cc-downloader`](https://github.com/commoncrawl/cc-downloader) program.
## Features
- Fetches specific web pages from Common Crawl archives
- Automatically probes crawls to find which ones contain your URL
- Supports manual or automatic crawl selection
- Displays archived versions of a URL for selection
- Downloads only the necessary WARC segment
- Includes automatic retries with backoff
- `--viewpage` option to get a Common Crawl viewer URL instead of downloading
## Usage
```sh
python cc-getpage.py [--viewpage] [CRAWL-ID]
```
### Options
| Option | Description |
|---|---|
| `--viewpage` | Print a Common Crawl viewer URL instead of downloading the WARC segment |
| `--version` | Show the program version |
If `CRAWL-ID` is omitted, the program will probe all available crawls to find which ones contain the given URL. This is rate-limited to be polite to the index server, so it may take a while. Press `Ctrl+C` to stop early and work with whatever matches have been found so far.
## **Contribute**
Pull requests are welcome. Feel free to improve features or fix bugs.
## License
This project is licensed under the **MIT Licence**.
## Contact
For support or questions, visit [Common Crawl](https://commoncrawl.org/contact-us) or open an issue on GitHub. You're also welcome to join our [Discord server](https://discord.gg/njaVFh7avF) or [Google Group](https://groups.google.com/g/common-crawl).