https://github.com/thunderpoot/cc-getpage

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
https://github.com/thunderpoot/cc-getpage

common-crawl common-crawl-data common-crawl-python common-crawl-with-python commoncrawl

Last synced: 5 months ago
JSON representation

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

Host: GitHub
URL: https://github.com/thunderpoot/cc-getpage
Owner: thunderpoot
License: mit
Created: 2025-03-02T12:21:48.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-02T12:30:12.000Z (over 1 year ago)
Last Synced: 2025-10-26T16:35:36.493Z (9 months ago)
Topics: common-crawl, common-crawl-data, common-crawl-python, common-crawl-with-python, commoncrawl
Language: Python
Homepage: https://commoncrawl.org/
Size: 1.72 MB
Stars: 6
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

![Masthead Image](mast.png)

# cc-getpage

`cc-getpage` is a lightweight Python utility for retrieving individual pages from the [Common Crawl](https://commoncrawl.org) archive. It provides a simple way to fetch specific web pages using Common Crawl's index and downloads the corresponding WARC file segment.

For **bulk downloads** or **entire snapshots**, please use the official [`cc-downloader`](https://github.com/commoncrawl/cc-downloader) program.

## Features

- Fetches specific web pages from Common Crawl archives
- Automatically probes crawls to find which ones contain your URL
- Supports manual or automatic crawl selection
- Displays archived versions of a URL for selection
- Downloads only the necessary WARC segment
- Includes automatic retries with backoff
- `--viewpage` option to get a Common Crawl viewer URL instead of downloading

## Usage

```sh
python cc-getpage.py [--viewpage] [CRAWL-ID]
```

### Options

| Option | Description |
|---|---|
| `--viewpage` | Print a Common Crawl viewer URL instead of downloading the WARC segment |
| `--version` | Show the program version |

If `CRAWL-ID` is omitted, the program will probe all available crawls to find which ones contain the given URL. This is rate-limited to be polite to the index server, so it may take a while. Press `Ctrl+C` to stop early and work with whatever matches have been found so far.

## **Contribute**
Pull requests are welcome. Feel free to improve features or fix bugs.

## License
This project is licensed under the **MIT Licence**.

## Contact
For support or questions, visit [Common Crawl](https://commoncrawl.org/contact-us) or open an issue on GitHub. You're also welcome to join our [Discord server](https://discord.gg/njaVFh7avF) or [Google Group](https://groups.google.com/g/common-crawl).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thunderpoot/cc-getpage

Awesome Lists containing this project

README