https://github.com/commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
https://github.com/commoncrawl/cc-notebooks
aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework
Last synced: 4 months ago
JSON representation
Various Jupyter notebooks about Common Crawl data
- Host: GitHub
- URL: https://github.com/commoncrawl/cc-notebooks
- Owner: commoncrawl
- License: apache-2.0
- Created: 2019-07-19T11:38:10.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2025-04-01T16:54:24.000Z (6 months ago)
- Last Synced: 2025-04-01T17:51:34.590Z (6 months ago)
- Topics: aws-athena, common-crawl, commoncrawl, jupyter-notebook, webarchiving, webgraph-framework
- Language: Jupyter Notebook
- Homepage:
- Size: 3.01 MB
- Stars: 51
- Watchers: 17
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Jupyter Notebooks to Analyze Common Crawl Data
* analyzing data using the [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/)
- blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: [net-blocking-iran-cc-main-2019-47.ipynb](./cc-index-table/net-blocking-iran-cc-main-2019-47.ipynb)
- total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the `.edu` top-level domain: [cc-main-2013-2019-metrics.ipynb](./cc-index-table/cc-main-2013-2019-metrics.ipynb)
- correlations between character sets and lanuages: [correlation-language-charset.ipynb](./cc-index-table/correlation-language-charset.ipynb)
* analyze the Common Crawl webgraph data sets and interactively explore the graphs: [cc-webgraph-statistics](./cc-webgraph-statistics/)
* how to explore WARC files [running a notebook on AWS EMR](./cc-emr-notebook/cluster_setup.md)
* [truncated record payloads in WARC Files](./warc-truncation/):
- verify that all truncated payloads are annotated by the [WARC-Truncated header](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated)
- which MIME types are mostly affected by truncation? Aggregations using the columnar index.