https://github.com/commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data
https://github.com/commoncrawl/cc-notebooks

aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework

Last synced: 4 months ago
JSON representation

Various Jupyter notebooks about Common Crawl data

Host: GitHub
URL: https://github.com/commoncrawl/cc-notebooks
Owner: commoncrawl
License: apache-2.0
Created: 2019-07-19T11:38:10.000Z (about 6 years ago)
Default Branch: main
Last Pushed: 2025-04-01T16:54:24.000Z (6 months ago)
Last Synced: 2025-04-01T17:51:34.590Z (6 months ago)
Topics: aws-athena, common-crawl, commoncrawl, jupyter-notebook, webarchiving, webgraph-framework
Language: Jupyter Notebook
Homepage:
Size: 3.01 MB
Stars: 51
Watchers: 17
Forks: 9
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Jupyter Notebooks to Analyze Common Crawl Data

* analyzing data using the [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/)
- blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: [net-blocking-iran-cc-main-2019-47.ipynb](./cc-index-table/net-blocking-iran-cc-main-2019-47.ipynb)
- total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the `.edu` top-level domain: [cc-main-2013-2019-metrics.ipynb](./cc-index-table/cc-main-2013-2019-metrics.ipynb)
- correlations between character sets and lanuages: [correlation-language-charset.ipynb](./cc-index-table/correlation-language-charset.ipynb)
* analyze the Common Crawl webgraph data sets and interactively explore the graphs: [cc-webgraph-statistics](./cc-webgraph-statistics/)
* how to explore WARC files [running a notebook on AWS EMR](./cc-emr-notebook/cluster_setup.md)
* [truncated record payloads in WARC Files](./warc-truncation/):
- verify that all truncated payloads are annotated by the [WARC-Truncated header](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated)
- which MIME types are mostly affected by truncation? Aggregations using the columnar index.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/commoncrawl/cc-notebooks

Awesome Lists containing this project

README