Projects in Awesome Lists by commoncrawl
A curated list of projects in awesome lists by commoncrawl .
https://github.com/commoncrawl/commoncrawl
Common Crawl support library to access 2008-2012 crawl archives (ARC files)
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
common-crawl commoncrawl pyspark spark sparksql warc-files wat-files wet
Last synced: 12 Jun 2025
https://github.com/commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
common-crawl commoncrawl statistics
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-index-table
Index Common Crawl archives in tabular format
apache-parquet aws-athena columnar-storage commoncrawl spark sql
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-webgraph
Tools to construct and process Common Crawl webgraphs
centrality-measures common-crawl commoncrawl pagerank webgraph webgraph-framework
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-downloader
A polite and user-friendly downloader for Common Crawl data
Last synced: 12 Jun 2025
https://github.com/pjox/cc-downloader
A polite and user-friendly downloader for Common Crawl data
Last synced: 12 Jun 2025
https://github.com/commoncrawl/web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
crawling dataset language-detection
Last synced: 02 Feb 2026
https://github.com/commoncrawl/cc-citations
Scientific articles using or citing Common Crawl data
Last synced: 12 Jun 2025
https://github.com/commoncrawl/whirlwind-python
A whirlwind tour of Common Crawl's data using Python
Last synced: 12 Jun 2025
https://github.com/commoncrawl/language-detection-cld2
Natural language detection, Java bindings for CLD2
language-detection language-identification natural-language
Last synced: 12 Jun 2025
https://github.com/commoncrawl/presentations
A collection of public presentations from the Common Crawl Foundation
Last synced: 24 Feb 2026
https://github.com/commoncrawl/cc-webgraph-statistics
Statistics of Common Crawl monthly Web Graphs
Last synced: 07 Oct 2025
https://github.com/commoncrawl/ml-opt-out-experiments
A series of experiments into ML opt–out protocols
Last synced: 03 Oct 2025
https://github.com/commoncrawl/cc-host-index
Tools for working with the host index
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-nutch-example
Apache Nutch example project to archive content in WARC files
Last synced: 12 Jun 2025
https://github.com/commoncrawl/wac2025-webgraph-workshop
Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025
Last synced: 12 Jun 2025
https://github.com/commoncrawl/whirlwind-java
A whirlwind tour of Common Crawl's data using Java
Last synced: 18 Jun 2026
https://github.com/commoncrawl/cc-web-graph-neo4j
Instructions and code for using the Common Crawl Web Graph in Neo4j format
Last synced: 03 Apr 2026
https://github.com/commoncrawl/cc-legal
Repository for legal documentation at the Common Crawl Foundation
Last synced: 02 Feb 2026
https://github.com/commoncrawl/cc-monitoring
Code that monitors Common Crawl infrastructure
Last synced: 12 Jun 2025
https://github.com/commoncrawl/ccf-eot-seeds-2024
Common Crawl's contribution of seeds to the End of Term Archive 2024
Last synced: 31 Jan 2026
https://github.com/commoncrawl/wac2025-cc-annotator-poster
A proof of concept pipeline for WARC annotation
Last synced: 12 Jun 2025
https://github.com/commoncrawl/discussions
For discussions and collaboration among all those who use or seek to use Common Crawl data
Last synced: 30 Jan 2026
https://github.com/commoncrawl/arabic-seed-processing
Turning 30,000 Arabic domains into a better crawl
Last synced: 18 Jun 2026
https://github.com/commoncrawl/eot2020-host-index
Tools to work with the preliminary End of Term Archive host index
Last synced: 03 Apr 2026
https://github.com/commoncrawl/cc-index-annotations
Example code to join an annotation to a host or url index
Last synced: 12 Jun 2025
https://github.com/commoncrawl/cc-warcinfo-index-builder
Code to build an index that maps warcinfo-id to crawl / warc
Last synced: 12 Jun 2025
https://github.com/commoncrawl/arc2warc-conversion
Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format
arc arc-files warc warc-files warc-format webarchive webarchiving
Last synced: 16 Feb 2026
https://github.com/commoncrawl/cc-host-index-media
Media files used in the README.d of cc-host-index
Last synced: 30 Jan 2026
https://github.com/commoncrawl/ccf-git-github-filesystem-unicode-test
Test files to diagnose git and filesystem problems with unicode normalization
Last synced: 27 Oct 2025