An open API service indexing awesome lists of open source software.

Projects in Awesome Lists by commoncrawl

A curated list of projects in awesome lists by commoncrawl .

https://github.com/commoncrawl/commoncrawl

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

archived inactive

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

common-crawl commoncrawl pyspark spark sparksql warc-files wat-files wet

Last synced: 12 Jun 2025

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

common-crawl commoncrawl statistics

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

apache-parquet aws-athena columnar-storage commoncrawl spark sql

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-webgraph

Tools to construct and process Common Crawl webgraphs

centrality-measures common-crawl commoncrawl pagerank webgraph webgraph-framework

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-downloader

A polite and user-friendly downloader for Common Crawl data

commoncrawl downloader rust

Last synced: 12 Jun 2025

https://github.com/pjox/cc-downloader

A polite and user-friendly downloader for Common Crawl data

commoncrawl downloader rust

Last synced: 12 Jun 2025

https://github.com/commoncrawl/web-languages

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

crawling dataset language-detection

Last synced: 02 Feb 2026

https://github.com/commoncrawl/cc-citations

Scientific articles using or citing Common Crawl data

bibliography bibtex opendata

Last synced: 12 Jun 2025

https://github.com/commoncrawl/whirlwind-python

A whirlwind tour of Common Crawl's data using Python

archive python tutorial warc

Last synced: 12 Jun 2025

https://github.com/commoncrawl/language-detection-cld2

Natural language detection, Java bindings for CLD2

language-detection language-identification natural-language

Last synced: 12 Jun 2025

https://github.com/commoncrawl/presentations

A collection of public presentations from the Common Crawl Foundation

Last synced: 24 Feb 2026

https://github.com/commoncrawl/cc-webgraph-statistics

Statistics of Common Crawl monthly Web Graphs

Last synced: 07 Oct 2025

https://github.com/commoncrawl/cc-vec

Last synced: 19 Jan 2026

https://github.com/commoncrawl/ml-opt-out-experiments

A series of experiments into ML opt–out protocols

Last synced: 03 Oct 2025

https://github.com/commoncrawl/cc-host-index

Tools for working with the host index

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-nutch-example

Apache Nutch example project to archive content in WARC files

Last synced: 12 Jun 2025

https://github.com/commoncrawl/wac2025-webgraph-workshop

Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025

Last synced: 12 Jun 2025

https://github.com/commoncrawl/whirlwind-java

A whirlwind tour of Common Crawl's data using Java

Last synced: 18 Jun 2026

https://github.com/commoncrawl/cc-web-graph-neo4j

Instructions and code for using the Common Crawl Web Graph in Neo4j format

Last synced: 03 Apr 2026

https://github.com/commoncrawl/cc-legal

Repository for legal documentation at the Common Crawl Foundation

Last synced: 02 Feb 2026

https://github.com/commoncrawl/cc-monitoring

Code that monitors Common Crawl infrastructure

Last synced: 12 Jun 2025

https://github.com/commoncrawl/ccf-eot-seeds-2024

Common Crawl's contribution of seeds to the End of Term Archive 2024

Last synced: 31 Jan 2026

https://github.com/commoncrawl/wac2025-cc-annotator-poster

A proof of concept pipeline for WARC annotation

Last synced: 12 Jun 2025

https://github.com/commoncrawl/discussions

For discussions and collaboration among all those who use or seek to use Common Crawl data

Last synced: 30 Jan 2026

https://github.com/commoncrawl/arabic-seed-processing

Turning 30,000 Arabic domains into a better crawl

Last synced: 18 Jun 2026

https://github.com/commoncrawl/eot2020-host-index

Tools to work with the preliminary End of Term Archive host index

Last synced: 03 Apr 2026

https://github.com/commoncrawl/cc-index-annotations

Example code to join an annotation to a host or url index

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-warcinfo-index-builder

Code to build an index that maps warcinfo-id to crawl / warc

Last synced: 12 Jun 2025

https://github.com/commoncrawl/arc2warc-conversion

Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

arc arc-files warc warc-files warc-format webarchive webarchiving

Last synced: 16 Feb 2026

https://github.com/commoncrawl/cc-host-index-media

Media files used in the README.d of cc-host-index

Last synced: 30 Jan 2026

https://github.com/commoncrawl/ccf-git-github-filesystem-unicode-test

Test files to diagnose git and filesystem problems with unicode normalization

Last synced: 27 Oct 2025