Projects in Awesome Lists tagged with commoncrawl

https://github.com/fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works

cc-news ccnews commoncrawl crawler data-gathering elasticsearch extract-articles extract-information extractor json news news-archive news-articles news-crawler news-extractor news-scraper news-websites nlp python roberta

Last synced: 13 May 2025

https://github.com/commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

common-crawl commoncrawl pyspark spark sparksql warc-files wat-files wet

Last synced: 12 Jun 2025

https://github.com/flairnlp/fundus

A very simple news crawler with a funny name

cc-news commoncrawl corpus corpus-tools crawler datasets image-classification image-extraction news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping

Last synced: 08 Jan 2026

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 12 Jun 2025

https://github.com/flairNLP/fundus

A very simple news crawler with a funny name

cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping

Last synced: 04 Mar 2025

https://github.com/commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

common-crawl commoncrawl statistics

Last synced: 12 Jun 2025

https://github.com/karust/gogetcrawl

Extract web archive data using Wayback Machine and Common Crawl

commoncrawl concurrency crawler golang wayback-machine webarchive

Last synced: 15 Jan 2026

https://github.com/oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar

Last synced: 03 Apr 2025

https://github.com/cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

cdx cdx-api commoncrawl python warc web-archives web-archiving

Last synced: 14 Dec 2025

https://github.com/cloudtracer/paskto

Paskto - Passive Web Scanner

commoncrawl internet-of-things internetarchive nikto osint passive-vulnerability-scanner scanner

Last synced: 13 May 2025

https://github.com/commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

apache-parquet aws-athena columnar-storage commoncrawl spark sql

Last synced: 12 Jun 2025

https://github.com/shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

commoncrawl dataset massivetext nlp python spark

Last synced: 27 Jul 2025

https://github.com/commoncrawl/cc-webgraph

Tools to construct and process Common Crawl webgraphs

centrality-measures common-crawl commoncrawl pagerank webgraph webgraph-framework

Last synced: 12 Jun 2025

https://github.com/centic9/commoncrawldocumentdownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

cdx-files commoncrawl java mime-types warc

Last synced: 07 Apr 2025

https://github.com/commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework

Last synced: 12 Jun 2025

https://github.com/commoncrawl/cc-downloader

A polite and user-friendly downloader for Common Crawl data

commoncrawl downloader rust

Last synced: 12 Jun 2025

https://github.com/pjox/cc-downloader

A polite and user-friendly downloader for Common Crawl data

commoncrawl downloader rust

Last synced: 12 Jun 2025

https://github.com/rix4uni/uforall

uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl

alienvault bugbounty commoncrawl crawler osint recon reconnaissance urlscan wayback

Last synced: 15 Apr 2025

https://github.com/generals-space/site-mirror-go

来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载

commoncrawl crawler mirror spider

Last synced: 14 Jan 2026

https://github.com/ahcm/tantivy_warc_indexer

builds a tantivy index from common crawl warc.wet files

commoncrawl index search tantivy

Last synced: 06 Aug 2025

https://github.com/thunderpoot/cc-getpage

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

common-crawl common-crawl-data common-crawl-python common-crawl-with-python commoncrawl

Last synced: 22 Feb 2026

https://github.com/code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

common-crawl commoncrawl warc

Last synced: 07 Oct 2025

https://github.com/preciz/common_crawl

Work with Common Crawl data from Elixir.

commoncrawl elixir

Last synced: 10 Apr 2025

https://github.com/atharvbyadav/ghostpath

👻 GhostPath — A powerful modular reconnaissance toolkit built for hackers, OSINT professionals & bug bounty hunters — passive + active recon in a sleek CLI shell. Discover subdomains, probe paths, mine archives and hunt certificates — all from one interactive terminal interface.

bug-bounty cli-tool commoncrawl cybersecurity ethical-hacking ghostpath hacking historical-data information-gathering infosec osint passive-recon reconnaissance web-recon

Last synced: 30 Jul 2025

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome