Projects in Awesome Lists tagged with commoncrawl
A curated list of projects in awesome lists tagged with commoncrawl .
https://github.com/fhamborg/news-please
news-please - an integrated web crawler and information extractor for news that just works
cc-news ccnews commoncrawl crawler data-gathering elasticsearch extract-articles extract-information extractor json news news-archive news-articles news-crawler news-extractor news-scraper news-websites nlp python roberta
Last synced: 13 May 2025
https://github.com/flairnlp/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus corpus-tools crawler datasets image-classification image-extraction news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 14 May 2025
https://github.com/flairNLP/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 04 Mar 2025
https://github.com/commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler
Last synced: 10 May 2025
https://github.com/oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar
Last synced: 03 Apr 2025
https://github.com/cocrawler/cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
cdx cdx-api commoncrawl python warc web-archives web-archiving
Last synced: 07 Apr 2025
https://github.com/cloudtracer/paskto
Paskto - Passive Web Scanner
commoncrawl internet-of-things internetarchive nikto osint passive-vulnerability-scanner scanner
Last synced: 13 May 2025
https://github.com/karust/gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
commoncrawl concurrency crawler golang wayback-machine webarchive
Last synced: 06 Apr 2025
https://github.com/shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
commoncrawl dataset massivetext nlp python spark
Last synced: 02 Dec 2024
https://github.com/commoncrawl/cc-index-table
Index Common Crawl archives in tabular format
apache-parquet aws-athena columnar-storage commoncrawl spark sql
Last synced: 25 Nov 2024
https://github.com/centic9/commoncrawldocumentdownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
cdx-files commoncrawl java mime-types warc
Last synced: 07 Apr 2025
https://github.com/commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework
Last synced: 09 Dec 2024
https://github.com/rix4uni/uforall
uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl
alienvault bugbounty commoncrawl crawler osint recon reconnaissance urlscan wayback
Last synced: 15 Apr 2025
https://github.com/pjox/cc-downloader
A polite and user-friendly downloader for Common Crawl data
Last synced: 18 Mar 2025
https://github.com/ahcm/tantivy_warc_indexer
builds a tantivy index from common crawl warc.wet files
commoncrawl index search tantivy
Last synced: 09 Dec 2024
https://github.com/code402/warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Last synced: 22 Apr 2025
https://github.com/preciz/common_crawl
Work with Common Crawl data from Elixir.
Last synced: 10 Apr 2025
https://github.com/thunderpoot/cc-getpage
Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
common-crawl common-crawl-data common-crawl-python common-crawl-with-python commoncrawl
Last synced: 12 Mar 2025