Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with common-crawl
A curated list of projects in awesome lists tagged with common-crawl .
https://github.com/ashvardanian/StringZilla
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring
Last synced: 31 Jul 2024
https://github.com/commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler
Last synced: 03 Aug 2024
https://github.com/oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar
Last synced: 01 Aug 2024
https://github.com/commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
aws-athena common-crawl commoncrawl jupyter-notebook webarchiving webgraph-framework
Last synced: 17 Aug 2024