An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with crawling

A curated list of projects in awesome lists tagged with crawling .

https://github.com/scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

crawler crawling framework hacktoberfest python scraping web-scraping web-scraping-python

Last synced: 16 Jan 2026

https://github.com/gocolly/colly

Elegant Scraper and Crawler Framework for Golang

crawler crawling framework go golang scraper scraping spider

Last synced: 12 May 2025

https://github.com/apifytech/apify-js

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 06 Jul 2025

https://github.com/apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 03 Nov 2025

https://github.com/codelucas/newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

crawler crawling news news-aggregator python scraper

Last synced: 12 May 2025

https://github.com/apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping

Last synced: 06 Mar 2026

https://github.com/D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 13 May 2025

https://github.com/hakluke/hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

bugbounty crawling hacking osint pentesting recon reconnaissance

Last synced: 14 May 2025

https://github.com/apache/nutch

Apache Nutch is an extensible and scalable web crawler

apache crawling hadoop java nutch web-crawler

Last synced: 13 May 2025

https://github.com/d4vinci/scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 15 Feb 2026

https://github.com/zorlan/skycaiji

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统

crawler crawling php spider webcrawler

Last synced: 14 May 2025

https://github.com/edoardottt/cariddi

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

bugbounty crawler crawling endpoint-discovery endpoints go golang hacktoberfest infosec osint penetration-testing pentesting recon reconnaissance redteam scraper secret-keys secrets-detection security security-tools

Last synced: 14 May 2025

https://github.com/natescarlet/holiday-cn

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

china crawling data holiday natural-language-processing

Last synced: 14 May 2025

https://github.com/NateScarlet/holiday-cn

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

china crawling data holiday natural-language-processing

Last synced: 26 Mar 2025

https://github.com/roach-php/core

The complete web scraping toolkit for PHP.

crawling php web-scraping

Last synced: 13 May 2025

https://github.com/lorey/mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples

crawler crawler-python crawling extraction-engine html machine-learning scraper scraping

Last synced: 15 May 2025

https://github.com/elixir-crawly/crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider

Last synced: 11 Dec 2025

https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 10 Feb 2026

https://github.com/clemfromspace/scrapy-selenium

Scrapy middleware to handle javascript pages using selenium

crawling scrapy selenium

Last synced: 14 May 2025

https://github.com/iawia002/Lulu

[Unmaintained] A simple and clean video/music/image downloader 👾

crawler crawling downloader python python3 scraper scraping video

Last synced: 22 Jul 2025

https://github.com/rebrowser/rebrowser-patches

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on demand.

automation bot bot-detection chrome chromedriver cloudflare crawler crawling datadome headless headless-chrome playwright puppeteer puppeteer-extra rebrowser scraping selenium stealth web-scraping webdriver

Last synced: 14 May 2025

https://github.com/slotix/dataflowkit

Extract structured data from web sites. Web sites scraping.

cdp chrome-fetcher crawling extract-data go golang golang-library headless scraper scraping scraping-websites

Last synced: 16 Jan 2026

https://github.com/essandess/isp-data-pollution

ISP Data Pollution to Protect Private Browsing History with Obfuscation

crawling data data-analytics obfuscation privacy privacy-enhancing-technologies web

Last synced: 29 Dec 2025

https://github.com/scrapinghub/spidermon

Scrapy Extension for monitoring spiders execution.

crawling hacktoberfest monitoring monitoring-tool scraping scrapinghub spiders testing

Last synced: 14 May 2025

https://github.com/l4rm4nd/linkedindumper

Python 3 script to dump/scrape/extract company employees from LinkedIn API

crawling employees extracting linkedin osint python3 scraping spider

Last synced: 18 Apr 2026

https://github.com/florents-tselai/warcdb

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 04 Apr 2025

https://github.com/Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

cli crawling database sqlite warc web-archiving web-data

Last synced: 08 Apr 2025

https://github.com/rivermont/spidy

The simple, easy to use command line web crawler.

crawler crawling python python3 web-crawler web-spider

Last synced: 16 Jan 2026

https://github.com/alephdata/memorious

Lightweight web scraping toolkit for documents and structured data.

crawling scraping scraping-framework

Last synced: 12 Apr 2025

https://github.com/infinilabs/crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider

Last synced: 11 Apr 2026

https://github.com/marshalx/telegram-crawler

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Last synced: 16 May 2025

https://github.com/MarshalX/telegram-crawler

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Last synced: 15 May 2025

https://github.com/ai-robots-txt/ai.robots.txt

A list of AI agents and robots to block.

ai crawlers crawling privacy

Last synced: 28 Mar 2025

https://github.com/roach-php/laravel

Laravel adapter for Roach, the complete web scraping toolkit for PHP.

crawling laravel php web-scraping

Last synced: 11 Apr 2025

https://github.com/antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

crawler crawling framework golang scraping web-crawler web-spider

Last synced: 14 Mar 2025

https://github.com/amerkurev/scrapper

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

crawler crawler-python crawling headless readability scraper scraping web-parsers web-parsing web-scraping

Last synced: 08 May 2025

https://github.com/a3h1nt/grawler

Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.

algorithm-schema automation crawling curl google-dorks grawler osint osint-tool php proxy scraping xampp

Last synced: 09 Apr 2025

https://github.com/A3h1nt/Grawler

Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.

algorithm-schema automation crawling curl google-dorks grawler osint osint-tool php proxy scraping xampp

Last synced: 11 Jul 2025

https://github.com/18520339/facebook-data-extraction

Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract Client/Server-side Rendered content

automation browser-fingerprinting crawling facebook facebook-graph-api proxy scraping selenium tor-network

Last synced: 03 Apr 2025

https://github.com/mehmetozkaya/dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 11 May 2025

https://github.com/mehmetozkaya/DotnetCrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 18 Apr 2025

https://github.com/N0taN3rd/Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 06 Apr 2025

https://github.com/n0tan3rd/squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 13 Sep 2025

https://github.com/dimkouv/massivedl

Download a large list of files concurrently

crawling download-manager downloader golang

Last synced: 15 Jan 2026

https://github.com/karthikuj/sasori

Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.

automation crawler crawling dast dynamic endpoint-discovery infosec puppeteer scraping security

Last synced: 15 Aug 2025

https://github.com/janreges/siteone-crawler

SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).

analyzer crawler crawling performance qa quality-assessment security seo seotools stress-testing swoole testing website

Last synced: 18 Mar 2026

https://github.com/unblocked-web/double-agent

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

crawling puppeteer scraping scrapy secret-agent

Last synced: 08 Apr 2025

https://github.com/alash3al/scraply

Scraply a simple dom scraper to fetch information from any html based website

crawler crawling dom golang scraper scrapers scraping-websites scrapy server

Last synced: 28 Apr 2025

https://github.com/ihandmine/aioscpy

An asyncio + aiolibs crawler imitate scrapy framework

aiohttp asyncio crawling framework loguru python3 scrapy scrapy-redis

Last synced: 14 Jan 2026

https://github.com/archiveteam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 04 Apr 2025

https://github.com/maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

crawling duckdb parquet warc web-archiving

Last synced: 16 May 2025

https://github.com/ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 18 Jul 2025

https://github.com/kreuzberg-dev/kreuzcrawl

High-performance web crawling engine with bindings for 11 languages

crawling csharp elixir ffi golang java mcp php python ruby rust typescript wasm web-crawler web-scraping

Last synced: 07 Jun 2026

https://github.com/fcavallarin/burp-dom-scanner

Burp Suite's extension to scan and crawl Single Page Applications

crawling dom scanning single-page-applications xss xss-detection

Last synced: 17 Mar 2025

https://github.com/creekorful/bathyscaphe

Fast, highly configurable, cloud native dark web crawler.

architecture crawler crawling elasticsearch golang hidden-services kibana tor web-crawler

Last synced: 17 Mar 2025

https://github.com/alexfazio/devdocs-to-llm

Turn any developer documentation into a GPT

crawler crawling firecrawl scraper scraping

Last synced: 08 Mar 2026

https://github.com/datawizard1337/ARGUS

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9

crawling python scraping scrapy scrapyd webcrawling webscraping

Last synced: 20 Mar 2025

https://github.com/carlosplanchon/spidercreator

Automated web scraping spider generation using Browser Use and LLMs. Streamline the creation of Playwright-based spiders with minimal manual coding. Ideal for large enterprises with recurring data extraction needs.

ai automation browser-use crawling llm low-code no-code python rpa scraping spider vibe-coding

Last synced: 15 Sep 2025

https://github.com/jroakes/tech-seo-crawler

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

crawling github-pages rendering seo wikipedia

Last synced: 12 Apr 2025

https://github.com/TransparencyToolkit/Harvester

Web crawling and document processing through a usable interface.

api crawling document interface ocr osint web

Last synced: 13 May 2025

https://github.com/archivebox/abx-dl

⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...

ai-scraping archivebox chrome cli cli-tool crawling curl downloader gallery-dl headless http-client internet-archiving playwright puppeteer scraping wget youtube-dl yt-dlp

Last synced: 15 Mar 2026

https://github.com/howie6879/talospider

talospider - A simple,lightweight scraping micro-framework

crawler crawling python spider web-spider

Last synced: 25 Oct 2025

https://github.com/scrapinghub/learn.scrapinghub.com

Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB

crawling learning python scraping scrapy tutorial

Last synced: 08 Jul 2025

https://github.com/swader/diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

ai artificial-intelligence bot crawl crawling diffbot machine-learning nlp php scrape scraped-data scraper scraping

Last synced: 21 Aug 2025

https://github.com/OlafZhang/bilib

整合多个B站原生API,并结合爬取技术的Python爬取用lib

anime bilibili-api crawling danmaku

Last synced: 16 Mar 2025

https://github.com/pzaino/thecrowler

A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze, and interact with the web in all its dimensions.

automation blue-team-tool content-detection content-discovery crawler crawling cyber-security cybersecurity cybersecurity-tools data-collection data-science distributed-systems golang indexer indexing reconnaissance red-team-tools scraping search-engine vulnerability-detection

Last synced: 06 Feb 2026

https://github.com/lorey/socials

👨‍👩‍👦 Social account detection and extraction in Python, e.g. for crawling/scraping.

crawling facebook instagram linkedin python scraping social-network

Last synced: 30 Dec 2025

https://github.com/mawrkus/jason-the-miner

⛏ A versatile Web scraper for Node.js

crawler crawling javascript scraper scraping web-scraper

Last synced: 08 Apr 2025