Projects in Awesome Lists tagged with crawling
A curated list of projects in awesome lists tagged with crawling .
https://github.com/scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
crawler crawling framework hacktoberfest python scraping web-scraping web-scraping-python
Last synced: 16 Jan 2026
https://github.com/apifytech/apify-js
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping
Last synced: 06 Jul 2025
https://github.com/apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping
Last synced: 03 Nov 2025
https://github.com/codelucas/newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
crawler crawling news news-aggregator python scraper
Last synced: 12 May 2025
https://github.com/go-rod/rod
A Chrome DevTools Protocol driver for web automation and scraping.
automation cdp chrome-devtools chrome-devtools-protocol chrome-headless crawling devtools devtools-protocol go golang gorod headless rod scraper testing web web-scraping
Last synced: 15 May 2025
https://github.com/montferret/ferret
Declarative web scraping
cdp chrome cli crawler crawling data-mining dsl go golang library query-language scraper scraping scraping-websites tool
Last synced: 01 May 2026
https://github.com/MontFerret/ferret
Declarative web scraping
cdp chrome cli crawler crawling data-mining dsl go golang hacktoberfest library query-language scraper scraping scraping-websites tool
Last synced: 13 Mar 2025
https://github.com/apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping
Last synced: 06 Mar 2026
https://github.com/D4Vinci/Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 13 May 2025
https://github.com/hakluke/hakrawler
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
bugbounty crawling hacking osint pentesting recon reconnaissance
Last synced: 14 May 2025
https://github.com/hardkoded/puppeteer-sharp
Headless Chrome .NET API
automation chrome chromium crawler crawling csharp e2e e2e-testing puppeteer webautomation
Last synced: 13 May 2025
https://github.com/apache/nutch
Apache Nutch is an extensible and scalable web crawler
apache crawling hadoop java nutch web-crawler
Last synced: 13 May 2025
https://github.com/d4vinci/scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 15 Feb 2026
https://github.com/lorien/grab
Web Scraping Framework
asynchronous crawler crawling framework http-client network pycurl python python-library python3 scraping spider urllib3 web-scraping
Last synced: 14 May 2025
https://github.com/zorlan/skycaiji
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
crawler crawling php spider webcrawler
Last synced: 14 May 2025
https://github.com/edoardottt/cariddi
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
bugbounty crawler crawling endpoint-discovery endpoints go golang hacktoberfest infosec osint penetration-testing pentesting recon reconnaissance redteam scraper secret-keys secrets-detection security security-tools
Last synced: 14 May 2025
https://github.com/natescarlet/holiday-cn
📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告
china crawling data holiday natural-language-processing
Last synced: 14 May 2025
https://github.com/NateScarlet/holiday-cn
📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告
china crawling data holiday natural-language-processing
Last synced: 26 Mar 2025
https://github.com/roach-php/core
The complete web scraping toolkit for PHP.
Last synced: 13 May 2025
https://github.com/lorey/mlscraper
🤖 Scrape data from HTML websites automatically by just providing examples
crawler crawler-python crawling extraction-engine html machine-learning scraper scraping
Last synced: 15 May 2025
https://github.com/elixir-crawly/crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider
Last synced: 11 Dec 2025
https://github.com/webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
crawler crawling wacz warc web-archiving web-crawler webrecorder
Last synced: 10 Feb 2026
https://github.com/clemfromspace/scrapy-selenium
Scrapy middleware to handle javascript pages using selenium
Last synced: 14 May 2025
https://github.com/scrapinghub/scrapyrt
HTTP API for Scrapy spiders
crawler crawling hacktoberfest hacktoberfest2021 python scraper scrapy twisted webcrawler webcrawling
Last synced: 15 May 2025
https://github.com/iawia002/Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
crawler crawling downloader python python3 scraper scraping video
Last synced: 22 Jul 2025
https://github.com/morvanzhou/easy-scraping-tutorial
Simple but useful Python web scraping tutorial code.
asyncio beautifulsoup crawler crawling distributed-scraper regex requests scraping scrapy urllib
Last synced: 16 May 2025
https://github.com/MorvanZhou/easy-scraping-tutorial
Simple but useful Python web scraping tutorial code.
asyncio beautifulsoup crawler crawling distributed-scraper regex requests scraping scrapy urllib
Last synced: 07 Sep 2025
https://github.com/rebrowser/rebrowser-patches
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on demand.
automation bot bot-detection chrome chromedriver cloudflare crawler crawling datadome headless headless-chrome playwright puppeteer puppeteer-extra rebrowser scraping selenium stealth web-scraping webdriver
Last synced: 14 May 2025
https://github.com/slotix/dataflowkit
Extract structured data from web sites. Web sites scraping.
cdp chrome-fetcher crawling extract-data go golang golang-library headless scraper scraping scraping-websites
Last synced: 16 Jan 2026
https://github.com/josephlimtech/linkedin-profile-scraper-api
🕵️♂️ LinkedIn profile scraper returning structured profile data in JSON.
crawler crawling expressjs json linkedin linkedin-bot linkedin-crawler linkedin-profile linkedin-profile-scraper linkedin-scraper linkedin-scraping nodejs profile-data puppeteer scraper scrapers scraping scraping-websites spider website-scraper
Last synced: 04 Apr 2025
https://github.com/essandess/isp-data-pollution
ISP Data Pollution to Protect Private Browsing History with Obfuscation
crawling data data-analytics obfuscation privacy privacy-enhancing-technologies web
Last synced: 29 Dec 2025
https://github.com/mishakorzik/adminhack
today we will hack the admin panel of the site.
admin-finder admin-hack admin-panel admin-website-hack allhackingtools cpanel cpanl-finder crawling directory-bruteforce hacking-tool kali-linux linux termux termux-hacking termux-tool website website-hacking website-hacking-methods websitehacking
Last synced: 16 May 2025
https://github.com/scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.
crawling hacktoberfest monitoring monitoring-tool scraping scrapinghub spiders testing
Last synced: 14 May 2025
https://github.com/zhuyingda/webster
a reliable high-level web crawling & scraping framework for Node.js.
automation-test automation-ui chromium crawler crawling headless-chrome javascript javascript-framework nodejs nodejs-framework puppeteer scraping-framework spider
Last synced: 15 May 2025
https://github.com/crawljax/crawljax
Crawljax
crawler crawling dom dynamic event-driven-crawling javascript test-generation web-analysis web-testing
Last synced: 16 May 2025
https://github.com/l4rm4nd/linkedindumper
Python 3 script to dump/scrape/extract company employees from LinkedIn API
crawling employees extracting linkedin osint python3 scraping spider
Last synced: 18 Apr 2026
https://github.com/scrapfly/scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
antibot automation captcha-bypass crawler crawling crawling-python datascraping proxies python python-scraper scraper scraping scraping-python spider twitter-scraper web-crawler web-scraping web-scraping-python webscraper webscraping
Last synced: 11 Apr 2025
https://github.com/florents-tselai/warcdb
WarcDB: Web crawl data as SQLite databases.
cli crawling database sqlite warc web-archiving web-data
Last synced: 04 Apr 2025
https://github.com/Florents-Tselai/WarcDB
WarcDB: Web crawl data as SQLite databases.
cli crawling database sqlite warc web-archiving web-data
Last synced: 08 Apr 2025
https://github.com/mhmdiaa/second-order
Second-order subdomain takeover scanner
crawler crawling infosec mapping penetration-testing penetration-testing-tools pentesting recon reconnaissance security security-tools web-application-security wordlist wordlist-generator
Last synced: 05 Apr 2025
https://github.com/xorbit01/webpalm
🕸️ Crawl in the web network
crawler crawling data data-science datamining go golang hack mining osint redteam spider tool
Last synced: 15 Dec 2025
https://github.com/XORbit01/webpalm
🕸️ Crawl in the web network
crawler crawling data data-science datamining go golang hack mining osint redteam spider tool
Last synced: 14 Apr 2025
https://github.com/crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping
Last synced: 15 May 2025
https://github.com/rivermont/spidy
The simple, easy to use command line web crawler.
crawler crawling python python3 web-crawler web-spider
Last synced: 16 Jan 2026
https://github.com/alephdata/memorious
Lightweight web scraping toolkit for documents and structured data.
crawling scraping scraping-framework
Last synced: 12 Apr 2025
https://github.com/infinilabs/crawler
🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider
Last synced: 11 Apr 2026
https://github.com/marshalx/telegram-crawler
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
crawler crawling crawling-python parser telegram telegram-org telegram-updates
Last synced: 16 May 2025
https://github.com/MarshalX/telegram-crawler
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
crawler crawling crawling-python parser telegram telegram-org telegram-updates
Last synced: 15 May 2025
https://github.com/mustafadalga/instagram-bot
An Instagram bot developed using the Selenium Framework
automation automation-selenium bot bulk-comments bulk-unfollow crawler crawling download-stories instagram instagram-api instagram-bot instagram-downloader instagram-without-api mass-liking python python3 selenium selenium-framework selenium-python selenium-webdriver
Last synced: 02 Oct 2025
https://github.com/ai-robots-txt/ai.robots.txt
A list of AI agents and robots to block.
Last synced: 28 Mar 2025
https://github.com/roach-php/laravel
Laravel adapter for Roach, the complete web scraping toolkit for PHP.
crawling laravel php web-scraping
Last synced: 11 Apr 2025
https://github.com/antchfx/antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
crawler crawling framework golang scraping web-crawler web-spider
Last synced: 14 Mar 2025
https://github.com/amerkurev/scrapper
Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
crawler crawler-python crawling headless readability scraper scraping web-parsers web-parsing web-scraping
Last synced: 08 May 2025
https://github.com/mishakorzik/infect
Create you virus in termux!
allhackingtools crawling hacking-tool infect infection linux termux termux-hacking termux-tool virus virus-termux viruses
Last synced: 09 May 2025
https://github.com/a3h1nt/grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
algorithm-schema automation crawling curl google-dorks grawler osint osint-tool php proxy scraping xampp
Last synced: 09 Apr 2025
https://github.com/A3h1nt/Grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
algorithm-schema automation crawling curl google-dorks grawler osint osint-tool php proxy scraping xampp
Last synced: 11 Jul 2025
https://github.com/google/corpuscrawler
Crawler for linguistic corpora
corpus-builder corpus-linguistics crawling linguistics minority-language
Last synced: 14 Mar 2025
https://github.com/18520339/facebook-data-extraction
Experience for effectively fetching Facebook data by Querying Graph API with Account-based Token and Operating undetectable scraping Bots to extract Client/Server-side Rendered content
automation browser-fingerprinting crawling facebook facebook-graph-api proxy scraping selenium tor-network
Last synced: 03 Apr 2025
https://github.com/csharp-leaf/Leaf.xNet
HTTP Library. Impoved original xNet.
capsolver captcha-solving cookies crawling csharp http http-client https parser proxy-client scraping
Last synced: 12 Apr 2025
https://github.com/mehmetozkaya/dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping
Last synced: 11 May 2025
https://github.com/mehmetozkaya/DotnetCrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping
Last synced: 18 Apr 2025
https://github.com/N0taN3rd/Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 06 Apr 2025
https://github.com/n0tan3rd/squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 13 Sep 2025
https://github.com/dimkouv/massivedl
Download a large list of files concurrently
crawling download-manager downloader golang
Last synced: 15 Jan 2026
https://github.com/karthikuj/sasori
Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.
automation crawler crawling dast dynamic endpoint-discovery infosec puppeteer scraping security
Last synced: 15 Aug 2025
https://github.com/janreges/siteone-crawler
SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
analyzer crawler crawling performance qa quality-assessment security seo seotools stress-testing swoole testing website
Last synced: 18 Mar 2026
https://github.com/unblocked-web/double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
crawling puppeteer scraping scrapy secret-agent
Last synced: 08 Apr 2025
https://github.com/alash3al/scraply
Scraply a simple dom scraper to fetch information from any html based website
crawler crawling dom golang scraper scrapers scraping-websites scrapy server
Last synced: 28 Apr 2025
https://github.com/ihandmine/aioscpy
An asyncio + aiolibs crawler imitate scrapy framework
aiohttp asyncio crawling framework loguru python3 scrapy scrapy-redis
Last synced: 14 Jan 2026
https://github.com/archiveteam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd
Last synced: 04 Apr 2025
https://github.com/SimFin/pdf-crawler
SimFin's open source PDF crawler
crawler crawling geckodriver pdf pdf-crawler puppeteer python selenium-webdriver
Last synced: 07 Apr 2025
https://github.com/simfin/pdf-crawler
SimFin's open source PDF crawler
crawler crawling geckodriver pdf pdf-crawler puppeteer python selenium-webdriver
Last synced: 28 Oct 2025
https://github.com/antoinevastel/bots-zoo
bot crawler crawling playwright puppeteer scraper scraping selenium user-agent useragent
Last synced: 16 Aug 2025
https://github.com/maxcountryman/warc-parquet
🗄️ A simple CLI for converting WARC to Parquet.
crawling duckdb parquet warc web-archiving
Last synced: 16 May 2025
https://github.com/ArchiveTeam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd
Last synced: 18 Jul 2025
https://github.com/kreuzberg-dev/kreuzcrawl
High-performance web crawling engine with bindings for 11 languages
crawling csharp elixir ffi golang java mcp php python ruby rust typescript wasm web-crawler web-scraping
Last synced: 07 Jun 2026
https://github.com/fcavallarin/burp-dom-scanner
Burp Suite's extension to scan and crawl Single Page Applications
crawling dom scanning single-page-applications xss xss-detection
Last synced: 17 Mar 2025
https://github.com/usc-isi-i2/dig-etl-engine
Download DIG to run on your laptop or server.
crawling etl-framework etl-pipeline information-extraction information-visualization search-engine
Last synced: 04 Aug 2025
https://github.com/creekorful/bathyscaphe
Fast, highly configurable, cloud native dark web crawler.
architecture crawler crawling elasticsearch golang hidden-services kibana tor web-crawler
Last synced: 17 Mar 2025
https://github.com/datawizard1337/ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
crawling python scraping scrapy scrapyd webcrawling webscraping
Last synced: 20 Mar 2025
https://github.com/carlosplanchon/spidercreator
Automated web scraping spider generation using Browser Use and LLMs. Streamline the creation of Playwright-based spiders with minimal manual coding. Ideal for large enterprises with recurring data extraction needs.
ai automation browser-use crawling llm low-code no-code python rpa scraping spider vibe-coding
Last synced: 15 Sep 2025
https://github.com/jroakes/tech-seo-crawler
Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.
crawling github-pages rendering seo wikipedia
Last synced: 12 Apr 2025
https://github.com/shurco/goClone
🌱 goClone - clone websites in seconds
cloner cloning crawler crawling go goclone golang hacktoberfest scraping scraping-websites scrapper website-cloner website-scraper wp2static
Last synced: 05 May 2025
https://github.com/archivebox/abx-dl
⬇️ A simple all-in-one CLI tool to download EVERYTHING from a URL (like youtube-dl/yt-dlp, forum-dl, gallery-dl, simpler ArchiveBox). 🎭 Uses headless Chrome to get HTML, JS, CSS, images/video/audio/subtitles, PDFs, screenshots, article text, git repos, and more...
ai-scraping archivebox chrome cli cli-tool crawling curl downloader gallery-dl headless http-client internet-archiving playwright puppeteer scraping wget youtube-dl yt-dlp
Last synced: 15 Mar 2026
https://github.com/afuntw/python-crawling-tutorial
Python crawling tutorial
crawling ipynb-jupyter-notebook python
Last synced: 28 Oct 2025
https://github.com/howie6879/talospider
talospider - A simple,lightweight scraping micro-framework
crawler crawling python spider web-spider
Last synced: 25 Oct 2025
https://github.com/swader/diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
ai artificial-intelligence bot crawl crawling diffbot machine-learning nlp php scrape scraped-data scraper scraping
Last synced: 21 Aug 2025
https://github.com/OlafZhang/bilib
整合多个B站原生API,并结合爬取技术的Python爬取用lib
anime bilibili-api crawling danmaku
Last synced: 16 Mar 2025
https://github.com/pzaino/thecrowler
A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze, and interact with the web in all its dimensions.
automation blue-team-tool content-detection content-discovery crawler crawling cyber-security cybersecurity cybersecurity-tools data-collection data-science distributed-systems golang indexer indexing reconnaissance red-team-tools scraping search-engine vulnerability-detection
Last synced: 06 Feb 2026
https://github.com/lorey/socials
👨👩👦 Social account detection and extraction in Python, e.g. for crawling/scraping.
crawling facebook instagram linkedin python scraping social-network
Last synced: 30 Dec 2025
https://github.com/Conso1eCowb0y/Deepminer
Deep web crawler and search engine
crawler crawling dark-web data-mining deepminer deepweb github hacking onion osint python-web-scraper python3 search-engine security security-tools spider the-onion-router tor tor-network webcrawler
Last synced: 20 Apr 2025
https://github.com/mawrkus/jason-the-miner
⛏ A versatile Web scraper for Node.js
crawler crawling javascript scraper scraping web-scraper
Last synced: 08 Apr 2025