An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with web-crawler

A curated list of projects in awesome lists tagged with web-crawler .

https://github.com/mendableai/firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

ai ai-scraping crawler data html-to-markdown llm markdown rag scraper scraping web-crawler webscraping

Last synced: 12 May 2025

https://github.com/apifytech/apify-js

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 06 Jul 2025

https://github.com/apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 03 Nov 2025

https://github.com/crawlab-team/crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

crawlab crawler crawling-tasks docker go platform scrapy scrapyd-ui spider spiders-management web-crawler webcrawler webspider

Last synced: 14 May 2025

https://github.com/ssssssss-team/spider-flow

新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

crawler jsoup spider spider-flow web-crawler web-spider webcrawler webspider xpath

Last synced: 14 May 2025

https://github.com/adithya-s-k/omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

ingestion-api ocr omniparser parse-server parser-library vision-transformer web-crawler whisper-api

Last synced: 13 May 2025

https://github.com/firecrawl/firecrawl-mcp-server

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai javascript-rendering llm-tools mcp mcp-server model-context-protocol search-api web-crawler web-scraping

Last synced: 07 Apr 2026

https://github.com/apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping

Last synced: 06 Mar 2026

https://github.com/apache/nutch

Apache Nutch is an extensible and scalable web crawler

apache crawling hadoop java nutch web-crawler

Last synced: 13 May 2025

https://github.com/sjdirect/abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

abot abot-nuget c-sharp crawler cross-platform csharp csharp-library javascript-renderer netcore netcore2 netcore3 netsta netstandard20 netstandard21 parsing pluggable spider spiders unit-testing web-crawler

Last synced: 13 May 2025

https://github.com/xianhu/pspider

简单易用的Python爬虫框架,QQ交流群:597510560

crawler multi-threading multiprocessing proxies python python-spider spider web-crawler web-spider

Last synced: 15 May 2025

https://github.com/xianhu/PSpider

简单易用的Python爬虫框架,QQ交流群:597510560

crawler multi-threading multiprocessing proxies python python-spider spider web-crawler web-spider

Last synced: 25 Mar 2025

https://github.com/microlinkhq/browserless

The headless Chrome/Chromium driver on top of Puppeteer. Take screenshots, generate PDFs, extract text and HTML with a production-ready API.

automation browser-automation chromium lighthouse pdf-generation screenshot web-crawler web-scraping

Last synced: 06 Mar 2026

https://github.com/Algebra-FUN/WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

book-downloader selenium web-crawler weread

Last synced: 17 Oct 2025

https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 10 Feb 2026

https://github.com/apache/stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

apache-storm crawler distributed java stormcrawler web-crawler

Last synced: 13 Feb 2026

https://github.com/algebra-fun/wereadscan

扫描“微信读书”已购图书并下载本地PDF的爬虫

book-downloader selenium web-crawler weread

Last synced: 15 May 2025

https://github.com/apache/incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

apache-storm crawler distributed java stormcrawler web-crawler

Last synced: 12 Apr 2025

https://github.com/platonai/pulsarRPA

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed AI-RPA.

ai-agents ai-crawler ai-rpa ai-scrarper crawler rpa scraper scraping web-crawler web-scraping

Last synced: 01 Apr 2025

https://github.com/postmodern/spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

crawler ruby scraper spider spider-links web web-crawler web-scraper web-scraping web-spider

Last synced: 13 May 2025

https://github.com/gildas-lormeau/single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

archiving cli crawler deno dockerfile nodejs scraping-websites single-file web-archiving web-crawler web-scraper web-scraping

Last synced: 15 May 2025

https://github.com/MarginaliaSearch/MarginaliaSearch

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

alt-search indexer internet-search language-processing no-ai-used no-cloud search-engine small-web web-crawler

Last synced: 05 Apr 2025

https://github.com/PhialsBasement/LibreCrawl

Free desktop SEO crawler - open source alternative to Screaming Frog and similar tools. Crawl websites, analyze links, extract SEO data, and export results without subscription fees. Fully customizable and extensible!

desktop-app flask free open-source python seo seo-analysis web-crawler website-auditing

Last synced: 06 May 2026

https://github.com/0xMassi/webclaw

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

ai ai-agents ai-scraping cli crawler data-extraction html-to-markdown llm markdown mcp mcp-server rust scraper self-hosted tls-fingerprinting web-crawler web-extraction web-scraper web-scraping webscraping

Last synced: 04 Apr 2026

https://github.com/USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

big-data distributed-systems information-retrieval nutch search search-engine solr spark tika web-crawler

Last synced: 25 Mar 2025

https://github.com/internetarchive/Zeno

State-of-the-art web crawler 🔱

archiving web-crawler zeno

Last synced: 05 May 2026

https://github.com/brendonboshell/supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

crawler distributed-crawler robot sitemap web-crawler

Last synced: 12 Jan 2026

https://github.com/internetarchive/zeno

State-of-the-art web crawler 🔱

archiving web-crawler zeno

Last synced: 27 Jan 2026

https://github.com/rivermont/spidy

The simple, easy to use command line web crawler.

crawler crawling python python3 web-crawler web-spider

Last synced: 16 Jan 2026

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 12 Jun 2025

https://github.com/infinilabs/crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider

Last synced: 11 Apr 2026

https://github.com/yields/ant

A web crawler for Go

go golang scraper spider web-crawler

Last synced: 16 May 2025

https://github.com/lucasxlu/LagouJob

Data Analysis & Mining for lagou.com

data-analysis data-mining lagou machine-learning nlp python3 web-crawler

Last synced: 18 Jul 2025

https://github.com/antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

crawler crawling framework golang scraping web-crawler web-spider

Last synced: 14 Mar 2025

https://github.com/crawler-commons/crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

java library open-source robots-txt robotstxt sitemaps web-crawler

Last synced: 06 Mar 2026

https://github.com/turnersoftware/infinitycrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 21 Jun 2025

https://github.com/TurnerSoftware/InfinityCrawler

A simple but powerful web crawler library for .NET

crawler robots-txt spider web-crawler web-crawling

Last synced: 25 Mar 2025

https://github.com/crawlab-team/crawlab-lite

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

crawlab crawler crawler-management crawling-tasks platform scrapy scrapy-ui scrapyd scrapyd-ui spider web-crawler

Last synced: 28 Jan 2026

https://github.com/mendableai/firecrawl-app-examples

🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.

ai ai-scraping data examples html-to-markdown llm markdown rag scrapers templates web-crawler

Last synced: 13 Apr 2025

https://github.com/elliotxx/zhihu-crawler-people

A simple distributed crawler for zhihu && data analysis

crawler python python-crawler spider web-crawler web-spider

Last synced: 13 Apr 2025

https://github.com/gosom/scrapemate

Golang Crawling and scraping framework

crawler go go-framework golang scraper spider web-crawler web-scraping

Last synced: 31 Jan 2026

https://github.com/Hecate2/Ignareo-ISML-auto-voter

Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)

asyncio chtholly concurrency distributed gevent high-performance http ignareo isml microservice python sukamoka sukasuka tiat web-crawler web-spider

Last synced: 11 Apr 2025

https://github.com/norconex/crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

collector-fs collector-http crawler crawlers filesystem-crawler flexible java search-engine web-crawler

Last synced: 05 May 2026

https://github.com/madi-s/lead-generation

Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.

chromedriver lead-generation leads leadscanner parser playwright python scraper web-crawler

Last synced: 06 Jul 2025

https://github.com/abaykan/CrawlBox

Easy way to brute-force web directory.

admin-finder crawler python web-crawler wordlist

Last synced: 26 Mar 2025

https://github.com/mazzzystar/proxy

A simple tool for fetching usable proxies from several websites.

proxies proxy-list proxypool web-crawler

Last synced: 09 Jan 2026

https://github.com/hominee/dyer

Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.

crawler rust rust-programming-language spider web-crawler web-framework web-scraping

Last synced: 11 Mar 2026

https://github.com/pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

ai-assistant ai-tools cheerio content-extraction crawler duckduckgo duckduckgo-search google-search mcp mcp-server web-content web-crawler web-scraper web-scraping web-search web-search-agent

Last synced: 06 Mar 2026

https://github.com/kreuzberg-dev/kreuzcrawl

High-performance web crawling engine with bindings for 11 languages

crawling csharp elixir ffi golang java mcp php python ruby rust typescript wasm web-crawler web-scraping

Last synced: 24 May 2026

https://github.com/creekorful/bathyscaphe

Fast, highly configurable, cloud native dark web crawler.

architecture crawler crawling elasticsearch golang hidden-services kibana tor web-crawler

Last synced: 17 Mar 2025

https://github.com/viveckh/lilhomie

A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.

data-engineering eda housing-price-analysis housing-price-prediction machine-learning machine-learning-projects predictions random-forest-regressor scrapy-crawler spiders trulia web-crawler

Last synced: 09 Sep 2025

https://github.com/tech-engine/goscrapy

GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.

data-extraction go-scrapy golang goscraper scrapy spider web-crawler webscraper webscrapping

Last synced: 18 Jan 2026

https://github.com/redcode-labs/unchain

A tool to find redirection chains in multiple URLs

golang reconnaissance redirection url url-redirection web-crawler

Last synced: 07 Apr 2025

https://github.com/redcode-labs/UnChain

A tool to find redirection chains in multiple URLs

golang reconnaissance redirection url url-redirection web-crawler

Last synced: 11 Jul 2025

https://github.com/mattdeitke/cvpr2019

Displays all the 2019 CVPR Accepted Papers in a way that they are easy to parse.

computer-vision cvpr2019 imagemagick lda python web-crawler web-crawler-python

Last synced: 13 Apr 2025

https://github.com/us/crw

Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

ai ai-agents crawler data-extraction docker firecrawl firecrawl-alternative html-to-markdown llm markdown mcp mcp-server rust scraping-api self-hosted tavily-alternative web-crawler web-scraper web-scraping web-search-api

Last synced: 09 May 2026

https://github.com/scrapegraphai/scrapegraph-py

Official Python SDK for the ScrapeGraph AI API. Smart scraping, search, crawling, markdownify, agentic browser automation, scheduled jobs, and structured data extraction

api json-schema python scrapegraph scraping sdk-js sdk-nodejs sdk-python web-crawler web-scraping web-scraping-python

Last synced: 21 Apr 2026

https://github.com/devopsgroup-io/siteshooter

:camera: Automate full website screenshots and PDF generation with multiple viewport support.

pdf-generation phantomjs salesforce screenshot seo sitemap web-crawler

Last synced: 13 Apr 2025

https://github.com/abo123456789/leek

Distributed task redisqueue(最简单python分布式函数调度框架)

distribute-crawler kafka leek producer-consumer queue-tasks redis redisqueue sqlite3 thread-pool web-crawler

Last synced: 10 Mar 2026

https://github.com/cheng-lin-li/market-trend-prediction

This is a project of build knowledge graph course. The project leverages historical stock price, and integrates social media listening from customers to predict market Trend On Dow Jones Industrial Average (DJIA).

djia dow-jones-industrial-average facebook facebook-crawler jupyter knowledge-graph knowledge-graph-course lstm market-trend-prediction prediction python rnn semantic-web social-media-mining twitter twitter-crawler web-crawler yahoo-finance-api

Last synced: 03 May 2025

https://github.com/scrapegraphai/scrapegraph-sdk

🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.

api scrapegraph scraping sdk-js sdk-nodejs sdk-python web-crawler web-scraping

Last synced: 26 Jun 2025

https://github.com/shenfe/puppeteer-service

🎠 Run headless Chrome (aka Puppeteer) as a service.

headless-chrome puppeteer puppeteer-service web-crawler

Last synced: 14 Jun 2025

https://github.com/threenine/stop-web-crawlers-api

Stop Web Crawlers update API

web-crawler

Last synced: 13 Apr 2025

https://github.com/spk/maman

Rust Web Crawler saving pages on Redis

crawler http spider web web-crawler

Last synced: 07 Oct 2025

https://github.com/spk/validate-website

Web crawler for checking the validity of your documents.

html validator web-crawler

Last synced: 23 Apr 2025

https://github.com/laurentvv/crawl4ai-mcp

Web crawling tool that integrates with AI assistants via the MCP

ai-tools crawl4ai mcp python3 web-crawler

Last synced: 24 Apr 2026

https://github.com/debugtalk/webcrawler

A web crawler based on requests-html, mainly targets for url validation test.

crawler requests-html web-crawler weblink

Last synced: 15 Apr 2025

https://github.com/leafrock/spiderx

A simple web-crawler development framework based on .Net Core.

csharp dotnetcore spider web-crawler

Last synced: 19 Apr 2025

https://github.com/scrapegraphai/scrapegraph-mcp

ScapeGraph MCP Server

web-crawler

Last synced: 10 Jul 2025

https://github.com/sergio11/eclipserecon

🌑 EclipseRecon is a personal project developed during my cybersecurity learning journey 🛡️. It helps practice web reconnaissance 🌐 by identifying subdomains 🧩, site structures 🧭, and vulnerabilities 🐞 in a controlled environment 🧪.

blue-team bug-bounty cybersecurity ethical-hacking information-gathering owasp penetration-testing reconnaissance red-team scan-tools security security-analysis security-reporting security-tools subdomain-scanner vulnerability vulnerability-scanner web-application-security web-crawler web-security

Last synced: 06 Sep 2025

https://github.com/calebwin/frequent

A utility for crawling websites and building frequency lists of words

frequency-lists python web-crawler web-crawler-python word-frequency

Last synced: 09 Apr 2025

https://github.com/bartozzz/crawlerr

A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.

crawler jsdom nodejs scraper spider web-crawler

Last synced: 23 Apr 2025

https://github.com/HHN/crawler4j

Open Source Web Crawler for Java - A fork of yasserg/crawler4j

crawler crawler4j java spider web-crawler web-spider

Last synced: 05 Oct 2025

https://github.com/waynechang65/ptt-crawler

ptt-crawler is a web crawler module designed to scarpe data from Ptt.

api crawl crawler javascript nodejs ptt scrape scraper scraping spider typescript web-crawler webcrawler

Last synced: 08 Oct 2025

https://github.com/biraj21/web-wanderer

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.

data-extraction multithreading python web-crawler webcrawler

Last synced: 12 Jan 2026

https://github.com/hmarzban/pipe2time.ir

Web Crawler for Time.ir to Retrive JSON File, jalali, qamari, miladi JSON Calendar API.

calendar events ics jalali json-api miladi nodejs shamsi-calendar web-crawler

Last synced: 25 Jul 2025