Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2024-12-25 00:05:56 UTC
- JSON Representation
https://github.com/scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
crawler crawling framework hacktoberfest python scraping web-scraping web-scraping-python
Last synced: 23 Dec 2024
https://github.com/naibowang/easyspider
A visual no-code/code-free web crawler/spider易采集:一个可视化浏览器自动化测试/数据采集/爬虫软件,可以无代码图形化的设计和执行爬虫任务。别名:ServiceWrapper面向Web应用的智能化服务封装系统。
batch-processing batch-script code-free crawler data-collection frontend gui html input-parameters layman parameters robotics rpa scraper spider visual visualization visualprogramming web www
Last synced: 23 Dec 2024
https://github.com/NaiboWang/EasySpider
A visual no-code/code-free web crawler/spider易采集:一个可视化浏览器自动化测试/数据采集/爬虫软件,可以无代码图形化的设计和执行爬虫任务。别名:ServiceWrapper面向Web应用的智能化服务封装系统。
batch-processing batch-script code-free crawler data-collection frontend gui html input-parameters layman parameters robotics rpa scraper spider visual visualization visualprogramming web www
Last synced: 27 Oct 2024
https://github.com/mendableai/firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
ai ai-scraping crawler data html-to-markdown llm markdown rag scraper scraping web-crawler webscraping
Last synced: 23 Dec 2024
https://github.com/binux/pyspider
A Powerful Spider(Web Crawler) System in Python.
Last synced: 29 Sep 2024
https://github.com/apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping
Last synced: 23 Dec 2024
https://github.com/shengqiangzhang/examples-of-web-crawlers
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )
agent-pool crawler example fund multithreading pyquery python selenium spider stock taobao tmall wechat wechat-report wereader
Last synced: 24 Dec 2024
https://github.com/codelucas/newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
crawler crawling news news-aggregator python scraper
Last synced: 23 Dec 2024
https://github.com/code4craft/webmagic
A scalable web crawler framework for Java.
crawler framework java scraping
Last synced: 20 Dec 2024
https://github.com/crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
crawlab crawler crawling-tasks docker go platform scrapy scrapyd-ui spider spiders-management web-crawler webcrawler webspider
Last synced: 24 Dec 2024
https://github.com/s0md3v/photon
Incredibly fast crawler designed for OSINT.
crawler information-gathering osint python spider
Last synced: 23 Dec 2024
https://github.com/s0md3v/Photon
Incredibly fast crawler designed for OSINT.
crawler information-gathering osint python spider
Last synced: 28 Oct 2024
https://github.com/ssssssss-team/spider-flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
crawler jsoup spider spider-flow web-crawler web-spider webcrawler webspider xpath
Last synced: 25 Dec 2024
https://github.com/evil0ctal/douyin_tiktok_download_api
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
api async crawler douyin douyin-api douyin-scraper douyin-tiktok-api douyin-tiktok-download fastapi no-watermark online-parsing python pywebio scraper spider tiktok tiktok-api tiktok-scraper tiktok-signature web-scraping
Last synced: 23 Dec 2024
https://github.com/guyueyingmu/avbook
AV 电影管理系统, avmoo , javbus , javlibrary 爬虫,线上 AV 影片图书馆,AV 磁力链接数据库,Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database
adult adult-video avmoo crawler database guzzlehttp javbus javlibrary laravel magnet magnet-link scraper spider
Last synced: 24 Dec 2024
https://github.com/Evil0ctal/Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
api async crawler douyin douyin-api douyin-scraper douyin-tiktok-api douyin-tiktok-download fastapi no-watermark online-parsing python pywebio scraper spider tiktok tiktok-api tiktok-scraper tiktok-signature web-scraping
Last synced: 29 Oct 2024
https://github.com/projectdiscovery/katana
A next-generation crawling and spidering framework.
cli crawler gocrawler headless spider-framework web-spider
Last synced: 24 Dec 2024
https://github.com/bda-research/node-crawler
Web Crawler/Spider for NodeJS + server-side jQuery ;-)
cheerio crawler extract-data javascript jquery nodejs spider
Last synced: 23 Dec 2024
https://github.com/alirezamika/autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
ai artificial-intelligence automation crawler machine-learning python scrape scraper scraping web-scraping webautomation webscraping
Last synced: 23 Dec 2024
https://github.com/montferret/ferret
Declarative web scraping
cdp chrome cli crawler crawling data-mining dsl go golang hacktoberfest library query-language scraper scraping scraping-websites tool
Last synced: 23 Dec 2024
https://github.com/MontFerret/ferret
Declarative web scraping
cdp chrome cli crawler crawling data-mining dsl go golang hacktoberfest library query-language scraper scraping scraping-websites tool
Last synced: 25 Oct 2024
https://github.com/rmax/scrapy-redis
Redis-based components for Scrapy.
crawler distributed redis scrapy
Last synced: 23 Dec 2024
https://github.com/spiderclub/haipproxy
:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis
crawler distributed high-availability ipproxy redis scheduler scrapy spider
Last synced: 25 Dec 2024
https://github.com/SpiderClub/haipproxy
:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis
crawler distributed high-availability ipproxy redis scheduler scrapy spider
Last synced: 29 Oct 2024
https://github.com/apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping
Last synced: 23 Dec 2024
https://github.com/dropsdevopsorg/ecommercecrawlers
实战🐍多种网站、电商数据爬虫🕷。包含🕸:淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、国家统计局、百度关键词收录数、蜘蛛泛目录、今日头条、豆瓣影评、携程、小米应用商店、安居客、途家民宿❤️❤️❤️。微信爬虫展示项目:
alitask baidu baidu-tieba baotu boss crawler ctrip dazhong-spider douban-movie douban-music fofa lagou python3 quanjing scrapy sohu taobao-spider wechat xianyu zhilianzhaopin
Last synced: 26 Dec 2024
https://github.com/DropsDevopsOrg/ECommerceCrawlers
实战🐍多种网站、电商数据爬虫🕷。包含🕸:淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、国家统计局、百度关键词收录数、蜘蛛泛目录、今日头条、豆瓣影评、携程、小米应用商店、安居客、途家民宿❤️❤️❤️。微信爬虫展示项目:
alitask baidu baidu-tieba baotu boss crawler ctrip dazhong-spider douban-movie douban-music fofa lagou python3 quanjing scrapy sohu taobao-spider wechat xianyu zhilianzhaopin
Last synced: 26 Oct 2024
https://github.com/myreader-io/mygptreader
A community-driven way to read and chat with AI bots - powered by chatGPT.
ai chatgpt crawler daily-news embedding gpt-35-turbo hot-news openai prompt reader scraper slack-bot
Last synced: 26 Dec 2024
https://github.com/madawei2699/myGPTReader
A community-driven way to read and chat with AI bots - powered by chatGPT.
ai chatgpt crawler daily-news embedding gpt-35-turbo hot-news openai prompt reader scraper slack-bot
Last synced: 28 Oct 2024
https://github.com/madawei2699/mygptreader
A community-driven way to read and chat with AI bots - powered by chatGPT.
ai chatgpt crawler daily-news embedding gpt-35-turbo hot-news openai prompt reader scraper slack-bot
Last synced: 15 Oct 2024
https://github.com/niespodd/browser-fingerprinting
Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️♂️ when scraping the web?
automation bot bot-detection browser-fingerprinting chromedriver chromium chromium-browser crawler detection fingerprinting puppeteer recaptcha scraper spider stealth web webscraping
Last synced: 24 Dec 2024
https://github.com/dotnetcore/dotnetspider
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
crawler cross-platform csharp distributed dotnetcore
Last synced: 24 Dec 2024
https://github.com/imwildcat/scylla
Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era
crawler proxy-pool python python3 scylla
Last synced: 24 Dec 2024
https://github.com/dotnetcore/DotnetSpider
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
crawler cross-platform csharp distributed dotnetcore
Last synced: 27 Oct 2024
https://github.com/imWildCat/scylla
Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era
crawler proxy-pool python python3 scylla
Last synced: 29 Oct 2024
https://github.com/constverum/proxybroker
Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:
anonymity anonymous crawler http-proxy privacy proxies proxy proxy-checker proxy-list proxy-server proxypool socks
Last synced: 24 Dec 2024
https://github.com/zu1k/proxypool
Automatically crawls proxy nodes on the public internet, de-duplicates and tests for usability and then provides a list of nodes
Last synced: 26 Sep 2024
https://github.com/arachni/arachni
Web Application Security Scanner Framework
analysis arachni audit crawler detection dom hack hacking javascript modular penetration-testing ruby scanner scanners security-audit sql-injection vulnerability-detection web-application xss
Last synced: 22 Dec 2024
https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China
Collection of China illegal cases about web crawler 本项目用来整理所有中国大陆爬虫开发者涉诉与违规相关的新闻、资料与法律法规。致力于帮助在中国大陆工作的爬虫行业从业者了解我国相关法律,避免触碰数据合规红线。 [AD]中文知识图谱门户
Last synced: 01 Nov 2024
https://github.com/Arachni/arachni
Web Application Security Scanner Framework
analysis arachni audit crawler detection dom hack hacking javascript modular penetration-testing ruby scanner scanners security-audit sql-injection vulnerability-detection web-application xss
Last synced: 03 Nov 2024
https://github.com/constverum/ProxyBroker
Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:
anonymity anonymous crawler http-proxy privacy proxies proxy proxy-checker proxy-list proxy-server proxypool socks
Last synced: 24 Oct 2024
https://github.com/dataabc/weibo-crawler
新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频
Last synced: 25 Dec 2024
https://github.com/hardkoded/puppeteer-sharp
Headless Chrome .NET API
automation chrome chromium crawler crawling csharp e2e e2e-testing puppeteer webautomation
Last synced: 24 Dec 2024
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 26 Oct 2024
https://github.com/tuhinshubhra/red_hawk
All in one tool for Information Gathering, Vulnerability Scanning and Crawling. A must have tool for all penetration testers
admin-scanner backups-finder cloudflare-detection cms-detector crawler domain-authority-scanner geo-ip http-header information-gathering mx-lookup page-authority-scanner reverse-ip-scan scanner sql-scanner sql-vulnerability-scannig subdomain-scanner subnet-lookup whois-lookup wordpress wordpress-scanner
Last synced: 24 Dec 2024
https://github.com/Tuhinshubhra/RED_HAWK
All in one tool for Information Gathering, Vulnerability Scanning and Crawling. A must have tool for all penetration testers
admin-scanner backups-finder cloudflare-detection cms-detector crawler domain-authority-scanner geo-ip http-header information-gathering mx-lookup page-authority-scanner reverse-ip-scan scanner sql-scanner sql-vulnerability-scannig subdomain-scanner subnet-lookup whois-lookup wordpress wordpress-scanner
Last synced: 30 Oct 2024
https://github.com/dedsecinside/torbot
Dark Web OSINT Tool
algorithm crawler dark-web dedsec-inside deepweb go hacking hacktoberfest osint projects psnappz python python-web-crawler python3 security security-tools spider tor tor-network torbot
Last synced: 25 Dec 2024
https://github.com/DedSecInside/TorBot
Dark Web OSINT Tool
algorithm crawler dark-web dedsec-inside deepweb go hacking hacktoberfest osint projects psnappz python python-web-crawler python3 security security-tools spider tor tor-network torbot
Last synced: 02 Nov 2024
https://github.com/Qianlitp/crawlergo
A powerful browser crawler for web vulnerability scanners
arsenal blackhat chrome-devtools chromedp crawler crawlergo golang headless headless-chrome vulnerability-scanner web-vulnerability-scanners
Last synced: 05 Nov 2024
https://github.com/kanasimi/work_crawler
Download comics novels 小说漫画下载工具 小説漫画のダウンローダ 小說漫畫下載:腾讯漫画 大角虫漫画 有妖气 咪咕 SF漫画 哦漫画 看漫画 漫画柜 汗汗酷漫 動漫伊甸園 快看漫画 微博动漫 733动漫网 大古漫画网 漫画DB 無限動漫 動漫狂 卡推漫画 动漫之家 动漫屋 古风漫画网 36漫画网 亲亲漫画网 乙女漫画 webtoons 咚漫 ニコニコ静画 ComicWalker ヤングエースUP モアイ pixivコミック サイコミ;アルファポリス カクヨム ハーメルン 小説家になろう 起点中文网 八一中文网 顶点小说 落霞小说网 努努书坊 笔趣阁→epub.
cejs comic-downloader comics crawler download-comic downloader ebook epub manga manga-downloader narou novel-downloader novels webcomics
Last synced: 27 Dec 2024
https://github.com/jae-jae/querylist
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
crawler querylist scraper spider
Last synced: 23 Dec 2024
https://github.com/nikolait/googlescraper
A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
crawler python scraping search-engine search-engine-optimization search-engines
Last synced: 20 Dec 2024
https://github.com/NikolaiT/GoogleScraper
A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
crawler python scraping search-engine search-engine-optimization search-engines
Last synced: 25 Oct 2024
https://github.com/jae-jae/QueryList
:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
crawler querylist scraper spider
Last synced: 25 Oct 2024
https://github.com/boris-code/feapder
🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度
crawler feapder feaplat python scrapy spider
Last synced: 25 Dec 2024
https://github.com/Boris-code/feapder
🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度
crawler feapder feaplat python scrapy spider
Last synced: 31 Oct 2024
https://github.com/spatie/crawler
An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
concurrency crawler guzzle php
Last synced: 23 Dec 2024
https://github.com/lorien/grab
Web Scraping Framework
asynchronous crawler crawling framework http-client network pycurl python python-library python3 scraping spider urllib3 web-scraping
Last synced: 25 Dec 2024
https://github.com/facundoolano/google-play-scraper
Node.js scraper to get data from Google Play
api crawler google-play nodejs scraper
Last synced: 24 Dec 2024
https://github.com/thewhiteh4t/FinalRecon
All In One Web Recon
crawler directory-search dns-enumeration headers javascript-crawler pentest-tool pentesting pentesting-tools port-scanning python3 reconnaissance ssl-certificate subdomain-enumeration traceroute web-penetration-testing web-reconnaissance webpentest whois
Last synced: 01 Nov 2024
https://github.com/thewhiteh4t/finalrecon
All In One Web Recon
crawler directory-search dns-enumeration headers javascript-crawler pentest-tool pentesting pentesting-tools port-scanning python3 reconnaissance ssl-certificate subdomain-enumeration traceroute web-penetration-testing web-reconnaissance webpentest whois
Last synced: 20 Dec 2024
https://github.com/sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
abot abot-nuget c-sharp crawler cross-platform csharp csharp-library javascript-renderer netcore netcore2 netcore3 netsta netstandard20 netstandard21 parsing pluggable spider spiders unit-testing web-crawler
Last synced: 25 Dec 2024
https://github.com/fhamborg/news-please
news-please - an integrated web crawler and information extractor for news that just works
cc-news ccnews commoncrawl crawler data-gathering elasticsearch extract-articles extract-information extractor json news news-archive news-articles news-crawler news-extractor news-scraper news-websites nlp python roberta
Last synced: 24 Dec 2024
https://github.com/puerkitobio/gocrawl
Polite, slim and concurrent web crawler.
Last synced: 20 Dec 2024
https://github.com/PuerkitoBio/gocrawl
Polite, slim and concurrent web crawler.
Last synced: 29 Oct 2024
https://github.com/jaybizzle/crawler-detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
bots crawler detect hacktoberfest php spider user-agent
Last synced: 23 Dec 2024
https://github.com/rendora/rendora
dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites
angular chrome-devtools chrome-headless crawler dynamic-rendering go golang javascript puppeteer react reactjs seo seo-optimization server-side-rendering spa ssr vue vuejs
Last synced: 21 Dec 2024
https://github.com/JayBizzle/Crawler-Detect
🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
bots crawler detect hacktoberfest php spider user-agent
Last synced: 03 Nov 2024
https://github.com/blankerl/dxy-covid-19-crawler
2019新型冠状病毒疫情实时爬虫及API | COVID-19/2019-nCoV Realtime Infection Crawler and API
2019-ncov crawler realtime-api
Last synced: 20 Dec 2024
https://github.com/zorlan/skycaiji
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
crawler crawling php spider webcrawler
Last synced: 20 Dec 2024
https://github.com/anouarbensaad/vulnx
vulnx 🕷️ an intelligent Bot, Shell can achieve automatic injection, and help researchers detect security vulnerabilities CMS system. It can perform a quick CMS security detection, information collection (including sub-domain name, ip address, country information, organizational information and time zone, etc.) and vulnerability scanning.
auto-exploiter bot cloudflare-detection cms-detector crawler detects-vulnerabilities dorks exploits hacking information-gathering pentest security-tools shell-injection subdomains-gathering vulnerability vulnerability-assessment vulnerability-detection vulnerability-exploit website-vulnerability-scanner wp-scanner
Last synced: 20 Dec 2024
https://github.com/sqzw-x/mdcx
Movie metadata scraper
crawler emby jav-scraper jellyfin metadata movie-crawler movie-metadata movie-scrapper movies python scraper
Last synced: 26 Dec 2024
https://github.com/xianhu/pspider
简单易用的Python爬虫框架,QQ交流群:597510560
crawler multi-threading multiprocessing proxies python python-spider spider web-crawler web-spider
Last synced: 21 Dec 2024
https://github.com/hu17889/go_spider
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.
crawler go pipeline schedule spider
Last synced: 29 Oct 2024
https://github.com/xianhu/PSpider
简单易用的Python爬虫框架,QQ交流群:597510560
crawler multi-threading multiprocessing proxies python python-spider spider web-crawler web-spider
Last synced: 29 Oct 2024
https://github.com/nekmo/dirhunt
Find web directories without bruteforce
crawler dirscanner pentesting python security security-tools websec without-bruteforce
Last synced: 26 Dec 2024
https://github.com/Nekmo/dirhunt
Find web directories without bruteforce
crawler dirscanner pentesting python security security-tools websec without-bruteforce
Last synced: 31 Oct 2024