Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

crawler crawling framework hacktoberfest python scraping web-scraping web-scraping-python

Last synced: 23 Dec 2024

https://github.com/naibowang/easyspider

A visual no-code/code-free web crawler/spider易采集:一个可视化浏览器自动化测试/数据采集/爬虫软件,可以无代码图形化的设计和执行爬虫任务。别名:ServiceWrapper面向Web应用的智能化服务封装系统。

batch-processing batch-script code-free crawler data-collection frontend gui html input-parameters layman parameters robotics rpa scraper spider visual visualization visualprogramming web www

Last synced: 23 Dec 2024

https://github.com/NaiboWang/EasySpider

A visual no-code/code-free web crawler/spider易采集:一个可视化浏览器自动化测试/数据采集/爬虫软件,可以无代码图形化的设计和执行爬虫任务。别名:ServiceWrapper面向Web应用的智能化服务封装系统。

batch-processing batch-script code-free crawler data-collection frontend gui html input-parameters layman parameters robotics rpa scraper spider visual visualization visualprogramming web www

Last synced: 27 Oct 2024

https://github.com/iawia002/lux

👾 Fast and simple video download library and CLI tool written in Go

bilibili crawler download downloader go golang iqiyi qq scraper tumblr video youku youtube

Last synced: 23 Dec 2024

https://github.com/iawia002/annie

👾 Fast and simple video download library and CLI tool written in Go

bilibili crawler download downloader go golang iqiyi qq scraper tumblr video youku youtube

Last synced: 10 Nov 2024

https://github.com/gocolly/colly

Elegant Scraper and Crawler Framework for Golang

crawler crawling framework go golang scraper scraping spider

Last synced: 23 Dec 2024

https://github.com/jhao104/proxy_pool

Python ProxyPool for web spider

crawler http proxy redis spider

Last synced: 23 Dec 2024

https://github.com/mendableai/firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

ai ai-scraping crawler data html-to-markdown llm markdown rag scraper scraping web-crawler webscraping

Last synced: 23 Dec 2024

https://github.com/binux/pyspider

A Powerful Spider(Web Crawler) System in Python.

crawler python

Last synced: 29 Sep 2024

https://github.com/apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping

Last synced: 23 Dec 2024

https://github.com/shengqiangzhang/examples-of-web-crawlers

一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. )

agent-pool crawler example fund multithreading pyquery python selenium spider stock taobao tmall wechat wechat-report wereader

Last synced: 24 Dec 2024

https://github.com/codelucas/newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

crawler crawling news news-aggregator python scraper

Last synced: 23 Dec 2024

https://github.com/code4craft/webmagic

A scalable web crawler framework for Java.

crawler framework java scraping

Last synced: 20 Dec 2024

https://github.com/crawlab-team/crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

crawlab crawler crawling-tasks docker go platform scrapy scrapyd-ui spider spiders-management web-crawler webcrawler webspider

Last synced: 24 Dec 2024

https://github.com/s0md3v/photon

Incredibly fast crawler designed for OSINT.

crawler information-gathering osint python spider

Last synced: 23 Dec 2024

https://github.com/s0md3v/Photon

Incredibly fast crawler designed for OSINT.

crawler information-gathering osint python spider

Last synced: 28 Oct 2024

https://github.com/ssssssss-team/spider-flow

新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

crawler jsoup spider spider-flow web-crawler web-spider webcrawler webspider xpath

Last synced: 25 Dec 2024

https://github.com/injetlee/python

Python脚本。模拟登录知乎, 爬虫,操作excel,微信公众号,远程开机

crawler excel python wechat

Last synced: 25 Dec 2024

https://github.com/evil0ctal/douyin_tiktok_download_api

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。

api async crawler douyin douyin-api douyin-scraper douyin-tiktok-api douyin-tiktok-download fastapi no-watermark online-parsing python pywebio scraper spider tiktok tiktok-api tiktok-scraper tiktok-signature web-scraping

Last synced: 23 Dec 2024

https://github.com/guyueyingmu/avbook

AV 电影管理系统, avmoo , javbus , javlibrary 爬虫,线上 AV 影片图书馆,AV 磁力链接数据库,Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database

adult adult-video avmoo crawler database guzzlehttp javbus javlibrary laravel magnet magnet-link scraper spider

Last synced: 24 Dec 2024

https://github.com/Evil0ctal/Douyin_TikTok_Download_API

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。

api async crawler douyin douyin-api douyin-scraper douyin-tiktok-api douyin-tiktok-download fastapi no-watermark online-parsing python pywebio scraper spider tiktok tiktok-api tiktok-scraper tiktok-signature web-scraping

Last synced: 29 Oct 2024

https://github.com/projectdiscovery/katana

A next-generation crawling and spidering framework.

cli crawler gocrawler headless spider-framework web-spider

Last synced: 24 Dec 2024

https://github.com/bda-research/node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

cheerio crawler extract-data javascript jquery nodejs spider

Last synced: 23 Dec 2024

https://github.com/chyroc/wechatsogou

基于搜狗微信搜索的微信公众号爬虫接口

crawler pypi python scrapy sogou wechat

Last synced: 24 Dec 2024

https://github.com/Chyroc/WechatSogou

基于搜狗微信搜索的微信公众号爬虫接口

crawler pypi python scrapy sogou wechat

Last synced: 19 Nov 2024

https://github.com/chyroc/WechatSogou

基于搜狗微信搜索的微信公众号爬虫接口

crawler pypi python scrapy sogou wechat

Last synced: 31 Oct 2024

https://github.com/rmax/scrapy-redis

Redis-based components for Scrapy.

crawler distributed redis scrapy

Last synced: 23 Dec 2024

https://github.com/SpiderClub/haipproxy

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

crawler distributed high-availability ipproxy redis scheduler scrapy spider

Last synced: 29 Oct 2024

https://github.com/spiderclub/haipproxy

:sparkling_heart: High available distributed ip proxy pool, powerd by Scrapy and Redis

crawler distributed high-availability ipproxy redis scheduler scrapy spider

Last synced: 25 Dec 2024

https://github.com/apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping

Last synced: 23 Dec 2024

https://github.com/dropsdevopsorg/ecommercecrawlers

实战🐍多种网站、电商数据爬虫🕷。包含🕸:淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、国家统计局、百度关键词收录数、蜘蛛泛目录、今日头条、豆瓣影评、携程、小米应用商店、安居客、途家民宿❤️❤️❤️。微信爬虫展示项目:

alitask baidu baidu-tieba baotu boss crawler ctrip dazhong-spider douban-movie douban-music fofa lagou python3 quanjing scrapy sohu taobao-spider wechat xianyu zhilianzhaopin

Last synced: 26 Dec 2024

https://github.com/DropsDevopsOrg/ECommerceCrawlers

实战🐍多种网站、电商数据爬虫🕷。包含🕸:淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、国家统计局、百度关键词收录数、蜘蛛泛目录、今日头条、豆瓣影评、携程、小米应用商店、安居客、途家民宿❤️❤️❤️。微信爬虫展示项目:

alitask baidu baidu-tieba baotu boss crawler ctrip dazhong-spider douban-movie douban-music fofa lagou python3 quanjing scrapy sohu taobao-spider wechat xianyu zhilianzhaopin

Last synced: 26 Oct 2024

https://github.com/myreader-io/mygptreader

A community-driven way to read and chat with AI bots - powered by chatGPT.

ai chatgpt crawler daily-news embedding gpt-35-turbo hot-news openai prompt reader scraper slack-bot

Last synced: 26 Dec 2024

https://github.com/madawei2699/myGPTReader

A community-driven way to read and chat with AI bots - powered by chatGPT.

ai chatgpt crawler daily-news embedding gpt-35-turbo hot-news openai prompt reader scraper slack-bot

Last synced: 28 Oct 2024

https://github.com/madawei2699/mygptreader

A community-driven way to read and chat with AI bots - powered by chatGPT.

ai chatgpt crawler daily-news embedding gpt-35-turbo hot-news openai prompt reader scraper slack-bot

Last synced: 15 Oct 2024

https://github.com/niespodd/browser-fingerprinting

Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️‍♂️ when scraping the web?

automation bot bot-detection browser-fingerprinting chromedriver chromium chromium-browser crawler detection fingerprinting puppeteer recaptcha scraper spider stealth web webscraping

Last synced: 24 Dec 2024

https://github.com/dotnetcore/dotnetspider

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

crawler cross-platform csharp distributed dotnetcore

Last synced: 24 Dec 2024

https://github.com/imwildcat/scylla

Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era

crawler proxy-pool python python3 scylla

Last synced: 24 Dec 2024

https://github.com/dotnetcore/DotnetSpider

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

crawler cross-platform csharp distributed dotnetcore

Last synced: 27 Oct 2024

https://github.com/imWildCat/scylla

Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era

crawler proxy-pool python python3 scylla

Last synced: 29 Oct 2024

https://github.com/constverum/proxybroker

Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:

anonymity anonymous crawler http-proxy privacy proxies proxy proxy-checker proxy-list proxy-server proxypool socks

Last synced: 24 Dec 2024

https://github.com/zu1k/proxypool

Automatically crawls proxy nodes on the public internet, de-duplicates and tests for usability and then provides a list of nodes

crawler proxypool

Last synced: 26 Sep 2024

https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China

Collection of China illegal cases about web crawler 本项目用来整理所有中国大陆爬虫开发者涉诉与违规相关的新闻、资料与法律法规。致力于帮助在中国大陆工作的爬虫行业从业者了解我国相关法律,避免触碰数据合规红线。 [AD]中文知识图谱门户

china crawler law

Last synced: 01 Nov 2024

https://github.com/constverum/ProxyBroker

Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:

anonymity anonymous crawler http-proxy privacy proxies proxy proxy-checker proxy-list proxy-server proxypool socks

Last synced: 24 Oct 2024

https://github.com/dataabc/weibo-crawler

新浪微博爬虫,用python爬取新浪微博数据,并下载微博图片和微博视频

crawler weibo weibo-spider

Last synced: 25 Dec 2024

https://github.com/elliotgao2/toapi

Every web site provides APIs.

api crawler flask html json python spider toapi web

Last synced: 20 Dec 2024

https://github.com/wkunzhi/python3-spider

Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️

crawl crawler dianping geek meituan pyppeteer python scrapy scrapy-crawler selenium spider splash taobao

Last synced: 20 Dec 2024

https://github.com/jumper2014/lianjia-beike-spider

链家网和贝壳网房价爬虫,采集北京上海广州深圳等21个中国主要城市的房价数据(小区,二手房,出租房,新房),稳定可靠快速!支持csv,MySQL, MongoDB,Excel, json存储,支持Python2和3,图表展示数据,注释丰富 ,点星支持,仅供学习参考,请勿用于商业用途,后果自负。

beike crawler house lianjia spider

Last synced: 25 Dec 2024

https://github.com/kanasimi/work_crawler

Download comics novels 小说漫画下载工具 小説漫画のダウンローダ 小說漫畫下載:腾讯漫画 大角虫漫画 有妖气 咪咕 SF漫画 哦漫画 看漫画 漫画柜 汗汗酷漫 動漫伊甸園 快看漫画 微博动漫 733动漫网 大古漫画网 漫画DB 無限動漫 動漫狂 卡推漫画 动漫之家 动漫屋 古风漫画网 36漫画网 亲亲漫画网 乙女漫画 webtoons 咚漫 ニコニコ静画 ComicWalker ヤングエースUP モアイ pixivコミック サイコミ;アルファポリス カクヨム ハーメルン 小説家になろう 起点中文网 八一中文网 顶点小说 落霞小说网 努努书坊 笔趣阁→epub.

cejs comic-downloader comics crawler download-comic downloader ebook epub manga manga-downloader narou novel-downloader novels webcomics

Last synced: 20 Dec 2024

https://github.com/wkunzhi/Python3-Spider

Python爬虫实战 - 模拟登陆各大网站 包含但不限于:滑块验证、拼多多、美团、百度、bilibili、大众点评、淘宝,如果喜欢请start ❤️

crawl crawler dianping geek meituan pyppeteer python scrapy scrapy-crawler selenium spider splash taobao

Last synced: 19 Nov 2024

https://github.com/jae-jae/querylist

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

crawler querylist scraper spider

Last synced: 23 Dec 2024

https://github.com/geziyor/geziyor

Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.

crawler go scraper scraping spider

Last synced: 23 Dec 2024

https://github.com/nikolait/googlescraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.

crawler python scraping search-engine search-engine-optimization search-engines

Last synced: 20 Dec 2024

https://github.com/NikolaiT/GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.

crawler python scraping search-engine search-engine-optimization search-engines

Last synced: 25 Oct 2024

https://github.com/jae-jae/QueryList

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。

crawler querylist scraper spider

Last synced: 25 Oct 2024

https://github.com/Boris-code/feapder

🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度

crawler feapder feaplat python scrapy spider

Last synced: 31 Oct 2024

https://github.com/boris-code/feapder

🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度

crawler feapder feaplat python scrapy spider

Last synced: 25 Dec 2024

https://github.com/jaeles-project/gospider

Gospider - Fast web spider written in Go

bugbounty crawler go gospider spider

Last synced: 26 Dec 2024

https://github.com/spatie/crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

concurrency crawler guzzle php

Last synced: 23 Dec 2024

https://github.com/xtuhcy/gecco

Easy to use lightweight web crawler(易用的轻量化网络爬虫)

crawler dynamic fastjson gecco java jsoup

Last synced: 26 Dec 2024

https://github.com/ngc660sec/ngcbot

一个基于✨HOOK机制的微信机器人,支持🌱安全新闻定时推送【FreeBuf,先知,安全客,奇安信攻防社区】,👯Kfc文案,⚡备案查询,⚡手机号归属地查询,⚡WHOIS信息查询,🎉星座查询,⚡天气查询,🌱摸鱼日历,⚡微步威胁情报查询, 🐛美女视频,⚡美女图片,👯帮助菜单。📫 支持积分功能,⚡支持自动拉人,⚡检测广告,🌱自动群发,👯Ai回复,😄自定义程度丰富,小白也可轻松上手!

bot crawler security wei-xin weixin wxbot

Last synced: 26 Dec 2024

https://github.com/facundoolano/google-play-scraper

Node.js scraper to get data from Google Play

api crawler google-play nodejs scraper

Last synced: 24 Dec 2024

https://github.com/sjdirect/abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

abot abot-nuget c-sharp crawler cross-platform csharp csharp-library javascript-renderer netcore netcore2 netcore3 netsta netstandard20 netstandard21 parsing pluggable spider spiders unit-testing web-crawler

Last synced: 25 Dec 2024

https://github.com/ngc660sec/NGCBot

一个基于✨HOOK机制的微信机器人,支持🌱安全新闻定时推送【FreeBuf,先知,安全客,奇安信攻防社区】,👯Kfc文案,⚡备案查询,⚡手机号归属地查询,⚡WHOIS信息查询,🎉星座查询,⚡天气查询,🌱摸鱼日历,⚡微步威胁情报查询, 🐛美女视频,⚡美女图片,👯帮助菜单。📫 支持积分功能,⚡支持自动拉人,⚡检测广告,🌱自动群发,👯Ai回复,😄自定义程度丰富,小白也可轻松上手!

bot crawler security wei-xin weixin wxbot

Last synced: 29 Oct 2024

https://github.com/puerkitobio/gocrawl

Polite, slim and concurrent web crawler.

crawler robots-txt

Last synced: 20 Dec 2024

https://github.com/elliotgao2/gain

Web crawling framework based on asyncio.

aiohttp asyncio crawler python spider uvloop

Last synced: 22 Dec 2024

https://github.com/PuerkitoBio/gocrawl

Polite, slim and concurrent web crawler.

crawler robots-txt

Last synced: 29 Oct 2024

https://github.com/jaybizzle/crawler-detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

bots crawler detect hacktoberfest php spider user-agent

Last synced: 23 Dec 2024

https://github.com/rendora/rendora

dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

angular chrome-devtools chrome-headless crawler dynamic-rendering go golang javascript puppeteer react reactjs seo seo-optimization server-side-rendering spa ssr vue vuejs

Last synced: 21 Dec 2024

https://github.com/JayBizzle/Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

bots crawler detect hacktoberfest php spider user-agent

Last synced: 03 Nov 2024

https://github.com/blankerl/dxy-covid-19-crawler

2019新型冠状病毒疫情实时爬虫及API | COVID-19/2019-nCoV Realtime Infection Crawler and API

2019-ncov crawler realtime-api

Last synced: 20 Dec 2024

https://github.com/zorlan/skycaiji

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统

crawler crawling php spider webcrawler

Last synced: 20 Dec 2024

https://github.com/anouarbensaad/vulnx

vulnx 🕷️ an intelligent Bot, Shell can achieve automatic injection, and help researchers detect security vulnerabilities CMS system. It can perform a quick CMS security detection, information collection (including sub-domain name, ip address, country information, organizational information and time zone, etc.) and vulnerability scanning.

auto-exploiter bot cloudflare-detection cms-detector crawler detects-vulnerabilities dorks exploits hacking information-gathering pentest security-tools shell-injection subdomains-gathering vulnerability vulnerability-assessment vulnerability-detection vulnerability-exploit website-vulnerability-scanner wp-scanner

Last synced: 20 Dec 2024

https://github.com/xianhu/pspider

简单易用的Python爬虫框架,QQ交流群:597510560

crawler multi-threading multiprocessing proxies python python-spider spider web-crawler web-spider

Last synced: 21 Dec 2024

https://github.com/hu17889/go_spider

[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

crawler go pipeline schedule spider

Last synced: 29 Oct 2024

https://github.com/xianhu/PSpider

简单易用的Python爬虫框架,QQ交流群:597510560

crawler multi-threading multiprocessing proxies python python-spider spider web-crawler web-spider

Last synced: 29 Oct 2024