Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2024-12-25 00:05:56 UTC
- JSON Representation
https://github.com/howie6879/ruia
Async Python 3.6+ web scraping micro-framework based on asyncio
aiohttp asyncio asyncio-spider crawler crawling-framework middlewares python python-ruia ruia spider uvloop
Last synced: 19 Dec 2024
https://github.com/lixi5338619/lxspider
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、各种指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书、大众点评、推特、脉脉、知乎》
12306 andrioid crawler douban douyin douyinsignature kuaishou meituan pdd signature taobao toutiao twitter wechat weibo weixin xiaohongshu xiecheng youku
Last synced: 21 Dec 2024
https://github.com/YoongiKim/AutoCrawler
Google, Naver multiprocess image web crawler (Selenium)
bigdata chromedriver crawler customizable deep-learning google image-crawler multiprocess python selenium thread
Last synced: 09 Nov 2024
https://github.com/yoongikim/autocrawler
Google, Naver multiprocess image web crawler (Selenium)
bigdata chromedriver crawler customizable deep-learning google image-crawler multiprocess python selenium thread
Last synced: 20 Dec 2024
https://github.com/extractus/article-extractor
To extract main article from given URL with Node.js
article article-extractor article-parser crawler extract nodejs readability scraper
Last synced: 24 Dec 2024
https://github.com/edoardottt/cariddi
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
bugbounty crawler crawling endpoint-discovery endpoints go golang hacktoberfest infosec osint penetration-testing pentesting recon reconnaissance redteam scraper secret-keys secrets-detection security security-tools
Last synced: 26 Dec 2024
https://github.com/coder-hxl/x-crawl
Flexible Node.js AI-assisted crawler library
ai ai-crawl chromium crawl crawler fingerprint flexible javascript multifunction nodejs puppeteer spider typescript
Last synced: 25 Dec 2024
https://github.com/github/lightcrawler
Crawl a website and run it through Google lighthouse
chrome crawler google-lighthouse
Last synced: 26 Sep 2024
https://github.com/diskoverdata/diskover-community
Diskover Community Edition - Open source file indexer, file search engine and data management and analytics powered by Elasticsearch
crawler disk-space disk-space-analyzer disk-usage duplicate-files duplicatefilefinder elasticsearch file-indexing file-manager file-tagging filesystem filesystem-analysis filesystem-indexer filesystem-visualization metadata php python storage storage-analytics web-application
Last synced: 19 Dec 2024
https://github.com/teamnewpipe/newpipeextractor
NewPipe's core library for extracting data from streaming sites
bandcamp crawler extractor mediaccc newpipe peertube scraper soundcloud youtube
Last synced: 26 Dec 2024
https://github.com/imthaghost/goclone
Website Cloner - Utilizes powerful Go routines to clone websites to your computer within seconds.
cloning crawler go golang website-cloner website-scraper
Last synced: 21 Dec 2024
https://github.com/TeamNewPipe/NewPipeExtractor
NewPipe's core library for extracting data from streaming sites
bandcamp crawler extractor mediaccc newpipe peertube scraper soundcloud youtube
Last synced: 26 Nov 2024
https://github.com/aandyprogram/scrawler
🏳️🌈 Media downloader from any sites, including Twitter, Reddit, Instagram, Threads, Facebook, OnlyFans, YouTube, Pinterest, PornHub, XHamster, XVIDEOS, ThisVid etc.
crawler download downloader gay image instagram lgbt manager media onlyfans photo pictures pornhub reddit thisvid twitter video xhamster xvideo youtube
Last synced: 19 Dec 2024
https://github.com/u3c3/bt-btt
磁力網站U3C3介紹以及域名更新
adult avmoo bittorrent bt btsow crawler download jav javbus javlibrary magnet magnet-link nyaa porn rarbg spider sukebei tracker u3c3
Last synced: 03 Dec 2024
https://github.com/LeonardoCardoso/SwiftLinkPreview
It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.
carthage cocoapods crawler flow ios macos preview regular-expressions relevant-texts swift swift-package-manager tvos url watchos website
Last synced: 09 Dec 2024
https://github.com/leonardocardoso/swiftlinkpreview
It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.
carthage cocoapods crawler flow ios macos preview regular-expressions relevant-texts swift swift-package-manager tvos url watchos website
Last synced: 14 Dec 2024
https://github.com/dadoonet/fscrawler
Elasticsearch File System Crawler (FS Crawler)
crawler elasticsearch java tika
Last synced: 25 Dec 2024
https://github.com/openwpm/OpenWPM
A web privacy measurement framework
crawler firefox privacy python3
Last synced: 31 Oct 2024
https://github.com/openwpm/openwpm
A web privacy measurement framework
crawler firefox privacy python3
Last synced: 26 Dec 2024
https://github.com/lorey/mlscraper
🤖 Scrape data from HTML websites automatically by just providing examples
crawler crawler-python crawling extraction-engine html machine-learning scraper scraping
Last synced: 20 Dec 2024
https://github.com/felipecsl/wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
Last synced: 19 Dec 2024
https://github.com/Adyzng/jd-autobuy
Python爬虫,京东自动登录,在线抢购商品
crawler jingdong python scraper
Last synced: 19 Nov 2024
https://github.com/AAndyProgram/SCrawler
🏳️🌈 Media downloader from any sites, including Twitter, Reddit, Instagram, Threads, Facebook, OnlyFans, YouTube, Pinterest, PornHub, XHamster, XVIDEOS, ThisVid etc.
crawler download downloader gay image instagram lgbt manager media onlyfans photo pictures pornhub reddit thisvid twitter video xhamster xvideo youtube
Last synced: 06 Nov 2024
https://github.com/srx-2000/spider_collection
python爬虫,目前库存:网易云音乐歌曲爬取,B站视频爬取,知乎问答爬取,壁纸爬取,xvideos视频爬取,有声书爬取,微博爬虫,安居客信息爬取+数据可视化,哔哩哔哩视频封面提取器,ip代理池封装,知乎百万级用户爬虫+数据分析,github用户爬虫
Last synced: 21 Dec 2024
https://github.com/kiddyuchina/beanbun
Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman。
Last synced: 20 Dec 2024
https://github.com/kiddyuchina/Beanbun
Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman。
Last synced: 01 Nov 2024
https://github.com/kkoooqq/fakebrowser
🤖 Fake fingerprints to bypass anti-bot systems. Simulate mouse and keyboard operations to make behavior like a real person.
anti-bot-detection anti-fingerprinting automation bot browser-fingerprint cheat crawler fake headless puppeteer puppeteer-extra puppeteer-extra-plugin scrapy spoof stealth
Last synced: 21 Dec 2024
https://github.com/instapy/instagram-profilecrawl
📝 quickly crawl the information (e.g. followers, tags etc...) of an instagram profile.
automation crawler information instagram instapy python python-script selenium simple
Last synced: 21 Dec 2024
https://github.com/seveniruby/AppCrawler
基于appium的app自动遍历工具
appium appium-app crawler diff scala xpath
Last synced: 08 Nov 2024
https://github.com/InstaPy/instagram-profilecrawl
📝 quickly crawl the information (e.g. followers, tags etc...) of an instagram profile.
automation crawler information instagram instapy python python-script selenium simple
Last synced: 02 Nov 2024
https://github.com/dwisiswant0/go-dork
The fastest dork scanner written in Go.
bing-dorks bugbounty bugbounty-tool crawler dork-scanner dorking golang google-dorking google-dorks infosec security shodan-dorks vulnerability-scanners
Last synced: 22 Dec 2024
https://github.com/the-robot/sqliv
massive SQL injection vulnerability scanner
crawler multiprocessing reverse-ip-scan scanner scanning sql-injection sqli sqli-vulnerability-scanner
Last synced: 28 Oct 2024
https://github.com/0xinfection/xsrfprobe
The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.
audit crafted-tokens crawler csrf csrf-attacks csrf-poc csrf-scanner csrf-tokens spider token-generation xsrf
Last synced: 24 Dec 2024
https://github.com/0xInfection/XSRFProbe
The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.
audit crafted-tokens crawler csrf csrf-attacks csrf-poc csrf-scanner csrf-tokens spider token-generation xsrf
Last synced: 28 Oct 2024
https://github.com/yutto-dev/bilili
:beers: bilibili video (including bangumi) and danmaku downloader | B站视频(含番剧)、弹幕下载器
bilibili crawler danmaku download downloader multithread python3 requests spider subtitle video
Last synced: 14 Oct 2024
https://github.com/vifreefly/kimuraframework
Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites
crawler headless-chrome kimurai scraper scrapy
Last synced: 25 Dec 2024
https://github.com/elixir-crawly/crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider
Last synced: 20 Dec 2024
https://github.com/oltarasenko/crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider
Last synced: 27 Nov 2024
https://github.com/xisuo67/xhs-spider
小红书数据采集、网站图片、视频资源批量下载工具,颜值超高的数据采集工具(批量下载,视频提取,图片,去水印等)Telegram:https://t.me/+ZtLSwuIKTo44MDY1
crawler csharp downloader wpf wpf-notifyicon wpf-ui
Last synced: 20 Dec 2024
https://github.com/pea3nut/pxer
A tool for pixiv.net. 人人可用的P站爬虫
add-on batch crawler pixiv tampermonkey userscript
Last synced: 23 Dec 2024
https://github.com/pea3nut/Pxer
A tool for pixiv.net. 人人可用的P站爬虫
add-on batch crawler pixiv tampermonkey userscript
Last synced: 03 Nov 2024
https://github.com/codelibs/fess
Fess is very powerful and easily deployable Enterprise Search Server.
crawler elasticsearch enterprise-search full-text-search fulltext-search java lucene search search-engine
Last synced: 21 Dec 2024
https://github.com/fredwu/crawler
A high performance web crawler / scraper in Elixir.
crawler elixir files offline scraper scraper-engine spider
Last synced: 26 Oct 2024
https://github.com/johanneszab/TumblThree
A Tumblr Blog Backup Application
backup crawler csharp downloader hidden-tumblr internationalization mef mvvm password-protected-tumblr safe-mode-tumblr tumblr tumblr-blog tumblr-likes tumblr-search tumblr-tags windows wpf
Last synced: 13 Nov 2024
https://github.com/johanneszab/tumblthree
A Tumblr Blog Backup Application
backup crawler csharp downloader hidden-tumblr internationalization mef mvvm password-protected-tumblr safe-mode-tumblr tumblr tumblr-blog tumblr-likes tumblr-search tumblr-tags windows wpf
Last synced: 27 Sep 2024
https://github.com/wycm/zhihu-crawler
zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目
Last synced: 12 Nov 2024
https://github.com/skywalkerdarren/chatweb
ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.
ai chatgpt crawler docx embedding faiss gpt gpt-35-turbo news-extractor newspaper openai pdf pgvector postgresql vector-database
Last synced: 09 Nov 2024
https://github.com/SkywalkerDarren/chatWeb
ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.
ai chatgpt crawler docx embedding faiss gpt gpt-35-turbo news-extractor newspaper openai pdf pgvector postgresql vector-database
Last synced: 01 Nov 2024
https://github.com/apache/incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
apache-storm crawler distributed java stormcrawler web-crawler
Last synced: 25 Oct 2024
https://github.com/hellock/icrawler
A multi-thread crawler framework with many builtin image crawlers provided.
bing-image crawler flickr-api google-images python scrapy spider
Last synced: 25 Dec 2024
https://github.com/spider-rs/spider
The fastest web crawler written in Rust. Maintained by @a11ywatch.
ai-scraping crawler headless-chrome indexer llm-crawler rust scraping spider web-crawler
Last synced: 22 Dec 2024
https://github.com/scrapinghub/scrapyrt
HTTP API for Scrapy spiders
crawler crawling hacktoberfest hacktoberfest2021 python scraper scrapy twisted webcrawler webcrawling
Last synced: 22 Dec 2024
https://github.com/Zeal-L/BiliBili-Manga-Downloader
一个好用的哔哩哔哩漫画下载器,拥有图形界面,支持关键词搜索漫画和二维码登入,黑科技下载未解锁章节,多线程下载,多种保存格式,本地漫画管理,一键检查更新!
bilibili bilibili-download comic-downloader comics crawler downloader gui manga manga-downloader pyside6 python3
Last synced: 27 Oct 2024
https://github.com/iawia002/Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
crawler crawling downloader python python3 scraper scraping video
Last synced: 29 Nov 2024
https://github.com/soskek/bookcorpus
Crawl BookCorpus
bookcorpus corpus crawler nlp scraper
Last synced: 25 Dec 2024
https://github.com/postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
crawler ruby scraper spider spider-links web web-crawler web-scraper web-scraping web-spider
Last synced: 19 Dec 2024
https://github.com/DataHenHQ/till
DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.
crawler man-in-the-middle mitm proxy-server scraper scraping web-scraping
Last synced: 26 Oct 2024
https://github.com/skrapeit/skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
crawler dom hacktoberfest html-parser integration-testing jsoup kotlin kotlin-dsl parse scraper skrape system-testing test-automation testing
Last synced: 25 Dec 2024
https://github.com/morvanzhou/easy-scraping-tutorial
Simple but useful Python web scraping tutorial code.
asyncio beautifulsoup crawler crawling distributed-scraper regex requests scraping scrapy urllib
Last synced: 22 Dec 2024
https://github.com/jomingyu/google-play-scraper
Google play scraper for Python inspired by <facundoolano/google-play-scraper>
crawler google-play hacktoberfest hacktoberfest2023 python scraper
Last synced: 22 Dec 2024
https://github.com/PuerkitoBio/fetchbot
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Last synced: 29 Oct 2024
https://github.com/wspl/creeper
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)
crawler cross-platform framework golang language script spider
Last synced: 29 Oct 2024
https://github.com/kimmeen/weibo-analyst
Social media (Weibo) comments analyzing toolbox in Chinese 微博评论分析工具, 实现功能: 1.微博评论数据爬取; 2.分词与关键词提取; 3.词云与词频统计; 4.情感分析; 5.主题聚类
crawler lda sentiment-analysis weibo word-clouds
Last synced: 21 Dec 2024
https://github.com/Foair/course-crawler
🎓 中国大学MOOC、学堂在线、网易云课堂、好大学在线、爱课程 MOOC 课程下载。
cnmooc course crawler icourse163 mooc netease python3 requests study tsinghua university-course xuetangx
Last synced: 26 Oct 2024
https://github.com/fanyong920/jvppeteer
Headless Chrome For Java (Java 爬虫)
chrome chrome-headless crawler java jvppeteer puppeteer scraper
Last synced: 19 Dec 2024
https://github.com/fffonion/xehentai
Doujinshi downloader 绅士漫画下载
crawler json-rpc python xehentai
Last synced: 24 Dec 2024
https://github.com/xuxueli/xxl-crawler
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
crawler distributed flexible java object-oriented spider web xxl-crawler
Last synced: 22 Dec 2024
https://github.com/polyrabbit/hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
chatgpt chatgpt-api crawler data-extraction extract-summaries hacker-news hacker-news-digest hacker-news-reader machine-learning news-aggregator openai openai-api python rss spider
Last synced: 21 Dec 2024
https://github.com/gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
archiving cli crawler deno dockerfile nodejs scraping-websites single-file web-archiving web-crawler web-scraper web-scraping
Last synced: 20 Dec 2024
https://github.com/stangirard/seo-audits-toolkit
SEO & Security Audit for Websites. Lighthouse & Security Headers crawler, Sitemap/Keywords/Images Extractor, Summarizer, etc ...
analysis audits crawler dashboard extractor headers internal-links lighthouse link-extractor python securityheader seo seo-tools serp summarizer
Last synced: 21 Dec 2024
https://github.com/python3webspider/douyin
API of DouYin for Humans used to Crawl Popular Videos and Musics
Last synced: 24 Dec 2024
https://github.com/Kharacternyk/dotcommon
What do people have in their dotfiles?
Last synced: 31 Oct 2024
https://github.com/fengzhizi715/netdiscovery
NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。
coroutines crawler disruptor dsl htmlunit kafka kotlin lettuce middleware redis rxjava2 selenium spider vertx3
Last synced: 21 Dec 2024
https://github.com/fengzhizi715/NetDiscovery
NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。
coroutines crawler disruptor dsl htmlunit kafka kotlin lettuce middleware redis rxjava2 selenium spider vertx3
Last synced: 12 Nov 2024
https://github.com/lixi5338619/lxbook
《爬虫逆向进阶实战》书籍代码库
android-resever crawler frida java javascript python smali spiders unidbg xposed
Last synced: 20 Dec 2024
https://github.com/StanGirard/seo-audits-toolkit
SEO & Security Audit for Websites. Lighthouse & Security Headers crawler, Sitemap/Keywords/Images Extractor, Summarizer, etc ...
analysis audits crawler dashboard extractor headers internal-links lighthouse link-extractor python securityheader seo seo-tools serp summarizer
Last synced: 29 Oct 2024
https://github.com/rndinfosecguy/Scavenger
Crawler (Bot) searching for credential leaks on paste sites.
bot crawler credentials leaks osint paste pastebin python
Last synced: 27 Oct 2024
https://github.com/linkedtales/scrapedin
LinkedIn Scraper (currently working 2020)
crawler linkedin linkedin-scraper scraper
Last synced: 19 Nov 2024
https://github.com/3nock/spidersuite
Advance web security spider/crawler
bugbounty cplusplus crawler gui information-gathering osint-tool pentest qt5 recon security-tools spider web-spider webcrawler
Last synced: 12 Oct 2024
https://github.com/speed/newcrawler
Free Web Scraping Tool with Java
crawler docker scraping spider
Last synced: 03 Nov 2024
https://github.com/jsrei/js-cookie-monitor-debugger-hook
js cookie逆向利器:js cookie变动监控可视化工具 & js cookie hook打条件断点
crawler js-reverse red-team reverse-engineering userscript web-security-research
Last synced: 21 Dec 2024
https://github.com/josephlimtech/linkedin-profile-scraper-api
🕵️♂️ LinkedIn profile scraper returning structured profile data in JSON.
crawler crawling expressjs json linkedin linkedin-bot linkedin-crawler linkedin-profile linkedin-profile-scraper linkedin-scraper linkedin-scraping nodejs profile-data puppeteer scraper scrapers scraping scraping-websites spider website-scraper
Last synced: 25 Dec 2024
https://github.com/setvisible/ArrowDL
ArrowDL (Arrow Downloader) is a download manager for Windows, MacOS and Linux
batch-download crawler download download-manager libtorrent magnet-link mass-downloader mozilla-firefox nativeclient picture-download qt stream-downloader streaming torrent-client torrent-downloader video-downloader web-engine webextensions youtube-dl youtube-downloader
Last synced: 26 Oct 2024
https://github.com/TumblThreeApp/TumblThree
A Tumblr and Twitter Blog Backup Application
backup blog-backup c-sharp crawler csharp dotnet downloader mvvm tumblr tumblr-backup tumblr-backup-application tumblr-blog tumblr-like tumblr-search twitter twitter-backup twitter-backup-application twitter-blog windows wpf
Last synced: 28 Oct 2024
https://github.com/webrecorder/browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
crawler crawling wacz warc web-archiving web-crawler webrecorder
Last synced: 25 Dec 2024
https://github.com/rajatomar788/pywebcopy
Locally saves webpages to your hard disk with images, css, js & links as is.
archive-tool crawler html html-parser mirror python web webpage
Last synced: 20 Nov 2024