Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/howie6879/ruia

Async Python 3.6+ web scraping micro-framework based on asyncio

aiohttp asyncio asyncio-spider crawler crawling-framework middlewares python python-ruia ruia spider uvloop

Last synced: 19 Dec 2024

https://github.com/lixi5338619/lxspider

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、各种指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书、大众点评、推特、脉脉、知乎》

12306 andrioid crawler douban douyin douyinsignature kuaishou meituan pdd signature taobao toutiao twitter wechat weibo weixin xiaohongshu xiecheng youku

Last synced: 21 Dec 2024

https://github.com/jsrei/ast-hook-for-js-re

浏览器内存漫游解决方案(探索中...)

crawler js-reverse

Last synced: 21 Dec 2024

https://github.com/extractus/article-extractor

To extract main article from given URL with Node.js

article article-extractor article-parser crawler extract nodejs readability scraper

Last synced: 24 Dec 2024

https://github.com/JSREI/ast-hook-for-js-RE

浏览器内存漫游解决方案(探索中...)

crawler js-reverse

Last synced: 21 Nov 2024

https://github.com/edoardottt/cariddi

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more

bugbounty crawler crawling endpoint-discovery endpoints go golang hacktoberfest infosec osint penetration-testing pentesting recon reconnaissance redteam scraper secret-keys secrets-detection security security-tools

Last synced: 19 Dec 2024

https://github.com/github/lightcrawler

Crawl a website and run it through Google lighthouse

chrome crawler google-lighthouse

Last synced: 26 Sep 2024

https://github.com/archiveteam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl crawler spider warc

Last synced: 19 Dec 2024

https://github.com/teamnewpipe/newpipeextractor

NewPipe's core library for extracting data from streaming sites

bandcamp crawler extractor mediaccc newpipe peertube scraper soundcloud youtube

Last synced: 19 Dec 2024

https://github.com/imthaghost/goclone

Website Cloner - Utilizes powerful Go routines to clone websites to your computer within seconds.

cloning crawler go golang website-cloner website-scraper

Last synced: 21 Dec 2024

https://github.com/ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl crawler spider warc

Last synced: 06 Nov 2024

https://github.com/TeamNewPipe/NewPipeExtractor

NewPipe's core library for extracting data from streaming sites

bandcamp crawler extractor mediaccc newpipe peertube scraper soundcloud youtube

Last synced: 26 Nov 2024

https://github.com/aandyprogram/scrawler

🏳️‍🌈 Media downloader from any sites, including Twitter, Reddit, Instagram, Threads, Facebook, OnlyFans, YouTube, Pinterest, PornHub, XHamster, XVIDEOS, ThisVid etc.

crawler download downloader gay image instagram lgbt manager media onlyfans photo pictures pornhub reddit thisvid twitter video xhamster xvideo youtube

Last synced: 19 Dec 2024

https://github.com/LeonardoCardoso/SwiftLinkPreview

It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.

carthage cocoapods crawler flow ios macos preview regular-expressions relevant-texts swift swift-package-manager tvos url watchos website

Last synced: 09 Dec 2024

https://github.com/leonardocardoso/swiftlinkpreview

It makes a preview from an URL, grabbing all the information such as title, relevant texts and images.

carthage cocoapods crawler flow ios macos preview regular-expressions relevant-texts swift swift-package-manager tvos url watchos website

Last synced: 14 Dec 2024

https://github.com/dadoonet/fscrawler

Elasticsearch File System Crawler (FS Crawler)

crawler elasticsearch java tika

Last synced: 25 Dec 2024

https://github.com/openwpm/openwpm

A web privacy measurement framework

crawler firefox privacy python3

Last synced: 19 Dec 2024

https://github.com/openwpm/OpenWPM

A web privacy measurement framework

crawler firefox privacy python3

Last synced: 31 Oct 2024

https://github.com/lorey/mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples

crawler crawler-python crawling extraction-engine html machine-learning scraper scraping

Last synced: 20 Dec 2024

https://github.com/felipecsl/wombat

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

crawler dsl ruby scraper

Last synced: 19 Dec 2024

https://github.com/Adyzng/jd-autobuy

Python爬虫,京东自动登录,在线抢购商品

crawler jingdong python scraper

Last synced: 19 Nov 2024

https://github.com/AAndyProgram/SCrawler

🏳️‍🌈 Media downloader from any sites, including Twitter, Reddit, Instagram, Threads, Facebook, OnlyFans, YouTube, Pinterest, PornHub, XHamster, XVIDEOS, ThisVid etc.

crawler download downloader gay image instagram lgbt manager media onlyfans photo pictures pornhub reddit thisvid twitter video xhamster xvideo youtube

Last synced: 06 Nov 2024

https://github.com/srx-2000/spider_collection

python爬虫,目前库存:网易云音乐歌曲爬取,B站视频爬取,知乎问答爬取,壁纸爬取,xvideos视频爬取,有声书爬取,微博爬虫,安居客信息爬取+数据可视化,哔哩哔哩视频封面提取器,ip代理池封装,知乎百万级用户爬虫+数据分析,github用户爬虫

crawler python spider

Last synced: 21 Dec 2024

https://github.com/kiddyuchina/beanbun

Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman。

beanbun crawler php spider

Last synced: 20 Dec 2024

https://github.com/kiddyuchina/Beanbun

Beanbun 是用 PHP 编写的多进程网络爬虫框架,具有良好的开放性、高可扩展性,基于 Workerman。

beanbun crawler php spider

Last synced: 01 Nov 2024

https://github.com/kkoooqq/fakebrowser

🤖 Fake fingerprints to bypass anti-bot systems. Simulate mouse and keyboard operations to make behavior like a real person.

anti-bot-detection anti-fingerprinting automation bot browser-fingerprint cheat crawler fake headless puppeteer puppeteer-extra puppeteer-extra-plugin scrapy spoof stealth

Last synced: 21 Dec 2024

https://github.com/instapy/instagram-profilecrawl

📝 quickly crawl the information (e.g. followers, tags etc...) of an instagram profile.

automation crawler information instagram instapy python python-script selenium simple

Last synced: 21 Dec 2024

https://github.com/darbra/sperm

浏览过的精彩逆向文章汇总,值得一看

crawl crawler frida spider unidbg

Last synced: 05 Dec 2024

https://github.com/seveniruby/AppCrawler

基于appium的app自动遍历工具

appium appium-app crawler diff scala xpath

Last synced: 08 Nov 2024

https://github.com/InstaPy/instagram-profilecrawl

📝 quickly crawl the information (e.g. followers, tags etc...) of an instagram profile.

automation crawler information instagram instapy python python-script selenium simple

Last synced: 02 Nov 2024

https://github.com/dixudx/tumblr-crawler

Easily download all the photos/videos from tumblr blogs. 下载指定的 Tumblr 博客中的图片,视频

crawler photos python tumblr videos

Last synced: 19 Dec 2024

https://github.com/0xinfection/xsrfprobe

The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

audit crafted-tokens crawler csrf csrf-attacks csrf-poc csrf-scanner csrf-tokens spider token-generation xsrf

Last synced: 24 Dec 2024

https://github.com/chenjiandongx/mzitu

👧 美女写真套图爬虫(二)

crawler meizi

Last synced: 24 Dec 2024

https://github.com/0xInfection/XSRFProbe

The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.

audit crafted-tokens crawler csrf csrf-attacks csrf-poc csrf-scanner csrf-tokens spider token-generation xsrf

Last synced: 28 Oct 2024

https://github.com/yutto-dev/bilili

:beers: bilibili video (including bangumi) and danmaku downloader | B站视频(含番剧)、弹幕下载器

bilibili crawler danmaku download downloader multithread python3 requests spider subtitle video

Last synced: 14 Oct 2024

https://github.com/vifreefly/kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

crawler headless-chrome kimurai scraper scrapy

Last synced: 18 Dec 2024

https://github.com/elixir-crawly/crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider

Last synced: 20 Dec 2024

https://github.com/oltarasenko/crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider

Last synced: 27 Nov 2024

https://github.com/xisuo67/xhs-spider

小红书数据采集、网站图片、视频资源批量下载工具,颜值超高的数据采集工具(批量下载,视频提取,图片,去水印等)Telegram:https://t.me/+ZtLSwuIKTo44MDY1

crawler csharp downloader wpf wpf-notifyicon wpf-ui

Last synced: 20 Dec 2024

https://github.com/pea3nut/pxer

A tool for pixiv.net. 人人可用的P站爬虫

add-on batch crawler pixiv tampermonkey userscript

Last synced: 23 Dec 2024

https://github.com/pea3nut/Pxer

A tool for pixiv.net. 人人可用的P站爬虫

add-on batch crawler pixiv tampermonkey userscript

Last synced: 03 Nov 2024

https://github.com/codelibs/fess

Fess is very powerful and easily deployable Enterprise Search Server.

crawler elasticsearch enterprise-search full-text-search fulltext-search java lucene search search-engine

Last synced: 21 Dec 2024

https://github.com/fredwu/crawler

A high performance web crawler / scraper in Elixir.

crawler elixir files offline scraper scraper-engine spider

Last synced: 26 Oct 2024

https://github.com/wycm/zhihu-crawler

zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目

crawler java spider zhihu

Last synced: 12 Nov 2024

https://github.com/skywalkerdarren/chatweb

ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.

ai chatgpt crawler docx embedding faiss gpt gpt-35-turbo news-extractor newspaper openai pdf pgvector postgresql vector-database

Last synced: 09 Nov 2024

https://github.com/Le0nsec/SecCrawler

一个方便安全研究人员获取每日安全日报的爬虫和推送程序,目前爬取范围包括先知社区、安全客、Seebug Paper、跳跳糖、奇安信攻防社区、棱角社区以及绿盟、腾讯玄武、天融信、360等实验室博客,持续更新中。

anquanke bot crawler security seebug xianzhi

Last synced: 05 Nov 2024

https://github.com/SkywalkerDarren/chatWeb

ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.

ai chatgpt crawler docx embedding faiss gpt gpt-35-turbo news-extractor newspaper openai pdf pgvector postgresql vector-database

Last synced: 01 Nov 2024

https://github.com/apache/incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

apache-storm crawler distributed java stormcrawler web-crawler

Last synced: 25 Oct 2024

https://github.com/hellock/icrawler

A multi-thread crawler framework with many builtin image crawlers provided.

bing-image crawler flickr-api google-images python scrapy spider

Last synced: 18 Dec 2024

https://github.com/spider-rs/spider

The fastest web crawler written in Rust. Maintained by @a11ywatch.

ai-scraping crawler headless-chrome indexer llm-crawler rust scraping spider web-crawler

Last synced: 22 Dec 2024

https://github.com/Zeal-L/BiliBili-Manga-Downloader

一个好用的哔哩哔哩漫画下载器,拥有图形界面,支持关键词搜索漫画和二维码登入,黑科技下载未解锁章节,多线程下载,多种保存格式,本地漫画管理,一键检查更新!

bilibili bilibili-download comic-downloader comics crawler downloader gui manga manga-downloader pyside6 python3

Last synced: 27 Oct 2024

https://github.com/iawia002/Lulu

[Unmaintained] A simple and clean video/music/image downloader 👾

crawler crawling downloader python python3 scraper scraping video

Last synced: 29 Nov 2024

https://github.com/soskek/bookcorpus

Crawl BookCorpus

bookcorpus corpus crawler nlp scraper

Last synced: 25 Dec 2024

https://github.com/postmodern/spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

crawler ruby scraper spider spider-links web web-crawler web-scraper web-scraping web-spider

Last synced: 19 Dec 2024

https://github.com/DataHenHQ/till

DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.

crawler man-in-the-middle mitm proxy-server scraper scraping web-scraping

Last synced: 26 Oct 2024

https://github.com/kong36088/BaiduImageSpider

一个超级轻量的百度图片爬虫

baidu crawler python3 spider

Last synced: 29 Oct 2024

https://github.com/jomingyu/google-play-scraper

Google play scraper for Python inspired by <facundoolano/google-play-scraper>

crawler google-play hacktoberfest hacktoberfest2023 python scraper

Last synced: 22 Dec 2024

https://github.com/PuerkitoBio/fetchbot

A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

crawler robots-txt

Last synced: 29 Oct 2024

https://github.com/wspl/creeper

:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

crawler cross-platform framework golang language script spider

Last synced: 29 Oct 2024

https://github.com/kimmeen/weibo-analyst

Social media (Weibo) comments analyzing toolbox in Chinese 微博评论分析工具, 实现功能: 1.微博评论数据爬取; 2.分词与关键词提取; 3.词云与词频统计; 4.情感分析; 5.主题聚类

crawler lda sentiment-analysis weibo word-clouds

Last synced: 21 Dec 2024

https://github.com/Foair/course-crawler

🎓 中国大学MOOC、学堂在线、网易云课堂、好大学在线、爱课程 MOOC 课程下载。

cnmooc course crawler icourse163 mooc netease python3 requests study tsinghua university-course xuetangx

Last synced: 26 Oct 2024

https://github.com/fanyong920/jvppeteer

Headless Chrome For Java (Java 爬虫)

chrome chrome-headless crawler java jvppeteer puppeteer scraper

Last synced: 19 Dec 2024

https://github.com/fffonion/xehentai

Doujinshi downloader 绅士漫画下载

crawler json-rpc python xehentai

Last synced: 24 Dec 2024

https://github.com/xuxueli/xxl-crawler

A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)

crawler distributed flexible java object-oriented spider web xxl-crawler

Last synced: 22 Dec 2024

https://github.com/ma6254/FictionDown

小说下载|小说爬取|起点|笔趣阁|导出Markdown|导出txt|转换epub|广告过滤|自动校对

biquge crawler fiction golang novels qidian spider

Last synced: 13 Nov 2024

https://github.com/gildas-lormeau/single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

archiving cli crawler deno dockerfile nodejs scraping-websites single-file web-archiving web-crawler web-scraper web-scraping

Last synced: 20 Dec 2024

https://github.com/stangirard/seo-audits-toolkit

SEO & Security Audit for Websites. Lighthouse & Security Headers crawler, Sitemap/Keywords/Images Extractor, Summarizer, etc ...

analysis audits crawler dashboard extractor headers internal-links lighthouse link-extractor python securityheader seo seo-tools serp summarizer

Last synced: 21 Dec 2024

https://github.com/python3webspider/douyin

API of DouYin for Humans used to Crawl Popular Videos and Musics

crawler douyin spider videos

Last synced: 24 Dec 2024

https://github.com/fengzhizi715/NetDiscovery

NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。

coroutines crawler disruptor dsl htmlunit kafka kotlin lettuce middleware redis rxjava2 selenium spider vertx3

Last synced: 12 Nov 2024

https://github.com/fengzhizi715/netdiscovery

NetDiscovery 是一款基于 Vert.x、RxJava 2 等框架实现的通用爬虫框架/中间件。

coroutines crawler disruptor dsl htmlunit kafka kotlin lettuce middleware redis rxjava2 selenium spider vertx3

Last synced: 21 Dec 2024

https://github.com/Kharacternyk/dotcommon

What do people have in their dotfiles?

crawler dotfiles

Last synced: 31 Oct 2024

https://github.com/lixi5338619/lxbook

《爬虫逆向进阶实战》书籍代码库

android-resever crawler frida java javascript python smali spiders unidbg xposed

Last synced: 20 Dec 2024

https://github.com/StanGirard/seo-audits-toolkit

SEO & Security Audit for Websites. Lighthouse & Security Headers crawler, Sitemap/Keywords/Images Extractor, Summarizer, etc ...

analysis audits crawler dashboard extractor headers internal-links lighthouse link-extractor python securityheader seo seo-tools serp summarizer

Last synced: 29 Oct 2024

https://github.com/rndinfosecguy/Scavenger

Crawler (Bot) searching for credential leaks on paste sites.

bot crawler credentials leaks osint paste pastebin python

Last synced: 27 Oct 2024

https://github.com/linkedtales/scrapedin

LinkedIn Scraper (currently working 2020)

crawler linkedin linkedin-scraper scraper

Last synced: 19 Nov 2024

https://github.com/speed/newcrawler

Free Web Scraping Tool with Java

crawler docker scraping spider

Last synced: 03 Nov 2024

https://github.com/jsrei/js-cookie-monitor-debugger-hook

js cookie逆向利器:js cookie变动监控可视化工具 & js cookie hook打条件断点

crawler js-reverse red-team reverse-engineering userscript web-security-research

Last synced: 21 Dec 2024

https://github.com/webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

crawler crawling wacz warc web-archiving web-crawler webrecorder

Last synced: 25 Dec 2024

https://github.com/rajatomar788/pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

archive-tool crawler html html-parser mirror python web webpage

Last synced: 20 Nov 2024