Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2024-11-15 00:06:24 UTC
- JSON Representation
https://github.com/yuanxu-li/html-table-extractor
extract data from html table
beautifulsoup crawler extract-data html html-table scraping table
Last synced: 06 Nov 2024
https://github.com/ArchiveTeam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd
Last synced: 06 Aug 2024
https://github.com/kcubeterm/achoz
Search through all your personal data efficiently like web search.
crawler document-search filesearch search-engine websearch
Last synced: 07 Nov 2024
https://github.com/feiskyer/scrapy-examples
Some scrapy and web.py exmaples
Last synced: 02 Nov 2024
https://github.com/samber/the-great-gpt-firewall
🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs
agent anthropic blocklist censorship crawler firewall genai generative-ai gpt gpt-4 llm openai robots-txt user-agent
Last synced: 09 Nov 2024
https://github.com/crawlzone/crawlzone
Crawlzone is a fast asynchronous internet crawling framework for PHP.
automated-testing crawler crawling-framework middleware php web-scraping web-search
Last synced: 29 Oct 2024
https://github.com/tzw0745/tumblr-crawler-cli
Tumblr Download Tool with High Speed and Customization. 高性能&高定制化的Tumblr下载工具。
cli-app crawler python tumblr tumblr-downloader
Last synced: 05 Aug 2024
https://github.com/zhang2333/light-crawler
a simplified directed customizable website crawler
Last synced: 14 Nov 2024
https://github.com/lexiestleszek/scrapegpt
ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.
crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper
Last synced: 14 Nov 2024
https://github.com/melroy89/metacritic_api
PHP Metacritic API - Mirror from my GitLab
api crawler data metacritic parser php scores scraper webscraping
Last synced: 09 Nov 2024
https://github.com/aufzayed/HydraRecon
All In One, Fast, Easy Recon Tool
bugbounty bugbounty-tool bugbountytips crawler hacking hacking-tools information-gathering open-source-intelligence osnit pentest pentest-tools pentesting python recon recon-tools
Last synced: 03 Aug 2024
https://github.com/mzollin/qr-pirate
crawl QR-codes from search engines and look for bitcoin private keys
bitcoin bitcoin-wallet crawler cryptocurrency private-key python qr-code qrcode qrcode-reader
Last synced: 11 Oct 2024
https://github.com/trudi-group/ipfs-crawler
A crawler for the IPFS network, code for our paper (https://arxiv.org/abs/2002.07747). Also holds scripts to evaluate the obtained data and make similar plots as in the paper.
crawler ipfs ipfs-network kademlia-dht libp2p
Last synced: 15 Nov 2024
https://github.com/liameno/librengine
Privacy Web Search Engine (not meta, own crawler)
cpp crawler encryption frontend privacy robots-txt rsa search-engine self-hosted spider websearch websearchengine
Last synced: 11 Nov 2024
https://github.com/LexiestLeszek/scrapeGPT
ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.
crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper
Last synced: 06 Nov 2024
https://github.com/saltyshiomix/nest-crawler
An easiest crawling and scraping module for NestJS
crawler nestjs nodejs scraper typescript
Last synced: 27 Oct 2024
https://github.com/absingh31/tor_spider
Python project to crawl and scrap the lesser known deep web or one can say dark web. Just provide the onion link and get started.
crawler file-manager ioc python3 scraper scraping socks stem tor tor-config tor-spider
Last synced: 03 Aug 2024
https://github.com/schollz/crawdad
Cross-platform persistent and distributed web crawler :crab:
Last synced: 08 Nov 2024
https://github.com/shurco/goClone
🌱 goClone - clone websites in seconds
cloner cloning crawler crawling go goclone golang hacktoberfest scraping scraping-websites scrapper website-cloner website-scraper wp2static
Last synced: 13 Nov 2024
https://github.com/cho45/chemrtron
A document viewer; fuzzy match incremental search.
crawler document-viewer electron increment javascript
Last synced: 31 Oct 2024
https://github.com/fernandod1/instagram-downloader
Instagram user's photos and videos downloader. Download all media files from any username. Working 2022!
crawler crawling-python instagram instagram-downloader instagram-feed instagram-photos instagram-scraper python scrap scraper scraping scraping-python scraping-tool scraping-websites
Last synced: 12 Nov 2024
https://github.com/mmerian/phpcrawl
Copy of http://phpcrawl.cuab.de/ for using with composer
Last synced: 07 Nov 2024
https://github.com/johanneszab/tumbltwo
TumblTwo, an Improved Fork of TumblOne, a Tumblr Downloader.
crawler downloader photos ripper tumblr tumblr-blog tumblr-downloader videos
Last synced: 15 Nov 2024
https://github.com/dannyben/snapcrawl
Crawl a website and take screenshots
capture crawler gem ruby screenshot
Last synced: 15 Nov 2024
https://github.com/fengzhizi715/piccrawler
使用RxJava2 和 Java 8的特性开发的图片爬虫
crawler java-8 parallel rxjava2
Last synced: 09 Nov 2024
https://github.com/drkostas/jobapplicationbot
A bot that automatically sends emails to new ads posted in any desired xe.gr search url.
bot crawler email-sender python scraper
Last synced: 28 Oct 2024
https://github.com/bajins/tool-gin
基于go-gin框架建立减少冗余动作项目,如:下载一些工具
crawler gin gin-gonic golang key keygen mobaxterm-keygen navicat nginx-conf nginx-configuration python3 registry-workshop scraper shell spider xftp xmanager xshell
Last synced: 15 Oct 2024
https://github.com/lobehub/chat-plugin-web-crawler
🧩 / 🕸 WebsiteCrawler - This plugin automatically crawls the main content of a specified URL webpage and uses it as context input.
ai chatgpt crawler function-calling lobe-chat lobe-chat-plugin openai
Last synced: 01 Nov 2024
https://github.com/howie6879/talospider
talospider - A simple,lightweight scraping micro-framework
crawler crawling python spider web-spider
Last synced: 09 Nov 2024
https://github.com/pablouser1/tikscraperphp
Wrapper for TikTok API
crawler php scraper scraping tiktok tiktok-api wrapper
Last synced: 11 Nov 2024
https://github.com/roccomuso/price-monitoring
Node.js price monitoring library, leveraging the power of x-ray and nightmare.
alert comparison crawler javascript monitoring nodejs price-tracker
Last synced: 28 Oct 2024
https://github.com/eliashaeussler/cache-warmup
🔥 PHP library to warm up caches of URLs located in XML sitemaps
cache-warmup crawler php xml-sitemap
Last synced: 01 Nov 2024
https://github.com/hfreire/browser-as-a-service
A web browser :earth_americas: hosted as a service, to render your JavaScript web pages as HTML
browser browser-as-a-service crawler docker github-actions javascript puppeteer rest-api scraper server webcrawler
Last synced: 26 Oct 2024
https://github.com/jaymon/wishlist
Read an Amazon wishlist programmatically with Python
amazon amazon-wishlist api crawler python scraper
Last synced: 27 Oct 2024
https://github.com/he426100/alipay-crawler
支付宝账单爬虫
alipay crawler selenium selenium-ide selenium-php selenium-webdriver
Last synced: 06 Nov 2024
https://github.com/findopendata/findopendata
A search engine for Open Data
crawler dataset-search opendata
Last synced: 05 Aug 2024
https://github.com/x-way/crawlerdetect
Golang module to detect bots and crawlers via the user agent
bot-detection crawler crawler-detection detect go spider user-agent
Last synced: 14 Nov 2024
https://github.com/d4vinci/scrapling
Lightning-Fast, Adaptive Web Scraping for Python
automation crawler crawling crawling-python css dom-manipulation hacktoberfest lxml playwright python python3 scraping selectors selenium stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 14 Nov 2024
https://github.com/farishijazi/rarbgcli
RARBG command line interface for scraping the rarbg.to torrent search engine
crawler rarbg rarbg-torrentapi torrent torrents torrents-crawler
Last synced: 27 Oct 2024
https://github.com/a11ywatch/crawler
gRPC web crawler turbo charged for performance
a11ywatch crawler grpc scraper
Last synced: 13 Oct 2024
https://github.com/valerebron/usetube
search & get datas from youtube no google account needed
crawler typescript video youtube youtube-api
Last synced: 07 Nov 2024
https://github.com/goldarowana/douyin-crawler
抖音爬虫. 通过手机代理爬取用户的作品和用户的喜欢
crawler douyin douyin-download java vertx
Last synced: 09 Oct 2024
https://github.com/forsti0506/a11y-sitechecker
Automatic accessibility checker with website crawling + screenshots for easy use
accessibility accessibility-criteria accessibility-testing axe crawler hacktoberfest open-source puppeteer typescript typescript-library
Last synced: 31 Oct 2024
https://github.com/sachaarbonel/scrapy.dart
Scrapy, a fast high-level web crawling & scraping framework for dart and Flutter
Last synced: 28 Oct 2024
https://github.com/ReddyyZ/URLBrute-Py
Tool to brute website sub-domains and dirs.
brute-force bruteforcer crawler dir-scanner dirscanner dirsearch sub-domain-enumeration sub-domain-scanner
Last synced: 04 Aug 2024
https://github.com/murat/tors
⏬ Yet another torrent searching application for your command line
crawler ruby-gem torrent-downloader torrent-search-engine
Last synced: 28 Oct 2024
https://github.com/spider-rs/spider-py
Spider ported to Python
crawler headless-chrome python scraper spider web-crawler
Last synced: 05 Nov 2024
https://github.com/soruly/anilist-crawler
Crawl data from anilist API and store in MariaDB.
Last synced: 27 Oct 2024
https://github.com/liangWenPeng/scrapy-admin
A django admin site for scrapy
Last synced: 17 Aug 2024
https://github.com/mike442144/seenreq
Generate an object for testing if a request is sent, request is Mikeal's request.
crawler duplicates-removed post request spider url
Last synced: 27 Oct 2024
https://github.com/mawrkus/jason-the-miner
⛏ A versatile Web scraper for Node.js
crawler crawling javascript scraper scraping web-scraper
Last synced: 13 Nov 2024
https://github.com/golang-collection/go-crawler-distributed
分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。
crawler docker elasticsearch go go-micro gocrawler microservice rabbitmq
Last synced: 04 Aug 2024
https://github.com/spk/maman
Rust Web Crawler saving pages on Redis
crawler http spider web web-crawler
Last synced: 01 Nov 2024
https://github.com/Conso1eCowb0y/Deepminer
Deep web crawler and search engine
crawler crawling dark-web data-mining deepminer deepweb github hacking onion osint python-web-scraper python3 search-engine security security-tools spider the-onion-router tor tor-network webcrawler
Last synced: 09 Nov 2024
https://github.com/riquellopes/fii
API para recuperar informações sobre FII
crawler investiment mongodb nodejs
Last synced: 31 Oct 2024
https://github.com/healeycodes/broken-link-crawler
:robot: Python bot that crawls your website looking for dead stuff
Last synced: 22 Oct 2024
https://github.com/healeycodes/Broken-Link-Crawler
:robot: Python bot that crawls your website looking for dead stuff
Last synced: 26 Sep 2024
https://github.com/taseikyo/crawler
:snake:A collection of simple Python crawlers.
baidu-tieba bilibili bing crawler douban pixiv python-crawler python3 youku
Last synced: 13 Nov 2024
https://github.com/elboletaire/php-crawler
:spider: A simple crawler (spider) writen in php just for fun, with zero dependencies
Last synced: 31 Oct 2024
https://github.com/axetroy/crawler
nodejs 爬虫框架. crawler framework for nodejs
Last synced: 27 Oct 2024
https://github.com/niespodd/webrtc-local-ip-leak
Oh no, stop this. You can see my local IP address 😲! Use `foundation` attribute against CRC32 lookup table to reveal local IP address of a Chrome/Chromium visitor.
automation bot bot-detection crawler spider stealth webrtc
Last synced: 09 Nov 2024
https://github.com/charlespikachu/seleniumlogin
Login some website using selenium.
crawler selenium selenium-webdriver spider taobao
Last synced: 09 Oct 2024
https://github.com/ryuchen/deadpool
该项目是一个使用celery作为主体框架的爬虫应用,能够灵活的添加爬虫任务,并且同时运行多站点的爬虫工作,所有组件都能够原生支持规模并发和分布式,加上celery原生的分布式调用,实现大规模并发。
celery crawler deadpool python3 spider taobao taobao-spider tmall tmall-spider
Last synced: 28 Oct 2024
https://github.com/ronin-rb/ronin-web
ronin-web is a collection of useful web helper methods and commands.
cli crawler hacktoberfest helpers html proxy-server ronin-rb ruby server spider web xml
Last synced: 04 Nov 2024
https://github.com/mirusu400/pinterest-infinite-crawler
An infinite Pinterest crawler/scraper. Crawl image with inifnite-scroll!
crawler hacktoberfest pinterest pinterest-downloader python scraper scraping selenium
Last synced: 06 Nov 2024
https://github.com/p0dalirius/robotstester
This Python script can enumerate all URLs present in robots.txt files, and test whether they can be accessed or not.
bugbounty crawler pentesting python robots tool
Last synced: 29 Oct 2024
https://github.com/VAllens/CrawlerSamples
This is a Puppeteer+AngleSharp crawler console app samples, used C# 7.1 coding and dotnet core build.
anglesharp chsarp crawler dotnetcore headless headless-browsers headless-chrome headless-chromium puppeteer
Last synced: 13 Nov 2024
https://github.com/mrxujiang/crawel
基于Apify+node+react搭建的有点意思的爬虫平台
apify crawler node puppeteer react react-hooks umi umi3
Last synced: 07 Nov 2024
https://github.com/maicius/universityrecruitment-ssurvey
用严肃的数据来回答“什么样的企业会到什么样的大学招聘”?
analysis beautifulsoup crawler data redis university
Last synced: 11 Nov 2024
https://github.com/jonaslejon/lolcrawler
Headless web crawler for bugbounty and penetration-testing/redteaming
bugbounty crawler docker penetration-testing penetration-testing-tools redteam redteam-tools redteaming
Last synced: 04 Aug 2024
https://github.com/0xhjk/x12306
12306查票助手,一键查询沿途所有站点,先上车后补票,让你的出行更省心。
12306 12306buyticket 12306helper 12306qiang-piao crawler fk12306 helper reqeusts spider ticket train x12306
Last synced: 14 Nov 2024
https://github.com/kylemocode/medium-stat-box
Practical pinned gist which show your latest medium status 📌
awesome-pinned-gists crawler github-action github-gists medium-stats
Last synced: 02 Nov 2024
https://github.com/hackfengJam/ArticleSpider
Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).
crawler distributed-systems django elasticsearch scrapy
Last synced: 31 Oct 2024
https://github.com/heyingcai/cetty
基于事件分发的爬虫框架
crawler event-dispatcher gather spider
Last synced: 13 Nov 2024
https://github.com/jfreegman/toxcrawler
A Tox DHT network crawler
crawler dht dht-network tox toxcore
Last synced: 08 Nov 2024
https://github.com/haxzie-xx/instagram-downloader
Node.js/Express app to retrive instagram video/image download urls
crawler downloader express instagram instagram-scraper nodejs
Last synced: 27 Oct 2024
https://github.com/wenyalintw/google-patents-scraper
Automatically download all PDF files of searching results & their patent families found on Google Patents.
crawler google-patents patent patents pdf scraper scraping scrapy web-scraping
Last synced: 11 Nov 2024
https://github.com/apocelipes/schannel-qt5
A GUI client of schannel powered by therecipe/qt and golang
client-side crawler go golang goqt linux qcharts qt5
Last synced: 09 Nov 2024
https://github.com/VeliovGroup/spiderable-middleware
🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks
crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable
Last synced: 04 Aug 2024
https://github.com/veliovgroup/spiderable-middleware
🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks
crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable
Last synced: 14 Oct 2024