Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2024-12-25 00:05:56 UTC
- JSON Representation
https://github.com/wuchunfu/ipproxypool
Golang 实现的 IP 代理池, 涉及到的技术点: go gorm proxy proxypool ip crawler 爬虫 mysql viper cobra
crawler go ip proxy proxy-server proxypool
Last synced: 19 Dec 2024
https://github.com/antoinevastel/bots-zoo
bot crawler crawling playwright puppeteer scraper scraping selenium user-agent useragent
Last synced: 16 Nov 2024
https://github.com/patrickschur/pappet
A command-line tool to crawl websites using puppeteer.
cli crawler pdf puppeteer screenshot
Last synced: 06 Nov 2024
https://github.com/ArchiveTeam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd
Last synced: 25 Nov 2024
https://github.com/kostas-pa/LFITester
LFITester is a Python3 program that automates the detection and exploitation of Local File Inclusion (LFI) vulnerabilities on a server.
bugbounty crawler cybersecurity enumeration exploitation fuzzing hacking lfi lfi-detection lfi-exploitation lfi-vulnerability penetration-testing penetration-testing-tools pentest-tool pentesting python web-hacking webhacking
Last synced: 21 Nov 2024
https://github.com/zhaow-de/rotating-tor-http-proxy
A multi-arch image provides one HTTP proxy endpoint with many concurrent tunnels to the Tor network.
amd64 arm64 armv6 armv7 crawler docker-image dockerhub-image haproxy multi-platform privoxy-tor proxy tor
Last synced: 24 Nov 2024
https://github.com/hueristiq/xcrawl3r
A command-line interface (CLI) based utility to recursively crawl webpages. It is designed to systematically browse webpages' URLs and follow links to discover linked webpages' URLs.
bug-bounty bug-bounty-tools contentdiscovery crawler ethical-hacking ethical-hacking-tools go golang penetration-testing penetration-testing-tools reconnaissance red-teaming red-teaming-tools web-security
Last synced: 24 Dec 2024
https://github.com/medcl/gopa-abandoned
GOPA, a spider written in Go.(NOTE: this project moved to https://github.com/infinitbyte/gopa )
crawler golang lightweight spider
Last synced: 19 Nov 2024
https://github.com/foo-git/rewe-discounts
Grabs current REWE discounts and saves them in a markdown file || Holt sich aktuelle REWE-Angebote und exportiert sie in eine Markdown-Liste
Last synced: 24 Dec 2024
https://github.com/creekorful/bathyscaphe
Fast, highly configurable, cloud native dark web crawler.
architecture crawler crawling elasticsearch golang hidden-services kibana tor web-crawler
Last synced: 27 Oct 2024
https://github.com/kurogai/deepweb-scappering
Discover hidden deepweb pages
crawler deepweb hacking hacking-tool internet kali python3 scappering scapre tor tor-network
Last synced: 27 Oct 2024
https://github.com/jefferyhus/es6-crawler-detect
:spider: This is an ES6 adaptation of the original PHP library CrawlerDetect, this library will help you detect bots/crawlers/spiders vie the useragent.
bots crawler detection es6-javascript spider
Last synced: 22 Dec 2024
https://github.com/JefferyHus/es6-crawler-detect
:spider: This is an ES6 adaptation of the original PHP library CrawlerDetect, this library will help you detect bots/crawlers/spiders vie the useragent.
bots crawler detection es6-javascript spider
Last synced: 11 Nov 2024
https://github.com/nietaki/crawlie
A simple Elixir library for writing decently-performing crawlers with minimum effort.
crawler elixir elixir-library genstage
Last synced: 18 Nov 2024
https://github.com/aminehorseman/images-web-crawler
This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders..
crawler dataset dataset-creation flickr-api google-images-crawler google-images-downloader image-classification image-dataset image-processing images machine-learning
Last synced: 15 Nov 2024
https://github.com/Randark-JMT/Bilibili_manga_download
带图形界面的哔哩哔哩漫画下载工具
bilibili crawler downloader pyside6 python python3 qt spider
Last synced: 27 Oct 2024
https://github.com/tobecrazy/seleniumdemo
Selenium automation test framework
container crawler docker docker-compose jenkins maven pip python selenium selenium-grid selenium-webdriver snapshot
Last synced: 19 Dec 2024
https://github.com/randark-jmt/bilibili_manga_download
带图形界面的哔哩哔哩漫画下载工具
bilibili crawler downloader pyside6 python python3 qt spider
Last synced: 20 Nov 2024
https://github.com/crawlab-team/webspot
An intelligent web service to automatically detect web content and extract information from it.
Last synced: 17 Nov 2024
https://github.com/yuanxu-li/html-table-extractor
extract data from html table
beautifulsoup crawler extract-data html html-table scraping table
Last synced: 06 Nov 2024
https://github.com/boris-code/feaplat
爬虫管理系统,支持集群,弹性伸缩。支持运行feapder、scrapy、selenium、playwright等各种框架及脚本
crawler feapder feaplat spider
Last synced: 05 Dec 2024
https://github.com/tensojka/instastories-backup
Backup your friends' Instagram Stories forever and get to keep them even after 24 hours.
backup crawler instagram instagram-stories python python-3-6 python3
Last synced: 22 Dec 2024
https://github.com/samber/the-great-gpt-firewall
🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs
agent anthropic blocklist censorship crawler firewall genai generative-ai gpt gpt-4 llm openai robots-txt user-agent
Last synced: 20 Dec 2024
https://github.com/kcubeterm/achoz
Search through all your personal data efficiently like web search.
crawler document-search filesearch search-engine websearch
Last synced: 19 Dec 2024
https://github.com/crawlzone/crawlzone
Crawlzone is a fast asynchronous internet crawling framework for PHP.
automated-testing crawler crawling-framework middleware php web-scraping web-search
Last synced: 29 Oct 2024
https://github.com/feiskyer/scrapy-examples
Some scrapy and web.py exmaples
Last synced: 02 Nov 2024
https://github.com/tzw0745/tumblr-crawler-cli
Tumblr Download Tool with High Speed and Customization. 高性能&高定制化的Tumblr下载工具。
cli-app crawler python tumblr tumblr-downloader
Last synced: 22 Nov 2024
https://github.com/zhang2333/light-crawler
a simplified directed customizable website crawler
Last synced: 24 Nov 2024
https://github.com/lucasayres/python-tools
A collection of Python tools, scripts and utilities to make your life easier.
automation codes collection crawler functions geolocation helper libs pdf python qrcode recipes scripts speech sqlalchemy tips tools tricks unzip utilities
Last synced: 19 Nov 2024
https://github.com/get-set-fetch/extension
web scraping extension
browser crawler extension indexeddb javascript npm scraper
Last synced: 23 Dec 2024
https://github.com/lexiestleszek/scrapegpt
ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.
crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper
Last synced: 14 Nov 2024
https://github.com/melroy89/metacritic_api
PHP Metacritic API - Mirror from my GitLab
api crawler data metacritic parser php scores scraper webscraping
Last synced: 12 Dec 2024
https://github.com/aufzayed/HydraRecon
All In One, Fast, Easy Recon Tool
bugbounty bugbounty-tool bugbountytips crawler hacking hacking-tools information-gathering open-source-intelligence osnit pentest pentest-tools pentesting python recon recon-tools
Last synced: 16 Nov 2024
https://github.com/trudi-group/ipfs-crawler
A crawler for the IPFS network, code for our paper (https://arxiv.org/abs/2002.07747). Also holds scripts to evaluate the obtained data and make similar plots as in the paper.
crawler ipfs ipfs-network kademlia-dht libp2p
Last synced: 24 Dec 2024
https://github.com/mzollin/qr-pirate
crawl QR-codes from search engines and look for bitcoin private keys
bitcoin bitcoin-wallet crawler cryptocurrency private-key python qr-code qrcode qrcode-reader
Last synced: 11 Oct 2024
https://github.com/howie6879/hproxy
hproxy - Asynchronous IP proxy pool, aims to make getting proxy as convenient as possible.(异步爬虫代理池)
asyncio crawler crawlers hproxy proxy proxy-pool proxy-spider sanic schedule
Last synced: 19 Nov 2024
https://github.com/usernam3/shopify-app-store-scraper
Crawler behind the Shopify App Marketplace dataset
crawler dataset-creation shopify
Last synced: 25 Dec 2024
https://github.com/liameno/librengine
Privacy Web Search Engine (not meta, own crawler)
cpp crawler encryption frontend privacy robots-txt rsa search-engine self-hosted spider websearch websearchengine
Last synced: 11 Nov 2024
https://github.com/LexiestLeszek/scrapeGPT
ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.
crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper
Last synced: 06 Nov 2024
https://github.com/aziz0x48/xsmtp
xSMTP 🦟 Lightning fast, multithreaded smtp scanner targeting open-relay and unsecured servers in multiple network ranges.
bot crawler exploit exploit-scanner multithreading networking pentest-tool pentesting pentesting-tools portscan portscanner python python-exploits scanner-web security security-tools smtp smtp-cracker
Last synced: 16 Dec 2024
https://github.com/saltyshiomix/nest-crawler
An easiest crawling and scraping module for NestJS
crawler nestjs nodejs scraper typescript
Last synced: 27 Oct 2024
https://github.com/nekolr/slime
🍰 A visual crawler management platform
crawler spider visual-crawler websocket
Last synced: 19 Nov 2024
https://github.com/absingh31/tor_spider
Python project to crawl and scrap the lesser known deep web or one can say dark web. Just provide the onion link and get started.
crawler file-manager ioc python3 scraper scraping socks stem tor tor-config tor-spider
Last synced: 17 Nov 2024
https://github.com/schollz/crawdad
Cross-platform persistent and distributed web crawler :crab:
Last synced: 08 Nov 2024
https://github.com/minicloudsky/eastmoney
python requests + Django+ nodejs koa+ mysql to crawl eastmoney fund and stock data,for data analysis and visualiaztion .
crawler database django eastmoney financial-analysis financial-data metabase mysql nodejs python vue vuejs
Last synced: 20 Nov 2024
https://github.com/cho45/chemrtron
A document viewer; fuzzy match incremental search.
crawler document-viewer electron increment javascript
Last synced: 31 Oct 2024
https://github.com/shurco/goClone
🌱 goClone - clone websites in seconds
cloner cloning crawler crawling go goclone golang hacktoberfest scraping scraping-websites scrapper website-cloner website-scraper wp2static
Last synced: 13 Nov 2024
https://github.com/dannyben/snapcrawl
Crawl a website and take screenshots
capture crawler gem ruby screenshot
Last synced: 19 Dec 2024
https://github.com/fernandod1/instagram-downloader
Instagram user's photos and videos downloader. Download all media files from any username. Working 2022!
crawler crawling-python instagram instagram-downloader instagram-feed instagram-photos instagram-scraper python scrap scraper scraping scraping-python scraping-tool scraping-websites
Last synced: 12 Nov 2024
https://github.com/johanneszab/tumbltwo
TumblTwo, an Improved Fork of TumblOne, a Tumblr Downloader.
crawler downloader photos ripper tumblr tumblr-blog tumblr-downloader videos
Last synced: 15 Nov 2024
https://github.com/mmerian/phpcrawl
Copy of http://phpcrawl.cuab.de/ for using with composer
Last synced: 18 Dec 2024
https://github.com/x-way/crawlerdetect
Golang module to detect bots and crawlers via the user agent
bot-detection crawler crawler-detection detect go spider user-agent
Last synced: 25 Dec 2024
https://github.com/bajins/tool-gin
基于go-gin框架建立减少冗余动作项目,如:下载一些工具
crawler gin gin-gonic golang key keygen mobaxterm-keygen navicat nginx-conf nginx-configuration python3 registry-workshop scraper shell spider xftp xmanager xshell
Last synced: 15 Oct 2024
https://github.com/drkostas/jobapplicationbot
A bot that automatically sends emails to new ads posted in any desired xe.gr search url.
bot crawler email-sender python scraper
Last synced: 18 Nov 2024
https://github.com/fengzhizi715/piccrawler
使用RxJava2 和 Java 8的特性开发的图片爬虫
crawler java-8 parallel rxjava2
Last synced: 09 Nov 2024
https://github.com/lobehub/chat-plugin-web-crawler
🧩 / 🕸 WebsiteCrawler - This plugin automatically crawls the main content of a specified URL webpage and uses it as context input.
ai chatgpt crawler function-calling lobe-chat lobe-chat-plugin openai
Last synced: 01 Nov 2024
https://github.com/howie6879/talospider
talospider - A simple,lightweight scraping micro-framework
crawler crawling python spider web-spider
Last synced: 09 Nov 2024
https://github.com/spider-rs/spider-py
Spider ported to Python
crawler headless-chrome python scraper spider web-crawler
Last synced: 23 Dec 2024
https://github.com/eliashaeussler/cache-warmup
🔥 PHP library to warm up caches of URLs located in XML sitemaps
cache-warmup crawler php xml-sitemap
Last synced: 20 Dec 2024
https://github.com/roccomuso/price-monitoring
Node.js price monitoring library, leveraging the power of x-ray and nightmare.
alert comparison crawler javascript monitoring nodejs price-tracker
Last synced: 28 Oct 2024
https://github.com/pablouser1/tikscraperphp
Wrapper for TikTok API
crawler php scraper scraping tiktok tiktok-api wrapper
Last synced: 18 Nov 2024
https://github.com/jaymon/wishlist
Read an Amazon wishlist programmatically with Python
amazon amazon-wishlist api crawler python scraper
Last synced: 27 Oct 2024
https://github.com/he426100/alipay-crawler
支付宝账单爬虫
alipay crawler selenium selenium-ide selenium-php selenium-webdriver
Last synced: 06 Nov 2024
https://github.com/findopendata/findopendata
A search engine for Open Data
crawler dataset-search opendata
Last synced: 22 Nov 2024
https://github.com/hfreire/browser-as-a-service
A web browser :earth_americas: hosted as a service, to render your JavaScript web pages as HTML
browser browser-as-a-service crawler docker github-actions javascript puppeteer rest-api scraper server webcrawler
Last synced: 17 Nov 2024
https://github.com/a11ywatch/crawler
gRPC web crawler turbo charged for performance
a11ywatch crawler grpc scraper
Last synced: 21 Dec 2024
https://github.com/D4Vinci/Scrapling
Lightning-Fast, Adaptive Web Scraping for Python
automation crawler crawling crawling-python css dom-manipulation hacktoberfest lxml playwright python python3 scraping selectors selenium stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 18 Nov 2024
https://github.com/twtrubiks/facebook-messenger-bot-tutorial
facebook-messenger-bot-tutorial use Python Django
bot crawler django facebook-messenger-bot ngrok ptt python tutorial webhooks
Last synced: 16 Nov 2024
https://github.com/d4vinci/scrapling
Lightning-Fast, Adaptive Web Scraping for Python
automation crawler crawling crawling-python css dom-manipulation hacktoberfest lxml playwright python python3 scraping selectors selenium stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 22 Dec 2024
https://github.com/farishijazi/rarbgcli
RARBG command line interface for scraping the rarbg.to torrent search engine
crawler rarbg rarbg-torrentapi torrent torrents torrents-crawler
Last synced: 27 Oct 2024
https://github.com/sachaarbonel/scrapy.dart
Scrapy, a fast high-level web crawling & scraping framework for dart and Flutter
Last synced: 20 Dec 2024
https://github.com/forsti0506/a11y-sitechecker
Automatic accessibility checker with website crawling + screenshots for easy use
accessibility accessibility-criteria accessibility-testing axe crawler hacktoberfest open-source puppeteer typescript typescript-library
Last synced: 22 Nov 2024
https://github.com/valerebron/usetube
search & get datas from youtube no google account needed
crawler typescript video youtube youtube-api
Last synced: 07 Nov 2024
https://github.com/goldarowana/douyin-crawler
抖音爬虫. 通过手机代理爬取用户的作品和用户的喜欢
crawler douyin douyin-download java vertx
Last synced: 09 Oct 2024
https://github.com/ReddyyZ/URLBrute-Py
Tool to brute website sub-domains and dirs.
brute-force bruteforcer crawler dir-scanner dirscanner dirsearch sub-domain-enumeration sub-domain-scanner
Last synced: 21 Nov 2024
https://github.com/murat/tors
⏬ Yet another torrent searching application for your command line
crawler ruby-gem torrent-downloader torrent-search-engine
Last synced: 28 Oct 2024
https://github.com/joaopauloaramuni/python
Repo Python
crawler python scraping scrapy
Last synced: 21 Dec 2024
https://github.com/soruly/anilist-crawler
Crawl data from anilist API and store in MariaDB.
Last synced: 27 Oct 2024
https://github.com/harborzeng/crawler_jd_what_worthy_buying
爬取京东商品所有评论,利用情感分析,判断商品是否值得买
Last synced: 17 Nov 2024
https://github.com/mariot/chan-downloader
CLI to download all images/webms in a 4chan thread
4chan 4chan-downloader crawler scraper
Last synced: 02 Dec 2024
https://github.com/mawrkus/jason-the-miner
⛏ A versatile Web scraper for Node.js
crawler crawling javascript scraper scraping web-scraper
Last synced: 13 Nov 2024
https://github.com/liangWenPeng/scrapy-admin
A django admin site for scrapy
Last synced: 09 Dec 2024
https://github.com/mike442144/seenreq
Generate an object for testing if a request is sent, request is Mikeal's request.
crawler duplicates-removed post request spider url
Last synced: 27 Oct 2024
https://github.com/golang-collection/go-crawler-distributed
分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。
crawler docker elasticsearch go go-micro gocrawler microservice rabbitmq
Last synced: 20 Nov 2024