Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/roccomuso/is-google

Verify that a request is from Google crawlers using Google's DNS verification steps

bot check crawler dns google ip js nodejs verify

Last synced: 27 Oct 2024

https://github.com/ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 06 Aug 2024

https://github.com/lrlna/puppeteer-walker

a puppeteer walker 🕷 🕸

chrome crawler headless puppeteer spider walker

Last synced: 27 Oct 2024

https://github.com/kcubeterm/achoz

Search through all your personal data efficiently like web search.

crawler document-search filesearch search-engine websearch

Last synced: 07 Nov 2024

https://github.com/feiskyer/scrapy-examples

Some scrapy and web.py exmaples

crawler python scrapy

Last synced: 02 Nov 2024

https://github.com/samber/the-great-gpt-firewall

🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs

agent anthropic blocklist censorship crawler firewall genai generative-ai gpt gpt-4 llm openai robots-txt user-agent

Last synced: 09 Nov 2024

https://github.com/jannchie/simpyder

超高速异步协程Python爬虫

crawler python spider

Last synced: 27 Oct 2024

https://github.com/crawlzone/crawlzone

Crawlzone is a fast asynchronous internet crawling framework for PHP.

automated-testing crawler crawling-framework middleware php web-scraping web-search

Last synced: 29 Oct 2024

https://github.com/tzw0745/tumblr-crawler-cli

Tumblr Download Tool with High Speed and Customization. 高性能&高定制化的Tumblr下载工具。

cli-app crawler python tumblr tumblr-downloader

Last synced: 05 Aug 2024

https://github.com/zhang2333/light-crawler

a simplified directed customizable website crawler

crawler node-js

Last synced: 14 Nov 2024

https://github.com/lexiestleszek/scrapegpt

ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.

crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper

Last synced: 14 Nov 2024

https://github.com/melroy89/metacritic_api

PHP Metacritic API - Mirror from my GitLab

api crawler data metacritic parser php scores scraper webscraping

Last synced: 09 Nov 2024

https://github.com/jhao104/spider

python crawler spider

crawler python spider

Last synced: 28 Oct 2024

https://github.com/trudi-group/ipfs-crawler

A crawler for the IPFS network, code for our paper (https://arxiv.org/abs/2002.07747). Also holds scripts to evaluate the obtained data and make similar plots as in the paper.

crawler ipfs ipfs-network kademlia-dht libp2p

Last synced: 15 Nov 2024

https://github.com/mzollin/qr-pirate

crawl QR-codes from search engines and look for bitcoin private keys

bitcoin bitcoin-wallet crawler cryptocurrency private-key python qr-code qrcode qrcode-reader

Last synced: 11 Oct 2024

https://github.com/hijkzzz/dht-crawler

A DHT Crawler based on Goroutine

crawler dht golang

Last synced: 12 Nov 2024

https://github.com/alexfazio/devdocs-to-llm

Turn any developer documentation into a GPT

crawler crawling firecrawl scraper scraping

Last synced: 27 Oct 2024

https://github.com/lin-jun-xiang/chatgpt-line-bot

🤖Free ChatGPT Line Bot with Horoscope, Music Broadcast, Google Image Search...

chatbot chatgpt craw crawler cron gpt gpt-3 gpt4free linebot replit scraper

Last synced: 09 Nov 2024

https://github.com/LexiestLeszek/scrapeGPT

ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.

crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper

Last synced: 06 Nov 2024

https://github.com/saltyshiomix/nest-crawler

An easiest crawling and scraping module for NestJS

crawler nestjs nodejs scraper typescript

Last synced: 27 Oct 2024

https://github.com/absingh31/tor_spider

Python project to crawl and scrap the lesser known deep web or one can say dark web. Just provide the onion link and get started.

crawler file-manager ioc python3 scraper scraping socks stem tor tor-config tor-spider

Last synced: 03 Aug 2024

https://github.com/schollz/crawdad

Cross-platform persistent and distributed web crawler :crab:

crawler golang redis web

Last synced: 08 Nov 2024

https://github.com/cho45/chemrtron

A document viewer; fuzzy match incremental search.

crawler document-viewer electron increment javascript

Last synced: 31 Oct 2024

https://github.com/dannyben/snapcrawl

Crawl a website and take screenshots

capture crawler gem ruby screenshot

Last synced: 15 Nov 2024

https://github.com/mmerian/phpcrawl

Copy of http://phpcrawl.cuab.de/ for using with composer

composer crawler php phpcrawl

Last synced: 07 Nov 2024

https://github.com/johanneszab/tumbltwo

TumblTwo, an Improved Fork of TumblOne, a Tumblr Downloader.

crawler downloader photos ripper tumblr tumblr-blog tumblr-downloader videos

Last synced: 15 Nov 2024

https://github.com/drkostas/jobapplicationbot

A bot that automatically sends emails to new ads posted in any desired xe.gr search url.

bot crawler email-sender python scraper

Last synced: 28 Oct 2024

https://github.com/fengzhizi715/piccrawler

使用RxJava2 和 Java 8的特性开发的图片爬虫

crawler java-8 parallel rxjava2

Last synced: 09 Nov 2024

https://github.com/lobehub/chat-plugin-web-crawler

🧩 / 🕸 WebsiteCrawler - This plugin automatically crawls the main content of a specified URL webpage and uses it as context input.

ai chatgpt crawler function-calling lobe-chat lobe-chat-plugin openai

Last synced: 01 Nov 2024

https://github.com/nicholaskajoh/devsearch

A web search engine built with Python which uses TF-IDF and PageRank to sort search results.

crawler flask mongodb pagerank python scrapy search search-engine spider tf-idf

Last synced: 11 Nov 2024

https://github.com/howie6879/talospider

talospider - A simple,lightweight scraping micro-framework

crawler crawling python spider web-spider

Last synced: 09 Nov 2024

https://github.com/roccomuso/price-monitoring

Node.js price monitoring library, leveraging the power of x-ray and nightmare.

alert comparison crawler javascript monitoring nodejs price-tracker

Last synced: 28 Oct 2024

https://github.com/eliashaeussler/cache-warmup

🔥 PHP library to warm up caches of URLs located in XML sitemaps

cache-warmup crawler php xml-sitemap

Last synced: 01 Nov 2024

https://github.com/hfreire/browser-as-a-service

A web browser :earth_americas: hosted as a service, to render your JavaScript web pages as HTML

browser browser-as-a-service crawler docker github-actions javascript puppeteer rest-api scraper server webcrawler

Last synced: 26 Oct 2024

https://github.com/jaymon/wishlist

Read an Amazon wishlist programmatically with Python

amazon amazon-wishlist api crawler python scraper

Last synced: 27 Oct 2024

https://github.com/findopendata/findopendata

A search engine for Open Data

crawler dataset-search opendata

Last synced: 05 Aug 2024

https://github.com/x-way/crawlerdetect

Golang module to detect bots and crawlers via the user agent

bot-detection crawler crawler-detection detect go spider user-agent

Last synced: 14 Nov 2024

https://github.com/farishijazi/rarbgcli

RARBG command line interface for scraping the rarbg.to torrent search engine

crawler rarbg rarbg-torrentapi torrent torrents torrents-crawler

Last synced: 27 Oct 2024

https://github.com/valerebron/usetube

search & get datas from youtube no google account needed

crawler typescript video youtube youtube-api

Last synced: 07 Nov 2024

https://github.com/a11ywatch/crawler

gRPC web crawler turbo charged for performance

a11ywatch crawler grpc scraper

Last synced: 13 Oct 2024

https://github.com/goldarowana/douyin-crawler

抖音爬虫. 通过手机代理爬取用户的作品和用户的喜欢

crawler douyin douyin-download java vertx

Last synced: 09 Oct 2024

https://github.com/ReedD/crawler

Chromium / Puppeteer site crawler

bot chromium crawler puppeteer redis scraper

Last synced: 25 Oct 2024

https://github.com/forsti0506/a11y-sitechecker

Automatic accessibility checker with website crawling + screenshots for easy use

accessibility accessibility-criteria accessibility-testing axe crawler hacktoberfest open-source puppeteer typescript typescript-library

Last synced: 31 Oct 2024

https://github.com/sachaarbonel/scrapy.dart

Scrapy, a fast high-level web crawling & scraping framework for dart and Flutter

crawler dart scrapy

Last synced: 28 Oct 2024

https://github.com/zhangyunhao116/mini-spider

简单、实用的爬虫工具,仅需四步创建属于你的爬虫程序!

crawler python spider

Last synced: 15 Nov 2024

https://github.com/murat/tors

⏬ Yet another torrent searching application for your command line

crawler ruby-gem torrent-downloader torrent-search-engine

Last synced: 28 Oct 2024

https://github.com/soruly/anilist-crawler

Crawl data from anilist API and store in MariaDB.

anilist anime crawler

Last synced: 27 Oct 2024

https://github.com/liangWenPeng/scrapy-admin

A django admin site for scrapy

crawler scrapy scrapyd spider

Last synced: 17 Aug 2024

https://github.com/mike442144/seenreq

Generate an object for testing if a request is sent, request is Mikeal's request.

crawler duplicates-removed post request spider url

Last synced: 27 Oct 2024

https://github.com/jin10086/copyheaders

方便的从浏览器复制浏览器头

crawler python tools

Last synced: 27 Oct 2024

https://github.com/mawrkus/jason-the-miner

⛏ A versatile Web scraper for Node.js

crawler crawling javascript scraper scraping web-scraper

Last synced: 13 Nov 2024

https://github.com/golang-collection/go-crawler-distributed

分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。

crawler docker elasticsearch go go-micro gocrawler microservice rabbitmq

Last synced: 04 Aug 2024

https://github.com/spk/maman

Rust Web Crawler saving pages on Redis

crawler http spider web web-crawler

Last synced: 01 Nov 2024

https://github.com/riquellopes/fii

API para recuperar informações sobre FII

crawler investiment mongodb nodejs

Last synced: 31 Oct 2024

https://github.com/healeycodes/broken-link-crawler

:robot: Python bot that crawls your website looking for dead stuff

bot crawler python

Last synced: 22 Oct 2024

https://github.com/healeycodes/Broken-Link-Crawler

:robot: Python bot that crawls your website looking for dead stuff

bot crawler python

Last synced: 26 Sep 2024

https://github.com/taseikyo/crawler

:snake:A collection of simple Python crawlers.

baidu-tieba bilibili bing crawler douban pixiv python-crawler python3 youku

Last synced: 13 Nov 2024

https://github.com/elboletaire/php-crawler

:spider: A simple crawler (spider) writen in php just for fun, with zero dependencies

crawler php spider

Last synced: 31 Oct 2024

https://github.com/axetroy/crawler

nodejs 爬虫框架. crawler framework for nodejs

crawler nodejs

Last synced: 27 Oct 2024

https://github.com/kant2002/ncrawler

Web Crawler written in C#

crawler scrapper

Last synced: 22 Oct 2024

https://github.com/niespodd/webrtc-local-ip-leak

Oh no, stop this. You can see my local IP address 😲! Use `foundation` attribute against CRC32 lookup table to reveal local IP address of a Chrome/Chromium visitor.

automation bot bot-detection crawler spider stealth webrtc

Last synced: 09 Nov 2024

https://github.com/charlespikachu/seleniumlogin

Login some website using selenium.

crawler selenium selenium-webdriver spider taobao

Last synced: 09 Oct 2024

https://github.com/ryuchen/deadpool

该项目是一个使用celery作为主体框架的爬虫应用,能够灵活的添加爬虫任务,并且同时运行多站点的爬虫工作,所有组件都能够原生支持规模并发和分布式,加上celery原生的分布式调用,实现大规模并发。

celery crawler deadpool python3 spider taobao taobao-spider tmall tmall-spider

Last synced: 28 Oct 2024

https://github.com/ronin-rb/ronin-web

ronin-web is a collection of useful web helper methods and commands.

cli crawler hacktoberfest helpers html proxy-server ronin-rb ruby server spider web xml

Last synced: 04 Nov 2024

https://github.com/mirusu400/pinterest-infinite-crawler

An infinite Pinterest crawler/scraper. Crawl image with inifnite-scroll!

crawler hacktoberfest pinterest pinterest-downloader python scraper scraping selenium

Last synced: 06 Nov 2024

https://github.com/p0dalirius/robotstester

This Python script can enumerate all URLs present in robots.txt files, and test whether they can be accessed or not.

bugbounty crawler pentesting python robots tool

Last synced: 29 Oct 2024

https://github.com/VAllens/CrawlerSamples

This is a Puppeteer+AngleSharp crawler console app samples, used C# 7.1 coding and dotnet core build.

anglesharp chsarp crawler dotnetcore headless headless-browsers headless-chrome headless-chromium puppeteer

Last synced: 13 Nov 2024

https://github.com/mrxujiang/crawel

基于Apify+node+react搭建的有点意思的爬虫平台

apify crawler node puppeteer react react-hooks umi umi3

Last synced: 07 Nov 2024

https://github.com/maicius/universityrecruitment-ssurvey

用严肃的数据来回答“什么样的企业会到什么样的大学招聘”?

analysis beautifulsoup crawler data redis university

Last synced: 11 Nov 2024

https://github.com/jonaslejon/lolcrawler

Headless web crawler for bugbounty and penetration-testing/redteaming

bugbounty crawler docker penetration-testing penetration-testing-tools redteam redteam-tools redteaming

Last synced: 04 Aug 2024

https://github.com/xiantang/spider

web crawler

crawler python3

Last synced: 08 Nov 2024

https://github.com/himself65/luogucrawler

一个python爬虫来爬取洛谷各种信息

crawler python python3

Last synced: 01 Oct 2024

https://github.com/0xhjk/x12306

12306查票助手,一键查询沿途所有站点,先上车后补票,让你的出行更省心。

12306 12306buyticket 12306helper 12306qiang-piao crawler fk12306 helper reqeusts spider ticket train x12306

Last synced: 14 Nov 2024

https://github.com/bin-huang/nodespider

[DEPRECATED] Simple, flexible, delightful web crawler/spider package

async crawl crawler node pipeline promise spider web

Last synced: 27 Oct 2024

https://github.com/kylemocode/medium-stat-box

Practical pinned gist which show your latest medium status 📌

awesome-pinned-gists crawler github-action github-gists medium-stats

Last synced: 02 Nov 2024

https://github.com/hackfengJam/ArticleSpider

Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

crawler distributed-systems django elasticsearch scrapy

Last synced: 31 Oct 2024

https://github.com/heyingcai/cetty

基于事件分发的爬虫框架

crawler event-dispatcher gather spider

Last synced: 13 Nov 2024

https://github.com/scrapy-plugins/scrapy-zyte-api

Zyte API integration for Scrapy

crawler plugin proxy scraping scrapy

Last synced: 12 Nov 2024

https://github.com/xfgryujk/taobaoanalysis

练习NLP,分析淘宝评论的项目

crawler nlp taobao

Last synced: 08 Nov 2024

https://github.com/jfreegman/toxcrawler

A Tox DHT network crawler

crawler dht dht-network tox toxcore

Last synced: 08 Nov 2024

https://github.com/haxzie-xx/instagram-downloader

Node.js/Express app to retrive instagram video/image download urls

crawler downloader express instagram instagram-scraper nodejs

Last synced: 27 Oct 2024

https://github.com/wenyalintw/google-patents-scraper

Automatically download all PDF files of searching results & their patent families found on Google Patents.

crawler google-patents patent patents pdf scraper scraping scrapy web-scraping

Last synced: 11 Nov 2024

https://github.com/gamemann/bestbuy-parser

A personal tool using Python's Scrapy framework to scrape Best Buy's product pages for RTX 3080 TIs and notify if available/not sold out.

3080 automation best bestbuy bot buy crawler parser python python3 rtx scrapy ti

Last synced: 27 Oct 2024

https://github.com/apocelipes/schannel-qt5

A GUI client of schannel powered by therecipe/qt and golang

client-side crawler go golang goqt linux qcharts qt5

Last synced: 09 Nov 2024

https://github.com/VeliovGroup/spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks

crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable

Last synced: 04 Aug 2024

https://github.com/veliovgroup/spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks

crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable

Last synced: 14 Oct 2024