Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/farishijazi/rarbgcli

RARBG command line interface for scraping the rarbg.to torrent search engine

crawler rarbg rarbg-torrentapi torrent torrents torrents-crawler

Last synced: 27 Oct 2024

https://github.com/LexiestLeszek/scrapeGPT

ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.

crawler huggingface large-language-models llm ollama proxy rag retrieval-augmented-generation robots-txt scraper telegram-bot website-scraper

Last synced: 01 Aug 2024

https://github.com/goldarowana/douyin-crawler

抖音爬虫. 通过手机代理爬取用户的作品和用户的喜欢

crawler douyin douyin-download java vertx

Last synced: 09 Oct 2024

https://github.com/sachaarbonel/scrapy.dart

Scrapy, a fast high-level web crawling & scraping framework for dart and Flutter

crawler dart scrapy

Last synced: 28 Oct 2024

https://github.com/a11ywatch/crawler

gRPC web crawler turbo charged for performance

a11ywatch crawler grpc scraper

Last synced: 13 Oct 2024

https://github.com/forsti0506/a11y-sitechecker

Automatic accessibility checker with website crawling + screenshots for easy use

accessibility accessibility-criteria accessibility-testing axe crawler hacktoberfest open-source puppeteer typescript typescript-library

Last synced: 31 Oct 2024

https://github.com/valerebron/usetube

search & get datas from youtube no google account needed

crawler typescript video youtube youtube-api

Last synced: 14 Oct 2024

https://github.com/ReedD/crawler

Chromium / Puppeteer site crawler

bot chromium crawler puppeteer redis scraper

Last synced: 25 Oct 2024

https://github.com/murat/tors

⏬ Yet another torrent searching application for your command line

crawler ruby-gem torrent-downloader torrent-search-engine

Last synced: 28 Oct 2024

https://github.com/soruly/anilist-crawler

Crawl data from anilist API and store in MariaDB.

anilist anime crawler

Last synced: 27 Oct 2024

https://github.com/mike442144/seenreq

Generate an object for testing if a request is sent, request is Mikeal's request.

crawler duplicates-removed post request spider url

Last synced: 27 Oct 2024

https://github.com/jin10086/copyheaders

方便的从浏览器复制浏览器头

crawler python tools

Last synced: 27 Oct 2024

https://github.com/liangWenPeng/scrapy-admin

A django admin site for scrapy

crawler scrapy scrapyd spider

Last synced: 17 Aug 2024

https://github.com/golang-collection/go-crawler-distributed

分布式爬虫项目,本项目支持个性化定制页面解析器二次开发,项目整体采用微服务架构,通过消息队列实现消息的异步发送,使用到的框架包括:redigo, gorm, goquery, easyjson, viper, amqp, zap, go-micro,并通过Docker实现容器化部署,中间爬虫节点支持水平拓展。

crawler docker elasticsearch go go-micro gocrawler microservice rabbitmq

Last synced: 04 Aug 2024

https://github.com/spk/maman

Rust Web Crawler saving pages on Redis

crawler http spider web web-crawler

Last synced: 01 Nov 2024

https://github.com/riquellopes/fii

API para recuperar informações sobre FII

crawler investiment mongodb nodejs

Last synced: 31 Oct 2024

https://github.com/healeycodes/Broken-Link-Crawler

:robot: Python bot that crawls your website looking for dead stuff

bot crawler python

Last synced: 26 Sep 2024

https://github.com/healeycodes/broken-link-crawler

:robot: Python bot that crawls your website looking for dead stuff

bot crawler python

Last synced: 22 Oct 2024

https://github.com/axetroy/crawler

nodejs 爬虫框架. crawler framework for nodejs

crawler nodejs

Last synced: 27 Oct 2024

https://github.com/elboletaire/php-crawler

:spider: A simple crawler (spider) writen in php just for fun, with zero dependencies

crawler php spider

Last synced: 31 Oct 2024

https://github.com/kant2002/ncrawler

Web Crawler written in C#

crawler scrapper

Last synced: 22 Oct 2024

https://github.com/charlespikachu/seleniumlogin

Login some website using selenium.

crawler selenium selenium-webdriver spider taobao

Last synced: 09 Oct 2024

https://github.com/ronin-rb/ronin-web

ronin-web is a collection of useful web helper methods and commands.

cli crawler hacktoberfest helpers html proxy-server ronin-rb ruby server spider web xml

Last synced: 04 Nov 2024

https://github.com/ryuchen/deadpool

该项目是一个使用celery作为主体框架的爬虫应用,能够灵活的添加爬虫任务,并且同时运行多站点的爬虫工作,所有组件都能够原生支持规模并发和分布式,加上celery原生的分布式调用,实现大规模并发。

celery crawler deadpool python3 spider taobao taobao-spider tmall tmall-spider

Last synced: 28 Oct 2024

https://github.com/himself65/luogucrawler

一个python爬虫来爬取洛谷各种信息

crawler python python3

Last synced: 01 Oct 2024

https://github.com/p0dalirius/robotstester

This Python script can enumerate all URLs present in robots.txt files, and test whether they can be accessed or not.

bugbounty crawler pentesting python robots tool

Last synced: 29 Oct 2024

https://github.com/mrxujiang/crawel

基于Apify+node+react搭建的有点意思的爬虫平台

apify crawler node puppeteer react react-hooks umi umi3

Last synced: 14 Oct 2024

https://github.com/jonaslejon/lolcrawler

Headless web crawler for bugbounty and penetration-testing/redteaming

bugbounty crawler docker penetration-testing penetration-testing-tools redteam redteam-tools redteaming

Last synced: 04 Aug 2024

https://github.com/bin-huang/nodespider

[DEPRECATED] Simple, flexible, delightful web crawler/spider package

async crawl crawler node pipeline promise spider web

Last synced: 27 Oct 2024

https://github.com/0xhjk/x12306

12306查票助手,一键查询沿途所有站点,先上车后补票,让你的出行更省心。

12306 12306buyticket 12306helper 12306qiang-piao crawler fk12306 helper reqeusts spider ticket train x12306

Last synced: 31 Oct 2024

https://github.com/kylemocode/medium-stat-box

Practical pinned gist which show your latest medium status 📌

awesome-pinned-gists crawler github-action github-gists medium-stats

Last synced: 02 Nov 2024

https://github.com/hackfengJam/ArticleSpider

Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

crawler distributed-systems django elasticsearch scrapy

Last synced: 31 Oct 2024

https://github.com/xiantang/spider

web crawler

crawler python3

Last synced: 15 Oct 2024

https://github.com/haxzie-xx/instagram-downloader

Node.js/Express app to retrive instagram video/image download urls

crawler downloader express instagram instagram-scraper nodejs

Last synced: 27 Oct 2024

https://github.com/apocelipes/schannel-qt5

A GUI client of schannel powered by therecipe/qt and golang

client-side crawler go golang goqt linux qcharts qt5

Last synced: 23 Oct 2024

https://github.com/gamemann/bestbuy-parser

A personal tool using Python's Scrapy framework to scrape Best Buy's product pages for RTX 3080 TIs and notify if available/not sold out.

3080 automation best bestbuy bot buy crawler parser python python3 rtx scrapy ti

Last synced: 27 Oct 2024

https://github.com/VeliovGroup/spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks

crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable

Last synced: 04 Aug 2024

https://github.com/jfreegman/toxcrawler

A Tox DHT network crawler

crawler dht dht-network tox toxcore

Last synced: 15 Oct 2024

https://github.com/veliovgroup/spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks

crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable

Last synced: 14 Oct 2024

https://github.com/ph-7/crawling-emails

Very simple bash script to crawl email addresses from a specific website.

bash crawler email email-scraper scrape scrape-email scraper scraping shell wget

Last synced: 28 Oct 2024

https://github.com/code4everything/visual-spider

欢迎体验我们全新的桌面端效率工具RunFlow,https://myrest.top/myflow

crawler crawler4j-java java-8 java8 javafx javafx-application spider visualization

Last synced: 29 Sep 2024

https://github.com/gomjellie/pysaint

[deprecated] 유세인트 파이썬 클라이언트

crawler sap soongsil unofficial

Last synced: 28 Oct 2024

https://github.com/debugtalk/webcrawler

A web crawler based on requests-html, mainly targets for url validation test.

crawler requests-html web-crawler weblink

Last synced: 16 Oct 2024

https://github.com/fanhuaandluomu/sina_spider

新浪微博爬虫:登录、关键词微博查询、微博监控

crawler python-2 sina-spider

Last synced: 12 Oct 2024

https://github.com/mamal72/iranian-calendar-events

Fetch Iranian calendar events (Jalali, Hijri and Gregorian) from time.ir website

crawler events iranian jalali jalali-calendar persian

Last synced: 02 Nov 2024

https://github.com/k1low/utsusemi

A tool to generate a static website by crawling the original site.

api aws aws-lambda crawler s3-website serverless serverless-framework

Last synced: 17 Oct 2024

https://github.com/k1LoW/utsusemi

A tool to generate a static website by crawling the original site.

api aws aws-lambda crawler s3-website serverless serverless-framework

Last synced: 04 Aug 2024

https://github.com/pykong/pypergrabber

Fetches PubMed article IDs (PMIDs) from email inbox, then crawls PubMed, Google Scholar and Sci-Hub for respective PDF files.

crawler email-inbox google-scholar pdf pmid pubmed python sci-hub scraper

Last synced: 16 Oct 2024

https://github.com/riptl/ytpriv

YT metadata exporter

big-data crawler csv datascience json video youtube

Last synced: 03 Aug 2024

https://github.com/alex-page/get-site-urls

🔗 Get all of the URL's from a website.

crawler sitemap-generator urls

Last synced: 27 Oct 2024

https://github.com/ERap320/CrowLeer

Powerful C++ web crawler based on libcurl

cli crawler crawling download

Last synced: 03 Aug 2024

https://github.com/novemberde/serverless-crawler-demo

Serverless Architecture Crawler demo

aws crawler demo handson serverless

Last synced: 04 Aug 2024

https://github.com/bartozzz/crawlerr

A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.

crawler jsdom nodejs scraper spider web-crawler

Last synced: 20 Oct 2024

https://github.com/mattwang44/uspto-patft-web-crawler

Crawler for fetching information of US Patents and PDF bulk download

crawler patent patent-crawler pyqt5 python3 uspto

Last synced: 02 Oct 2024

https://github.com/italia/publiccode-crawler

publiccode.yml crawler for the Open Source software catalog of Developers Italia

crawler developers-italia hacktoberfest publiccode publiccodeyml

Last synced: 02 Aug 2024

https://github.com/alessandrodd/googleplay_api

Google Play Unofficial Python 3 API Library

android crawler googleplay googleplay-api playstore

Last synced: 27 Oct 2024

https://github.com/kagami/tistore

:camera: Tistory photo grabber

crawler cross-platform electron tistory

Last synced: 22 Oct 2024

https://github.com/ysh329/douban-crawler

抓取豆瓣小组相关信息(小组、用户、帖子)。

crawler douban douban-crawler

Last synced: 23 Oct 2024

https://github.com/feng19/spider_man

SpiderMan,a base-on Broadway fast high-level web crawling & scraping framework for Elixir.

crawler data-mining elixir erlang framework spider

Last synced: 29 Oct 2024

https://github.com/xiongwilee/techweekly

高可配的技术周报邮件推送工具

crawler nodejs techweekly

Last synced: 18 Oct 2024

https://github.com/alanshaw/libp2p-dht-scrape-aas

🧹 A libp2p DHT scraper as a service allowing anyone to collect, consume and use to generate useful reports & visualisations.

crawler dht kademlia libp2p p2p scraper

Last synced: 21 Oct 2024

https://github.com/rzo1/crawler4j

Open Source Web Crawler for Java - A maintained fork of yasserg/crawler4j

crawler crawler4j java spider web-crawler web-spider

Last synced: 29 Sep 2024

https://github.com/Actomaton/ActoCrawler

🕸️ Swift Concurrency-powered crawler engine on top of Actomaton.

crawler swift

Last synced: 09 Aug 2024

https://github.com/yokawasa/scrapy-azuresearch-crawler-samples

Scrapy as a Web Crawler for Azure Search Samples

azure azure-search crawler python python3 scrapy search

Last synced: 30 Oct 2024

https://github.com/capjamesg/indieweb-search

Source code for the IndieWeb search engine.

crawler indieweb search search-engine

Last synced: 03 Aug 2024

https://github.com/mendableai/firecrawl-py

Crawl and convert any website into clean markdown

ai crawler llm python scraper

Last synced: 13 Aug 2024

https://github.com/asing1001/movierater

A useful website for finding movie's rating in Chinese and English. By crawling Yahoo, Ptt, IMDB.

apollo-client chai crawler graphql material-ui mocha mongodb movies nodejs reactjs redis server-side-rendering service-worker sinon typescript

Last synced: 14 Oct 2024

https://github.com/nvk681/gumo

A crawler that extracts data from a dynamic webpage. Written in node js.

crawler elasticsearch neo4j nodejs

Last synced: 11 Oct 2024

https://github.com/ruedigervoigt/exoskeleton

A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend

crawler crawling-framework database machine-learning mariadb network python python-3 scraping

Last synced: 15 Oct 2024

https://github.com/RuedigerVoigt/exoskeleton

A Python framework to build polite, but tenacious crawlers / scrapers with a MariaDB backend

crawler crawling-framework database machine-learning mariadb network python python-3 scraping

Last synced: 01 Aug 2024

https://github.com/waynechang65/ptt-crawler

ptt-crawler is a web crawler module designed to scarpe data from Ptt.

crawler javascript nodejs ptt scraper scraping spider web-crawler webcrawler

Last synced: 19 Oct 2024

https://github.com/petehouston/udemy-crawler

Crawling Udemy course info and save into JSON format.

crawler crawling node node-cli udemy udemy-api udemy-crawl

Last synced: 23 Oct 2024

https://github.com/tower1229/crawler

Nodejs crawler for cnbeta.com

crawler nodejs

Last synced: 14 Oct 2024

https://github.com/p0dalirius/crawlersuseragents

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

bugbounty crawler crawlers pentest request tool user-agent web

Last synced: 29 Oct 2024

https://github.com/loomisloud/onion-crawler

Tor website crawler (specific for Alphabay at the time)

crawler onion parser python tor

Last synced: 03 Aug 2024

https://github.com/enijkamp/supermonkey

A crawler for automated Android UI testing.

ai android crawler

Last synced: 22 Oct 2024

https://github.com/alinebastos/crawler

Web Crawler created with Node.js and Puppeteer

crawler fs javascript nodejs puppeteer scraping

Last synced: 05 Nov 2024

https://github.com/PadishahIII/SecretScraper

SecretScraper is a web scraper that crawl through target websites, scrape from http response and extract secret information via regular expression.

crawler cyper hyperscan pentest-tool pentesting python sensitivity-analysis webscraper

Last synced: 13 Aug 2024

https://github.com/paambaati/websight

🕷A simple but *really* fast crawler built with Node.js & TypeScript

coding-challenge crawler interview-questions javascript monzo nodejs typescript

Last synced: 15 Oct 2024

https://github.com/lixi5338619/lxparse

用于解析列表页链接和提取详细页内容的库

crawler htmlparse python

Last synced: 05 Nov 2024

https://github.com/Knovour/json-web-crawler

Use JSON to list all elements (with css 3 and jquery selector) that you want to crawl.

crawler javascript jquery json web-crawler

Last synced: 03 Aug 2024

https://github.com/vignif/crawler-google-scholar

This bot crawls and downloads statistics and pictures from google scholar's researchers.

crawler downloading-statistics google-scholar indexes statistics

Last synced: 01 Aug 2024

https://github.com/pourmand1376/persiancrawler

Open source crawler for Persian websites.

crawler machine-learning news python scrapy tasnim text-classification

Last synced: 11 Oct 2024