Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/healeycodes/broken-link-crawler

:robot: Python bot that crawls your website looking for dead stuff

bot crawler python

Last synced: 10 Dec 2024

https://github.com/spk/maman

Rust Web Crawler saving pages on Redis

crawler http spider web web-crawler

Last synced: 01 Nov 2024

https://github.com/riquellopes/fii

API para recuperar informações sobre FII

crawler investiment mongodb nodejs

Last synced: 31 Oct 2024

https://github.com/healeycodes/Broken-Link-Crawler

:robot: Python bot that crawls your website looking for dead stuff

bot crawler python

Last synced: 26 Sep 2024

https://github.com/xiaoxiunique/x-kit

一个用于抓取和分析 X (Twitter) 用户数据和推文的工具。

crawler kols twitter x

Last synced: 27 Dec 2024

https://github.com/jonaslejon/lolcrawler

Headless web crawler for bugbounty and penetration-testing/redteaming

bugbounty crawler docker penetration-testing penetration-testing-tools redteam redteam-tools redteaming

Last synced: 21 Nov 2024

https://github.com/pzaino/thecrowler

A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze, and interact with the web in all its dimensions.

automation content-detection content-discovery crawler crawling cyber-security cybersecurity cybersecurity-tools golang indexer indexing reconnaissance scraping search-engine vulnerability-detection

Last synced: 03 Dec 2024

https://github.com/elboletaire/php-crawler

:spider: A simple crawler (spider) writen in php just for fun, with zero dependencies

crawler php spider

Last synced: 31 Oct 2024

https://github.com/axetroy/crawler

nodejs 爬虫框架. crawler framework for nodejs

crawler nodejs

Last synced: 27 Oct 2024

https://github.com/kant2002/ncrawler

Web Crawler written in C#

crawler scrapper

Last synced: 22 Oct 2024

https://github.com/p0dalirius/robotstester

This Python script can enumerate all URLs present in robots.txt files, and test whether they can be accessed or not.

bugbounty crawler pentesting python robots tool

Last synced: 30 Dec 2024

https://github.com/niespodd/webrtc-local-ip-leak

Oh no, stop this. You can see my local IP address 😲! Use `foundation` attribute against CRC32 lookup table to reveal local IP address of a Chrome/Chromium visitor.

automation bot bot-detection crawler spider stealth webrtc

Last synced: 09 Nov 2024

https://github.com/taseikyo/crawler

:snake:A collection of simple Python crawlers.

baidu-tieba bilibili bing crawler douban pixiv python-crawler python3 youku

Last synced: 13 Nov 2024

https://github.com/ronin-rb/ronin-web

ronin-web is a collection of useful web helper methods and commands.

cli crawler hacktoberfest helpers html proxy-server ronin-rb ruby server spider web xml

Last synced: 04 Nov 2024

https://github.com/charlespikachu/seleniumlogin

Login some website using selenium.

crawler selenium selenium-webdriver spider taobao

Last synced: 09 Oct 2024

https://github.com/ryuchen/deadpool

该项目是一个使用celery作为主体框架的爬虫应用,能够灵活的添加爬虫任务,并且同时运行多站点的爬虫工作,所有组件都能够原生支持规模并发和分布式,加上celery原生的分布式调用,实现大规模并发。

celery crawler deadpool python3 spider taobao taobao-spider tmall tmall-spider

Last synced: 28 Oct 2024

https://github.com/kkomelin/insecres

A console tool that finds insecure resources on HTTPS sites

crawler finder https security

Last synced: 25 Nov 2024

https://github.com/himself65/luogucrawler

一个python爬虫来爬取洛谷各种信息

crawler python python3

Last synced: 01 Oct 2024

https://github.com/mirusu400/pinterest-infinite-crawler

An infinite Pinterest crawler/scraper. Crawl image with inifnite-scroll!

crawler hacktoberfest pinterest pinterest-downloader python scraper scraping selenium

Last synced: 06 Nov 2024

https://github.com/maicius/universityrecruitment-ssurvey

用严肃的数据来回答“什么样的企业会到什么样的大学招聘”?

analysis beautifulsoup crawler data redis university

Last synced: 11 Nov 2024

https://github.com/xiantang/spider

web crawler

crawler python3

Last synced: 08 Nov 2024

https://github.com/0xhjk/x12306

12306查票助手,一键查询沿途所有站点,先上车后补票,让你的出行更省心。

12306 12306buyticket 12306helper 12306qiang-piao crawler fk12306 helper reqeusts spider ticket train x12306

Last synced: 25 Dec 2024

https://github.com/mrxujiang/crawel

基于Apify+node+react搭建的有点意思的爬虫平台

apify crawler node puppeteer react react-hooks umi umi3

Last synced: 07 Nov 2024

https://github.com/VAllens/CrawlerSamples

This is a Puppeteer+AngleSharp crawler console app samples, used C# 7.1 coding and dotnet core build.

anglesharp chsarp crawler dotnetcore headless headless-browsers headless-chrome headless-chromium puppeteer

Last synced: 13 Nov 2024

https://github.com/bin-huang/nodespider

[DEPRECATED] Simple, flexible, delightful web crawler/spider package

async crawl crawler node pipeline promise spider web

Last synced: 06 Dec 2024

https://github.com/iljan/narr

Download audio tracks from Netflix to sample your favorite shows

chrome-devtools-protocol cli crawler downloader music

Last synced: 02 Dec 2024

https://github.com/migalabs/armiarma

Armiarma is a Libp2p open-network crawler with a current focus on Ethereum's CL network

crawler ethereum libp2p monitoring

Last synced: 19 Dec 2024

https://github.com/kylemocode/medium-stat-box

Practical pinned gist which show your latest medium status 📌

awesome-pinned-gists crawler github-action github-gists medium-stats

Last synced: 12 Dec 2024

https://github.com/hackfengJam/ArticleSpider

Crawling zhihu, jobbole, lagou by Scrapy, and using Elasticsearch+Django to build a Search Engine website --- README_zh.md (including: implementation roadmap, distributed-crawler and coping with anti-crawling strategies).

crawler distributed-systems django elasticsearch scrapy

Last synced: 31 Oct 2024

https://github.com/twtrubiks/auto_crawler_ptt_beauty_image

Auto Crawler Ptt Beauty Image Use Python Schedule

beauty crawler heroku image ptt python schedule tutorial

Last synced: 16 Nov 2024

https://github.com/scrapy-plugins/scrapy-zyte-api

Zyte API integration for Scrapy

crawler plugin proxy scraping scrapy

Last synced: 28 Dec 2024

https://github.com/heyingcai/cetty

基于事件分发的爬虫框架

crawler event-dispatcher gather spider

Last synced: 13 Nov 2024

https://github.com/apocelipes/schannel-qt5

A GUI client of schannel powered by therecipe/qt and golang

client-side crawler go golang goqt linux qcharts qt5

Last synced: 09 Nov 2024

https://github.com/wenyalintw/google-patents-scraper

Automatically download all PDF files of searching results & their patent families found on Google Patents.

crawler google-patents patent patents pdf scraper scraping scrapy web-scraping

Last synced: 11 Nov 2024

https://github.com/jfreegman/toxcrawler

A Tox DHT network crawler

crawler dht dht-network tox toxcore

Last synced: 08 Nov 2024

https://github.com/xfgryujk/taobaoanalysis

练习NLP,分析淘宝评论的项目

crawler nlp taobao

Last synced: 08 Nov 2024

https://github.com/gamemann/bestbuy-parser

A personal tool using Python's Scrapy framework to scrape Best Buy's product pages for RTX 3080 TIs and notify if available/not sold out.

3080 automation best bestbuy bot buy crawler parser python python3 rtx scrapy ti

Last synced: 27 Oct 2024

https://github.com/haxzie-xx/instagram-downloader

Node.js/Express app to retrive instagram video/image download urls

crawler downloader express instagram instagram-scraper nodejs

Last synced: 27 Oct 2024

https://github.com/VeliovGroup/spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks

crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable

Last synced: 18 Nov 2024

https://github.com/subins2000/phpwebcrawler

A Web Crawler Created in PHP

crawler php

Last synced: 13 Nov 2024

https://github.com/andreaskoch/gargantua

The fast website crawler

command-line crawler golang xml-sitemap

Last synced: 16 Nov 2024

https://github.com/veliovgroup/spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks

crawler meteor meteor-package middleware nodejs npm npm-package seo seo-optimization spiderable

Last synced: 14 Oct 2024

https://github.com/code4everything/visual-spider

欢迎体验我们全新的桌面端效率工具RunFlow,https://myrest.top/myflow

crawler crawler4j-java java-8 java8 javafx javafx-application spider visualization

Last synced: 29 Sep 2024

https://github.com/miry/medup

Download all content from Medium and Dev.to to local folder

cli crawler devto json markdown medium sync tool

Last synced: 06 Nov 2024

https://github.com/deptagency/octopus

Recursive and multi-threaded broken link checker

broken checker crawler links

Last synced: 20 Nov 2024

https://github.com/a252937166/toutiaocrawler

头条号爬虫案例

crawler toutiao

Last synced: 21 Nov 2024

https://github.com/ph-7/crawling-emails

Very simple bash script to crawl email addresses from a specific website.

bash crawler email email-scraper scrape scrape-email scraper scraping shell wget

Last synced: 28 Oct 2024

https://github.com/gomjellie/pysaint

[deprecated] 유세인트 파이썬 클라이언트

crawler sap soongsil unofficial

Last synced: 25 Dec 2024

https://github.com/fanhuaandluomu/sina_spider

新浪微博爬虫:登录、关键词微博查询、微博监控

crawler python-2 sina-spider

Last synced: 22 Dec 2024

https://github.com/debugtalk/webcrawler

A web crawler based on requests-html, mainly targets for url validation test.

crawler requests-html web-crawler weblink

Last synced: 08 Nov 2024

https://github.com/mamal72/iranian-calendar-events

Fetch Iranian calendar events (Jalali, Hijri and Gregorian) from time.ir website

crawler events iranian jalali jalali-calendar persian

Last synced: 02 Nov 2024

https://github.com/juzeon/advanced-php-crawler

新浪博客文章/wenku8轻小说文库爬虫,可抓取图片保存,一键制作电子书。kindle读书党的神器!

calibre crawler gitbook kindle php sina

Last synced: 10 Nov 2024

https://github.com/pykong/pypergrabber

Fetches PubMed article IDs (PMIDs) from email inbox, then crawls PubMed, Google Scholar and Sci-Hub for respective PDF files.

crawler email-inbox google-scholar pdf pmid pubmed python sci-hub scraper

Last synced: 08 Nov 2024

https://github.com/mendableai/firecrawl-py

Crawl and convert any website into clean markdown

ai crawler llm python scraper

Last synced: 08 Nov 2024

https://github.com/k1low/utsusemi

A tool to generate a static website by crawling the original site.

api aws aws-lambda crawler s3-website serverless serverless-framework

Last synced: 17 Oct 2024

https://github.com/k1LoW/utsusemi

A tool to generate a static website by crawling the original site.

api aws aws-lambda crawler s3-website serverless serverless-framework

Last synced: 20 Nov 2024

https://github.com/mjavadhpour/telegram-member-inviter

Crawling client's groups and channels to invite their members to a target group.

crawler python python3 robot telegram telegram-client telethon

Last synced: 16 Nov 2024

https://github.com/howie6879/php-google

Google search results crawler, get google search results that you need - php

crawler google-search php-google

Last synced: 19 Nov 2024

https://github.com/simionrobert/bitinsight

:earth_africa: Bittorrent Network Overview through Infohash Indexing, Metadata and IP visualisations of the DHT network

bep51 bittorrent crawler dht elasticsearch infohash javascript nodejs torrent

Last synced: 23 Dec 2024

https://github.com/fedebotu/iclr2023-openreviewdata

Crawl & Visualize ICLR 2023 Data from OpenReview

crawler dataset iclr iclr2023 openreview peer-review review scraper

Last synced: 06 Nov 2024

https://github.com/codelibs/fess-crawler

Web/FileSystem Crawler Library

crawler java

Last synced: 27 Dec 2024

https://github.com/zenrows/scaling-to-distributed-crawling

Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.

crawler crawling distributed python python3 scraping spider

Last synced: 16 Nov 2024

https://github.com/riptl/ytpriv

YT metadata exporter

big-data crawler csv datascience json video youtube

Last synced: 16 Nov 2024

https://github.com/italia/publiccode-crawler

publiccode.yml crawler for the Open Source software catalog of Developers Italia

crawler developers-italia hacktoberfest publiccode publiccodeyml

Last synced: 10 Nov 2024

https://github.com/alehkot/job-funnel-ts

Automated tool for scraping job postings into a .xlsx files inspired by Job Funnel.

crawler hacktoberfest jobs typescript

Last synced: 06 Dec 2024

https://github.com/alex-page/get-site-urls

🔗 Get all of the URL's from a website.

crawler sitemap-generator urls

Last synced: 27 Oct 2024

https://github.com/tychozzz/article_crawler

✨ Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.

article crawler html markdown pypi python

Last synced: 12 Nov 2024

https://github.com/ERap320/CrowLeer

Powerful C++ web crawler based on libcurl

cli crawler crawling download

Last synced: 16 Nov 2024

https://github.com/novemberde/serverless-crawler-demo

Serverless Architecture Crawler demo

aws crawler demo handson serverless

Last synced: 10 Nov 2024

https://github.com/dachcom-digital/pimcore-lucene-search

Pimcore Website Indexer (powered by Zend Search Lucene)

crawler lucene lucenesearch pimcore

Last synced: 14 Nov 2024

https://github.com/nicolasmure/crawlerdetectbundle

A Symfony bundle for the Crawler-Detect library (detects bots/crawlers/spiders via the user agent)

bot bundle crawler php symfony

Last synced: 16 Nov 2024

https://github.com/jurooravec/crawlee-one

Professional scrapers that provide full control to the users. Crawlee One builds on top of Crawlee and Apify and extends them with features for robust and highly configurable web scrapers.

actor apify crawlee crawler framework scraper scraping web

Last synced: 13 Nov 2024

https://github.com/bartozzz/crawlerr

A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.

crawler jsdom nodejs scraper spider web-crawler

Last synced: 08 Nov 2024

https://github.com/mattwang44/uspto-patft-web-crawler

Crawler for fetching information of US Patents and PDF bulk download

crawler patent patent-crawler pyqt5 python3 uspto

Last synced: 02 Oct 2024

https://github.com/weihanli/proxycrawler

代理爬虫服务,爬取代理IP并保存到 Redis 中, topshelf+Quartz.Net+redis

crawler proxy proxy-ip redis

Last synced: 27 Nov 2024

https://github.com/owenliang/dht

一个DHT爬虫

bencode crawler dht

Last synced: 22 Nov 2024

https://github.com/kagami/tistore

:camera: Tistory photo grabber

crawler cross-platform electron tistory

Last synced: 13 Dec 2024

https://github.com/qibinlou/faceplusplus-stars-library-images-crawler

Face++ starlib 明星库头像标注集爬虫及图片集合,用于face recognition training

crawler faceplusplus image-recognition images traning

Last synced: 21 Nov 2024

https://github.com/alessandrodd/googleplay_api

Google Play Unofficial Python 3 API Library

android crawler googleplay googleplay-api playstore

Last synced: 27 Oct 2024

https://github.com/bitxx/pholcus

对基于golang的henrylee2cn/pholcusl爬虫框架的修复和完善,满足自身需要

crawler golang pholcus

Last synced: 21 Nov 2024

https://github.com/ysh329/douban-crawler

抓取豆瓣小组相关信息(小组、用户、帖子)。

crawler douban douban-crawler

Last synced: 23 Oct 2024

https://github.com/o8e/soccer-scrape

:page_with_curl: Scrape football data from Bet365

bet365 betting crawler es6 football javascript puppeteer scraper soccer

Last synced: 13 Nov 2024

https://github.com/wwwwwydev/crawlist

A universal solution for web crawling lists

crawl crawler crawler-python python reptile

Last synced: 12 Nov 2024

https://github.com/xiongwilee/techweekly

高可配的技术周报邮件推送工具

crawler nodejs techweekly

Last synced: 08 Nov 2024

https://github.com/feng19/spider_man

SpiderMan,a base-on Broadway fast high-level web crawling & scraping framework for Elixir.

crawler data-mining elixir erlang framework spider

Last synced: 29 Oct 2024