Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/chushuai/wscan

Wscan is a web security scanner that focuses on web security, dedicated to making web security accessible to everyone.

cel-go chromedp crawler headless martian passive-vulnerability-scanner poc sql-injection subdomains testwaf vulnerability-scanner waf webscan wscan xss

Last synced: 04 Aug 2024

https://github.com/dirtyfilthy/freshonions-torscraper

Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion

crawler darknet hidden-services onion scraper spider tor

Last synced: 01 Aug 2024

https://github.com/stanzhai/Html2Article

Html网页正文提取

article content crawler html spider topic

Last synced: 04 Aug 2024

https://github.com/ChenZixinn/spider_reverse

爬虫逆向案例,已完成:TLS指纹|瑞数|震坤行 | 网易易盾 | 微信小程序反编译逆向(百达星系) | 同花顺 | rpc解密 | 加速乐 | 极验滑块验证码 | 巨量算数 | Boss直聘 | 企查查 | 中国五矿 | qq音乐 | 产业政策大数据平台 | 企知道 | 雪球网(acw_sc__v2) | 1688 | 七麦数据 | whggzy | 企名科技 | mohurd | 艺恩数据 | 欧科云链

crawler python requests spider

Last synced: 31 Oct 2024

https://github.com/yhy0/Jie

Jie stands out as a comprehensive security assessment and exploitation tool meticulously crafted for web applications. Its robust suite of features encompasses vulnerability scanning, information gathering, and exploitation, elevating it to an indispensable toolkit for both security professionals and penetration testers.(expectations)

apollo-exp crawler jie scan scanner security-copilot shiro-exp vul vulnerability vulnerability-detection vulnerability-exploitation vulnerability-scanners

Last synced: 10 Sep 2024

https://github.com/shaohua0116/ICLR2020-OpenReviewData

Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.

conference crawler data-analysis iclr iclr2020 machine-learning visualization

Last synced: 07 Aug 2024

https://github.com/AndyTheFactory/newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.

articles articles-data crawler datasets-preparation news newspaper3k python requests scraper scraping

Last synced: 26 Oct 2024

https://github.com/tasos-py/Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python

bing crawler google python scraper search-engine yahoo

Last synced: 04 Aug 2024

https://github.com/lixi5338619/lxbook

《爬虫逆向进阶实战》书籍代码库

android-resever crawler frida java javascript python smali spiders unidbg xposed

Last synced: 05 Nov 2024

https://github.com/gadfly0x/signature_algorithm

各种App、小程序、网站的请求签名或加密算法。 现已有:自如、小红书、蛋壳公寓、luckin coffee(瑞幸咖啡)、bangkokair(曼谷航空)

crawler reverse-engineering spider

Last synced: 02 Aug 2024

https://github.com/roniemartinez/dude

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

async beautifulsoup4 crawler css framework lxml parsel playwright python scraper scraping selenium sync web-scraping webscraping xpath

Last synced: 11 Oct 2024

https://github.com/heqin-zhu/music-recover

:musical_note: 缓存文件转换为 MP3 文件

crawler mp3 python regex

Last synced: 06 Aug 2024

https://github.com/platonai/PulsarRPA

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

crawler data-mining data-science rpa scraper scraping web-automation web-crawler web-mining web-scraping web-sql

Last synced: 05 Nov 2024

https://github.com/lgraubner/sitemap-generator

Easily create XML sitemaps for your website.

crawler google seo sitemap sitemap-generator xml-sitemap

Last synced: 08 Aug 2024

https://github.com/cyubuchen/free_proxy_website

获取免费socks/https/http代理的网站集合

crawler free-proxy-list ip proxy proxy-checker spider

Last synced: 03 Aug 2024

https://github.com/shaohua0116/ICLR2019-OpenReviewData

Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.

crawler crawling-python openreview tutorial

Last synced: 07 Aug 2024

https://github.com/brendonboshell/supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

crawler distributed-crawler robot sitemap web-crawler

Last synced: 25 Oct 2024

https://github.com/microsoft/ghcrawler

Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

crawler data github github-api github-webhooks ospo

Last synced: 25 Sep 2024

https://github.com/elvisyjlin/media-scraper

Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok

crawler instagram pixiv reddit scraper tiktok tumblr twitter

Last synced: 04 Nov 2024

https://github.com/scrapy-plugins/scrapy-crawlera

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy

crawler crawler-detection plugin proxy scraping scrapy

Last synced: 05 Sep 2024

https://github.com/scrapy-plugins/scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy

crawler crawler-detection plugin proxy scraping scrapy

Last synced: 26 Oct 2024

https://github.com/xiyuan-fengyu/ppspider

web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案

angular cheerio crawler headless mongodb nedb node node-spider nodejs nodejs-spider proxy puppeteer spider task-queue task-scheduling typescript

Last synced: 10 Oct 2024

https://github.com/rivermont/spidy

The simple, easy to use command line web crawler.

crawler crawling python python3 web-crawler web-spider

Last synced: 29 Oct 2024

https://github.com/jackluson/chinese-fund-crawler

中国场外基金数据爬取&汇总分析

crawler fund morningstar

Last synced: 02 Aug 2024

https://github.com/snakem982/Pandora-Box

A Simple Mihomo GUI.

crawler gui linux mac mihomo windows

Last synced: 28 Oct 2024

https://github.com/Josue87/EmailFinder

Search emails from a domain through search engines

crawler osint

Last synced: 02 Aug 2024

https://github.com/infinilabs/crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider

Last synced: 04 Aug 2024

https://github.com/dennis-tra/nebula

🌌 A network agnostic DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.

cid crawler filecoin golang hacktoberfest ipfs libp2p

Last synced: 01 Nov 2024

https://github.com/TikHubIO/TikHub-API-Python-SDK

High-performance asynchronous Douyin(抖音) TikTok Xiaohongshu(小红书) Kuaishou(快手) Weibo(微博) Instagram YouTube(油管) Twitter(X) Captcha Solver(验证码解决器) Temp Mail(临时邮箱) API(接口).

api captcha-solver crawler data-api douyin douyin-tiktok-api instagram kuaishou netease-cloud-music private-api scrapy tiktok twitter weibo xiaohongshu xiguashipin

Last synced: 29 Oct 2024

https://github.com/jaybizzle/laravel-crawler-detect

A Laravel wrapper for CrawlerDetect - the web crawler detection library

bot crawler detect laravel php spider

Last synced: 29 Oct 2024

https://github.com/lgraubner/sitemap-generator-cli

Creates an XML-Sitemap by crawling a given site.

cli crawler google seo sitemap xml-sitemap

Last synced: 02 Aug 2024

https://github.com/snakem982/pandora-box

A Simple Mihomo GUI.

crawler gui linux mac mihomo windows

Last synced: 09 Oct 2024

https://github.com/GraySilver/wencai

This is a wencai crawler.(i问财的策略回测接口的Pythonic工具包)

crawler finance pandas quant quantitative-finance tushare wencai

Last synced: 30 Oct 2024

https://github.com/mikemeliz/torcrawl.py

Crawl and extract (regular or onion) webpages through TOR network

crawler extractor onion osint python tor

Last synced: 03 Aug 2024

https://github.com/devanshbatham/Gorecon

Gorecon is a All in one Reconnaissance Tool , a.k.a swiss knife for Reconnaissance , A tool that every pentester/bughunter might wanna consider into their arsenal

admin-panel-finder backups-finder cmsdetecter configurationfiles crawler directory-bruteforce dns dnsrecon email-hunter geo-ip nameserver recon reconaissance reverse-dns scanner subdomain-enumeration subdomain-scanner subnet-lookup whois-lookup wordpress-scanner

Last synced: 04 Nov 2024

https://github.com/eight04/comiccrawler

An image crawler written in Python.

cli crawler gui image-crawler python tkinter

Last synced: 30 Oct 2024

https://github.com/Jasonnor/th-music-video-generator

Touhou Project random music video generator/player, crawling image and video from websites to generate MV.

crawler javascript music-video touhou web

Last synced: 02 Aug 2024

https://github.com/eight04/ComicCrawler

An image crawler written in Python.

cli crawler gui image-crawler python tkinter

Last synced: 15 Aug 2024

https://github.com/marshalx/telegram-crawler

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Last synced: 30 Oct 2024

https://github.com/algolia/algoliasearch-netlify

Official Algolia Plugin for Netlify. Index your website to Algolia when deploying your project to Netlify with the Algolia Crawler

algolia algolia-crawler algoliasearch crawler jamstack netlify netlify-plugin search

Last synced: 12 Oct 2024

https://github.com/glaucocustodio/tanakai

Tanakai is a modern web scraping framework written in Ruby. A fork of Kimurai.

chrome-headless crawler kimurai scraper scrapy webscraping

Last synced: 31 Oct 2024

https://github.com/antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

crawler crawling framework golang scraping web-crawler web-spider

Last synced: 26 Oct 2024

https://github.com/hezhizheng/go-movies

golang spider Crawler 爬虫 电影

colly crawler docker fasthttp go gocolly golang movies redis spider

Last synced: 30 Oct 2024

https://github.com/chenjiandongx/Github-spider

Github 仓库及用户分析爬虫

crawler github scrapy

Last synced: 02 Aug 2024

https://github.com/zrashwani/arachnid

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

crawler php scraping seo

Last synced: 29 Oct 2024

https://github.com/xyntax/filesensor

Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具

crawler fuzzing pentesting scrapy

Last synced: 31 Oct 2024

https://github.com/zntfdr/Selenops

A Swift Web Crawler 🕷

command-line-tool crawler scripting swift web

Last synced: 06 Aug 2024

https://github.com/dwisiswant0/galer

A fast tool to fetch URLs from HTML attributes by crawl-in.

crawler devtool extractor galer go golang spider url-extractor url-parser waybackurls

Last synced: 28 Oct 2024

https://github.com/zntfdr/selenops

A Swift Web Crawler 🕷

command-line-tool crawler scripting swift web

Last synced: 31 Oct 2024

https://github.com/commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler

Last synced: 03 Aug 2024

https://github.com/myvyang/chromium_for_spider

dynamic crawler for web vulnerability scanner

chromium crawler puppeteer security spider

Last synced: 04 Aug 2024

https://github.com/cwjokaka/ok_ip_proxy_pool

🍿爬虫代理IP池(proxy pool) python🍟一个还ok的IP代理池

aiohttp async beautifulsoup4 crawler flask http ip pool proxy proxypool py python python3 spider sqlite

Last synced: 12 Oct 2024

https://github.com/kong36088/ZhihuSpider

多线程知乎用户爬虫,基于python3

crawler multi-threading python python3 spider zhihu

Last synced: 07 Aug 2024

https://github.com/MarshalX/telegram-crawler

🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Last synced: 04 Aug 2024

https://github.com/ScottSloan/Bili23-Downloader

下载 Bilibili 视频/番剧/电影/纪录片 等资源

bilibili crawler linux macos python videodownloader windows wxpython

Last synced: 27 Oct 2024

https://github.com/lgh06/web-page-monitor

Web Site Page Changes Monitor. 网站网页页面更新变更监控提醒。

change-alert change-detection change-monitor crawler monitor website-change-monitor website-monitoring

Last synced: 04 Aug 2024

https://github.com/R4yGM/dorkscout

DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets

bug-bounty crawler ghdb golang google-dorks osint scraper security

Last synced: 04 Aug 2024

https://github.com/redco/goose-parser

Universal scraping tool, which allows you to extract data using multiple environments

browser crawler docker goose jsdom nodejs parser parsing phantomjs scraper scraping

Last synced: 05 Nov 2024

https://github.com/spatie/robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

crawler php robots-txt

Last synced: 03 Nov 2024

https://github.com/icy/google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

bash cookie crawler curl google ownership wget

Last synced: 14 Oct 2024

https://github.com/MikeMeliz/TorCrawl.py

Crawl and extract (regular or onion) webpages through TOR network

crawler extractor onion osint python tor

Last synced: 01 Aug 2024

https://github.com/crypto-crawler/crypto-crawler-rs

A rock-solid cryptocurrency crawler library.

crawler cryptocurrency websocket

Last synced: 28 Oct 2024

https://github.com/mgleon08/instagram-crawler

Crawl instagram photos, posts and videos for download.

crawler gem instagram instagram-crawler instagram-scraper ruby rubygems scraper

Last synced: 14 Aug 2024

https://github.com/macacajs/NoSmoke

A cross platform UI crawler which scans view trees then generate and execute UI test cases.

android crawler ios macaca smoke-tests test-automation webdriver

Last synced: 01 Aug 2024

https://github.com/webysther/packagist-mirror

📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer

composer composer-packages crawler mirror packagist packagist-mirror php

Last synced: 03 Nov 2024

https://github.com/Webysther/packagist-mirror

📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer

composer composer-packages crawler mirror packagist packagist-mirror php

Last synced: 02 Nov 2024

https://github.com/elliotxx/zhihu-crawler-people

A simple distributed crawler for zhihu && data analysis

crawler python python-crawler spider web-crawler web-spider

Last synced: 31 Oct 2024

https://github.com/vormkracht10/laravel-seo-scanner

Scan your Laravel application routes for SEO improvements suggestions.

crawler laravel laravel-framework laravel-seo laravel-seo-scanner scanner seo seo-optimization seo-tools seotools

Last synced: 11 Oct 2024

https://github.com/codesofun/web-bee

🐝 Web vertical crawler framework for fun

crawler framework java java-8 webbee

Last synced: 12 Oct 2024

https://github.com/Josue87/MetaFinder

Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata

crawler metadata osint

Last synced: 04 Aug 2024

https://github.com/AnyISalIn/zhihu_fun

基于 Selenium 的知乎关键词爬虫

crawler python python3 selenium zhihu

Last synced: 30 Oct 2024

https://github.com/cocrawler/cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc

Last synced: 29 Oct 2024

https://github.com/viasite/site-audit-seo

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx

audit cli crawl-site crawler lighthouse puppeteer scraper seo seo-audit seo-site-audit site-audit xlsx

Last synced: 01 Aug 2024

https://github.com/mehmetozkaya/dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 27 Oct 2024

https://github.com/rebrowser/rebrowser-patches

Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on demand.

automation bot bot-detection chrome chromedriver cloudflare crawler crawling datadome headless headless-chrome playwright puppeteer puppeteer-extra rebrowser scraping selenium stealth web-scraping webdriver

Last synced: 10 Oct 2024

https://github.com/Jiramew/spoon

🥄 A package for building specific Proxy Pool for different Sites.

crawler distributed ip proxies proxy proxy-provider proxypool python redis spider spoon

Last synced: 01 Aug 2024

https://github.com/amerkurev/scrapper

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

crawler crawling docker headless readability scraper web-parsers web-parsing web-scraping

Last synced: 01 Nov 2024

https://github.com/n0tan3rd/squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 27 Oct 2024

https://github.com/guilhermecgs/ir

Projeto de calculo de Imposto de Renda em operacoes na bovespa automaticamente. Tags:canal eletronico do investidor, CEI, selenium, bovespa, IRPF, IR, imposto de renda, finance, yahoo finance, acao, fii, etf, python, crawler, webscraping, calculadora ir

acoes b3 bovespa calculadora-ir canal-eletronico-investidor cei crawler etf fii finance imposto-de-renda irpf webscraping

Last synced: 02 Aug 2024

https://github.com/N0taN3rd/Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 01 Aug 2024

https://github.com/fanhuaandluomu/pkulaw_spider

爬取北大法宝网http://www.pkulaw.cn/Case/

ai crawler law python-2 spider

Last synced: 12 Oct 2024

https://github.com/chenjiandongx/soksaccounts

🔥 Shadowsocks 账号爬虫

crawler shadowsocks

Last synced: 03 Aug 2024

https://github.com/cytopia/urlbuster

Powerful mutable web directory fuzzer to bruteforce existing and/or hidden files or directories.

brute-force bruteforce bruteforce-attacks crawler cytopia-sec url-bruteforcer

Last synced: 31 Oct 2024

https://github.com/nfx/slrp

rotating open proxy multiplexer

crawler golang proxy proxy-checker proxy-list proxy-pool proxy-server

Last synced: 04 Aug 2024