Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2024-11-05 00:06:41 UTC
- JSON Representation
https://github.com/abhisharma404/vault
swiss army knife for hackers
crawler fuzzing hacking hacking-tool information-gathering lfi networking offensive-security osint pentesting port-scanner python rfi scanner scrapy security sqlite ssl-inspection vault xss-vulnerability
Last synced: 03 Nov 2024
https://github.com/jaeksoft/opensearchserver
Open-source Enterprise Grade Search Engine Software
crawler custom-search enterprise indexing java lucene ocr opensearchserver search search-engine synonyms webcrawler webcrawling
Last synced: 29 Oct 2024
https://github.com/AlexMathew/scrapple
A framework for creating semi-automatic web content extractors
beautifulsoup crawler css-selector extractor lxml python scrapers scraping scrapy selector selector-expression tutorial web-scraper web-scraping xpath-expression
Last synced: 31 Oct 2024
https://github.com/chushuai/wscan
Wscan is a web security scanner that focuses on web security, dedicated to making web security accessible to everyone.
cel-go chromedp crawler headless martian passive-vulnerability-scanner poc sql-injection subdomains testwaf vulnerability-scanner waf webscan wscan xss
Last synced: 04 Aug 2024
https://github.com/dirtyfilthy/freshonions-torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
crawler darknet hidden-services onion scraper spider tor
Last synced: 01 Aug 2024
https://github.com/ChenZixinn/spider_reverse
爬虫逆向案例,已完成:TLS指纹|瑞数|震坤行 | 网易易盾 | 微信小程序反编译逆向(百达星系) | 同花顺 | rpc解密 | 加速乐 | 极验滑块验证码 | 巨量算数 | Boss直聘 | 企查查 | 中国五矿 | qq音乐 | 产业政策大数据平台 | 企知道 | 雪球网(acw_sc__v2) | 1688 | 七麦数据 | whggzy | 企名科技 | mohurd | 艺恩数据 | 欧科云链
crawler python requests spider
Last synced: 31 Oct 2024
https://github.com/yhy0/Jie
Jie stands out as a comprehensive security assessment and exploitation tool meticulously crafted for web applications. Its robust suite of features encompasses vulnerability scanning, information gathering, and exploitation, elevating it to an indispensable toolkit for both security professionals and penetration testers.(expectations)
apollo-exp crawler jie scan scanner security-copilot shiro-exp vul vulnerability vulnerability-detection vulnerability-exploitation vulnerability-scanners
Last synced: 10 Sep 2024
https://github.com/shaohua0116/ICLR2020-OpenReviewData
Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.
conference crawler data-analysis iclr iclr2020 machine-learning visualization
Last synced: 07 Aug 2024
https://github.com/AndyTheFactory/newspaper4k
📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
articles articles-data crawler datasets-preparation news newspaper3k python requests scraper scraping
Last synced: 26 Oct 2024
https://github.com/tasos-py/Search-Engines-Scraper
Search google, bing, yahoo, and other search engines with python
bing crawler google python scraper search-engine yahoo
Last synced: 04 Aug 2024
https://github.com/gadfly0x/signature_algorithm
各种App、小程序、网站的请求签名或加密算法。 现已有:自如、小红书、蛋壳公寓、luckin coffee(瑞幸咖啡)、bangkokair(曼谷航空)
crawler reverse-engineering spider
Last synced: 02 Aug 2024
https://github.com/roniemartinez/dude
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
async beautifulsoup4 crawler css framework lxml parsel playwright python scraper scraping selenium sync web-scraping webscraping xpath
Last synced: 11 Oct 2024
https://github.com/platonai/PulsarRPA
Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.
crawler data-mining data-science rpa scraper scraping web-automation web-crawler web-mining web-scraping web-sql
Last synced: 05 Nov 2024
https://github.com/lgraubner/sitemap-generator
Easily create XML sitemaps for your website.
crawler google seo sitemap sitemap-generator xml-sitemap
Last synced: 08 Aug 2024
https://github.com/cyubuchen/free_proxy_website
获取免费socks/https/http代理的网站集合
crawler free-proxy-list ip proxy proxy-checker spider
Last synced: 03 Aug 2024
https://github.com/shaohua0116/ICLR2019-OpenReviewData
Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.
crawler crawling-python openreview tutorial
Last synced: 07 Aug 2024
https://github.com/brendonboshell/supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
crawler distributed-crawler robot sitemap web-crawler
Last synced: 25 Oct 2024
https://github.com/microsoft/ghcrawler
Crawl GitHub APIs and store the discovered orgs, repos, commits, ...
crawler data github github-api github-webhooks ospo
Last synced: 25 Sep 2024
https://github.com/mhmdiaa/second-order
Second-order subdomain takeover scanner
crawler crawling infosec mapping penetration-testing penetration-testing-tools pentesting recon reconnaissance security security-tools web-application-security wordlist wordlist-generator
Last synced: 03 Nov 2024
https://github.com/scrapy-plugins/scrapy-crawlera
Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
crawler crawler-detection plugin proxy scraping scrapy
Last synced: 05 Sep 2024
https://github.com/scrapy-plugins/scrapy-zyte-smartproxy
Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
crawler crawler-detection plugin proxy scraping scrapy
Last synced: 26 Oct 2024
https://github.com/salimk/Rcrawler
An R web crawler and scraper
crawler crawlers r rpackage scraper webcrawler webscraper webscraping webscrapping
Last synced: 25 Oct 2024
https://github.com/Malwarize/webpalm
🕸️ Crawl in the web network
crawler crawling data data-science datamining go golang hack mining osint redteam spider tool
Last synced: 01 Aug 2024
https://github.com/crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping
Last synced: 25 Oct 2024
https://github.com/xiyuan-fengyu/ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
angular cheerio crawler headless mongodb nedb node node-spider nodejs nodejs-spider proxy puppeteer spider task-queue task-scheduling typescript
Last synced: 10 Oct 2024
https://github.com/rivermont/spidy
The simple, easy to use command line web crawler.
crawler crawling python python3 web-crawler web-spider
Last synced: 29 Oct 2024
https://github.com/dmi3kno/polite
Be nice on the web
crawler memoise r r-package rate-limiter robotstxt rstats rvest scraper webscraping
Last synced: 25 Oct 2024
https://github.com/Josue87/EmailFinder
Search emails from a domain through search engines
Last synced: 02 Aug 2024
https://github.com/infinilabs/crawler
🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
crawler crawling elasticsearch lightweight scraping spider web-crawler web-scraping web-spider
Last synced: 04 Aug 2024
https://github.com/dennis-tra/nebula
🌌 A network agnostic DHT crawler, monitor, and measurement tool that exposes timely information about DHT networks.
cid crawler filecoin golang hacktoberfest ipfs libp2p
Last synced: 01 Nov 2024
https://github.com/TikHubIO/TikHub-API-Python-SDK
High-performance asynchronous Douyin(抖音) TikTok Xiaohongshu(小红书) Kuaishou(快手) Weibo(微博) Instagram YouTube(油管) Twitter(X) Captcha Solver(验证码解决器) Temp Mail(临时邮箱) API(接口).
api captcha-solver crawler data-api douyin douyin-tiktok-api instagram kuaishou netease-cloud-music private-api scrapy tiktok twitter weibo xiaohongshu xiguashipin
Last synced: 29 Oct 2024
https://github.com/lgraubner/sitemap-generator-cli
Creates an XML-Sitemap by crawling a given site.
cli crawler google seo sitemap xml-sitemap
Last synced: 02 Aug 2024
https://github.com/krypton-byte/tiktok-downloader
Tiktok Downloader/Scraper using requests & bs4
asynchronous asyncio beautifulsoup bs4 crawler downloader flask krypton-byte lightweight nowm python python3 requests tiktok watermark web without
Last synced: 29 Oct 2024
https://github.com/mustafadalga/instagram-bot
An Instagram bot developed using the Selenium Framework
automation automation-selenium bot bulk-comments bulk-unfollow crawler crawling download-stories instagram instagram-api instagram-bot instagram-downloader instagram-without-api mass-liking python python3 selenium selenium-framework selenium-python selenium-webdriver
Last synced: 28 Sep 2024
https://github.com/GraySilver/wencai
This is a wencai crawler.(i问财的策略回测接口的Pythonic工具包)
crawler finance pandas quant quantitative-finance tushare wencai
Last synced: 30 Oct 2024
https://github.com/devanshbatham/Gorecon
Gorecon is a All in one Reconnaissance Tool , a.k.a swiss knife for Reconnaissance , A tool that every pentester/bughunter might wanna consider into their arsenal
admin-panel-finder backups-finder cmsdetecter configurationfiles crawler directory-bruteforce dns dnsrecon email-hunter geo-ip nameserver recon reconaissance reverse-dns scanner subdomain-enumeration subdomain-scanner subnet-lookup whois-lookup wordpress-scanner
Last synced: 04 Nov 2024
https://github.com/eight04/comiccrawler
An image crawler written in Python.
cli crawler gui image-crawler python tkinter
Last synced: 30 Oct 2024
https://github.com/Jasonnor/th-music-video-generator
Touhou Project random music video generator/player, crawling image and video from websites to generate MV.
crawler javascript music-video touhou web
Last synced: 02 Aug 2024
https://github.com/eight04/ComicCrawler
An image crawler written in Python.
cli crawler gui image-crawler python tkinter
Last synced: 15 Aug 2024
https://github.com/marshalx/telegram-crawler
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
crawler crawling crawling-python parser telegram telegram-org telegram-updates
Last synced: 30 Oct 2024
https://github.com/algolia/algoliasearch-netlify
Official Algolia Plugin for Netlify. Index your website to Algolia when deploying your project to Netlify with the Algolia Crawler
algolia algolia-crawler algoliasearch crawler jamstack netlify netlify-plugin search
Last synced: 12 Oct 2024
https://github.com/glaucocustodio/tanakai
Tanakai is a modern web scraping framework written in Ruby. A fork of Kimurai.
chrome-headless crawler kimurai scraper scrapy webscraping
Last synced: 31 Oct 2024
https://github.com/antchfx/antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
crawler crawling framework golang scraping web-crawler web-spider
Last synced: 26 Oct 2024
https://github.com/zrashwani/arachnid
Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
Last synced: 29 Oct 2024
https://github.com/xyntax/filesensor
Dynamic file detection tool based on crawler 基于爬虫的动态敏感文件探测工具
crawler fuzzing pentesting scrapy
Last synced: 31 Oct 2024
https://github.com/zntfdr/Selenops
A Swift Web Crawler 🕷
command-line-tool crawler scripting swift web
Last synced: 06 Aug 2024
https://github.com/dwisiswant0/galer
A fast tool to fetch URLs from HTML attributes by crawl-in.
crawler devtool extractor galer go golang spider url-extractor url-parser waybackurls
Last synced: 28 Oct 2024
https://github.com/zntfdr/selenops
A Swift Web Crawler 🕷
command-line-tool crawler scripting swift web
Last synced: 31 Oct 2024
https://github.com/commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
apache-storm common-crawl commoncrawl crawler news storm-crawler warc web-crawler
Last synced: 03 Aug 2024
https://github.com/s0rg/crawley
The unix-way web crawler
cli crawler go golang golang-application pentest pentest-tool pentesting unix-way web-crawler web-scraping web-spider
Last synced: 02 Nov 2024
https://github.com/kong36088/ZhihuSpider
多线程知乎用户爬虫,基于python3
crawler multi-threading python python3 spider zhihu
Last synced: 07 Aug 2024
https://github.com/MarshalX/telegram-crawler
🕷 Automatically detect changes made to the official Telegram sites, clients and servers.
crawler crawling crawling-python parser telegram telegram-org telegram-updates
Last synced: 04 Aug 2024
https://github.com/ScottSloan/Bili23-Downloader
下载 Bilibili 视频/番剧/电影/纪录片 等资源
bilibili crawler linux macos python videodownloader windows wxpython
Last synced: 27 Oct 2024
https://github.com/lgh06/web-page-monitor
Web Site Page Changes Monitor. 网站网页页面更新变更监控提醒。
change-alert change-detection change-monitor crawler monitor website-change-monitor website-monitoring
Last synced: 04 Aug 2024
https://github.com/dwisiswant0/gf-secrets
Secret and/or credential patterns used for gf.
alienvault-otx bugbounty crawler gau gf gitleaks infosec open-threat-exchange secrets-detection trufflehog trufflehog3 wayback wayback-machine waybackurl
Last synced: 28 Oct 2024
https://github.com/R4yGM/dorkscout
DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets
bug-bounty crawler ghdb golang google-dorks osint scraper security
Last synced: 04 Aug 2024
https://github.com/kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
corpus corpus-linguistics crawler dataset dependency-parser indonesian indonesian-language named-entity-recognition nlp parallel-corpus pos-tagging sentiment-analysis
Last synced: 01 Aug 2024
https://github.com/spatie/robots-txt
Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
Last synced: 03 Nov 2024
https://github.com/crypto-crawler/crypto-crawler-rs
A rock-solid cryptocurrency crawler library.
crawler cryptocurrency websocket
Last synced: 28 Oct 2024
https://github.com/mgleon08/instagram-crawler
Crawl instagram photos, posts and videos for download.
crawler gem instagram instagram-crawler instagram-scraper ruby rubygems scraper
Last synced: 14 Aug 2024
https://github.com/macacajs/NoSmoke
A cross platform UI crawler which scans view trees then generate and execute UI test cases.
android crawler ios macaca smoke-tests test-automation webdriver
Last synced: 01 Aug 2024
https://github.com/webysther/packagist-mirror
📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer
composer composer-packages crawler mirror packagist packagist-mirror php
Last synced: 03 Nov 2024
https://github.com/ovnrain/javbus-api
一个自我托管的 JavBus API 服务
adults api api-server crawler docker javbus magnet nodejs spider typescript vercel vercel-deployment
Last synced: 04 Aug 2024
https://github.com/Webysther/packagist-mirror
📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer
composer composer-packages crawler mirror packagist packagist-mirror php
Last synced: 02 Nov 2024
https://github.com/0xsha/ChainWalker
Rapid Smart Contract Crawler
blockchain crawler dataset evm-bytecode geth security smart-contracts web3
Last synced: 04 Aug 2024
https://github.com/elliotxx/zhihu-crawler-people
A simple distributed crawler for zhihu && data analysis
crawler python python-crawler spider web-crawler web-spider
Last synced: 31 Oct 2024
https://github.com/vormkracht10/laravel-seo-scanner
Scan your Laravel application routes for SEO improvements suggestions.
crawler laravel laravel-framework laravel-seo laravel-seo-scanner scanner seo seo-optimization seo-tools seotools
Last synced: 11 Oct 2024
https://github.com/Josue87/MetaFinder
Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata
Last synced: 04 Aug 2024
https://github.com/cocrawler/cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc
Last synced: 29 Oct 2024
https://github.com/viasite/site-audit-seo
Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx
audit cli crawl-site crawler lighthouse puppeteer scraper seo seo-audit seo-site-audit site-audit xlsx
Last synced: 01 Aug 2024
https://github.com/mehmetozkaya/dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping
Last synced: 27 Oct 2024
https://github.com/rebrowser/rebrowser-patches
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on demand.
automation bot bot-detection chrome chromedriver cloudflare crawler crawling datadome headless headless-chrome playwright puppeteer puppeteer-extra rebrowser scraping selenium stealth web-scraping webdriver
Last synced: 10 Oct 2024
https://github.com/Jiramew/spoon
🥄 A package for building specific Proxy Pool for different Sites.
crawler distributed ip proxies proxy proxy-provider proxypool python redis spider spoon
Last synced: 01 Aug 2024
https://github.com/amerkurev/scrapper
Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
crawler crawling docker headless readability scraper web-parsers web-parsing web-scraping
Last synced: 01 Nov 2024
https://github.com/n0tan3rd/squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 27 Oct 2024
https://github.com/guilhermecgs/ir
Projeto de calculo de Imposto de Renda em operacoes na bovespa automaticamente. Tags:canal eletronico do investidor, CEI, selenium, bovespa, IRPF, IR, imposto de renda, finance, yahoo finance, acao, fii, etf, python, crawler, webscraping, calculadora ir
acoes b3 bovespa calculadora-ir canal-eletronico-investidor cei crawler etf fii finance imposto-de-renda irpf webscraping
Last synced: 02 Aug 2024
https://github.com/N0taN3rd/Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 01 Aug 2024
https://github.com/cytopia/urlbuster
Powerful mutable web directory fuzzer to bruteforce existing and/or hidden files or directories.
brute-force bruteforce bruteforce-attacks crawler cytopia-sec url-bruteforcer
Last synced: 31 Oct 2024
https://github.com/nfx/slrp
rotating open proxy multiplexer
crawler golang proxy proxy-checker proxy-list proxy-pool proxy-server
Last synced: 04 Aug 2024
https://github.com/oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar
Last synced: 04 Nov 2024