Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- GitHub: https://github.com/topics/crawler
- Wikipedia: https://en.wikipedia.org/wiki/Web_crawler
- Last updated: 2024-12-25 00:05:56 UTC
- JSON Representation
https://github.com/kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
corpus corpus-linguistics crawler dataset dependency-parser indonesian indonesian-language named-entity-recognition nlp parallel-corpus pos-tagging sentiment-analysis
Last synced: 08 Nov 2024
https://github.com/ovnrain/javbus-api
一个自我托管的 JavBus API 服务
adults api api-server crawler docker javbus magnet nodejs spider typescript vercel vercel-deployment
Last synced: 20 Dec 2024
https://github.com/zhaotianff/csharpcrawler
C#爬虫示例程序,想学习爬虫入门知识的可以看过来。后续会慢慢加入更多爬虫相关的知识。
Last synced: 18 Dec 2024
https://github.com/spatie/robots-txt
Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
Last synced: 22 Dec 2024
https://github.com/tufayellus/linkedin-scraper
A LinkedIn Scraper to scrape up to 1k LinkedIn profiles(due to LinkedIn limit) from company profile links and save their e-mail addresses if available! (actively maintained, if anything doesn't work, open an issue in the repo)
crawler digital-marketing email-marketing email-scraper leads linkedin linkedin-bot linkedin-gui linkedin-scraper linkedin-scraper-gui scrape-email scrape-emails scraper scraper-engine
Last synced: 21 Dec 2024
https://github.com/6677-ai/tap4-ai-crawler
The crawler opened source by tap4.ai
aitoolkit aitools crawler crawler-engine crawler-python
Last synced: 23 Dec 2024
https://github.com/linkedtales/scrapedin-linkedin-crawler
Crawler for LinkedIn full profiles 2019
crawler linkedin linkedin-crawler
Last synced: 06 Nov 2024
https://github.com/crypto-crawler/crypto-crawler-rs
A rock-solid cryptocurrency crawler library.
crawler cryptocurrency websocket
Last synced: 28 Oct 2024
https://github.com/vormkracht10/laravel-seo-scanner
Scan your Laravel application routes for SEO improvements suggestions.
crawler laravel laravel-framework laravel-seo laravel-seo-scanner scanner seo seo-optimization seo-tools seotools
Last synced: 21 Dec 2024
https://github.com/jsrei/crawler-js-hook-framework-public
JS逆向Hook工具集,开源部分工具到这里
Last synced: 16 Nov 2024
https://github.com/crawlab-team/crawlab-lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
crawlab crawler crawler-management crawling-tasks platform scrapy scrapy-ui scrapyd scrapyd-ui spider web-crawler
Last synced: 17 Nov 2024
https://github.com/macacajs/NoSmoke
A cross platform UI crawler which scans view trees then generate and execute UI test cases.
android crawler ios macaca smoke-tests test-automation webdriver
Last synced: 08 Nov 2024
https://github.com/mgleon08/instagram-crawler
Crawl instagram photos, posts and videos for download.
crawler gem instagram instagram-crawler instagram-scraper ruby rubygems scraper
Last synced: 05 Dec 2024
https://github.com/webysther/packagist-mirror
📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer
composer composer-packages crawler mirror packagist packagist-mirror php
Last synced: 03 Nov 2024
https://github.com/Josue87/MetaFinder
Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata
Last synced: 21 Nov 2024
https://github.com/0xsha/ChainWalker
Rapid Smart Contract Crawler
blockchain crawler dataset evm-bytecode geth security smart-contracts web3
Last synced: 21 Nov 2024
https://github.com/elliotxx/zhihu-crawler-people
A simple distributed crawler for zhihu && data analysis
crawler python python-crawler spider web-crawler web-spider
Last synced: 18 Dec 2024
https://github.com/Webysther/packagist-mirror
📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer
composer composer-packages crawler mirror packagist packagist-mirror php
Last synced: 02 Nov 2024
https://github.com/subins2000/search
An Open Source Search Engine
crawler php search search-engine
Last synced: 25 Dec 2024
https://github.com/evil0ctal/fast-powerful-whisper-ai-services-api
⚡ 一款用于自动语音识别 (ASR)、翻译的高性能异步 API。不需要购买Whisper API,使用本地运行的Whisper模型进行推理,并支持多GPU并发,针对分布式部署进行设计。还内置了包括TikTok、抖音等社交媒体平台的爬虫,可实现来自多个社交平台的无缝媒体处理,为媒体内容数据自动化处理提供了强大且可扩展的解决方案。
asr crawler douyin-api fastapi faster-whisper openai-whisper speech-recognition speech-to-text speech-to-text-api tiktok-analytics tiktok-api tiktok-crawler video-analysis whisper-ai whisper-api whisperbot
Last synced: 23 Dec 2024
https://github.com/cocrawler/cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc
Last synced: 29 Oct 2024
https://github.com/viasite/site-audit-seo
Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx
audit cli crawl-site crawler lighthouse puppeteer scraper seo seo-audit seo-site-audit site-audit xlsx
Last synced: 06 Nov 2024
https://github.com/mehmetozkaya/DotnetCrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping
Last synced: 09 Nov 2024
https://github.com/mehmetozkaya/dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping
Last synced: 17 Nov 2024
https://github.com/Jiramew/spoon
🥄 A package for building specific Proxy Pool for different Sites.
crawler distributed ip proxies proxy proxy-provider proxypool python redis spider spoon
Last synced: 06 Nov 2024
https://github.com/norconex/crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
collector-fs collector-http crawler crawlers filesystem-crawler flexible java search-engine web-crawler
Last synced: 25 Dec 2024
https://github.com/nfx/slrp
rotating open proxy multiplexer
crawler golang proxy proxy-checker proxy-list proxy-pool proxy-server
Last synced: 21 Dec 2024
https://github.com/amerkurev/scrapper
Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
crawler crawling docker headless readability scraper web-parsers web-parsing web-scraping
Last synced: 21 Dec 2024
https://github.com/N0taN3rd/Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 05 Nov 2024
https://github.com/n0tan3rd/squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving
Last synced: 27 Oct 2024
https://github.com/guilhermecgs/ir
Projeto de calculo de Imposto de Renda em operacoes na bovespa automaticamente. Tags:canal eletronico do investidor, CEI, selenium, bovespa, IRPF, IR, imposto de renda, finance, yahoo finance, acao, fii, etf, python, crawler, webscraping, calculadora ir
acoes b3 bovespa calculadora-ir canal-eletronico-investidor cei crawler etf fii finance imposto-de-renda irpf webscraping
Last synced: 11 Nov 2024
https://github.com/stulzq/HttpCode.Core
简单、易用、高效 一个有态度的开源.Net Http请求框架!可以用制作爬虫,api请求等等。
crawler httpcode httpmock httprequest net-core net-standard
Last synced: 13 Nov 2024
https://github.com/beb7/gflare-tk
Open-Source Python Based SEO Web Crawler
crawler python robots-txt scraper seo seo-crawler tkinter
Last synced: 14 Nov 2024
https://github.com/cytopia/urlbuster
Powerful mutable web directory fuzzer to bruteforce existing and/or hidden files or directories.
brute-force bruteforce bruteforce-attacks crawler cytopia-sec url-bruteforcer
Last synced: 25 Dec 2024
https://github.com/oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
common-crawl commoncrawl corpus-linguistics crawler fasttext language-classification nlp oscar
Last synced: 04 Nov 2024
https://github.com/tijme/not-your-average-web-crawler
A web crawler (for bug hunting) that gathers more than you can imagine.
bug-bounty callbacks crawler custom get post python request scanner scraper security spider vulnerability
Last synced: 23 Dec 2024
https://github.com/luohaha/jlitespider
A lite distributed Java spider framework :-)
crawler distributed distributed-systems rabbitmq spider
Last synced: 18 Nov 2024
https://github.com/Liu233w/acm-statistics
An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge
acm-icpc codechef-api codeforces-api crawler csharp docker javascript nodejs spoj-api vue
Last synced: 07 Nov 2024
https://github.com/liu233w/acm-statistics
An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge
acm-icpc codechef-api codeforces-api crawler csharp docker javascript nodejs spoj-api vue
Last synced: 21 Dec 2024
https://github.com/seart-group/ghs
GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them
bootstrap crawler csv-export dataset-generation docker-compose git github java-17 json-export mining-software-repositories msr mysql platform repository search-engine spring-boot spring-boot-application spring-boot-server sql-dump xml-export
Last synced: 22 Dec 2024
https://github.com/bartdag/pylinkvalidator
pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.
crawler link-checker networking python
Last synced: 24 Dec 2024
https://github.com/aliakhtari78/spotifyscraper
Spotify Scraper to extract all the information from spotify, download mp3 with cover of the song
album-title crawler free infromation preview-mp3 python python3 scraper spotfiy spotify-crawler spotify-downloader spotify-scraper spotify-scraping spotify-songs spotify-web-player webscraper webscraping
Last synced: 20 Dec 2024
https://github.com/nuhmanpk/webscrapper
Simple and powerfull all in one Telegram Bot to scrap / crawl webpages using Requests, html5lib and Beautifulsoup
beautifulsoup4 crawler crawler-engine crawler-python hacktoberfest hacktoberfest-accepted hacktoberfest2023 pyrogram pyrogram-bot requests scraper scraping selenium telegram telegram-bot web-scraping webscraping webscrapper webscrapping webscrapping-python
Last synced: 20 Dec 2024
https://github.com/janreges/siteone-crawler
SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
analyzer crawler crawling performance qa quality-assessment security seo seotools stress-testing swoole testing website
Last synced: 25 Oct 2024
https://github.com/abaykan/CrawlBox
Easy way to brute-force web directory.
admin-finder crawler python web-crawler wordlist
Last synced: 30 Oct 2024
https://github.com/twiny/spidy
Domain names collector - Crawl websites and collect domain names along with their availability status.
backlinks crawler domain expired-domain golang scraper seotools spider
Last synced: 17 Dec 2024
https://github.com/karust/gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
commoncrawl concurrency crawler golang wayback-machine webarchive
Last synced: 05 Nov 2024
https://github.com/nuhmanpk/WebScrapper
Simple and powerfull all in one Telegram Bot to scrap / crawl webpages using Requests, html5lib and Beautifulsoup
beautifulsoup4 crawler crawler-engine crawler-python hacktoberfest hacktoberfest-accepted hacktoberfest2023 pyrogram pyrogram-bot requests scraper scraping selenium telegram telegram-bot web-scraping webscraping webscrapper webscrapping webscrapping-python
Last synced: 29 Nov 2024
https://github.com/moranzcw/Zhihu-Spider
一个获取知乎用户主页信息的多线程Python爬虫程序。
crawler jupyter-notebook matplotlib python requests zhihu-spider
Last synced: 31 Oct 2024
https://github.com/hominee/dyer
Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
crawler rust rust-programming-language spider web-crawler web-framework web-scraping
Last synced: 06 Nov 2024
https://github.com/tgiles/auto-lighthouse
A utility package for automating lighthouse reporting
audits auto-lighthouse crawler lighthouse-reports robots simplecrawler
Last synced: 19 Dec 2024
https://github.com/teal33t/poopak
POOPAK - TOR Hidden Service Crawler
crawler dark-web darknet deepweb docker flask hidden-services mongo osint redis tor tor-network
Last synced: 21 Dec 2024
https://github.com/TGiles/auto-lighthouse
A utility package for automating lighthouse reporting
audits auto-lighthouse crawler lighthouse-reports robots simplecrawler
Last synced: 05 Nov 2024
https://github.com/karthikuj/sasori
Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.
automation crawler crawling dast dynamic endpoint-discovery infosec puppeteer scraping security
Last synced: 20 Dec 2024
https://github.com/duckduckgo/tracker-radar-collector
🕸 Modular, multithreaded, puppeteer-based crawler
crawler puppeteer tracker-radar
Last synced: 19 Dec 2024
https://github.com/lincanbin/sina-weibo-album-downloader
Multithreading download all HD photos / pictures from someone's Sina Weibo album.
Last synced: 16 Nov 2024
https://github.com/JakePartusch/lumberjack
An automated website accessibility scanner and cli
a11y accessibility axe cli crawler lumberjack
Last synced: 18 Nov 2024
https://github.com/luckylittle/blinkist-m4a-downloader
Grabs all of the audio files from all of the Blinkist books
audiobooks blinkist books crawler data-archiving data-mining data-processing go golang scraper spider
Last synced: 11 Nov 2024
https://github.com/alash3al/scraply
Scraply a simple dom scraper to fetch information from any html based website
crawler crawling dom golang scraper scrapers scraping-websites scrapy server
Last synced: 29 Nov 2024
https://github.com/jakepartusch/lumberjack
An automated website accessibility scanner and cli
a11y accessibility axe cli crawler lumberjack
Last synced: 27 Oct 2024
https://github.com/wx-chevalier/sentinel-crawler
Xenomorph Crawler, a Concise, Declarative and Observable Distributed Crawler(Node / Go / Java / Rust) For Web, RDB, OS, also can act as a Monitor(with Prometheus) or ETL for Infrastructure :dizzy: 多语言执行器,分布式爬虫
crawler etl koa2 monitor nodejs react wx-code
Last synced: 20 Dec 2024
https://github.com/greengerong/prerender-java
java framework for prerender
angular1 crawler java prerender prerendered-page seo
Last synced: 25 Dec 2024
https://github.com/nasa-jpl-memex/memex-explorer
Viewers for statistics and dashboarding of Domain Search Engine data
ache anaconda apache crawler dashboard domain-discovery memex-explorer miniconda nutch tika
Last synced: 25 Nov 2024
https://github.com/duyet/pricetrack
Price tracker monitors of products and alerts you when prices drop. Supported tiki.vn, shopee, lotte.vn, ... Built with firebase https://pricetrack.web.app
api crawler cronjob-scheduler firebase firebase-auth firebase-functions firebase-hosting firestore redash shopee shopee-api tiki tracking
Last synced: 19 Dec 2024
https://github.com/mazzzystar/baiducrawler
Sample of using proxies to crawl baidu search results.
Last synced: 11 Nov 2024
https://github.com/ethereum/node-crawler
Attempts to crawl the Ethereum network of valid Ethereum execution nodes and visualizes them in a nice web dashboard.
Last synced: 18 Dec 2024
https://github.com/SimFin/pdf-crawler
SimFin's open source PDF crawler
crawler crawling geckodriver pdf pdf-crawler puppeteer python selenium-webdriver
Last synced: 06 Nov 2024
https://github.com/simfin/pdf-crawler
SimFin's open source PDF crawler
crawler crawling geckodriver pdf pdf-crawler puppeteer python selenium-webdriver
Last synced: 11 Oct 2024
https://github.com/hardikvasa/webb
Python: An all-in-one Web Crawler, Web Parser and Web Scrapping library!
crawl-pages crawler python-library
Last synced: 24 Dec 2024
https://github.com/SeaQL/starfish-ql
✴️ An experimental graph database
crates-io crawler database graph hacktoberfest network rust sql visualization
Last synced: 11 Nov 2024
https://github.com/schollz/linkcrawler
Cross-platform persistent and distributed web crawler :link:
Last synced: 08 Nov 2024
https://github.com/pavlovtech/WebReaper
Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.
crawler datamining parser parsing scraper scraping scraping-api scraping-data scraping-tool scraping-web scraping-websites webcrawler webscraping
Last synced: 06 Nov 2024
https://github.com/maxvalue/terpene-profile-parser-for-cannabis-strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
analysis aromatherapy bioinformatics biological-data biological-data-analysis cannabis cannabis-strains crawler data-science database health plants python python-3 scrapy terpene-profile terpenes web-crawler web-crawler-python web-crawling
Last synced: 17 Nov 2024
https://github.com/zytedata/zyte-smartproxy-headless-proxy
A complimentary proxy to help to use SPM with headless browsers
Last synced: 11 Nov 2024
https://github.com/archiveteam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd
Last synced: 21 Dec 2024
https://github.com/ducdev/aliexscrape
Get Aliexpress product details in JSON
aliexpress aliexpress-api aliexpress-crawler aliexpress-scraper aliexpress-spider crawler dropship dropshipping hacktoberfest hacktoberfest19 hacktoberfest2019 json scraper spider
Last synced: 17 Nov 2024
https://github.com/wuchunfu/ipproxypool
Golang 实现的 IP 代理池, 涉及到的技术点: go gorm proxy proxypool ip crawler 爬虫 mysql viper cobra
crawler go ip proxy proxy-server proxypool
Last synced: 19 Dec 2024