Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

https://github.com/redco/goose-parser

Universal scraping tool, which allows you to extract data using multiple environments

browser crawler docker goose jsdom nodejs parser parsing phantomjs scraper scraping

Last synced: 24 Dec 2024

https://github.com/spatie/robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

crawler php robots-txt

Last synced: 22 Dec 2024

https://github.com/zhaotianff/csharpcrawler

C#爬虫示例程序,想学习爬虫入门知识的可以看过来。后续会慢慢加入更多爬虫相关的知识。

crawler csharp wpf

Last synced: 26 Dec 2024

https://github.com/gaussic/weibo_wordcloud

根据关键词抓取微博数据,再生成词云

crawler keyword search weibo wordcloud

Last synced: 19 Dec 2024

https://github.com/tufayellus/linkedin-scraper

A LinkedIn Scraper to scrape up to 1k LinkedIn profiles(due to LinkedIn limit) from company profile links and save their e-mail addresses if available! (actively maintained, if anything doesn't work, open an issue in the repo)

crawler digital-marketing email-marketing email-scraper leads linkedin linkedin-bot linkedin-gui linkedin-scraper linkedin-scraper-gui scrape-email scrape-emails scraper scraper-engine

Last synced: 21 Dec 2024

https://github.com/icy/google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

bash cookie crawler curl google ownership wget

Last synced: 25 Dec 2024

https://github.com/6677-ai/tap4-ai-crawler

The crawler opened source by tap4.ai

aitoolkit aitools crawler crawler-engine crawler-python

Last synced: 23 Dec 2024

https://github.com/linkedtales/scrapedin-linkedin-crawler

Crawler for LinkedIn full profiles 2019

crawler linkedin linkedin-crawler

Last synced: 06 Nov 2024

https://github.com/crypto-crawler/crypto-crawler-rs

A rock-solid cryptocurrency crawler library.

crawler cryptocurrency websocket

Last synced: 28 Oct 2024

https://github.com/vormkracht10/laravel-seo-scanner

Scan your Laravel application routes for SEO improvements suggestions.

crawler laravel laravel-framework laravel-seo laravel-seo-scanner scanner seo seo-optimization seo-tools seotools

Last synced: 21 Dec 2024

https://github.com/jsrei/crawler-js-hook-framework-public

JS逆向Hook工具集,开源部分工具到这里

crawler

Last synced: 16 Nov 2024

https://github.com/crawlab-team/crawlab-lite

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

crawlab crawler crawler-management crawling-tasks platform scrapy scrapy-ui scrapyd scrapyd-ui spider web-crawler

Last synced: 17 Nov 2024

https://github.com/macacajs/NoSmoke

A cross platform UI crawler which scans view trees then generate and execute UI test cases.

android crawler ios macaca smoke-tests test-automation webdriver

Last synced: 08 Nov 2024

https://github.com/mgleon08/instagram-crawler

Crawl instagram photos, posts and videos for download.

crawler gem instagram instagram-crawler instagram-scraper ruby rubygems scraper

Last synced: 05 Dec 2024

https://github.com/webysther/packagist-mirror

📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer

composer composer-packages crawler mirror packagist packagist-mirror php

Last synced: 03 Nov 2024

https://github.com/Josue87/MetaFinder

Search for documents in a domain through Search Engines (Google, Bing and Baidu). The objective is to extract metadata

crawler metadata osint

Last synced: 21 Nov 2024

https://github.com/elliotxx/zhihu-crawler-people

A simple distributed crawler for zhihu && data analysis

crawler python python-crawler spider web-crawler web-spider

Last synced: 26 Dec 2024

https://github.com/subins2000/search

An Open Source Search Engine

crawler php search search-engine

Last synced: 25 Dec 2024

https://github.com/Webysther/packagist-mirror

📦✂️📋📦 Create a mirror of packagist.org metadata for use locally with composer

composer composer-packages crawler mirror packagist packagist-mirror php

Last synced: 02 Nov 2024

https://github.com/codesofun/web-bee

🐝 Web vertical crawler framework for fun

crawler framework java java-8 webbee

Last synced: 26 Dec 2024

https://github.com/ma63d/leetcode-spider

用 node.js 爬你自己的 leetcode 解题源码

algorithm co crawler leetcode nodejs

Last synced: 25 Dec 2024

https://github.com/AnyISalIn/zhihu_fun

基于 Selenium 的知乎关键词爬虫

crawler python python3 selenium zhihu

Last synced: 30 Oct 2024

https://github.com/evil0ctal/fast-powerful-whisper-ai-services-api

⚡ 一款用于自动语音识别 (ASR)、翻译的高性能异步 API。不需要购买Whisper API,使用本地运行的Whisper模型进行推理,并支持多GPU并发,针对分布式部署进行设计。还内置了包括TikTok、抖音等社交媒体平台的爬虫,可实现来自多个社交平台的无缝媒体处理,为媒体内容数据自动化处理提供了强大且可扩展的解决方案。

asr crawler douyin-api fastapi faster-whisper openai-whisper speech-recognition speech-to-text speech-to-text-api tiktok-analytics tiktok-api tiktok-crawler video-analysis whisper-ai whisper-api whisperbot

Last synced: 23 Dec 2024

https://github.com/cocrawler/cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

aiohttp aiohttp-client async-python concurrency crawler pluggable-modules python3 screenshot warc

Last synced: 29 Oct 2024

https://github.com/viasite/site-audit-seo

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx

audit cli crawl-site crawler lighthouse puppeteer scraper seo seo-audit seo-site-audit site-audit xlsx

Last synced: 06 Nov 2024

https://github.com/mehmetozkaya/dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 27 Dec 2024

https://github.com/mehmetozkaya/DotnetCrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

crawler crawling csharp ddd-architecture dotnetcore entity-framework-core htmlagilitypack scraping scrapy scrapy-crawler webcrawler webcrawler-htmlagilitypack webcrawling webscraper webscraping

Last synced: 09 Nov 2024

https://github.com/Jiramew/spoon

🥄 A package for building specific Proxy Pool for different Sites.

crawler distributed ip proxies proxy proxy-provider proxypool python redis spider spoon

Last synced: 06 Nov 2024

https://github.com/norconex/crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

collector-fs collector-http crawler crawlers filesystem-crawler flexible java search-engine web-crawler

Last synced: 25 Dec 2024

https://github.com/nfx/slrp

rotating open proxy multiplexer

crawler golang proxy proxy-checker proxy-list proxy-pool proxy-server

Last synced: 21 Dec 2024

https://github.com/bytebuff/scrapingoutsourcing

ScrapingOutsourcing专注分享爬虫代码 尽量每周更新一个

appium crawler docker requests scrapy spider

Last synced: 10 Nov 2024

https://github.com/amerkurev/scrapper

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

crawler crawling docker headless readability scraper web-parsers web-parsing web-scraping

Last synced: 21 Dec 2024

https://github.com/N0taN3rd/Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 05 Nov 2024

https://github.com/n0tan3rd/squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

browser-automation chrome chrome-headless crawler crawling headless-chrome high-fidelity-preservation puppeteer webarchives webarchiving

Last synced: 27 Oct 2024

https://github.com/guilhermecgs/ir

Projeto de calculo de Imposto de Renda em operacoes na bovespa automaticamente. Tags:canal eletronico do investidor, CEI, selenium, bovespa, IRPF, IR, imposto de renda, finance, yahoo finance, acao, fii, etf, python, crawler, webscraping, calculadora ir

acoes b3 bovespa calculadora-ir canal-eletronico-investidor cei crawler etf fii finance imposto-de-renda irpf webscraping

Last synced: 11 Nov 2024

https://github.com/fanhuaandluomu/pkulaw_spider

爬取北大法宝网http://www.pkulaw.cn/Case/

ai crawler law python-2 spider

Last synced: 22 Dec 2024

https://github.com/stulzq/HttpCode.Core

简单、易用、高效 一个有态度的开源.Net Http请求框架!可以用制作爬虫,api请求等等。

crawler httpcode httpmock httprequest net-core net-standard

Last synced: 13 Nov 2024

https://github.com/zhangbohan/fun_crawler

Crawl some picture for fun

crawler meizitu python spider

Last synced: 19 Dec 2024

https://github.com/chenjiandongx/soksaccounts

🔥 Shadowsocks 账号爬虫

crawler shadowsocks

Last synced: 09 Nov 2024

https://github.com/beb7/gflare-tk

Open-Source Python Based SEO Web Crawler

crawler python robots-txt scraper seo seo-crawler tkinter

Last synced: 14 Nov 2024

https://github.com/cytopia/urlbuster

Powerful mutable web directory fuzzer to bruteforce existing and/or hidden files or directories.

brute-force bruteforce bruteforce-attacks crawler cytopia-sec url-bruteforcer

Last synced: 25 Dec 2024

https://github.com/vinaygopinath/ngMeta

Dynamic meta tags in your AngularJS single page application

angularjs crawler meta-tags opengraph seo ui-router

Last synced: 25 Nov 2024

https://github.com/tijme/not-your-average-web-crawler

A web crawler (for bug hunting) that gathers more than you can imagine.

bug-bounty callbacks crawler custom get post python request scanner scraper security spider vulnerability

Last synced: 23 Dec 2024

https://github.com/luohaha/jlitespider

A lite distributed Java spider framework :-)

crawler distributed distributed-systems rabbitmq spider

Last synced: 18 Nov 2024

https://github.com/jin10086/pachong

一些爬虫的代码

crawler python2

Last synced: 22 Dec 2024

https://github.com/Liu233w/acm-statistics

An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge

acm-icpc codechef-api codeforces-api crawler csharp docker javascript nodejs spoj-api vue

Last synced: 07 Nov 2024

https://github.com/liu233w/acm-statistics

An online tool (crawler) to analyze users performance in online judges (coding competition websites). Supported OJ: POJ, HDU, HYSBZ, CodeForces, UVA, ICPC Live Archive, FZU, SPOJ, Timus (URAL), LeetCode_CN, CSU, LibreOJ, 洛谷, 牛客OJ, Lutece (UESTC), AtCoder, AIZU, CodeChef, El Judge, BNUOJ, Codewars, UOJ, NBUT, 51Nod, DMOJ, VJudge

acm-icpc codechef-api codeforces-api crawler csharp docker javascript nodejs spoj-api vue

Last synced: 21 Dec 2024

https://github.com/clarketm/s3recon

Amazon S3 bucket finder and crawler.

crawler finder python recon s3 s3-bucket

Last synced: 25 Dec 2024

https://github.com/bartdag/pylinkvalidator

pylinkvalidator is a standalone and pure python link validator and crawler that traverses a web site and reports errors (e.g., 500 and 404 errors) encountered.

crawler link-checker networking python

Last synced: 24 Dec 2024

https://github.com/abaykan/CrawlBox

Easy way to brute-force web directory.

admin-finder crawler python web-crawler wordlist

Last synced: 30 Oct 2024

https://github.com/twiny/spidy

Domain names collector - Crawl websites and collect domain names along with their availability status.

backlinks crawler domain expired-domain golang scraper seotools spider

Last synced: 17 Dec 2024

https://github.com/janreges/siteone-crawler

SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).

analyzer crawler crawling performance qa quality-assessment security seo seotools stress-testing swoole testing website

Last synced: 25 Oct 2024

https://github.com/egoist/taki

Take a snapshot of any website.

crawler prerender snapshot

Last synced: 24 Dec 2024

https://github.com/karust/gogetcrawl

Extract web archive data using Wayback Machine and Common Crawl

commoncrawl concurrency crawler golang wayback-machine webarchive

Last synced: 05 Nov 2024

https://github.com/JarryShaw/darc

Darkweb Crawler Project

crawler darkweb

Last synced: 30 Oct 2024

https://github.com/moranzcw/Zhihu-Spider

一个获取知乎用户主页信息的多线程Python爬虫程序。

crawler jupyter-notebook matplotlib python requests zhihu-spider

Last synced: 31 Oct 2024

https://github.com/algolia/npm-search

🗿 npm ↔️ Algolia replication tool :skier: :snail: :artificial_satellite:

algolia couchdb crawler npm search sync yarn

Last synced: 08 Nov 2024

https://github.com/hominee/dyer

Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.

crawler rust rust-programming-language spider web-crawler web-framework web-scraping

Last synced: 06 Nov 2024

https://github.com/tgiles/auto-lighthouse

A utility package for automating lighthouse reporting

audits auto-lighthouse crawler lighthouse-reports robots simplecrawler

Last synced: 19 Dec 2024

https://github.com/TGiles/auto-lighthouse

A utility package for automating lighthouse reporting

audits auto-lighthouse crawler lighthouse-reports robots simplecrawler

Last synced: 05 Nov 2024

https://github.com/karthikuj/sasori

Sasori is a dynamic web crawler powered by Puppeteer, designed for lightning-fast endpoint discovery.

automation crawler crawling dast dynamic endpoint-discovery infosec puppeteer scraping security

Last synced: 20 Dec 2024

https://github.com/lincanbin/sina-weibo-album-downloader

Multithreading download all HD photos / pictures from someone's Sina Weibo album.

crawler python weibo

Last synced: 16 Nov 2024

https://github.com/duckduckgo/tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler

crawler puppeteer tracker-radar

Last synced: 19 Dec 2024

https://github.com/jakepartusch/lumberjack

An automated website accessibility scanner and cli

a11y accessibility axe cli crawler lumberjack

Last synced: 27 Oct 2024

https://github.com/JakePartusch/lumberjack

An automated website accessibility scanner and cli

a11y accessibility axe cli crawler lumberjack

Last synced: 18 Nov 2024

https://github.com/alash3al/scraply

Scraply a simple dom scraper to fetch information from any html based website

crawler crawling dom golang scraper scrapers scraping-websites scrapy server

Last synced: 29 Nov 2024

https://github.com/storyicon/graphquery

GraphQuery is a query language and execution engine tied to any backend service.

crawler css graph html jsonpath query regexp sql xml xpath

Last synced: 05 Nov 2024

https://github.com/WuLC/GoogleImagesDownloader

Enlarge training dataset by searching images with specified keywords in google and download the presented images

crawler google image keyword selenium

Last synced: 07 Nov 2024

https://github.com/wx-chevalier/sentinel-crawler

Xenomorph Crawler, a Concise, Declarative and Observable Distributed Crawler(Node / Go / Java / Rust) For Web, RDB, OS, also can act as a Monitor(with Prometheus) or ETL for Infrastructure :dizzy: 多语言执行器,分布式爬虫

crawler etl koa2 monitor nodejs react wx-code

Last synced: 20 Dec 2024

https://github.com/nasa-jpl-memex/memex-explorer

Viewers for statistics and dashboarding of Domain Search Engine data

ache anaconda apache crawler dashboard domain-discovery memex-explorer miniconda nutch tika

Last synced: 25 Nov 2024

https://github.com/duyet/pricetrack

Price tracker monitors of products and alerts you when prices drop. Supported tiki.vn, shopee, lotte.vn, ... Built with firebase https://pricetrack.web.app

api crawler cronjob-scheduler firebase firebase-auth firebase-functions firebase-hosting firestore redash shopee shopee-api tiki tracking

Last synced: 19 Dec 2024

https://github.com/glouw/andvaranaut

A dungeon crawler

crawl crawler dungeon

Last synced: 10 Nov 2024

https://github.com/mazzzystar/baiducrawler

Sample of using proxies to crawl baidu search results.

baidu crawler proxies proxy

Last synced: 11 Nov 2024

https://github.com/ethereum/node-crawler

Attempts to crawl the Ethereum network of valid Ethereum execution nodes and visualizes them in a nice web dashboard.

crawler ethereum

Last synced: 25 Dec 2024

https://github.com/hardikvasa/webb

Python: An all-in-one Web Crawler, Web Parser and Web Scrapping library!

crawl-pages crawler python-library

Last synced: 24 Dec 2024

https://github.com/SeaQL/starfish-ql

✴️ An experimental graph database

crates-io crawler database graph hacktoberfest network rust sql visualization

Last synced: 11 Nov 2024

https://github.com/schollz/linkcrawler

Cross-platform persistent and distributed web crawler :link:

crawler hyperlinks web

Last synced: 08 Nov 2024

https://github.com/pavlovtech/WebReaper

Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.

crawler datamining parser parsing scraper scraping scraping-api scraping-data scraping-tool scraping-web scraping-websites webcrawler webscraping

Last synced: 06 Nov 2024

https://github.com/lixi5338619/asyncpy

使用asyncio和aiohttp开发的轻量级异步协程web爬虫框架

aiohttp asyncio asyncpy crawler python scrapy

Last synced: 26 Dec 2024

https://github.com/brantou/crawler

爬虫, http代理, 模拟登陆!

crawler python scrapy

Last synced: 13 Nov 2024

https://github.com/zytedata/zyte-smartproxy-headless-proxy

A complimentary proxy to help to use SPM with headless browsers

crawler proxy scraping

Last synced: 11 Nov 2024

https://github.com/archiveteam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

archiveteam archiving crawl crawler crawlers crawling downloader ftp lua scraper scraping spider warc webarchiving wget wget-lua zstd

Last synced: 21 Dec 2024