{"id":13394633,"url":"https://github.com/BruceDone/awesome-crawler","last_synced_at":"2025-03-13T20:31:36.333Z","repository":{"id":37396635,"uuid":"70459317","full_name":"BruceDone/awesome-crawler","owner":"BruceDone","description":"A collection of awesome web crawler,spider in different languages","archived":false,"fork":false,"pushed_at":"2024-05-17T07:25:25.000Z","size":76,"stargazers_count":6156,"open_issues_count":31,"forks_count":677,"subscribers_count":198,"default_branch":"master","last_synced_at":"2024-05-19T22:00:58.473Z","etag":null,"topics":["awesome","crawler","node-crawler","scraper","spider","web-crawler","web-scraper"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BruceDone.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-10T06:42:00.000Z","updated_at":"2024-05-29T05:07:49.215Z","dependencies_parsed_at":"2024-04-13T22:55:24.937Z","dependency_job_id":"fceabc74-e72a-420e-82ec-f2581005ab8f","html_url":"https://github.com/BruceDone/awesome-crawler","commit_stats":{"total_commits":70,"total_committers":29,"mean_commits":2.413793103448276,"dds":0.5285714285714286,"last_synced_commit":"5b6f40dab0518171d9b3c7adec31a257eeab64eb"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BruceDone%2Fawesome-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BruceDone%2Fawesome-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BruceDone%2Fawesome-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BruceDone%2Fawesome-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BruceDone","download_url":"https://codeload.github.com/BruceDone/awesome-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243125540,"owners_count":20240276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awesome","crawler","node-crawler","scraper","spider","web-crawler","web-scraper"],"created_at":"2024-07-30T17:01:26.304Z","updated_at":"2025-03-13T20:31:36.307Z","avatar_url":"https://github.com/BruceDone.png","language":null,"funding_links":[],"categories":["Others","HarmonyOS","miscellaneous","Related Awesome Lists","Others (1002)","优质的 Github 库","Other awesome list","Other Lists","DevOps Utilities","后端","📦 Legacy \u0026 Inactive Projects","Python","Awesome Lists","Crawler"],"sub_categories":["Windows Manager","TeX Lists"],"readme":"# Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)\nA collection of awesome web crawler,spider and resources in different languages.\n\n## Contents\n\n- [Python](#python)\n- [Java](#java)\n- [C#](#c)\n- [JavaScript](#javascript)\n- [PHP](#php)\n- [C++](#c-1)\n- [C](#c-2)\n- [Ruby](#ruby)\n- [Rust](#rust)\n- [R](#r)\n- [Erlang](#erlang)\n- [Perl](#perl)\n- [Go](#go)\n- [Scala](#scala)\n\n## Python \n* [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework.\n    * [django-dynamic-scraper](https://github.com/holgerd77/django-dynamic-scraper) - Creating Scrapy scrapers via the Django admin interface.\n    * [Scrapy-Redis](https://github.com/rolando/scrapy-redis) - Redis-based components for Scrapy.\n    * [scrapy-cluster](https://github.com/istresearch/scrapy-cluster) - Uses Redis and Kafka to create a distributed on demand scraping cluster.\n    * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider.\n* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.\n* [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency.\n* [cola](https://github.com/chineking/cola) - A distributed crawling framework.\n* [Demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework.\n* [Scrapely](https://github.com/scrapy/scrapely) - A pure-python HTML screen-scraping library.\n* [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.\n* [you-get](https://github.com/soimort/you-get) -  Dumb downloader that scrapes the web.\n* [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.\n* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.\n* [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.\n* [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.\n* [MSpider](https://github.com/manning23/MSpider) - A simple ,easy spider using gevent and js render. \n* [brownant](https://github.com/douban/brownant) - A lightweight web data extracting framework.\n* [PSpider](https://github.com/xianhu/PSpider) - A simple spider frame in Python3.\n* [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone.\n* [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler.\n* [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. \n* [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3\n* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio. \n\n## Java\n* [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search.\n* [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment.\n    * [anthelion](https://github.com/yahoo/anthelion) - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.\n* [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler.\n* [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML.\n* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction.\n* [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.\n* [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler\n* [WebCollector](https://github.com/CrawlScript/WebCollector) - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.\n* [Webmagic](https://github.com/code4craft/webmagic) - A scalable crawler framework.\n* [Spiderman](https://git.oschina.net/l-weiwei/spiderman) - A scalable ,extensible, multi-threaded web crawler.\n    * [Spiderman2](http://git.oschina.net/l-weiwei/Spiderman2) - A distributed  web crawler framework,support js render.\n* [Heritrix3](https://github.com/internetarchive/heritrix3) -  Extensible, web-scale, archival-quality web crawler project.\n* [SeimiCrawler](https://github.com/zhegexiaohuozi/SeimiCrawler) - An agile, distributed crawler framework.\n* [StormCrawler](http://github.com/DigitalPebble/storm-crawler/) - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm\n* [Spark-Crawler](https://github.com/USCDataScience/sparkler) - Evolving Apache Nutch to run on Spark.\n* [webBee](https://github.com/pkwenda/webBee) - A DFS web spider.\n* [spider-flow](https://github.com/ssssssss-team/spider-flow) - A visual spider framework, it's so good that you don't need to write any code to crawl the website.\n* [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.\n\n\n## C# \n* [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.\n* [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression.\n* [DotnetSpider](https://github.com/zlzforever/DotnetSpider) - This is a cross platfrom, ligth spider develop by C#.\n* [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility.\n* [Hawk](https://github.com/ferventdesert/Hawk) - Advanced Crawler and ETL tool written in C#/WPF.\n* [SkyScraper](https://github.com/JonCanning/SkyScraper) - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.\n* [Infinity Crawler](https://github.com/TurnerSoftware/InfinityCrawler) - A simple but powerful web crawler library in C#.\n\n## JavaScript\n* [scraperjs](https://github.com/ruipgil/scraperjs) - A complete and versatile web scraper.\n* [scrape-it](https://github.com/IonicaBizau/scrape-it) - A Node.js scraper for humans.\n* [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler.\n* [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api.\n* [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported.\n* [webster](https://github.com/zhuyingda/webster) - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.\n* [x-ray](https://github.com/lapwinglabs/x-ray) - Web scraper with pagination and crawler support.\n* [node-osmosis](https://github.com/rchipka/node-osmosis) - HTML/XML parser and web scraper for Node.js.\n* [web-scraper-chrome-extension](https://github.com/martinsbalodis/web-scraper-chrome-extension) - Web data extraction tool implemented as chrome extension.\n* [supercrawler](https://github.com/brendonboshell/supercrawler) - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. \n* [headless-chrome-crawler](https://github.com/yujiosaka/headless-chrome-crawler) - Headless Chrome crawls with jQuery support\n* [Squidwarc](https://github.com/n0tan3rd/squidwarc) - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head\n* [crawlee](https://github.com/apify/crawlee) - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. \n\n\n## PHP\n* [Goutte](https://github.com/FriendsOfPHP/Goutte) - A screen scraping and web crawling library for PHP.\n    * [laravel-goutte](https://github.com/dweidner/laravel-goutte) - Laravel 5 Facade for Goutte.\n* [dom-crawler](https://github.com/symfony/dom-crawler) - The DomCrawler component eases DOM navigation for HTML and XML documents.\n* [QueryList](https://github.com/jae-jae/QueryList) - The progressive PHP crawler framework.\n* [pspider](https://github.com/hightman/pspider) - Parallel web crawler written in PHP.\n* [php-spider](https://github.com/mvdbos/php-spider) - A configurable and extensible PHP web spider.\n* [spatie/crawler](https://github.com/spatie/crawler) - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.\n* [crawlzone/crawlzone](https://github.com/crawlzone/crawlzone) - Crawlzone is a fast asynchronous internet crawling framework for PHP.\n* [PHPScraper](https://github.com/spekulatius/PHPScraper) - PHPScraper is a scraper \u0026 crawler built for simplicity.\n\n## C++\n* [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++.\n\n## C\n* [httrack](https://github.com/xroche/httrack) - Copy websites to your computer.\n\n## Ruby\n* [Nokogiri](https://github.com/sparklemotion/nokogiri) - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.\n* [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).\n* [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.\n* [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper \u0026 File Harvester.\n* [Spidr](https://github.com/postmodern/spidr) - Spider a site, multiple domains, certain links or infinitely.\n* [Cobweb](https://github.com/stewartmckee/cobweb) - Web crawler with very flexible crawling options, standalone or using sidekiq.\n* [mechanize](https://github.com/sparklemotion/mechanize) - Automated web interaction \u0026 crawling.\n\n## Rust\n* [spider](https://github.com/spider-rs/spider) - The fastest web crawler and indexer.\n* [crawler](https://github.com/a11ywatch/crawler) - A gRPC web indexer turbo charged for performance.\n\n## R\n* [rvest](https://github.com/hadley/rvest) - Simple web scraping for R.\n\n## Erlang \n* [ebot](https://github.com/matteoredaelli/ebot) - A scalable, distribuited and highly configurable web cawler.\n\n## Perl\n* [web-scraper](https://github.com/miyagawa/web-scraper) - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.\n\n## Go\n* [pholcus](https://github.com/henrylee2cn/pholcus) -  A distributed, high concurrency and powerful web crawler.\n* [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.\n* [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.\n* [go_spider](https://github.com/hu17889/go_spider) - An awesome Go concurrent Crawler(spider) framework. \n* [dht](https://github.com/shiyanhui/dht) - BitTorrent DHT Protocol \u0026\u0026 DHT Spider.\n* [ants-go](https://github.com/wcong/ants-go) - A open source, distributed, restful crawler engine in golang.\n* [scrape](https://github.com/yhat/scrape) - A simple, higher level interface for Go web scraping.\n* [creeper](https://github.com/wspl/creeper) - The Next Generation Crawler Framework (Go).\n* [colly](https://github.com/asciimoo/colly) - Fast and Elegant Scraping Framework for Gophers.\n* [ferret](https://github.com/MontFerret/ferret) - Declarative web scraping.\n* [Dataflow kit](https://github.com/slotix/dataflowkit) - Extract structured data from web pages. Web sites scraping.\n* [Hakrawler](https://github.com/hakluke/hakrawler) - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application\n\n\n## Scala\n* [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling.\n* [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy.\n* [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBruceDone%2Fawesome-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBruceDone%2Fawesome-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBruceDone%2Fawesome-crawler/lists"}