Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

awesome-crawlers


https://github.com/fs-hao/awesome-crawlers

  • Scrapy - 08-25 | A fast high-level screen scraping and web crawling framework. |
  • you-get - 08-25 | Dumb downloader that scrapes the web. |
  • colly - 08-25 | Fast and Elegant Scraping Framework for Gophers. |
  • pyspider - 08-25 | A powerful spider system. |
  • newspaper - 08-24 | News, full-text, and article metadata extraction in Python 3 |
  • Webmagic - 08-24 | A scalable crawler framework. |
  • Goutte - 08-24 | A screen scraping and web crawling library for PHP. |
  • portia - 08-24 | Visual scraping for Scrapy. |
  • crawlee - 08-25 | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. |
  • spider-flow - 08-25 | A visual spider framework, it's so good that you don't need to write any code to crawl the website. |
  • pholcus - 08-24 | A distributed, high concurrency and powerful web crawler. |
  • node-crawler - 08-24 | Node-crawler has clean,simple api. |
  • Nokogiri - 08-24 | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. |
  • x-ray - 08-19 | Web scraper with pagination and crawler support. |
  • ferret - 08-23 | Declarative web scraping. |
  • headless-chrome-crawler - 08-24 | Headless Chrome crawls with jQuery support |
  • Scrapy-Redis - 08-25 | Redis-based components for Scrapy. |
  • MechanicalSoup - 08-24 | A Python library for automating interaction with websites. |
  • Crawler4j - 08-24 | Simple and lightweight web crawler. |
  • mechanize - 08-21 | Automated web interaction & crawling. |
  • node-osmosis - 08-23 | HTML/XML parser and web scraper for Node.js. |
  • scrape-it - 08-19 | A Node.js scraper for humans. |
  • Hakrawler - 08-24 | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
  • dom-crawler - 08-23 | The DomCrawler component eases DOM navigation for HTML and XML documents. |
  • DotnetSpider - 08-24 | This is a cross platfrom, ligth spider develop by C#. |
  • scraperjs - 08-15 | A complete and versatile web scraper. |
  • RoboBrowser - 08-23 | A simple, Pythonic library for browsing the web without a standalone web browser. |
  • distribute_crawler - 08-23 | Uses scrapy,redis, mongodb,graphite to create a distributed spider. |
  • Hawk - 08-25 | Advanced Crawler and ETL tool written in C#/WPF. |
  • WebCollector - 08-17 | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. |
  • anthelion - 08-07 | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. |
  • dht - 08-24 | BitTorrent DHT Protocol && DHT Spider. |
  • httrack - 08-24 | Copy websites to your computer. |
  • QueryList - 08-07 | The progressive PHP crawler framework. |
  • Heritrix3 - 08-25 | Extensible, web-scale, archival-quality web crawler project. |
  • Gecco - 08-18 | A easy to use lightweight web crawler |
  • spatie/crawler - 08-24 | An easy to use, powerful crawler implemented in PHP. Can execute Javascript. |
  • simplecrawler - 08-24 | Event driven web crawler. |
  • Abot - 08-24 | C# web crawler built for speed and flexibility. |
  • Gain - 08-11 | Web crawling framework based on asyncio for everyone. |
  • gocrawl - 08-23 | Polite, slim and concurrent web crawler. |
  • SeimiCrawler - 08-24 | An agile, distributed crawler framework. |
  • Scrapely - 08-20 | A pure-python HTML screen-scraping library. |
  • go_spider - 08-18 | An awesome Go concurrent Crawler(spider) framework. |
  • PSpider - 08-24 | A simple spider frame in Python3. |
  • aspider - 08-14 | An async web scraping micro-framework based on asyncio. |
  • upton - 08-20 | A batteries-included framework for easy web-scraping. Just add CSS(Or do more). |
  • scrape - 08-22 | A simple, higher level interface for Go web scraping. |
  • open-source-search-engine - 08-23 | A distributed open source search engine and spider/crawler written in C/C++. |
  • cola - 07-29 | A distributed crawling framework. |
  • rvest - 08-23 | Simple web scraping for R. |
  • php-spider - 08-24 | A configurable and extensible PHP web spider. |
  • wombat - 08-18 | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. |
  • web-scraper-chrome-extension - 08-22 | Web data extraction tool implemented as chrome extension. |
  • scrapy-cluster - 08-23 | Uses Redis and Kafka to create a distributed on demand scraping cluster. |
  • django-dynamic-scraper - 08-25 | Creating Scrapy scrapers via the Django admin interface. |
  • sukhoi - 07-01 | Minimalist and powerful Web Crawler. |
  • StormCrawler - 08-22 | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm |
  • creeper - 08-23 | The Next Generation Crawler Framework (Go). |
  • fetchbot - 08-24 | A simple and flexible web crawler that follows the robots.txt policies and crawl delays. |
  • Spidr - 07-06 | Spider a site, multiple domains, certain links or infinitely. |
  • Dataflow kit - 08-24 | Extract structured data from web pages. Web sites scraping. |
  • webster - 08-13 | A reliable web crawling framework which can scrape ajax and js rendered content in a web page. |
  • laravel-goutte - 08-10 | Laravel 5 Facade for Goutte. |
  • ACHE Crawler - 08-17 | An easy to use web crawler for domain-specific search. |
  • PHPScraper - 08-23 | PHPScraper is a scraper & crawler built for simplicity. |
  • Spark-Crawler - 08-25 | Evolving Apache Nutch to run on Spark. |
  • ants-go - 03-13 | A open source, distributed, restful crawler engine in golang. |
  • supercrawler - 08-09 | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. |
  • MSpider - 05-31 | A simple ,easy spider using gevent and js render. |
  • ebot - 07-05 | A scalable, distribuited and highly configurable web cawler. |
  • spidy - 08-03 | The simple, easy to use command line web crawler. |
  • spider - 08-22 | The fastest web crawler and indexer. |
  • pspider - 05-22 | Parallel web crawler written in PHP. |
  • js-crawler - 08-06 | Web crawler for Node.JS, both HTTP and HTTPS are supported. |
  • Cobweb - 01-02 | Web crawler with very flexible crawling options, standalone or using sidekiq. |
  • Infinity Crawler - 08-15 | A simple but powerful web crawler library in C#. |
  • webBee - 07-27 | A DFS web spider. |
  • crawley - 08-10 | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. |
  • CoCrawler - 08-03 | A versatile web crawler built using modern tools and concurrency. |
  • Squidwarc - 05-13 | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
  • brownant - 05-05 | A lightweight web data extracting framework. |
  • crawler - 07-15 | Scala DSL for web crawling. |
  • RubyRetriever - 08-06 | RubyRetriever is a Web Crawler, Scraper & File Harvester. |
  • scrala - 03-15 | Scala crawler(spider) framework, inspired by scrapy. |
  • Demiurge - 06-02 | PyQuery-based scraping micro-framework. |
  • web-scraper - 05-27 | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. |
  • crawlzone/crawlzone - 08-04 | Crawlzone is a fast asynchronous internet crawling framework for PHP. |
  • ferrit - 11-02 | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. |
  • SkyScraper - 03-05 | An asynchronous web scraper / web crawler using async / await and Reactive Extensions. |
  • Apache Nutch - - | -- | -- | Highly extensible, highly scalable web crawler for production environment. |
  • JSoup - - | -- | -- | Scrapes, parses, manipulates and cleans HTML. |
  • Open Search Server - - | -- | -- | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. |
  • Spiderman - - | -- | -- | A scalable ,extensible, multi-threaded web crawler. |
  • feedparser - - | -- | -- | Universal feed parser. |
  • Nokogiri - 08-24 | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. |
  • httrack - 08-24 | Copy websites to your computer. |
  • DotnetSpider - 08-24 | This is a cross platfrom, ligth spider develop by C#. |
  • Hawk - 08-25 | Advanced Crawler and ETL tool written in C#/WPF. |
  • Abot - 08-24 | C# web crawler built for speed and flexibility. |
  • Infinity Crawler - 08-15 | A simple but powerful web crawler library in C#. |
  • SkyScraper - 03-05 | An asynchronous web scraper / web crawler using async / await and Reactive Extensions. |
  • open-source-search-engine - 08-23 | A distributed open source search engine and spider/crawler written in C/C++. |
  • ebot - 07-05 | A scalable, distribuited and highly configurable web cawler. |
  • colly - 08-25 | Fast and Elegant Scraping Framework for Gophers. |
  • pholcus - 08-24 | A distributed, high concurrency and powerful web crawler. |
  • ferret - 08-23 | Declarative web scraping. |
  • Hakrawler - 08-24 | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
  • dht - 08-24 | BitTorrent DHT Protocol && DHT Spider. |
  • gocrawl - 08-23 | Polite, slim and concurrent web crawler. |
  • go_spider - 08-18 | An awesome Go concurrent Crawler(spider) framework. |
  • scrape - 08-22 | A simple, higher level interface for Go web scraping. |
  • creeper - 08-23 | The Next Generation Crawler Framework (Go). |
  • fetchbot - 08-24 | A simple and flexible web crawler that follows the robots.txt policies and crawl delays. |
  • Dataflow kit - 08-24 | Extract structured data from web pages. Web sites scraping. |
  • ants-go - 03-13 | A open source, distributed, restful crawler engine in golang. |
  • Scrapely - 08-20 | A pure-python HTML screen-scraping library. |
  • upton - 08-20 | A batteries-included framework for easy web-scraping. Just add CSS(Or do more). |
  • StormCrawler - 08-22 | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm |
  • Webmagic - 08-24 | A scalable crawler framework. |
  • spider-flow - 08-25 | A visual spider framework, it's so good that you don't need to write any code to crawl the website. |
  • Crawler4j - 08-24 | Simple and lightweight web crawler. |
  • WebCollector - 08-17 | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. |
  • anthelion - 08-07 | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. |
  • Heritrix3 - 08-25 | Extensible, web-scale, archival-quality web crawler project. |
  • Gecco - 08-18 | A easy to use lightweight web crawler |
  • SeimiCrawler - 08-24 | An agile, distributed crawler framework. |
  • ACHE Crawler - 08-17 | An easy to use web crawler for domain-specific search. |
  • Spark-Crawler - 08-25 | Evolving Apache Nutch to run on Spark. |
  • webBee - 07-27 | A DFS web spider. |
  • node-crawler - 08-24 | Node-crawler has clean,simple api. |
  • x-ray - 08-19 | Web scraper with pagination and crawler support. |
  • headless-chrome-crawler - 08-24 | Headless Chrome crawls with jQuery support |
  • node-osmosis - 08-23 | HTML/XML parser and web scraper for Node.js. |
  • scrape-it - 08-19 | A Node.js scraper for humans. |
  • scraperjs - 08-15 | A complete and versatile web scraper. |
  • simplecrawler - 08-24 | Event driven web crawler. |
  • web-scraper-chrome-extension - 08-22 | Web data extraction tool implemented as chrome extension. |
  • webster - 08-13 | A reliable web crawling framework which can scrape ajax and js rendered content in a web page. |
  • supercrawler - 08-09 | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. |
  • Cobweb - 01-02 | Web crawler with very flexible crawling options, standalone or using sidekiq. |
  • Squidwarc - 05-13 | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
  • Goutte - 08-24 | A screen scraping and web crawling library for PHP. |
  • dom-crawler - 08-23 | The DomCrawler component eases DOM navigation for HTML and XML documents. |
  • QueryList - 08-07 | The progressive PHP crawler framework. |
  • spatie/crawler - 08-24 | An easy to use, powerful crawler implemented in PHP. Can execute Javascript. |
  • php-spider - 08-24 | A configurable and extensible PHP web spider. |
  • laravel-goutte - 08-10 | Laravel 5 Facade for Goutte. |
  • PHPScraper - 08-23 | PHPScraper is a scraper & crawler built for simplicity. |
  • pspider - 05-22 | Parallel web crawler written in PHP. |
  • crawlzone/crawlzone - 08-04 | Crawlzone is a fast asynchronous internet crawling framework for PHP. |
  • web-scraper - 05-27 | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. |
  • Scrapy - 08-25 | A fast high-level screen scraping and web crawling framework. |
  • you-get - 08-25 | Dumb downloader that scrapes the web. |
  • pyspider - 08-25 | A powerful spider system. |
  • newspaper - 08-24 | News, full-text, and article metadata extraction in Python 3 |
  • portia - 08-24 | Visual scraping for Scrapy. |
  • Scrapy-Redis - 08-25 | Redis-based components for Scrapy. |
  • MechanicalSoup - 08-24 | A Python library for automating interaction with websites. |
  • RoboBrowser - 08-23 | A simple, Pythonic library for browsing the web without a standalone web browser. |
  • distribute_crawler - 08-23 | Uses scrapy,redis, mongodb,graphite to create a distributed spider. |
  • Gain - 08-11 | Web crawling framework based on asyncio for everyone. |
  • PSpider - 08-24 | A simple spider frame in Python3. |
  • aspider - 08-14 | An async web scraping micro-framework based on asyncio. |
  • cola - 07-29 | A distributed crawling framework. |
  • scrapy-cluster - 08-23 | Uses Redis and Kafka to create a distributed on demand scraping cluster. |
  • django-dynamic-scraper - 08-25 | Creating Scrapy scrapers via the Django admin interface. |
  • sukhoi - 07-01 | Minimalist and powerful Web Crawler. |
  • MSpider - 05-31 | A simple ,easy spider using gevent and js render. |
  • spidy - 08-03 | The simple, easy to use command line web crawler. |
  • crawley - 08-10 | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. |
  • CoCrawler - 08-03 | A versatile web crawler built using modern tools and concurrency. |
  • brownant - 05-05 | A lightweight web data extracting framework. |
  • Demiurge - 06-02 | PyQuery-based scraping micro-framework. |
  • rvest - 08-23 | Simple web scraping for R. |
  • mechanize - 08-21 | Automated web interaction & crawling. |
  • wombat - 08-18 | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. |
  • Spidr - 07-06 | Spider a site, multiple domains, certain links or infinitely. |
  • RubyRetriever - 08-06 | RubyRetriever is a Web Crawler, Scraper & File Harvester. |
  • spider - 08-22 | The fastest web crawler and indexer. |
  • crawler - 07-15 | Scala DSL for web crawling. |
  • scrala - 03-15 | Scala crawler(spider) framework, inspired by scrapy. |
  • ferrit - 11-02 | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. |
  • crawlee - 08-25 | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. |
  • js-crawler - 08-06 | Web crawler for Node.JS, both HTTP and HTTPS are supported. |