Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-crawlers


https://github.com/fs-hao/awesome-crawlers

Last synced: about 24 hours ago
JSON representation

  • All

    • Scrapy - 08-25 | A fast high-level screen scraping and web crawling framework. |
    • you-get - 08-25 | Dumb downloader that scrapes the web. |
    • pyspider - 08-25 | A powerful spider system. |
    • newspaper - 08-24 | News, full-text, and article metadata extraction in Python 3 |
    • Webmagic - 08-24 | A scalable crawler framework. |
    • Goutte - 08-24 | A screen scraping and web crawling library for PHP. |
    • portia - 08-24 | Visual scraping for Scrapy. |
    • crawlee - 08-25 | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. |
    • spider-flow - 08-25 | A visual spider framework, it's so good that you don't need to write any code to crawl the website. |
    • node-crawler - 08-24 | Node-crawler has clean,simple api. |
    • Nokogiri - 08-24 | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. |
    • ferret - 08-23 | Declarative web scraping. |
    • headless-chrome-crawler - 08-24 | Headless Chrome crawls with jQuery support |
    • Scrapy-Redis - 08-25 | Redis-based components for Scrapy. |
    • Crawler4j - 08-24 | Simple and lightweight web crawler. |
    • mechanize - 08-21 | Automated web interaction & crawling. |
    • node-osmosis - 08-23 | HTML/XML parser and web scraper for Node.js. |
    • scrape-it - 08-19 | A Node.js scraper for humans. |
    • Hakrawler - 08-24 | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
    • dom-crawler - 08-23 | The DomCrawler component eases DOM navigation for HTML and XML documents. |
    • scraperjs - 08-15 | A complete and versatile web scraper. |
    • RoboBrowser - 08-23 | A simple, Pythonic library for browsing the web without a standalone web browser. |
    • distribute_crawler - 08-23 | Uses scrapy,redis, mongodb,graphite to create a distributed spider. |
    • Hawk - 08-25 | Advanced Crawler and ETL tool written in C#/WPF. |
    • WebCollector - 08-17 | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. |
    • anthelion - 08-07 | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. |
    • dht - 08-24 | BitTorrent DHT Protocol && DHT Spider. |
    • httrack - 08-24 | Copy websites to your computer. |
    • QueryList - 08-07 | The progressive PHP crawler framework. |
    • Heritrix3 - 08-25 | Extensible, web-scale, archival-quality web crawler project. |
    • Gecco - 08-18 | A easy to use lightweight web crawler |
    • spatie/crawler - 08-24 | An easy to use, powerful crawler implemented in PHP. Can execute Javascript. |
    • Abot - 08-24 | C# web crawler built for speed and flexibility. |
    • gocrawl - 08-23 | Polite, slim and concurrent web crawler. |
    • SeimiCrawler - 08-24 | An agile, distributed crawler framework. |
    • Scrapely - 08-20 | A pure-python HTML screen-scraping library. |
    • go_spider - 08-18 | An awesome Go concurrent Crawler(spider) framework. |
    • PSpider - 08-24 | A simple spider frame in Python3. |
    • aspider - 08-14 | An async web scraping micro-framework based on asyncio. |
    • upton - 08-20 | A batteries-included framework for easy web-scraping. Just add CSS(Or do more). |
    • scrape - 08-22 | A simple, higher level interface for Go web scraping. |
    • open-source-search-engine - 08-23 | A distributed open source search engine and spider/crawler written in C/C++. |
    • cola - 07-29 | A distributed crawling framework. |
    • rvest - 08-23 | Simple web scraping for R. |
    • php-spider - 08-24 | A configurable and extensible PHP web spider. |
    • wombat - 08-18 | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. |
    • web-scraper-chrome-extension - 08-22 | Web data extraction tool implemented as chrome extension. |
    • scrapy-cluster - 08-23 | Uses Redis and Kafka to create a distributed on demand scraping cluster. |
    • django-dynamic-scraper - 08-25 | Creating Scrapy scrapers via the Django admin interface. |
    • sukhoi - 07-01 | Minimalist and powerful Web Crawler. |
    • creeper - 08-23 | The Next Generation Crawler Framework (Go). |
    • fetchbot - 08-24 | A simple and flexible web crawler that follows the robots.txt policies and crawl delays. |
    • Spidr - 07-06 | Spider a site, multiple domains, certain links or infinitely. |
    • Dataflow kit - 08-24 | Extract structured data from web pages. Web sites scraping. |
    • webster - 08-13 | A reliable web crawling framework which can scrape ajax and js rendered content in a web page. |
    • laravel-goutte - 08-10 | Laravel 5 Facade for Goutte. |
    • ACHE Crawler - 08-17 | An easy to use web crawler for domain-specific search. |
    • PHPScraper - 08-23 | PHPScraper is a scraper & crawler built for simplicity. |
    • Spark-Crawler - 08-25 | Evolving Apache Nutch to run on Spark. |
    • ants-go - 03-13 | A open source, distributed, restful crawler engine in golang. |
    • supercrawler - 08-09 | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. |
    • MSpider - 05-31 | A simple ,easy spider using gevent and js render. |
    • ebot - 07-05 | A scalable, distribuited and highly configurable web cawler. |
    • spidy - 08-03 | The simple, easy to use command line web crawler. |
    • spider - 08-22 | The fastest web crawler and indexer. |
    • pspider - 05-22 | Parallel web crawler written in PHP. |
    • js-crawler - 08-06 | Web crawler for Node.JS, both HTTP and HTTPS are supported. |
    • Cobweb - 01-02 | Web crawler with very flexible crawling options, standalone or using sidekiq. |
    • Infinity Crawler - 08-15 | A simple but powerful web crawler library in C#. |
    • webBee - 07-27 | A DFS web spider. |
    • crawley - 08-10 | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. |
    • CoCrawler - 08-03 | A versatile web crawler built using modern tools and concurrency. |
    • Squidwarc - 05-13 | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
    • brownant - 05-05 | A lightweight web data extracting framework. |
    • crawler - 07-15 | Scala DSL for web crawling. |
    • RubyRetriever - 08-06 | RubyRetriever is a Web Crawler, Scraper & File Harvester. |
    • scrala - 03-15 | Scala crawler(spider) framework, inspired by scrapy. |
    • Demiurge - 06-02 | PyQuery-based scraping micro-framework. |
    • web-scraper - 05-27 | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. |
    • crawlzone/crawlzone - 08-04 | Crawlzone is a fast asynchronous internet crawling framework for PHP. |
    • ferrit - 11-02 | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. |
    • SkyScraper - 03-05 | An asynchronous web scraper / web crawler using async / await and Reactive Extensions. |
    • Apache Nutch - - | -- | -- | Highly extensible, highly scalable web crawler for production environment. |
    • JSoup - - | -- | -- | Scrapes, parses, manipulates and cleans HTML. |
    • Open Search Server - - | -- | -- | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. |
    • Spiderman - - | -- | -- | A scalable ,extensible, multi-threaded web crawler. |
    • pholcus - 08-24 | A distributed, high concurrency and powerful web crawler. |
    • x-ray - 08-19 | Web scraper with pagination and crawler support. |
    • Scrapy-Redis - 08-25 | Redis-based components for Scrapy. |
    • MechanicalSoup - 08-24 | A Python library for automating interaction with websites. |
    • DotnetSpider - 08-24 | This is a cross platfrom, ligth spider develop by C#. |
    • anthelion - 08-07 | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. |
    • simplecrawler - 08-24 | Event driven web crawler. |
    • aspider - 08-14 | An async web scraping micro-framework based on asyncio. |
    • cola - 07-29 | A distributed crawling framework. |
    • rvest - 08-23 | Simple web scraping for R. |
    • sukhoi - 07-01 | Minimalist and powerful Web Crawler. |
    • StormCrawler - 08-22 | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm |
    • js-crawler - 08-06 | Web crawler for Node.JS, both HTTP and HTTPS are supported. |
    • scrala - 03-15 | Scala crawler(spider) framework, inspired by scrapy. |
    • Gain - 08-11 | Web crawling framework based on asyncio for everyone. |