An open API service indexing awesome lists of open source software.

awesome-crawler

A collection of awesome web crawler,spider in different languages
https://github.com/BruceDone/awesome-crawler

Last synced: 2 days ago
JSON representation

  • Java

    • websphinx - Website-Specific Processors for HTML information extraction.
    • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
    • Spiderman - A scalable ,extensible, multi-threaded web crawler.
    • Spiderman2 - A distributed web crawler framework,support js render.
    • ACHE Crawler - An easy to use web crawler for domain-specific search.
    • anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
    • Crawler4j - Simple and lightweight web crawler.
    • websphinx - Website-Specific Processors for HTML information extraction.
    • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
    • Gecco - A easy to use lightweight web crawler
    • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
    • Webmagic - A scalable crawler framework.
    • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
    • SeimiCrawler - An agile, distributed crawler framework.
    • Spark-Crawler - Evolving Apache Nutch to run on Spark.
    • webBee - A DFS web spider.
    • spider-flow - A visual spider framework, it's so good that you don't need to write any code to crawl the website.
    • Norconex Web Crawler - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.
  • C#

    • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
    • ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
    • Abot - C# web crawler built for speed and flexibility.
    • Hawk - Advanced Crawler and ETL tool written in C#/WPF.
    • SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
    • Infinity Crawler - A simple but powerful web crawler library in C#.
  • Python

    • Scrapy - A fast high-level screen scraping and web crawling framework.
    • Scrapy-Redis - Redis-based components for Scrapy.
    • cola - A distributed crawling framework.
    • you-get - Dumb downloader that scrapes the web.
    • MechanicalSoup - A Python library for automating interaction with websites.
    • portia - Visual scraping for Scrapy.
    • crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
    • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
    • MSpider - A simple ,easy spider using gevent and js render.
    • brownant - A lightweight web data extracting framework.
    • PSpider - A simple spider frame in Python3.
    • Gain - Web crawling framework based on asyncio for everyone.
    • sukhoi - Minimalist and powerful Web Crawler.
    • spidy - The simple, easy to use command line web crawler.
    • newspaper - News, full-text, and article metadata extraction in Python 3
    • django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.
    • scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.
    • distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
    • pyspider - A powerful spider system.
    • CoCrawler - A versatile web crawler built using modern tools and concurrency.
    • Demiurge - PyQuery-based scraping micro-framework.
    • Scrapely - A pure-python HTML screen-scraping library.
  • JavaScript

    • scraperjs - A complete and versatile web scraper.
    • scrape-it - A Node.js scraper for humans.
    • simplecrawler - Event driven web crawler.
    • node-crawler - Node-crawler has clean,simple api.
    • js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
    • webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
    • x-ray - Web scraper with pagination and crawler support.
    • node-osmosis - HTML/XML parser and web scraper for Node.js.
    • web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
    • supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
    • headless-chrome-crawler - Headless Chrome crawls with jQuery support
    • Squidwarc - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
    • crawlee - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
  • PHP

    • Goutte - A screen scraping and web crawling library for PHP.
    • laravel-goutte - Laravel 5 Facade for Goutte.
    • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
    • QueryList - The progressive PHP crawler framework.
    • pspider - Parallel web crawler written in PHP.
    • php-spider - A configurable and extensible PHP web spider.
    • spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
    • crawlzone/crawlzone - Crawlzone is a fast asynchronous internet crawling framework for PHP.
    • PHPScraper - PHPScraper is a scraper & crawler built for simplicity.
  • C++

  • C

    • httrack - Copy websites to your computer.
  • Ruby

    • Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
    • upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
    • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
    • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
    • Spidr - Spider a site, multiple domains, certain links or infinitely.
    • Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
    • mechanize - Automated web interaction & crawling.
  • Rust

    • spider - The fastest web crawler and indexer.
    • crawler - A gRPC web indexer turbo charged for performance.
  • Erlang

    • ebot - A scalable, distribuited and highly configurable web cawler.
  • Perl

    • web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
  • Go

    • pholcus - A distributed, high concurrency and powerful web crawler.
    • gocrawl - Polite, slim and concurrent web crawler.
    • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
    • go_spider - An awesome Go concurrent Crawler(spider) framework.
    • dht - BitTorrent DHT Protocol && DHT Spider.
    • ants-go - A open source, distributed, restful crawler engine in golang.
    • scrape - A simple, higher level interface for Go web scraping.
    • creeper - The Next Generation Crawler Framework (Go).
    • ferret - Declarative web scraping.
    • Dataflow kit - Extract structured data from web pages. Web sites scraping.
    • Hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
  • Scala

    • crawler - Scala DSL for web crawling.
    • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.