Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-crawler

A collection of awesome web crawler,spider in different languages
https://github.com/BruceDone/awesome-crawler

Last synced: 2 days ago
JSON representation

  • Python

    • cola - A distributed crawling framework.
    • feedparser - Universal feed parser.
    • MechanicalSoup - A Python library for automating interaction with websites.
    • cola - A distributed crawling framework.
    • feedparser - Universal feed parser.
    • MechanicalSoup - A Python library for automating interaction with websites.
    • Gain - Web crawling framework based on asyncio for everyone.
    • aspider - An async web scraping micro-framework based on asyncio.
    • Scrapy - A fast high-level screen scraping and web crawling framework.
    • django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.
    • scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.
    • distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
    • pyspider - A powerful spider system.
    • CoCrawler - A versatile web crawler built using modern tools and concurrency.
    • Demiurge - PyQuery-based scraping micro-framework.
    • Scrapely - A pure-python HTML screen-scraping library.
    • Scrapy-Redis - Redis-based components for Scrapy.
    • cola - A distributed crawling framework.
  • Java

    • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
    • anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
    • JSoup - Scrapes, parses, manipulates and cleans HTML.
    • websphinx - Website-Specific Processors for HTML information extraction.
    • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
    • Spiderman - A scalable ,extensible, multi-threaded web crawler.
    • Spiderman2 - A distributed web crawler framework,support js render.
    • Spiderman2 - A distributed web crawler framework,support js render.
  • C#

    • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
    • DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
    • ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
  • JavaScript

    • x-ray - Web scraper with pagination and crawler support.
  • R

    • rvest - Simple web scraping for R.
  • Go

    • colly - Fast and Elegant Scraping Framework for Gophers.