awesome-crawler

A collection of awesome web crawler,spider in different languages
https://github.com/BruceDone/awesome-crawler

Last synced: 6 days ago
JSON representation

Python
- cola - A distributed crawling framework.
- feedparser - Universal feed parser.
- MechanicalSoup - A Python library for automating interaction with websites.
- cola - A distributed crawling framework.
- feedparser - Universal feed parser.
- MechanicalSoup - A Python library for automating interaction with websites.
- Gain - Web crawling framework based on asyncio for everyone.
- aspider - An async web scraping micro-framework based on asyncio.
- Scrapy - A fast high-level screen scraping and web crawling framework.
- django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.
- scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.
- distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
- pyspider - A powerful spider system.
- CoCrawler - A versatile web crawler built using modern tools and concurrency.
- Demiurge - PyQuery-based scraping micro-framework.
- Scrapely - A pure-python HTML screen-scraping library.
- Scrapy-Redis - Redis-based components for Scrapy.
- cola - A distributed crawling framework.
Java
- Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
- anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
- JSoup - Scrapes, parses, manipulates and cleans HTML.
- websphinx - Website-Specific Processors for HTML information extraction.
- Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- Spiderman - A scalable ,extensible, multi-threaded web crawler.
- Spiderman2 - A distributed web crawler framework,support js render.
C#
- SimpleCrawler - Simple spider base on mutithreading, regluar expression.
- DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
- ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
JavaScript
- x-ray - Web scraper with pagination and crawler support.
R
- rvest - Simple web scraping for R.
Go
- colly - Fast and Elegant Scraping Framework for Gophers.

Programming Languages

Python 7 HTML 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-crawler

Python

Java

C#

JavaScript

R

Go