Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cassidoo/scrapers

A list of scrapers from around the web.
https://github.com/cassidoo/scrapers

list scrape-websites scraper web-scraper

Last synced: about 1 month ago
JSON representation

A list of scrapers from around the web.

Awesome Lists containing this project

README

        

# Scrapers
A list of scrapers from around the web.

Find your way through with the [Table of Contents](#table-of-contents). It will showcase the entire list with easy navigate to their pros and cons while also providing links to their respective websites.

Please contribute by adding links, adding pros/cons, titles, or anything else you think would be helpful!
Please help maintain alphabetical order.

## Table Of Contents
- [Apifier](#apifier) [(link)](http://apifier.com)
- [Beautiful Soup](#beautiful-soup) [(link)](https://www.crummy.com/software/BeautifulSoup/)
- [Browse AI](#browse-ai) [(link)](https://www.browseai.com/)
- [Cheerio](#cheerio) [(link)](https://cheerio.js.org/)
- [Clearbit](#clearbit) [(link)](http://clearbit.com)
- [Common Crawl](#common-crawl) [(link)](https://commoncrawl.org/)
- [Crawly](#crawly) [(link)](http://crawly.diffbot.com/)
- [Dexi.io](#dexiio) [(link)](https://dexi.io/)
- [Diffbot](#diffbot) [(link)](http://diffbot.com)
- [Diggernaut](#diggernaut) [(link)](https://www.diggernaut.com/)
- [eLink](#elink) [(link)](http://elink.club)
- [EliteProxySwitcher](#eliteproxyswitcher) [(link)](http://www.eliteproxyswitcher.com/)
- [Email Hunter](#email-hunter) [(link)](http://emailhunter.co)
- [FiveFilters](#fivefilters) [(link)](http://fivefilters.org/)
- [FMiner](#fminer) [(link)](http://www.fminer.com/)
- [FullContact](#fullcontact) [(link)](http://fullcontact.com)
- [Grabby](#grabby) [(link)](http://grabby.io)
- [HrefScrap](#hrefscrap) [(link)](https://github.com/theIYD/HrefScrapper)
- [Import.io](#importio) [(link)](http://import.io)
- [Kimonolabs](#kimonolabs) [(link)](http://kimonolabs.com)
- [lxml](#lxml) [(link)](http://lxml.de/)
- [Mozenda](#mozenda) [(link)](http://mozenda.com)
- [Morph.io](#morphio) [(link)](https://morph.io/)
- [Node-crawler](#node-crawler) [(link)](http://nodecrawler.org)
- [Nutch](#nutch) [(link)](http://nutch.apache.org/)
- [Outwit Hub](#outwit-hub) [(link)](http://www.outwit.com/products/hub/)
- [Octoparse](#octoparse) [(link)](http://www.octoparse.com/)
- [rvest](#rvest) [(link)](https://github.com/hadley/rvest)
- [scrape-it](#scrape-it) [(link)](https://github.com/IonicaBizau/scrape-it)
- [Scraper.AI](#scraper-ai) [(link)](https://scraper.ai)
- [ScraperAPI](#scraperapi) [(link)](https://scraperapi.com)
- [ScraperWiki](#scraperwiki) [(link)](https://scraperwiki.com/)
- [ScrapingAnt](#scrapingant) [(link)](http://scrapingant.com)
- [Scrapinghub](#scrapinghub) [(link)](http://scrapinghub.com)
- [Scrapper](#scrapper) [(link)](https://github.com/amerkurev/scrapper)
- [Screen Scraper](#screen-scraper) [(link)](http://community.screen-scraper.com/)
- [Toofr](#toofr) [(link)](http://toofr.com)
- [UBot Studio](#ubot-studio) [(link)](http://www.ubotstudio.com/index7)
- [UiPath](#uipath) [(link)](http://www.uipath.com/)
- [Venom](#uipath) [(link)](https://venom.preferred.ai/)
- [Web Robots](#web-robots) [(link)](https://webrobots.io)
- [Web Scraper](#web-scraper) [(link)](http://webscraper.io/)
- [WrapAPI](#wrapapi) [(link)](https://wrapapi.com)
- [X-Ray](#x-ray) [(link)](https://github.com/lapwinglabs/x-ray)
- [ZenRows](#zenrows) [(link)](https://www.zenrows.com/)

### [Apifier](http://apifier.com)

**Description**: Cloud-based scraper for JavaScript.

**Applicable Language(s)**
- JavaScript

---

### [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)

**Description**: A Python library for navigating and parsing results from the
Web. It allow for searching the HTML tree to find various tags.

**Applicable Language(s)**
- Python

---

### [Browse AI](https://www.browseai.com/)
**Description**: Browse AI is a cloud-based SaaS that lets you extract and monitor structured data from any website with no code through a click and extract interface. It also comes with a REST API, webhooks, and native integrations with tools like Google Sheets.

**Applicable Language(s)**
- C
- Clojure
- C#
- Go
- Java
- Node
- Objective-C
- Ocaml
- PHP
- Python
- Ruby
- Shell
- Swift

---

### [Cheerio](https://cheerio.js.org/)
**Description**:Fast, flexible & lean implementation of core jQuery designed

**Applicable Language(s)**
- JavaScript

---

### [Clearbit](http://clearbit.com)

**Description**: Service for looking up company and people information.

**Applicable Language(s)**

---

### [Common Crawl](https://commoncrawl.org/)

**Description**: Open dataset of crawled websites.

**Applicable Language(s)**

---

### [Crawly](http://crawly.diffbot.com/)

**Description**: Automatic service that turns a website into structured data in the form of JSON or CSV.

**Applicable Language(s)**

---

### [Dexi.io](https://dexi.io/)

**Description**: Website data extraction using a visual programming language.

**Applicable Language(s)**

---

### [Diffbot](http://diffbot.com)

**Description**: Automated tool for extracting structured information from
pages, crawling websites, and turning a website into an API.

**Applicable Language(s)**

---

### [Diggernaut](https://www.diggernaut.com/)

**Description**: Cloud based web scraping platform.

**Applicable Language(s)**
- SML
- Javascript

**Pros**
- Scraper can be build using visual tool and scraping meta language
- Can execute JS snippets inside scraper
- Supports Selenium (optionally) and OCR
- Automated data validation and export to any text based format
- Can run scrapers manually and scheduled in the cloud or compile and run locally
- Full automation using API and integrations with other APIs

**Cons**
- Currently in beta
- Doesn't support PDF parsing yet

---

### [eLink](http://elink.club)

**Description**: Tool to mine LinkedIn profiles based on keywords.

**Applicable Language(s)**

---

### [EliteProxySwitcher](http://www.eliteproxyswitcher.com/)

**Description**: Local software that can download a proxy list and let users choose which one to use.

**Applicable Language(s)**

---

### [Email Hunter](http://emailhunter.co)

**Description**: API to find e-mail addresses for a given domain name.

**Applicable Language(s)**

---

### [FiveFilters](http://fivefilters.org/)

**Description**: Provide various website extraction and transformation tools
such as Full-Text RSS and Term Extraction as services.

**Applicable Language(s)**

---

### [FMiner](http://www.fminer.com/)

**Description**: Local software for web scraping using a recording and a visual programming language.

**Applicable Language(s)**

---

### [FullContact](http://fullcontact.com)

**Description**: API to retrieve more information on a person.

**Applicable Language(s)**

---

### [Grabby](http://grabby.io)

**Description**: Service that searches a website for e-mails.

**Applicable Language(s)**

---

### [HrefScrap](https://github.com/theIYD/HrefScrapper)

**Description**: A chrome extension which scrapes off all the href's from a web page.

**Applicable Language(s)**

---

### [Import.io](http://import.io)

**Description**: Automated tool to extract structured information from websites.

**Applicable Language(s)**

---

### [Kimonolabs](http://kimonolabs.com)

**Description**: Kimono was acquired by Palantir. This was a cloud-based
service for turning websites into structured APIs. Now they offer a desktop-based
alternative for continuing to use their tools.

**Applicable Language(s)**

---

### [lxml](http://lxml.de/)

**Description**: lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

**Pros**
- Incredibly fast (see: [Python HTML Parser Performance](http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html))

**Applicable Language(s)**
- Python

---

### [Mozenda](http://mozenda.com)

**Description**: Extract structured information from HTML, PDF, Excel, and Word by clicking on document elements.

**Applicable Language(s)**

---

### [Morph.io](https://morph.io/)

**Description**: Based on ScraperWiki, run scrapers in Python, Ruby, R, Perl or Node.js.

**Applicable Language(s)**
- Node.js
- Perl
- Python
- R
- Ruby

---

### [Node-Crawler](http://nodecrawler.org)

**Description**: Web Crawler/Spider for NodeJS + server-side jQuery

**Applicable Language(s)**
- Node.js

---

### [Nutch](http://nutch.apache.org/)

**Description**: Web crawler that can be combined with the Hadoop ecosystem to
run in a cluster.

**Applicable Language(s)**

---

### [Outwit Hub](http://www.outwit.com/products/hub/)

**Description**: Application that can extract information from a website and
turn it into structured data (CSV, Excel, etc.).

**Applicable Language(s)**

---

### [Octoparse](http://www.octoparse.com)

**Description**: The free web scraping tool for extracting all the web page data into several structured file formats easily and effectively.

**Applicable Language(s)**

---

### [rvest](https://github.com/hadley/rvest)

**Description**: R package to scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

**Applicable Language(s)**
- R

---

### [scrape-it](https://github.com/IonicaBizau/scrape-it)

**Description**: A Node.js scraper for humans.

**Applicable Language(s)**
- JavaScript (Node.js)

---

### [Scraper.AI](https://scraper.AI)

**Description**: Scraper.AI is an automated scraping SaaS that makes extracting data from any webpage as simple as clicking and selecting what you want. With a few clicks you can gather thousands of records.

Best of all, changes to the selections are monitored as often as you want. Updates are pushed to a consumable API for you to build on top of it

**Applicable Language(s)**
- Any, through a JSON API and (optional) webhook

---

### [ScraperAPI](https://scraperapi.com)

**Description**: ScraperAPI is a tool for developers building web scrapers, it handles proxies, browsers, and CAPTCHAs so developers can get the raw HTML from any website with a simple API call.

It’s the ultimate web scraping service for developers, with special pools of proxies for ecommerce price scraping, search engine scraping, social media scraping, sneaker scraping, ticket scraping and more.

**Applicable Language(s)**
- Python
- NodeJS
- PHP
- Ruby
- Java

---

### [ScraperWiki](https://scraperwiki.com/)

**Description**: Write a scraper in the browser and run on their cloud-based
service. This is used by many news organisations.

**Applicable Language(s)**

---

### [ScrapingAnt](https://scrapingant.com)

**Description**: ScrapingAnt is a Headless Chrome scraping API and free checked proxies service. ScrapingAnt supports Javascript rendering, premium rotating proxies and CAPTCHAs avoiding tools. Free plans available.

**Applicable Language(s)**
- Any, through a JSON API

---

### [Scrapinghub](http://scrapinghub.com)

**Description**: Scraper cloud hosting as a service. Allows developers to
deploy their own scrapers on their platform and benefit from their existing
infrastructure.

**Applicable Language(s)**

---

### [Scrapper](https://github.com/amerkurev/scrapper)

**Description**: Scrapper is a powerful web scraping tool with a built-in headless browser and Read mode for parsing. It has a simple and beautiful web interface, a REST API, and can search for news links on websites. Other features include stealth mode, caching results, page screenshots, proxy support, and full customization. Scrapper is delivered as a Docker image and is free to use.

**Applicable Language(s)**
- Any, through a JSON API

---

### [Screen Scraper](http://community.screen-scraper.com/)

**Description**: Local tool for scraping websites.

**Applicable Language(s)**

---

### [Toofr](http://toofr.com)

**Description**: Service for looking up business e-mails.

**Applicable Language(s)**

---

### [UBot Studio](http://www.ubotstudio.com/index7)

**Description**: Web automation software using a visual programming language
and recorder.

**Applicable Language(s)**

---

### [UiPath](http://www.uipath.com/)

**Description**: Visual tool for GUI automation by recording.

**Applicable Language(s)**

---

### [Venom](https://venom.preferred.ai)

**Description**: Venom is an open source focused crawler for the Deep Web.

**Features**
- Multi-threaded
- Structured crawling
- Page Validation
- Automatic Retries
- Proxy support

**Applicable Language(s)**
- JAVA

---

### [Web Robots](https://webrobots.io)

**Description**: Data as a Service platform for web scraping.

**Pros**
- Scraping dynamic javascript heavy websites
- Login and form fill on websites
- Data normalization and validation
- Data uploads

**Cons**
- Currently in beta
- Possible payment model in the future

**Applicable Language(s)**

---

### [Web Scraper](http://webscraper.io/)

**Description**: Extension that downloads websites and turns them into
structured data. Data is selected by element or by specialised selectors (e.g.,
for tables).

**Applicable Language(s)**

---

### [WrapAPI](https://wrapapi.com)

**Description**: Turn a website into an API. The structure of the data is defined by clicking elements or regular expressions.

**Applicable Language(s)**

---

### [X-Ray](https://github.com/lapwinglabs/x-ray)

**Description**: NPM module for scraping structured data via jQuery-like selectors.

**Applicable Language(s)**
- JavaScript (Node.js)

---

### [ZenRows](https://www.zenrows.com/)

**Description**: Web Scraping API & proxy server that bypasses any anti-bot solution while offering javascript rendering, rotating proxies, and geotargeting.

**Applicable Language(s)**
- Any, using an API or proxy
- JavaScript (Node.js SDK available)