An open API service indexing awesome lists of open source software.

awesome-web-scraper

A collection of awesome web scaper, crawler.
https://github.com/duyet/awesome-web-scraper

Last synced: 5 days ago
JSON representation

  • Java

    • websphinx - Website-Specific Processors for HTML INformation eXtraction.
    • websphinx - Website-Specific Processors for HTML INformation eXtraction.
    • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
    • crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
  • C#

    • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
  • PHP

    • PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
    • Crawler - A library for Rapid Web Crawler and Scraper Development.
    • Crawler - A library for Rapid Web Crawler and Scraper Development.
    • Crawler - A library for Rapid Web Crawler and Scraper Development.
    • Crawler - A library for Rapid Web Crawler and Scraper Development.
    • DiDOM - Simple and fast HTML parser.
    • simple_html_dom - Just a Simple HTML DOM library fork.
    • PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
    • Crawler - A library for Rapid Web Crawler and Scraper Development.
    • Goutte - Goutte, a simple PHP Web Scraper.
  • Contributing

  • C/C++

    • HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
  • Python

    • Scrapegraph-ai - An open source library for making scraping with the use of the AI
    • extractnet - machine learning based content & metadata extraction framework for Python
    • gdom - gdom, DOM Traversing and Scraping using GraphQL.
    • trafilatura - Library and command-line tool to extract metadata, main text, and comments.
    • scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
  • Nodejs

    • jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
    • lightcrawler - Crawl a website and run it through Google lighthouse.
    • puppeteer - Headless Chrome Node API https://pptr.dev.
    • Phantomjs - Scriptable Headless WebKit.
    • node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
    • node-simplecrawler - Flexible event driven crawler for node.
    • spider - Programmable spidering of web sites with node.js and jQuery.
    • slimerjs - A PhantomJS-like tool running Gecko.
    • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
    • zombie - Insanely fast, full-stack, headless browser testing using node.js.
    • xray - The next web scraper. See through the `<html>` noise.
  • Ruby

    • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • Go

    • gocrawl - Polite, slim and concurrent web crawler.
    • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  • Rust

    • scraper - HTML parsing and querying with CSS selectors.
    • reqwest - An ergonomic, batteries-included HTTP Client for Rust.
  • Erlang

    • ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.