Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/duyet/awesome-web-scraper
A collection of awesome web scaper, crawler.
https://github.com/duyet/awesome-web-scraper
List: awesome-web-scraper
awesome awesome-list goutte phantomjs php scrapy slimerjs spider storage web-crawler web-scraper
Last synced: 24 days ago
JSON representation
A collection of awesome web scaper, crawler.
- Host: GitHub
- URL: https://github.com/duyet/awesome-web-scraper
- Owner: duyet
- License: mit
- Created: 2016-02-09T17:18:04.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2024-04-04T13:12:43.000Z (9 months ago)
- Last Synced: 2024-05-21T01:11:19.794Z (7 months ago)
- Topics: awesome, awesome-list, goutte, phantomjs, php, scrapy, slimerjs, spider, storage, web-crawler, web-scraper
- Size: 48.8 KB
- Stars: 240
- Watchers: 11
- Forks: 46
- Open Issues: 37
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- awesome-open-source-marketing - duyetdev/awesome-web-scraper
- ultimate-awesome - awesome-web-scraper - A collection of awesome web scaper, crawler. (Other Lists / Monkey C Lists)
README
# Awesome Web Scraper [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
A collection of awesome web scaper, crawler.
## Java
* [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML INformation eXtraction.
* [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
* [crawler4j](https://github.com/yasserg/crawler4j) - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.## C/C++
* [HTTrack](http://www.httrack.com/) - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.## C#
* [ccrawler](https://code.google.com/archive/p/ccrawler/) - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.## Erlang
* [ebot](https://github.com/matteoredaelli/ebot) - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.## Python
* [scrapy](https://github.com/scrapy/scrapy) - Scrapy, a fast high-level web crawling & scraping framework for Python.
* [gdom](https://github.com/syrusakbary/gdom) - gdom, DOM Traversing and Scraping using GraphQL.
* [trafilatura](https://github.com/adbar/trafilatura) - Library and command-line tool to extract metadata, main text, and comments.
* [extractnet](https://github.com/currentsapi/extractnet) - machine learning based content & metadata extraction framework for Python
* [Scrapegraph-ai](https://github.com/VinciGit00/Scrapegraph-ai) - An open source library for making scraping with the use of the AI## PHP
* [Goutte](https://github.com/FriendsOfPHP/Goutte) - Goutte, a simple PHP Web Scraper.
* [DiDOM](https://github.com/Imangazaliev/DiDOM) - Simple and fast HTML parser.
* [simple_html_dom](https://github.com/samacs/simple_html_dom) - Just a Simple HTML DOM library fork.
* [PHPCrawl](http://phpcrawl.cuab.de/) - PHPCrawl is a framework for crawling/spidering websites written in PHP.
* [Crawler](https://www.crwlr.software/packages/crawler) - A library for Rapid Web Crawler and Scraper Development.## Nodejs
* [puppeteer](https://github.com/GoogleChrome/puppeteer) - Headless Chrome Node API https://pptr.dev.
* [Phantomjs](https://github.com/ariya/phantomjs) - Scriptable Headless WebKit.
* [node-crawler](https://github.com/bda-research/node-crawler) - Web Crawler/Spider for NodeJS + server-side jQuery.
* [node-simplecrawler](https://github.com/simplecrawler/simplecrawler) - Flexible event driven crawler for node.
* [spider](https://github.com/mikeal/spider) - Programmable spidering of web sites with node.js and jQuery.
* [slimerjs](https://github.com/laurentj/slimerjs) - A PhantomJS-like tool running Gecko.
* [casperjs](https://github.com/casperjs/casperjs) - Navigation scripting & testing utility for PhantomJS and SlimerJS.
* [zombie](https://github.com/assaf/zombie) - Insanely fast, full-stack, headless browser testing using node.js.
* [nightmare](https://github.com/segmentio/nightmare) - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
* [jsdom](https://github.com/jsdom/jsdom) - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
* [xray](https://github.com/matthewmueller/x-ray) - The next web scraper. See through the `` noise.
* [lightcrawler](https://github.com/github/lightcrawler) - Crawl a website and run it through Google lighthouse.## Ruby
* [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.## Go
* [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.
* [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.## Rust
* [scraper](https://github.com/causal-agent/scraper) - HTML parsing and querying with CSS selectors.
* [reqwest](https://github.com/seanmonstar/reqwest) - An ergonomic, batteries-included HTTP Client for Rust.---------------------
## License
[MIT](LICENSE)## Contributing
Please, read the [Contribution Guidelines](https://github.com/duyetdev/awesome-web-scraper/blob/master/CONTRIBUTING.md) before submitting your suggestion.
Feel free to [open an issue](https://github.com/duyetdev/awesome-web-scraper/issues) or [create a pull request](https://github.com/duyetdev/awesome-web-scraper/pulls) with your additions.