awesome-web-scraper
A collection of awesome web scaper, crawler.
https://github.com/duyet/awesome-web-scraper
Last synced: 5 days ago
JSON representation
-
Java
- websphinx - Website-Specific Processors for HTML INformation eXtraction.
- websphinx - Website-Specific Processors for HTML INformation eXtraction.
- Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
-
C#
- ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
-
PHP
- PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
- Crawler - A library for Rapid Web Crawler and Scraper Development.
- Crawler - A library for Rapid Web Crawler and Scraper Development.
- Crawler - A library for Rapid Web Crawler and Scraper Development.
- Crawler - A library for Rapid Web Crawler and Scraper Development.
- DiDOM - Simple and fast HTML parser.
- simple_html_dom - Just a Simple HTML DOM library fork.
- PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
- Crawler - A library for Rapid Web Crawler and Scraper Development.
- Goutte - Goutte, a simple PHP Web Scraper.
-
Contributing
- Contribution Guidelines
- open an issue - web-scraper/pulls) with your additions.
- Contribution Guidelines
-
C/C++
- HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
-
Python
- Scrapegraph-ai - An open source library for making scraping with the use of the AI
- extractnet - machine learning based content & metadata extraction framework for Python
- gdom - gdom, DOM Traversing and Scraping using GraphQL.
- trafilatura - Library and command-line tool to extract metadata, main text, and comments.
- scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
-
Nodejs
- jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
- lightcrawler - Crawl a website and run it through Google lighthouse.
- puppeteer - Headless Chrome Node API https://pptr.dev.
- Phantomjs - Scriptable Headless WebKit.
- node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
- node-simplecrawler - Flexible event driven crawler for node.
- spider - Programmable spidering of web sites with node.js and jQuery.
- slimerjs - A PhantomJS-like tool running Gecko.
- casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
- zombie - Insanely fast, full-stack, headless browser testing using node.js.
- xray - The next web scraper. See through the `<html>` noise.
-
Ruby
- wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
-
Go
-
Rust
-
Erlang
- ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.
Programming Languages
Sub Categories
Keywords
crawler
7
javascript
4
phantomjs
3
web-scraping
2
slimerjs
2
scraping
2
robots-txt
2
firefox
2
chrome
2
nodejs
2
html
2
automation
2
nlp
1
news-crawler
1
news-aggregator
1
news
1
html2text
1
html-to-markdown
1
corpus-tools
1
corpus-builder
1
corpus
1
article-extractor
1
web-scraping-python
1
dsl
1
ruby
1
scraper
1
python
1
framework
1
crawling
1
headless
1
headless-browser
1
dom-apis
1
jsdom
1
web
1
testing
1
node-module
1
headless-chrome
1
developer-tools
1
chromium
1
xpath
1
xml-parser
1
xml
1
parser
1
html-parser
1
dom
1
google-lighthouse
1
casperjs
1
spider
1
jquery
1
extract-data
1