Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/harryf/node-soupselect
Port of Simon Willison's Soup Select (for BeautifulSoup) to node.js and node-htmlparser
https://github.com/harryf/node-soupselect
Last synced: 13 days ago
JSON representation
Port of Simon Willison's Soup Select (for BeautifulSoup) to node.js and node-htmlparser
- Host: GitHub
- URL: https://github.com/harryf/node-soupselect
- Owner: harryf
- Created: 2010-09-10T22:43:49.000Z (about 14 years ago)
- Default Branch: master
- Last Pushed: 2018-03-14T14:54:49.000Z (over 6 years ago)
- Last Synced: 2024-10-11T12:19:38.605Z (about 1 month ago)
- Language: HTML
- Homepage: http://github.com/harryf/node-soupselect
- Size: 59.6 KB
- Stars: 244
- Watchers: 11
- Forks: 28
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
node-soupselect
---------------A port of Simon Willison's [soupselect](http://code.google.com/p/soupselect/) for use with node.js and node-htmlparser.
$ npm install soupselect
Minimal example...
var select = require('soupselect').select;
// dom provided by htmlparser...
select(dom, "#main a.article").forEach(function(element) {//...});Wanted a friendly way to scrape HTML using node.js. Tried using [jsdom](http://github.com/tmpvar/jsdom), prompted by [this article](http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs) but, unfortunately, [jsdom](http://github.com/tmpvar/jsdom) takes a strict view of lax HTML making it unusable for scraping the kind of soup found in real world web pages. Luckily [htmlparser](http://github.com/tautologistics/node-htmlparser/) is more forgiving. More details on this found [here](http://www.reddit.com/r/node/comments/dm0tz/nodesoupselect_for_scraping_html_with_css/c118r23).
A complete example including fetching HTML etc...;
var select = require('soupselect').select,
htmlparser = require("htmlparser"),
http = require('http');// fetch some HTML...
var http = require('http');
var host = 'www.reddit.com';
var client = http.createClient(80, host);
var request = client.request('GET', '/',{'host': host});request.on('response', function (response) {
response.setEncoding('utf8');var body = "";
response.on('data', function (chunk) {
body = body + chunk;
});response.on('end', function() {
// now we have the whole body, parse it and select the nodes we want...
var handler = new htmlparser.DefaultHandler(function(err, dom) {
if (err) {
console.error("Error: " + err);
} else {// soupselect happening here...
var titles = select(dom, 'a.title');sys.puts("Top stories from reddit");
titles.forEach(function(title) {
sys.puts("- " + title.children[0].raw + " [" + title.attribs.href + "]\n");
})
}
});var parser = new htmlparser.Parser(handler);
parser.parseComplete(body);
});
});
request.end();Notes:
* Requires node-htmlparser > 1.6.2 & node.js 2+
* Calls to select are synchronous - not worth trying to make it asynchronous IMO given the use case