Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/harryf/node-soupselect

Port of Simon Willison's Soup Select (for BeautifulSoup) to node.js and node-htmlparser
https://github.com/harryf/node-soupselect

Last synced: 13 days ago
JSON representation

Port of Simon Willison's Soup Select (for BeautifulSoup) to node.js and node-htmlparser

Host: GitHub
URL: https://github.com/harryf/node-soupselect
Owner: harryf
Created: 2010-09-10T22:43:49.000Z (about 14 years ago)
Default Branch: master
Last Pushed: 2018-03-14T14:54:49.000Z (over 6 years ago)
Last Synced: 2024-10-11T12:19:38.605Z (about 1 month ago)
Language: HTML
Homepage: http://github.com/harryf/node-soupselect
Size: 59.6 KB
Stars: 244
Watchers: 11
Forks: 28
Open Issues: 12
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        node-soupselect

---------------

A port of Simon Willison's [soupselect](http://code.google.com/p/soupselect/) for use with node.js and node-htmlparser.

    $ npm install soupselect

Minimal example...

    var select = require('soupselect').select;

    // dom provided by htmlparser...

    select(dom, "#main a.article").forEach(function(element) {//...});

Wanted a friendly way to scrape HTML using node.js. Tried using [jsdom](http://github.com/tmpvar/jsdom), prompted by [this article](http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs) but, unfortunately, [jsdom](http://github.com/tmpvar/jsdom) takes a strict view of lax HTML making it unusable for scraping the kind of soup found in real world web pages. Luckily [htmlparser](http://github.com/tautologistics/node-htmlparser/) is more forgiving. More details on this found [here](http://www.reddit.com/r/node/comments/dm0tz/nodesoupselect_for_scraping_html_with_css/c118r23).

A complete example including fetching HTML etc...;

    var select = require('soupselect').select,

        htmlparser = require("htmlparser"),

        http = require('http');

    // fetch some HTML...

    var http = require('http');

    var host = 'www.reddit.com';

    var client = http.createClient(80, host);

    var request = client.request('GET', '/',{'host': host});

    request.on('response', function (response) {

        response.setEncoding('utf8');

        var body = "";

        response.on('data', function (chunk) {

            body = body + chunk;

        });

        response.on('end', function() {

            // now we have the whole body, parse it and select the nodes we want...

            var handler = new htmlparser.DefaultHandler(function(err, dom) {

                if (err) {

                    console.error("Error: " + err);

                } else {

                    // soupselect happening here...

                    var titles = select(dom, 'a.title');

                    sys.puts("Top stories from reddit");

                    titles.forEach(function(title) {

                        sys.puts("- " + title.children[0].raw + " [" + title.attribs.href + "]\n");

                    })

                }

            });

            var parser = new htmlparser.Parser(handler);

            parser.parseComplete(body);

        });

    });

    request.end();

Notes:

* Requires node-htmlparser > 1.6.2 & node.js 2+

* Calls to select are synchronous - not worth trying to make it asynchronous IMO given the use case