Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mateogianolio/domp

Web scraping, crawling and DOM tree manipulation for Node.js.
https://github.com/mateogianolio/domp

Last synced: 3 months ago
JSON representation

Web scraping, crawling and DOM tree manipulation for Node.js.

Host: GitHub
URL: https://github.com/mateogianolio/domp
Owner: mateogianolio
License: mit
Created: 2016-02-18T21:46:29.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2016-02-20T13:32:40.000Z (over 8 years ago)
Last Synced: 2024-06-28T18:47:41.732Z (5 months ago)
Language: JavaScript
Homepage:
Size: 941 KB
Stars: 14
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # domp

Web scraping, crawling and DOM tree manipulation for Node.js. Uses [htmlparser2](https://github.com/fb55/htmlparser2) for HTML parsing and [robots-txt](https://github.com/Woorank/robots-txt) for `robots.txt` checking.

```bash

$ npm install domp

```

```javascript

var domp = require('domp');

```

### Usage

#### [Get single page (`examples/single.js`)](https://github.com/mateogianolio/domp/blob/master/examples/single.js)

```javascript

domp(url, function(dom) {

  console.log(...dom.map(node => node.name));

  // html head meta title script ...

});

```

#### [Get multiple pages (`examples/multiple.js`)](https://github.com/mateogianolio/domp/blob/master/examples/multiple.js)

You can scrape an `Array` of urls by

1. providing a callback:

  ```javascript

  domp(urls, function(dom) {

    // called twice

  })

  ```

2. looping through an iterator

  ```javascript

  for (var page of domp(urls))

    page.then(function (dom) {

      // resolved

    }, function (error) {

      // rejected

    });

  ```

#### [Crawling (`examples/crawl.js`)](https://github.com/mateogianolio/domp/blob/master/examples/crawl.js)

```javascript

function resolve(next) {

  return function (dom) {

    var title = dom.find('title').next().value,

        links = [...dom.filter(node => node.href && node.href.indexOf('http') === 0)];

    // get random link

    var link = links[Math.floor(Math.random() * links.length)];

    console.log(title.text);

    console.log(link.href);

    // submit link(s) to be scraped next

    next(link.href);

  };

}

domp.crawl('https://en.wikipedia.org', function(requests, next) {

  for (var request of requests)

    request.then(resolve(next));

});

```

### DOM Tree traversal

Standard traversal using `for ... of`:

```javascript

for (var node of dom)

  console.log(node);

```

Sibling (children with same parent) traversal using `for ... of`:

```javascript

for (var sibling of node.siblings)

  console.log(sibling);

```

Tag name traversal using `for ... of` and `find(name)`:

```javascript

for (var node of dom.find('p'))

  console.log(node);

```

### DOM Manipulation

DOM nodes (see `node.js`) implement mapping similar to what we're used to from `Array.prototype.map`, but instead of returning an `Array` it returns an `Iterable`. The `Iterable` can either be unpacked into an `Array` using the spread operator (`...`) or be used as a normal iterator.

```javascript

var names = dom.map(node => node.name);

names = [...names];

// names = ['html', 'head', 'meta', 'title', ...]

for (var name of names)

  console.log(name);

// html

// head

// ...

```

Filtering works pretty much the same (returns `Iterable`):

```javascript

// get all 'p' tags

var paragraphs = dom.filter(node => node.name === 'p');

// traverse

for (var p of paragraphs)

  console.log(p);

```

There's also the short `find(name)` that can be used to find tag names in the tree:

```javascript

for (var node in dom.find('p'))

  console.log(node);

```