Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js
https://github.com/N0taN3rd/node-warc

chrome-remote-interface pupeteer warc warc-files web-archives web-archiving webarchive webarchiving

Last synced: about 2 months ago
JSON representation

Parse And Create Web ARChive (WARC) files with node.js

Host: GitHub
URL: https://github.com/N0taN3rd/node-warc
Owner: N0taN3rd
License: mit
Created: 2017-05-21T06:00:43.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2023-01-03T16:43:54.000Z (over 1 year ago)
Last Synced: 2024-04-15T07:39:53.941Z (2 months ago)
Topics: chrome-remote-interface, pupeteer, warc, warc-files, web-archives, web-archiving, webarchive, webarchiving
Language: JavaScript
Homepage:
Size: 7.99 MB
Stars: 91
Watchers: 9
Forks: 23
Open Issues: 24
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Lists

awesome-web-archiving - node-warc - Parse WARC files or create WARC files using either [Electron](https://electron.atom.io/) or [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) (Node.js). *(Stable)* (Tools & Software / WARC I/O Libraries)
awesome-web-archiving - node-warc - Parse WARC files or create WARC files using either [Electron](https://electron.atom.io/) or [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) (Node.js). (Stable) (Tools & Software / WARC I/O Libraries)
awesome-stars - N0taN3rd/node-warc - Parse And Create Web ARChive (WARC) files with node.js (others)

README

        # node-warc

Parse Web Archive (WARC) files or create WARC files using 

 - [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface)

 - [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra) 

 - [Puppeteer](https://github.com/GoogleChrome/puppeteer)

 - [Electron](https://electron.atom.io/)

 - [request](https://github.com/request/request)

Run `npm install node-warc` or `yarn add node-warc` to ge started

[![npm Package](https://img.shields.io/npm/v/node-warc.svg?style=flat-square)](https://www.npmjs.com/package/node-warc)

## Documentation

Full documentation available at [n0tan3rd.github.io/node-warc](https://n0tan3rd.github.io/node-warc/)

## Parsing

### Using async iteration

**Requires node 10 or greater**

```js

const fs = require('fs')

const zlib = require('zlib')

// recordIterator only exported if async iteration on readable streams is available

const { recordIterator } = require('node-warc')

async function iterateRecords (warcStream) {

  for await (const record of recordIterator(warcStream)) {

    console.log(record)

  }

}

iterateRecords(

  fs.createReadStream('').pipe(zlib.createGunzip())

).then(() => {

  console.log('done')

})

```

Or using one of the parsers

```js

for await (const record of new AutoWARCParser('')) {

    console.log(record)

}

```

### Using Stream Transform

```js

const fs = require('fs')

const { WARCStreamTransform } = require('node-warc')

fs

  .createReadStream('')

  .pipe(new WARCStreamTransform())

  .on('data', record => {

    console.log(record)

  })

```

### Both ``.warc`` and ``.warc.gz``

```js

const { AutoWARCParser } = require('node-warc')

const parser = new AutoWARCParser('')

parser.on('record', record => { console.log(record) })

parser.on('done', () => { console.log('finished') })

parser.on('error', error => { console.error(error) })

parser.start()

```

### Only gzip'd warc files

```js

const { WARCGzParser } = require('node-warc')

const parser = new WARCGzParser('')

parser.on('record', record => { console.log(record) })

parser.on('done', () => { console.log('finished') })

parser.on('error', error => { console.error(error) })

parser.start()

```

### Only non gzip'd warc files

```js

const { WARCGzParser } = require('node-warc')

const parser = new WARCParser('')

parser.on('record', record => { console.log(record) })

parser.on('done', () => { console.log('finished') })

parser.on('error', error => { console.error(error) })

parser.start()

```

## WARC Creation 

### Environment

* `NODEWARC_WRITE_GZIPPED` - enable writing gzipped records to WARC outputs.

### Examples

#### Using [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface)

```js

const CRI = require('chrome-remote-interface')

const { RemoteChromeWARCWriter, RemoteChromeCapturer } = require('node-warc')

;(async () => {

  const client = await CRI()

  await Promise.all([

    client.Page.enable(),

    client.Network.enable(),

  ])

  const cap = new RemoteChromeCapturer(client.Network)

  cap.startCapturing()

  await client.Page.navigate({ url: 'http://example.com' });

  // actual code should wait for a better stopping condition, eg. network idle

  await client.Page.loadEventFired()

  const warcGen = new RemoteChromeWARCWriter()

  await warcGen.generateWARC(cap, client.Network, {

    warcOpts: {

      warcPath: 'myWARC.warc'

    },

    winfo: {

      description: 'I created a warc!',

      isPartOf: 'My awesome pywb collection'

    }

  })

  await client.close()

})()

```

#### Using [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra) 

```js

const { CRIExtra, Events, Page } = require('chrome-remote-interface-extra')

const { CRIExtraWARCGenerator, CRIExtraCapturer } = require('node-warc')

;(async () => {

  let client

  try {

    // connect to endpoint

    client = await CRIExtra({ host: 'localhost', port: 9222 })

    const page = await Page.create(client)

    const cap = new CRIExtraCapturer(page, Events.Page.Request)

    cap.startCapturing()

    await page.goto('https://example.com', { waitUntil: 'networkIdle' })

    const warcGen = new CRIExtraWARCGenerator()

    await warcGen.generateWARC(cap, {

      warcOpts: {

        warcPath: 'myWARC.warc'

      },

      winfo: {

        description: 'I created a warc!',

        isPartOf: 'My awesome pywb collection'

      }

    })

  } catch (err) {

    console.error(err)

  } finally {

    if (client) {

      await client.close()

    }

  }

})()

```

#### Using [Puppeteer](https://github.com/GoogleChrome/puppeteer)

```js

const puppeteer = require('puppeteer')

const { Events } = require('puppeteer')

const { PuppeteerWARCGenerator, PuppeteerCapturer } = require('node-warc')

;(async () => {

  const browser = await puppeteer.launch()

  const page = await browser.newPage()

  const cap = new PuppeteerCapturer(page, Events.Page.Request)

  cap.startCapturing()

  await page.goto('http://example.com', { waitUntil: 'networkidle0' })

  const warcGen = new PuppeteerWARCGenerator()

  await warcGen.generateWARC(cap, {

    warcOpts: {

      warcPath: 'myWARC.warc'

    },

    winfo: {

      description: 'I created a warc!',

      isPartOf: 'My awesome pywb collection'

    }

  })

  await page.close()

  await browser.close()

})()

```

#### Note

The generateWARC method used in the preceding examples is helper function for making 

the WARC generation process simple. See its implementation for a full example 

of WARC generation using node-warc

Or see one of the crawler implementations provided by [Squidwarc](https://github.com/N0taN3rd/Squidwarc/tree/master/lib/crawler).