{"id":14156107,"url":"https://github.com/N0taN3rd/node-warc","last_synced_at":"2025-08-06T02:31:43.308Z","repository":{"id":21229861,"uuid":"91936471","full_name":"N0taN3rd/node-warc","owner":"N0taN3rd","description":"Parse And Create Web ARChive  (WARC) files with node.js","archived":false,"fork":false,"pushed_at":"2023-01-03T16:43:54.000Z","size":8374,"stargazers_count":91,"open_issues_count":24,"forks_count":23,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-04-15T07:39:53.941Z","etag":null,"topics":["chrome-remote-interface","pupeteer","warc","warc-files","web-archives","web-archiving","webarchive","webarchiving"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/N0taN3rd.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-21T06:00:43.000Z","updated_at":"2024-04-14T20:35:50.000Z","dependencies_parsed_at":"2023-01-12T03:45:20.759Z","dependency_job_id":null,"html_url":"https://github.com/N0taN3rd/node-warc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/N0taN3rd%2Fnode-warc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/N0taN3rd%2Fnode-warc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/N0taN3rd%2Fnode-warc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/N0taN3rd%2Fnode-warc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/N0taN3rd","download_url":"https://codeload.github.com/N0taN3rd/node-warc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228829051,"owners_count":17978142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chrome-remote-interface","pupeteer","warc","warc-files","web-archives","web-archiving","webarchive","webarchiving"],"created_at":"2024-08-17T08:05:13.433Z","updated_at":"2024-12-09T03:31:03.118Z","avatar_url":"https://github.com/N0taN3rd.png","language":"JavaScript","funding_links":[],"categories":["Tools \u0026 Software","others"],"sub_categories":["WARC I/O Libraries"],"readme":"# node-warc\nParse Web Archive (WARC) files or create WARC files using \n - [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface)\n - [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra) \n - [Puppeteer](https://github.com/GoogleChrome/puppeteer)\n - [Electron](https://electron.atom.io/)\n - [request](https://github.com/request/request)\n\n\nRun `npm install node-warc` or `yarn add node-warc` to ge started\n\n[![npm Package](https://img.shields.io/npm/v/node-warc.svg?style=flat-square)](https://www.npmjs.com/package/node-warc)\n\n## Documentation\nFull documentation available at [n0tan3rd.github.io/node-warc](https://n0tan3rd.github.io/node-warc/)\n\n## Parsing\n\n### Using async iteration\n**Requires node 10 or greater**\n```js\nconst fs = require('fs')\nconst zlib = require('zlib')\n// recordIterator only exported if async iteration on readable streams is available\nconst { recordIterator } = require('node-warc')\n\nasync function iterateRecords (warcStream) {\n  for await (const record of recordIterator(warcStream)) {\n    console.log(record)\n  }\n}\n\niterateRecords(\n  fs.createReadStream('\u003cpath-to-gzipd-warcfile\u003e').pipe(zlib.createGunzip())\n).then(() =\u003e {\n  console.log('done')\n})\n```\n\nOr using one of the parsers\n```js\nfor await (const record of new AutoWARCParser('\u003cpath-to-warcfile\u003e')) {\n    console.log(record)\n}\n```\n\n### Using Stream Transform\n```js\nconst fs = require('fs')\nconst { WARCStreamTransform } = require('node-warc')\n\nfs\n  .createReadStream('\u003cpath-to-warcfile\u003e')\n  .pipe(new WARCStreamTransform())\n  .on('data', record =\u003e {\n    console.log(record)\n  })\n```\n\n### Both ``.warc`` and ``.warc.gz``\n```js\nconst { AutoWARCParser } = require('node-warc')\n\nconst parser = new AutoWARCParser('\u003cpath-to-warcfile\u003e')\nparser.on('record', record =\u003e { console.log(record) })\nparser.on('done', () =\u003e { console.log('finished') })\nparser.on('error', error =\u003e { console.error(error) })\nparser.start()\n```\n\n### Only gzip'd warc files\n```js\nconst { WARCGzParser } = require('node-warc')\n\nconst parser = new WARCGzParser('\u003cpath-to-warcfile\u003e')\nparser.on('record', record =\u003e { console.log(record) })\nparser.on('done', () =\u003e { console.log('finished') })\nparser.on('error', error =\u003e { console.error(error) })\nparser.start()\n```\n\n### Only non gzip'd warc files\n```js\nconst { WARCGzParser } = require('node-warc')\n\nconst parser = new WARCParser('\u003cpath-to-gzipd-warcfile\u003e')\nparser.on('record', record =\u003e { console.log(record) })\nparser.on('done', () =\u003e { console.log('finished') })\nparser.on('error', error =\u003e { console.error(error) })\nparser.start()\n```\n\n## WARC Creation \n\n### Environment\n* `NODEWARC_WRITE_GZIPPED` - enable writing gzipped records to WARC outputs.\n\n### Examples\n\n#### Using [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface)\n\n```js\nconst CRI = require('chrome-remote-interface')\nconst { RemoteChromeWARCWriter, RemoteChromeCapturer } = require('node-warc')\n\n;(async () =\u003e {\n  const client = await CRI()\n  await Promise.all([\n    client.Page.enable(),\n    client.Network.enable(),\n  ])\n  const cap = new RemoteChromeCapturer(client.Network)\n  cap.startCapturing()\n  await client.Page.navigate({ url: 'http://example.com' });\n  // actual code should wait for a better stopping condition, eg. network idle\n  await client.Page.loadEventFired()\n  const warcGen = new RemoteChromeWARCWriter()\n  await warcGen.generateWARC(cap, client.Network, {\n    warcOpts: {\n      warcPath: 'myWARC.warc'\n    },\n    winfo: {\n      description: 'I created a warc!',\n      isPartOf: 'My awesome pywb collection'\n    }\n  })\n  await client.close()\n})()\n```\n\n#### Using [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra) \n```js\nconst { CRIExtra, Events, Page } = require('chrome-remote-interface-extra')\nconst { CRIExtraWARCGenerator, CRIExtraCapturer } = require('node-warc')\n\n;(async () =\u003e {\n  let client\n  try {\n    // connect to endpoint\n    client = await CRIExtra({ host: 'localhost', port: 9222 })\n    const page = await Page.create(client)\n    const cap = new CRIExtraCapturer(page, Events.Page.Request)\n    cap.startCapturing()\n    await page.goto('https://example.com', { waitUntil: 'networkIdle' })\n    const warcGen = new CRIExtraWARCGenerator()\n    await warcGen.generateWARC(cap, {\n      warcOpts: {\n        warcPath: 'myWARC.warc'\n      },\n      winfo: {\n        description: 'I created a warc!',\n        isPartOf: 'My awesome pywb collection'\n      }\n    })\n  } catch (err) {\n    console.error(err)\n  } finally {\n    if (client) {\n      await client.close()\n    }\n  }\n})()\n```\n\n#### Using [Puppeteer](https://github.com/GoogleChrome/puppeteer)\n```js\nconst puppeteer = require('puppeteer')\nconst { Events } = require('puppeteer')\nconst { PuppeteerWARCGenerator, PuppeteerCapturer } = require('node-warc')\n\n;(async () =\u003e {\n  const browser = await puppeteer.launch()\n  const page = await browser.newPage()\n  const cap = new PuppeteerCapturer(page, Events.Page.Request)\n  cap.startCapturing()\n  await page.goto('http://example.com', { waitUntil: 'networkidle0' })\n  const warcGen = new PuppeteerWARCGenerator()\n  await warcGen.generateWARC(cap, {\n    warcOpts: {\n      warcPath: 'myWARC.warc'\n    },\n    winfo: {\n      description: 'I created a warc!',\n      isPartOf: 'My awesome pywb collection'\n    }\n  })\n  await page.close()\n  await browser.close()\n})()\n```\n\n#### Note\nThe generateWARC method used in the preceding examples is helper function for making \nthe WARC generation process simple. See its implementation for a full example \nof WARC generation using node-warc\n\nOr see one of the crawler implementations provided by [Squidwarc](https://github.com/N0taN3rd/Squidwarc/tree/master/lib/crawler).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FN0taN3rd%2Fnode-warc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FN0taN3rd%2Fnode-warc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FN0taN3rd%2Fnode-warc/lists"}