{"id":16255227,"url":"https://github.com/jdesboeufs/plunger","last_synced_at":"2025-06-20T00:05:46.531Z","repository":{"id":34563995,"uuid":"38509830","full_name":"jdesboeufs/plunger","owner":"jdesboeufs","description":"Powerful link analyzer","archived":false,"fork":false,"pushed_at":"2023-08-31T21:28:43.000Z","size":1449,"stargazers_count":4,"open_issues_count":5,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-02T01:29:02.492Z","etag":null,"topics":["analysis","archives","atom","crawling","http","indexof","inspector","nodejs"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jdesboeufs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2015-07-03T20:40:21.000Z","updated_at":"2023-08-31T20:03:08.000Z","dependencies_parsed_at":"2023-08-31T22:10:39.982Z","dependency_job_id":null,"html_url":"https://github.com/jdesboeufs/plunger","commit_stats":null,"previous_names":["jdesboeufs/plunger","geodatagouv/plunger"],"tags_count":27,"template":false,"template_full_name":null,"purl":"pkg:github/jdesboeufs/plunger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdesboeufs%2Fplunger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdesboeufs%2Fplunger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdesboeufs%2Fplunger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdesboeufs%2Fplunger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jdesboeufs","download_url":"https://codeload.github.com/jdesboeufs/plunger/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdesboeufs%2Fplunger/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260183301,"owners_count":22971204,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","archives","atom","crawling","http","indexof","inspector","nodejs"],"created_at":"2024-10-10T15:28:57.118Z","updated_at":"2025-06-20T00:05:41.518Z","avatar_url":"https://github.com/jdesboeufs.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# plunger\n\n\u003e Powerful link analyzer\n\n[![npm version](https://badgen.net/npm/v/plunger)](https://www.npmjs.com/package/plunger)\n[![XO code style](https://badgen.net/badge/code%20style/XO/cyan)](https://github.com/xojs/xo)\n\n`plunger` analyzes an URL or a local path and recursively builds a tree of the files it contains or links to. It can ignore files that haven’t changed since the last check, or depending on a specific Etag. All of it is configurable.\n\nThe analyzed files are downloaded to temporary locations on your system. It’s up to you to do anything with them, and to clean those locations afterwards.\n\n## Usage\n\n`plunger` can be used in your project or as a CLI (with limited configuration support).\n\n### Requirements\n\n* [Node.js](https://nodejs.org) \u003e= 18.0\n* [unar](https://theunarchiver.com/command-line)\n\n\n### Installation\n\nYou can add `plunger` to your project by running:\n\n```shell\n$ npm install --save plunger\n```\n\nOr if you’re using Yarn:\n\n```shell\n$ yarn add plunger\n```\n\n## Analyzing\n\nTwo types of resources can be identified: files and containers. Containers either contain other resources or link to other resources.\n\nThere are two types of analyzers: `http` and `path`.\nAn URL will go through the `http` analyzers to determine whether the resources can be an `http` container, then files will be downloaded and go through the `path` analyzers, to identify `path` containers.\n\nSupported container types:\n- Directories (`path`)\n- Archives (`path`)\n- *Index of* pages (`http`)\n- Atom feeds (`http`)\n\nEverything that is not matched as a container will be a file. Containers will expose an array of children resources, which can be either containers, or files.\n\nA tree is then built recursively, following that principle.\n\n## API\n\n`plunger` only exposes one function: `analyzeLocation(location, options)`\n\nThis function builds a complete tree of all the items found at `location`.\n\nFor example, analyzing `http://example.org/index.html` would yield something like the following:\n\n```js\n{ inputType: 'http',\n  url: 'http://example.org/index.html',\n  statusCode: 200,\n  redirectUrls: [],\n  finalUrl: 'http://example.org/index.html',\n  etag: '\"359670651+gzip\"',\n  lastModified: 'Fri, 09 Aug 2013 23:54:35 GMT',\n  cacheControl: 'max-age=604800',\n  fileName: 'index.html',\n  fileTypes:\n   [ { ext: 'html', mime: 'text/html', source: 'http:content-type' },\n     { ext: 'html', mime: 'text/html', source: 'http:filename' },\n     { ext: 'html', mime: 'text/html', source: 'path:filename' } ],\n  temporary: '/var/folders/wb/4xx5dj9j0r12lym3mgxhj0l00000gn/T/plunger_nk443a',\n  fileSize: 1270,\n  digest: 'sha384-bo7Rewmo/VHAS0xEa1JGwfNQAKfP42gfnoF9DM3grWq+0TT4ygQ+4P4NJLNBFBI/',\n  path: '/var/folders/wb/4xx5dj9j0r12lym3mgxhj0l00000gn/T/plunger_nk443a/index.html',\n  type: 'file',\n  analyzed: true }\n```\n\n- `location` is a string, it can be either a path on your filesystem, or an URL.\n- `options` is an object of options:\n\n| option          | default value | type    | description |\n|-----------------|---------------|---------|----------|\n| etag            | `null`        | String  | Will be set to the `If-None-Match` HTTP header |\n| lastModified    | `null`        | String|Date | Date of `location`’s last modification date, will be set to the `If-Modified-Since` HTTP header |\n| userAgent       | plunger/1.0   | String  | User agent, will be set to the `User-Agent` HTTP header |\n| timeout         | `{connection: 2000, activity: 4000, download: 0}` | Object | See timeouts section |\n| cache           | `null` | Object | See caching section |\n| maxDownloadSize | 100 * 1024 * 1024 | Number | Max size, in bytes, before the download of a file is interrupted |\n| digestAlgorithm | sha384        | String  | Algorithm which file digests are computed with |\n| extractArchives | `true`        | Boolean | Disable to stop extracting archives |\n| indexOfMatches  | `[/Directory of/, /Index of/, /Listing of/]` | RegExp[] | Array of regexp to match index of-type pages |\n| logger          | defaultLogger based on `debug` | Object  | Define a logger with a `log(event, token)` method |\n\nIt returns a `Promise` to the root tree node.\n\n#### Timeouts\n\nThere are 3 configurable timeouts:\n\n- `connection`: timeout before an HTTP/HTTPS connection can be established, defaults to `2000ms`.\n- `activity`: timeout between 2 data chunks received by the server, defaults to `4000ms`.\n- `download`: timeout for the whole file to be downloaded, defaults to `0` (disabled).\n\nAll timeouts can be disabled by setting them to 0.\n\n#### Caching\n\n##### URL Cache\n\nIt is possible to pass a callback to retrieve informations about previous URL checks in order to allow unnecessary downloads. This is done using `cache.getUrlCache(token)` and `cache.setUrlCache(token)` options of `analyzeLocation()`.\n\n`cache.getUrlCache` will return an object of options that will override the options passed to `analyzeLocation()`. It can be interesting to set a `lastModified` and an `etag` property.\n\nThe idea is to save information about an analyzed URL in `cache.setUrlCache` in a custom cache.\n\n##### File Cache\n\nYou can also pass a callback to match a file’s digest against a database, in order to stop processing the file if it hasn’t change. For example, it would be wise to prevent extracting an archive and analyzing its content if the archive hasn’t changed.\n\nThis is done using the `cache.getFileCache(token)` option of `analyzeLocation()`.\n\n`cache.getFileCache` will return a `Boolean` indicating whether the file is in cache. Return `true` to stop further analyzes.\n\n\n##### Example\n\n```js\nasync function getUrlCache(token) {\n  const cache = await db.getByUrl(token.url)\n\n  console.log(cache ? 'HIT' : 'MISS', token.url)\n  return {\n    etag: cache.etag,\n    lastModified: cache.lastModified\n  }\n}\n```\n\n```js\nasync function setUrlCache(token) {\n  const urls = [...token.redirectUrls, token.finalUrl]\n\n  for (const url of urls) {\n    await db.create({\n      url,\n      etag: token.etag,\n      lastModified: token.lastModified\n    })\n    console.log('SAVE', url)\n  }\n}\n```\n\n```js\nasync function getFileCache(token) {\n  const cache = await db.findFileFromToken(token) // Magic\n\n  return cache.digest === token.digest\n}\n```\n\n```js\nconst {analyzeLocation} = require('plunger')\n\nconst tree = await analyzeLocation('http://example.com', {\n  cache: {\n    getUrlCache,\n    setUrlCache,\n    getFileCache\n  }\n})\n```\n\n##### Options\n\nAll three cache functions are also called with the options passed to `analyzeLocation`, this allows to use the logger in them, for example.\n\n```js\nasync function getUrlCache(token, options) {\n  options.logger.log('check for url cache', token)\n}\n```\n\n#### Example usage:\n\n```js\nconst {analyzeLocation} = require('plunger')\n\nconst tree = await analyzeLocation('http://example.org/', {\n  digestAlgorithm: 'md5' // No fear\n})\n\nconsole.log(tree.digest)\n```\n\n## License\n\nMIT\n\n## Miscellaneous\n\n```\n    ╚⊙ ⊙╝\n  ╚═(███)═╝\n ╚═(███)═╝\n╚═(███)═╝\n ╚═(███)═╝\n  ╚═(███)═╝\n   ╚═(███)═╝\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdesboeufs%2Fplunger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjdesboeufs%2Fplunger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdesboeufs%2Fplunger/lists"}