https://github.com/laurengarcia/url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://github.com/laurengarcia/url-metadata

Last synced: 18 days ago
JSON representation

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.

Host: GitHub
URL: https://github.com/laurengarcia/url-metadata
Owner: laurengarcia
License: mit
Created: 2016-10-21T14:18:38.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2024-07-27T21:08:34.000Z (9 months ago)
Last Synced: 2024-12-20T09:30:19.071Z (4 months ago)
Language: JavaScript
Homepage: https://www.npmjs.com/package/url-metadata
Size: 164 KB
Stars: 169
Watchers: 7
Forks: 44
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Roadmap: ROADMAP.md

Awesome Lists containing this project

my-awesome-list - url-metadata

README

        # url-metadata

Request a url and scrape the metadata from its HTML using Node.js or the browser. Has an alternate mode that lets you pass in your own `Response` object as well (see `Options`).

Includes:

- meta tags

- favicons

- citations, per the Google Scholar spec

- [Open Graph Protocol (og:) Tags](http://ogp.me/)

- [Twitter Card Tags](https://developer.twitter.com/en/docs/twitter-for-websites/cards/overview/markup)

- [JSON-LD](https://moz.com/blog/json-ld-for-beginners)

- h1-h6 tags

- img tags

- automatic charset detection & decoding (optional)

- the full response body as a string of html (optional)

v5.0.0+ Protects against:

- infinite redirect loops

- SSRF attacks via `request-filtering-agent` in Node.js v18+ environments (custom options available)

More details in the `Returns` section below.

To report a bug or request a feature please open an issue or pull request in [GitHub](https://github.com/laurengarcia/url-metadata). Please read the `Troublehsooting` section below *before* filing a bug.

## Usage

Works with Node.js versions `>=18.0.0` or in the browser when bundled with Webpack or Parcel (see `/example-typescript`). Under the hood, this package does some post-request processing on top of the `fetch` API. Use previous version `2.5.0` which uses the (now-deprecated) `request` module if you don't have access to `fetch` in your target environment.

Install in your project:

```

npm install url-metadata --save

```

In your project file:

```javascript

const urlMetadata = require('url-metadata');

try {

  const url = 'https://www.npmjs.com/package/url-metadata';

  const metadata = await urlMetadata(url);

  console.log(metadata);

} catch (err) {

  console.log(err);

}

```

### Options & Defaults

The default options are the values below. To override the default options, pass in a second options argument.

```javascript

const options = {

  // custom request headers

  requestHeaders: {

    'User-Agent': 'url-metadata',

    'From': 'example@example.com'

  },

  // to prevent SSRF attacks, this default option blocks requests

  // to private network & reserved IP addresses

  // supported in Node.js v18+; other envs ignore silently

  // https://www.npmjs.com/package/request-filtering-agent

  requestFilteringAgentOptions: undefined,

  // `fetch` API cache setting for request

  cache: 'no-cache',

  // `fetch` mode (ex: 'cors', 'same-origin', etc)

  mode: 'cors',

  // maximum redirects in request chain, defaults to 10

  maxRedirects: 10,

  // fetch timeout in milliseconds, default is 10 seconds

  timeout: 10000,

  // charset to decode response with (ex: 'auto', 'utf-8', 'EUC-JP')

  // defaults to auto-detect in `Content-Type` header or meta tag

  // if none found, default `auto` option falls back to `utf-8`

  // override by passing in charset here (ex: 'windows-1251'):

  decode: 'auto',

  // number of characters to truncate description to

  descriptionLength: 750,

  // force image urls in selected tags to use https,

  // valid for images & favicons with full paths

  ensureSecureImageRequest: true,

  // return raw response body as string

  includeResponseBody: false,

  // alternate use-case: pass in `Response` object here to be parsed

  // see example below

  parseResponseObject: undefined

};

// Basic usage

try {

  const url = 'https://www.npmjs.com/package/url-metadata';

  const metadata = await urlMetadata(url, options);

  console.log(metadata);

} catch (err) {

  console.log(err);

}

// Alternate use-case: parse a Response object instead

try {

  // fetch the url in your own code

  const response = await fetch('https://www.npmjs.com/package/url-metadata');

  // ... do other stuff with it...

  // pass the `response` object to be parsed for its metadata

  const metadata = await urlMetadata(null, { parseResponseObject: response });

  console.log(metadata);

} catch (err) {

  console.log(err);

}

// Similarly, if you have a string of html you can create

// a response object and pass the html string into it.

const html = `

  

    

    Metadata page

    

    

  

  

    
Metadata page

  

`;

const response = new Response(html, {

  headers: {

    'Content-Type': 'text/html'

  }

});

const metadata = await urlMetadata(null, { parseResponseObject: response });

console.log(metadata);

```

### Returns

Returns a promise resolved with an object. Note that the `url` field returned will be the last hop in the request chain. If you pass in a url from a url shortener you'll get back the final destination as the `url`.

The returned `metadata` object consists of key/value pairs that are all strings, with a few exceptions:

- `favicons` returns an array of objects containing key/value pairs (strings)

- `jsonld` returns an array of objects

- all meta tags that begin with `citation_` (ex: `citation_author`) return with keys as strings and values that are an array of strings to conform to the [Google Scholar spec](https://www.google.com/intl/en/scholar/inclusion.html#indexing) which allows for multiple citation meta tags with different content values. So if the html contains:

```

```

... this module will return:

```

'citation_author': ["Arlitsch, Kenning", "OBrien, Patrick"],

```

A basic template for the returned metadata object can be found in `lib/metadata-fields.js`. Any additional meta tags found on the page are appended as new fields to the object.

### Troubleshooting

**Issue:** `Response status code 0` or `CORS errors`. The `fetch` request failed at either the network or protocol level. Possible causes:

- CORS errors. Try changing the mode option (ex: `cors`, `same-origin`, etc) or setting the `Access-Control-Allow-Origin` header on the server response from the url you are requesting if you have access to it.

- Trying to access an `https` resource that has invalid certificate, or trying to access an `http` resource from a page with an `https` origin.

- A browser plugin such as an ad-blocker or privacy protector.

**Issue:** `fetch is not defined`. Error thrown in a Node.js or browser environment that doesn't have `fetch` method available. Try upgrading your environment (Node.js version `>=18.0.0`), or you can use an earlier version of this package (version 2.5.0).

**Issue:** Request returns `404`, `403` errors or a CAPTCHA form. Your request may have been blocked by the server because it suspects you are a bot or scraper. Check [this list](https://dev.to/princepeterhansen/7-ways-to-avoid-getting-blocked-or-blacklisted-when-web-scraping-45ii) to ensure you're not triggering a block.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/laurengarcia/url-metadata

Awesome Lists containing this project

README

Metadata page