{"id":13401145,"url":"https://github.com/microlinkhq/metascraper","last_synced_at":"2025-05-13T15:09:12.205Z","repository":{"id":37406159,"uuid":"59617593","full_name":"microlinkhq/metascraper","owner":"microlinkhq","description":"Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.","archived":false,"fork":false,"pushed_at":"2025-04-27T21:12:06.000Z","size":29264,"stargazers_count":2441,"open_issues_count":12,"forks_count":174,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-05-06T11:58:01.586Z","etag":null,"topics":["metadata","parse","scrape"],"latest_commit_sha":null,"homepage":"https://metascraper.js.org","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microlinkhq.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"custom":"https://microlink.io/#pricing"}},"created_at":"2016-05-25T00:15:58.000Z","updated_at":"2025-05-06T04:01:57.000Z","dependencies_parsed_at":"2022-07-08T17:48:02.332Z","dependency_job_id":"508b33be-04b0-424f-81df-78d4c33f00db","html_url":"https://github.com/microlinkhq/metascraper","commit_stats":{"total_commits":1819,"total_committers":37,"mean_commits":49.16216216216216,"dds":0.07806487080813629,"last_synced_commit":"79bc161f7f2c1eab1c405a14e4dcca47befc61dc"},"previous_names":[],"tags_count":519,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microlinkhq%2Fmetascraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microlinkhq%2Fmetascraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microlinkhq%2Fmetascraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microlinkhq%2Fmetascraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microlinkhq","download_url":"https://codeload.github.com/microlinkhq/metascraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252707162,"owners_count":21791490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["metadata","parse","scrape"],"created_at":"2024-07-30T19:00:59.184Z","updated_at":"2025-05-13T15:09:07.185Z","avatar_url":"https://github.com/microlinkhq.png","language":"HTML","readme":"\u003ch1 align=\"center\"\u003e\n  \u003cbr\u003e\n  \u003cimg style=\"width: 500px; margin:3rem 0 1.5rem;\" src=\"https://metascraper.js.org/static/logo-banner.png\" alt=\"metascraper\"\u003e\n  \u003cbr\u003e\n  \u003cbr\u003e\n\u003c/h1\u003e\n\n![Last version](https://img.shields.io/github/tag/microlinkhq/metascraper.svg?style=flat-square)\n[![Coverage Status](https://img.shields.io/coveralls/microlinkhq/metascraper.svg?style=flat-square)](https://coveralls.io/github/microlinkhq/metascraper)\n[![NPM Status](https://img.shields.io/npm/dm/metascraper.svg?style=flat-square)](https://www.npmjs.org/package/metascraper)\n\n\u003e A library to easily get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.\n\n## What is it\n\nThe **metascraper** library allows you to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.\n\nIt follows a few principles:\n\n- Have a high accuracy for online articles by default.\n- Make it simple to add new rules or override existing ones.\n- Don't restrict rules to CSS selectors or text accessors.\n\n## Getting started\n\nLet's extract accurate information from the following website:\n\n![](https://i.imgur.com/jZl0Uej.png)\n\nFirst, **metascraper** expects you provide the HTML markup behind the target URL.\n\nThere are multiple ways to get the HTML markup. In our case, we are going to run a programmatic headless browser to simulate real user navigation, so the data obtained will be close to a real-world example.\n\n```js\nconst getHTML = require('html-get')\n\n/**\n * `browserless` will be passed to `html-get`\n * as driver for getting the rendered HTML.\n */\nconst browserless = require('browserless')()\n\nconst getContent = async url =\u003e {\n  // create a browser context inside the main Chromium process\n  const browserContext = browserless.createContext()\n  const promise = getHTML(url, { getBrowserless: () =\u003e browserContext })\n  // close browser resources before return the result\n  promise.then(() =\u003e browserContext).then(browser =\u003e browser.destroyContext())\n  return promise\n}\n\n/**\n * `metascraper` is a collection of tiny packages,\n * so you can just use what you actually need.\n */\nconst metascraper = require('metascraper')([\n  require('metascraper-author')(),\n  require('metascraper-date')(),\n  require('metascraper-description')(),\n  require('metascraper-image')(),\n  require('metascraper-logo')(),\n  require('metascraper-clearbit')(),\n  require('metascraper-publisher')(),\n  require('metascraper-title')(),\n  require('metascraper-url')()\n])\n\n/**\n * The main logic\n */\ngetContent('https://microlink.io')\n  .then(metascraper)\n  .then(metadata =\u003e console.log(metadata))\n  .then(browserless.close)\n  .then(process.exit)\n```\n\nThe output will be something like:\n\n```json\n{\n  \"author\": \"Microlink HQ\",\n  \"date\": \"2022-07-10T22:53:04.856Z\",\n  \"description\": \"Enter a URL, receive information. Normalize metadata. Get HTML markup. Take a screenshot. Identify tech stack. Generate a PDF. Automate web scraping. Run Lighthouse\",\n  \"image\": \"https://cdn.microlink.io/logo/banner.jpeg\",\n  \"logo\": \"https://cdn.microlink.io/logo/trim.png\",\n  \"publisher\": \"Microlink\",\n  \"title\": \"Turns websites into data — Microlink\",\n  \"url\": \"https://microlink.io/\"\n}\n```\n\n## What data it detects\n\n\u003e **Note**: Custom metadata detection can be defined using a [rule bundle](#rules-bundles).\n\nHere is an example of the metadata that **metascraper** can detect:\n\n- `audio` — e.g. \u003csmall\u003e*ht\u003cspan\u003etps://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3*\u003c/small\u003e\u003cbr/\u003e\nA audio URL that best represents the article.\n\n- `author` — e.g. \u003csmall\u003e*Noah Kulwin*\u003c/small\u003e\u003cbr/\u003e\n  A human-readable representation of the author's name.\n\n- `date` — e.g. \u003csmall\u003e*2016-05-27T00:00:00.000Z*\u003c/small\u003e\u003cbr/\u003e\n  An [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) representation of the date the article was published.\n\n- `description` — e.g. \u003csmall\u003e*Venture capitalists are raising money at the fastest rate...*\u003c/small\u003e\u003cbr/\u003e\n  The publisher's chosen description of the article.\n\n- `video` — e.g. \u003csmall\u003e*ht\u003cspan\u003etps://assets.entrepreneur.com/content/preview.mp4*\u003c/small\u003e\u003cbr/\u003e\n  A video URL that best represents the article.\n\n- `image` — e.g. \u003csmall\u003e*ht\u003cspan\u003etps://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg*\u003c/small\u003e\u003cbr/\u003e\n  An image URL that best represents the article.\n\n- `lang` — e.g. \u003csmall\u003e*en*\u003c/small\u003e\u003cbr/\u003e\n  An [ISO 639-1](https://en.wikipedia.org/wiki/ISO_639-1) representation of the url content language.\n\n- `logo` — e.g. \u003csmall\u003e*ht\u003cspan\u003etps://entrepreneur.com/favicon180x180.png*\u003c/small\u003e\u003cbr/\u003e\n  An image URL that best represents the publisher brand.\n\n- `publisher` — e.g. \u003csmall\u003e*Fast Company*\u003c/small\u003e\u003cbr/\u003e\n  A human-readable representation of the publisher's name.\n\n- `title` — e.g. \u003csmall\u003e*Meet Wall Street's New A.I. Sheriffs*\u003c/small\u003e\u003cbr/\u003e\n  The publisher's chosen title of the article.\n\n- `url` — e.g. \u003csmall\u003e*ht\u003cspan\u003etp://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion*\u003c/small\u003e\u003cbr/\u003e\n  The URL of the article.\n\n## How it works\n\n**metascraper** is built out of rules bundles.\n\nIt was designed to be easy to adapt. You can compose your own transformation pipeline using existing rules or write your own.\n\nRules bundles are a collection of HTML selectors around a determinate property. When you load the library, implicitly it is loading [core rules](#core-rules).\n\nEach set of rules load a set of selectors in order to get a determinate value.\n\nThese rules are sorted with priority: The first rule that resolve the value successfully, stop the rest of rules for get the property. Rules are sorted intentionally from specific to more generic.\n\nRules work as fallback between them:\n\n- If the first rule fails, then it fallback in the second rule.\n- If the second rule fails, time to third rule.\n- etc\n\n**metascraper** do that until finish all the rule or find the first rule that resolves the value.\n\n## Importing rules\n\n**metascraper** exports a constructor that need to be initialized providing a collection of rules to load:\n\n```js\nconst metascraper = require('metascraper')([\n  require('metascraper-author')(),\n  require('metascraper-date')(),\n  require('metascraper-description')(),\n  require('metascraper-image')(),\n  require('metascraper-logo')(),\n  require('metascraper-clearbit')(),\n  require('metascraper-publisher')(),\n  require('metascraper-title')(),\n  require('metascraper-url')()\n])\n```\n\nAgain, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.\n\nUse the first parameter to pass custom options specific per each rules bundle:\n\n```js\nconst metascraper = require('metascraper')([\n  require('metascraper-clearbit')({\n    size: 256,\n    format: 'jpg'\n  })\n])\n```\n\n## Rules bundles\n\n?\u003e Can't find the rules bundle that you want? Let's [open an issue](https://github.com/microlinkhq/metascraper/issues/new) to create it.\n\n### Official\n\n\u003e Rules bundles maintained by metascraper maintainers.\n\n**Core essential**\n\n- [metascraper-audio](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-audio) – Get audio property from HTML markup.\n- [metascraper-author](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-author) – Get author property from HTML markup.\n- [metascraper-date](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-date) – Get date property from HTML markup.\n- [metascraper-description](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-description) – Get description property from HTML markup.\n- [metascraper-feed](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-feed) – Get RSS/Atom feed URL from HTML markup.\n- [metascraper-image](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-image) – Get image property from HTML markup.\n- [metascraper-iframe](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-iframe) – Get iframe for embedding content for the supported providers.\n- [metascraper-lang](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-lang) – Get lang property from HTML markup.\n- [metascraper-logo](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-logo) – Get logo property from HTML markup.\n- [metascraper-logo-favicon](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-logo-favicon) – Metascraper logo favicon fallback.\n- [metascraper-media-provider](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-media-provider) – Get specific video provider url (Facebook/Twitter/Vimeo/etc).\n- [metascraper-publisher](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-publisher) – Get publisher property from HTML markup.\n- [metascraper-readability](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-readability) – A Mozilla readability connector for metascraper.\n- [metascraper-title](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-title) – Get title property from HTML markup.\n- [metascraper-url](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-url) – Get url property from HTML markup.\n- [metascraper-video](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-video) – Get video property from HTML markup.\n\n**Vendor specific**\n\n- [metascraper-amazon](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-amazon) – Metascraper integration with Amazon.\n- [metascraper-clearbit](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-clearbit) – Metascraper integration with Clearbit Logo API.\n- [metascraper-instagram](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-instagram) –  Metascraper integration for Instagram.\n- [metascraper-manifest](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-manifest) –  Metascraper integration for detecting PWA Web app [manifests](https://developer.mozilla.org/en-US/docs/Web/Manifest).\n- [metascraper-soundcloud](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-soundcloud) – Metascraper integration with SoundCloud.\n- [metascraper-telegram](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-telegram) – Metascraper integration with Telegram.\n- [metascraper-uol](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-uol) – Metascraper integration for uol.com URLs.\n- [metascraper-spotify](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-spotify) – Metascraper integration with Spotify.\n- [metascraper-x](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-x) – Metascraper integration with x.com.\n- [metascraper-youtube](https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper-youtube) – Metascraper integration with YouTube.\n\n### Community\n\n\u003e Rules bundles maintained by individual users.\n\n- [metascraper-address](https://github.com/goodhood-eu/metascraper-address) – Get schema.org formatted address.\n- [metascraper-shopping](https://github.com/samirrayani/metascraper-shopping) – Get product information from HTML markup on merchant websites.\n\nSee [CONTRIBUTING](/CONTRIBUTING.md) for adding your own module!\n\n## API\n\n### constructor(rules)\n\nCreate a new **metascraper** instance declaring the rules bundle to be used explicitly.\n\n#### rules\n\nType: `Array`\n\nThe collection of rules bundle to be loaded.\n\n### metascraper(options)\n\nCall the instance for extracting content based on rules bundle provided at the constructor.\n\n#### options\n\n#### url\n\n*Required*\u003cbr\u003e\nType: `String`\n\nThe URL associated with the HTML markup.\n\nIt is used for resolve relative links that can be present in the HTML markup.\n\nit can be used as fallback field for different rules as well.\n\n##### html\n\nType: `String`\n\nThe HTML markup for extracting the content.\n\n##### htmlDom\n\nType: `object`\n\nThe DOM representation of the HTML markup. When it's not provided, it's get from the `html` parameter.\n\n#### rules\n\nType: `Array`\n\nYou can pass additional rules to add on execution time. \n\nThese rules will be merged with your loaded [rules](#rules) at the beginning.\n\n#### validateUrl\n\nType: `boolean`\u003cbr\u003e\nDefault: `true`\n\nEnsure the URL provided is validated as a [WHATWG URL](https://nodejs.org/api/url.html#url_the_whatwg_url_api) API compliant.\n\n## Environment Variables\n\n#### METASCRAPER_RE2\n\nType: `boolean`\u003cbr\u003e\nDefault: `true`\n\nIt attemptt to load re2 to use instead of RegExp.\n\n## Benchmark\n\nTo give you an idea of how accurate **metascraper** is, here is a comparison of similar libraries:\n\n| Library   | [metascraper](https://www.npmjs.com/package/metascraper) | [html-metadata](https://www.npmjs.com/package/html-metadata) | [node-metainspector](https://www.npmjs.com/package/node-metainspector) | [open-graph-scraper](https://www.npmjs.com/package/open-graph-scraper) | [unfluff](https://www.npmjs.com/package/unfluff) |\n|:----------|:-----------------------------------------------------------|:---------------------------------------------------------------|:-------------------------------------------------------------------------|:-------------------------------------------------------------------------|:---------------------------------------------------|\n| Correct   | **95.54%**                                                 | **74.56%**                                                     | **61.16%**                                                               | **66.52%**                                                               | **70.90%**                                         |\n| Incorrect | 1.79%                                                      | 1.79%                                                          | 0.89%                                                                    | 6.70%                                                                    | 10.27%                                             |\n| Missed    | 2.68%                                                      | 23.67%                                                         | 37.95%                                                                   | 26.34%                                                                   | 8.95%                                              |\n\nA big part of the reason for **metascraper**'s higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.\n\n**metascraper**'s default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.\n\nIf you're interested in the breakdown by individual pieces of metadata, check out the [full comparison summary](/bench), or dive into the [raw result data for each library](/bench/results).\n\n## License\n\n**metascraper** © [Microlink](https://microlink.io), released under the [MIT](https://github.com/microlinkhq/metascraper/blob/master/LICENSE.md) License.\u003cbr\u003e\nAuthored and maintained by [Microlink](https://microlink.io) with help from [contributors](https://github.com/microlinkhq/metascraper/contributors).\n\n\u003e [microlink.io](https://microlink.io) · GitHub [microlinkhq](https://github.com/microlinkhq) · X [@microlinkhq](https://x.com/microlinkhq)\n","funding_links":["https://microlink.io/#pricing"],"categories":["HTML","Repository","others"],"sub_categories":["HTTP"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrolinkhq%2Fmetascraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrolinkhq%2Fmetascraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrolinkhq%2Fmetascraper/lists"}