{"id":13484993,"url":"https://github.com/BetaHuhn/metadata-scraper","last_synced_at":"2025-03-27T17:30:44.602Z","repository":{"id":36969232,"uuid":"312401277","full_name":"BetaHuhn/metadata-scraper","owner":"BetaHuhn","description":"🏷️ A JavaScript library for scraping/parsing metadata from a web page.","archived":false,"fork":false,"pushed_at":"2025-03-25T03:35:45.000Z","size":1048,"stargazers_count":119,"open_issues_count":5,"forks_count":18,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-25T13:02:06.062Z","etag":null,"topics":["html-scraper","javascript-library","meta-tags","metadata","metadata-extraction","metatags","open-graph","page","parser","typescript"],"latest_commit_sha":null,"homepage":"https://mxis.ch","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BetaHuhn.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"ko_fi":"betahuhn","github":"betahuhn"}},"created_at":"2020-11-12T21:34:00.000Z","updated_at":"2025-03-11T02:19:38.000Z","dependencies_parsed_at":"2024-01-13T19:20:09.771Z","dependency_job_id":"cfdee3c1-837d-40c7-b02c-87688697d406","html_url":"https://github.com/BetaHuhn/metadata-scraper","commit_stats":{"total_commits":281,"total_committers":7,"mean_commits":"40.142857142857146","dds":0.5266903914590748,"last_synced_commit":"16661155773bf87679d80b94e0c980831fc368cb"},"previous_names":[],"tags_count":64,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BetaHuhn%2Fmetadata-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BetaHuhn%2Fmetadata-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BetaHuhn%2Fmetadata-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BetaHuhn%2Fmetadata-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BetaHuhn","download_url":"https://codeload.github.com/BetaHuhn/metadata-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245467620,"owners_count":20620215,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["html-scraper","javascript-library","meta-tags","metadata","metadata-extraction","metatags","open-graph","page","parser","typescript"],"created_at":"2024-07-31T17:01:41.901Z","updated_at":"2025-03-27T17:30:44.135Z","avatar_url":"https://github.com/BetaHuhn.png","language":"TypeScript","funding_links":["https://ko-fi.com/betahuhn","https://github.com/sponsors/betahuhn","https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick\u0026hosted_button_id=394RTSBEEEFEE"],"categories":["TypeScript"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\t\n# metadata-scraper\n\n[![GitHub](https://img.shields.io/github/license/mashape/apistatus.svg)](https://github.com/BetaHuhn/metadata-scraper/blob/master/LICENSE) ![David](https://img.shields.io/david/betahuhn/metadata-scraper) [![npm](https://img.shields.io/npm/v/metadata-scraper)](https://www.npmjs.com/package/metadata-scraper)\n\nA Javascript library for scraping/parsing metadata from a web page.\n\n\u003c/div\u003e\n\n## 👋 Introduction\n\n[metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) is a Javascript library which scrapes/parses metadata from web pages. You only need to supply it with a URL or an HTML string and it will use different rules to find the most relevant metadata like:\n\n- Title\n- Description\n- Favicons/Images\n- Language\n- Keywords\n- Author\n- and more (full list [below](#-all-metadata))\n\n## 🚀 Get started\n\nInstall [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) via npm:\n```shell\nnpm install metadata-scraper\n```\n\n## 📚 Usage\n\nImport `metadata-scraper` and pass it a URL or options object:\n\n```js\nconst getMetaData = require('metadata-scraper')\n\nconst url = 'https://github.com/BetaHuhn/metadata-scraper'\n\ngetMetaData(url).then((data) =\u003e {\n\tconsole.log(data)\n})\n```\n\nOr with `async`/`await`:\n\n```js\nconst getMetaData = require('metadata-scraper')\n\nasync function run() {\n\tconst url = 'https://github.com/BetaHuhn/metadata-scraper'\n\tconst data = await getMetaData(url)\n\tconsole.log(data)\n}\n\nrun()\n```\n\nThis will return:\n\n```js\n{\n\ttitle: 'BetaHuhn/metadata-scraper',\n\tdescription: 'A Javascript library for scraping/parsing metadata from a web page.',\n\tlanguage: 'en',\n\turl: 'https://github.com/BetaHuhn/metadata-scraper',\n\tprovider: 'GitHub',\n\ttwitter: '@github',\n\timage: 'https://avatars1.githubusercontent.com/u/51766171?s=400\u0026v=4',\n\ticon: 'https://github.githubassets.com/favicons/favicon.svg'\n}\n```\n\nYou can see a list of all metadata which [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) tries to scrape [below](#-all-metadata).\n\n## ⚙️ Configuration\n\nYou can change the behaviour of [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) by passing an options object:\n\n```js\nconst getMetaData = require('metadata-scraper')\n\nconst options = {\n\turl: 'https://github.com/BetaHuhn/metadata-scraper', // URL of web page\n\tmaxRedirects: 0, // Maximum number of redirects to follow (default: 5)\n\tua: 'MyApp', // Specify User-Agent header\n\tlang: 'de-CH', // Specify Accept-Language header\n\ttimeout: 1000, // Request timeout in milliseconds (default: 10000ms)\n\tforceImageHttps: false, // Force all image URLs to use https (default: true)\n\tcustomRules: {} // more info below\n}\n\ngetMetaData(options).then((data) =\u003e {\n\tconsole.log(data)\n})\n```\n\nYou can specify the URL by either passing it as the first parameter, or by setting it in the options object.\n\n## 📖 Examples\n\nHere are some examples on how to use [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper):\n\n### Basic\n\nPass a URL as the first parameter and [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) automatically scrapes it and returns everything it finds:\n\n```js\nconst getMetaData = require('metadata-scraper')\nconst data = await getMetaData('https://github.com/BetaHuhn/metadata-scraper')\n```\n\nExample file located at [examples/basic.js](/examples/basic.js).\n\n---\n\n### HTML String\n\nIf you already have an HTML string and don't want [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) to make an http request, specify it in the options object:\n\n```js\nconst getMetaData = require('metadata-scraper')\n\nconst html = `\n\t\u003cmeta name=\"og:title\" content=\"Example\"\u003e\n\t\u003cmeta name=\"og:description\" content=\"This is an example.\"\u003e\n`\n\nconst options {\n\thtml: html, \n\turl: 'https://example.com' // Optional URL to make relative image paths absolute\n}\n\nconst data = await getMetaData(options)\n```\n\nExample file located at [examples/html.js](/examples/html.js).\n\n---\n\n### Custom Rules\n\nLook at the `rules.ts` file in the `src` directory to see all rules which will be used.\n\nYou can expand [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) easily by specifying custom rules:\n\n```js\nconst getMetaData = require('metadata-scraper')\n\nconst options = {\n\turl: 'https://github.com/BetaHuhn/metadata-scraper',\n\tcustomRules: {\n\t\tname: {\n\t\t\trules: [\n\t\t\t\t[ 'meta[name=\"customName\"][content]', (element) =\u003e element.getAttribute('content') ]\n\t\t\t],\n\t\t\tprocessor: (text) =\u003e text.toLowerCase()\n\t\t}\n\t}\n}\n\nconst data = await getMetaData(options)\n```\n\n`customRules` needs to contain one or more objects, where the key (name above) will identify the value in the returned data. \n\nYou can then specify different rules for each item in the rules array. \n\nThe first item is the query which gets inserted into the browsers querySelector function, and the second item is a function which gets passed the HTML element:\n\n```js\n[ 'querySelector', (element) =\u003e element.innerText ]\n```\n\nYou can also specify a `processor` function which will process/transform the result of one of the matched rules:\n\n```js\n{\n\tprocessor: (text) =\u003e text.toLowerCase()\n}\n```\n\nIf you find a useful rule, let me know and I will add it (or create a PR yourself).\n\nExample file located at [examples/custom.js](/examples/custom.js).\n\n# 📇 All metadata\n\nHere's what [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) currently tries to scrape:\n\n```js\n{\n\ttitle: 'Title of page or article',\n\tdescription: 'Description of page or article',\n\tlanguage: 'Language of page or article',\n\ttype: 'Page type',\n\turl: 'URL of page',\n\tprovider: 'Page provider',\n\tkeywords: ['array', 'of', 'keywords'],\n\tsection: 'Section/Category of page',\n\tauthor: 'Article author',\n\tpublished: 1605221765, // Date the article was published\n\tmodified: 1605221765, // Date the article was modified\n\trobots: ['array', 'for', 'robots'],\n\tcopyright: 'Page copyright',\n\temail: 'Contact email',\n\ttwitter: 'Twitter handle',\n\tfacebook: 'Facebook account id',\n\timage: 'Image URL',\n\ticon: 'Favicon URL',\n\tvideo: 'Video URL',\n\taudio: 'Audio URL'\n}\n```\n\nIf you find a useful metatag, let me know and I will add it (or create a PR yourself).\n\n## 💻 Development\n\nIssues and PRs are very welcome!\n\nPlease check out the [contributing guide](CONTRIBUTING.md) before you start.\n\nThis project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). To see differences with previous versions refer to the [CHANGELOG](CHANGELOG.md).\n\n## ❔ About\n\nThis library was developed by me ([@betahuhn](https://github.com/BetaHuhn)) in my free time. If you want to support me:\n\n[![Donate via PayPal](https://img.shields.io/badge/paypal-donate-009cde.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick\u0026hosted_button_id=394RTSBEEEFEE)\n\n### Credits\n\nThis library is based on Mozilla's [page-metadata-parser](https://github.com/mozilla/page-metadata-parser). I converted it to TypeScript, implemented a few new features, and added more rules.\n\n## License\n\nCopyright 2020 Maximilian Schiller\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBetaHuhn%2Fmetadata-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBetaHuhn%2Fmetadata-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBetaHuhn%2Fmetadata-scraper/lists"}