{"id":36503138,"url":"https://github.com/sc10ntech/site-metadata-extractor","last_synced_at":"2026-01-12T02:26:07.861Z","repository":{"id":64590440,"uuid":"202796165","full_name":"sc10ntech/site-metadata-extractor","owner":"sc10ntech","description":"Cleans and extracts a web resource's metadata","archived":false,"fork":false,"pushed_at":"2025-10-28T21:03:03.000Z","size":2925,"stargazers_count":2,"open_issues_count":13,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-11-27T10:37:07.153Z","etag":null,"topics":["extractor","metadata","metadata-extraction","opengraph","webpage-extractor"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sc10ntech.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-16T20:36:51.000Z","updated_at":"2025-07-27T22:27:52.000Z","dependencies_parsed_at":"2023-12-29T00:25:41.605Z","dependency_job_id":"fc3b2fdb-1324-4637-9c09-a9105bc022f2","html_url":"https://github.com/sc10ntech/site-metadata-extractor","commit_stats":{"total_commits":153,"total_committers":6,"mean_commits":25.5,"dds":0.5294117647058824,"last_synced_commit":"0826a2d98b70dd4d566dbb4295833583eeaadf30"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/sc10ntech/site-metadata-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sc10ntech%2Fsite-metadata-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sc10ntech%2Fsite-metadata-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sc10ntech%2Fsite-metadata-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sc10ntech%2Fsite-metadata-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sc10ntech","download_url":"https://codeload.github.com/sc10ntech/site-metadata-extractor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sc10ntech%2Fsite-metadata-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28332401,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T00:36:25.062Z","status":"online","status_checked_at":"2026-01-12T02:00:08.677Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extractor","metadata","metadata-extraction","opengraph","webpage-extractor"],"created_at":"2026-01-12T02:26:07.420Z","updated_at":"2026-01-12T02:26:07.856Z","avatar_url":"https://github.com/sc10ntech.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Site Metadata Extractor\n\nCleans and extracts a web(site) resource's metadata.\n\nMetadata extraction fields currently supported:\n\n| Name                     | Data Type      |\n| ------------------------ | -------------- |\n| author                   | array (jsonb)  |\n| canonical_url            | string         |\n| copyright                | string         |\n| date (publish date)      | date           |\n| description              | text           |\n| favicon                  | text           |\n| image (primary/og image) | text           |\n| jsonld (structured data) | object (jsonb) |\n| keywords                 | array (jsonb)  |\n| lang                     | string         |\n| locale                   | string         |\n| origin                   | string         |\n| publisher                | string         |\n| site_name                | string         |\n| tags                     | array (jsonb)  |\n| title                    | string         |\n| type                     | string         |\n| truncated_text           | text           |\n| status                   | string         |\n| videos                   | array (jsonb)  |\n| links                    | array (jsonb)  |\n\n## Install\n\nNPM:\n\n```bash\n$ npm install site-metadata-extractor --save\n```\n\nYarn:\n\n```bash\n$ yarn add site-metadata-extractor\n```\n\n## Usage\n\nFeed in a raw markup from a webpage to get extracted metadata fields.\n\n**From `.html` file:**\n\n```js\nimport fs from \"fs\";\nimport siteMetadataExtractor from \"site-metadata-extractor\";\n\nconst getMetadataFromFile = (filename) =\u003e {\n  const filepath = path.resolve(__dirname, `../data/${filename}.html`);\n  const markup = fs.readFileSync(filepath).toString();\n  // feel free to use localhost as the second parameter for testing\n  const metadata = siteMetadataExtractor(markup, \"YOUR_SITE_ORIGIN_HERE\");\n  return metadata;\n};\n\ngetMetadataFromFile(\"example\");\n```\n\n**From a server request:**\n\n```js\nimport axios from 'axios';\nimport siteMetadataExtractor from 'site-metadata-extractor';\n\nconst processSite = async (url) =\u003e {\n  return axios.get(url, config = {})\n    .then(res =\u003e {\n      const { headers } = res;\n      const contentType = headers['content-type'];\n      if (contentType.includes('text/html')) {\n        return {\n          body: res.data,\n          url\n        };\n      } else {\n        return {};\n      }\n    })\n    .catch(err =\u003e {\n      console.log(err);\n    });\n};\n\nprocessSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)\n\t.then((data) =\u003e {\n\t\t...\n    siteMetadataExtractor(data, \"https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/\", \"en\");\n    ...\n\t});\n```\n\n## Development\n\n1. Run: `git clone https://github.com/sc10ntech/site-metadata-extractor.git`\n2. Change into project directory and install deps: `cd site-metadata-extractor \u0026\u0026 npm i`\n\n## Creids \u0026 Disclaimer\n\nsite-metadata-extractor was inspired by, and tries to be the spiritual successor to [node-unfluff](https://github.com/ageitgey/node-unfluff)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsc10ntech%2Fsite-metadata-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsc10ntech%2Fsite-metadata-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsc10ntech%2Fsite-metadata-extractor/lists"}