{"id":13668245,"url":"https://github.com/indix/web-auto-extractor","last_synced_at":"2025-04-06T10:10:50.523Z","repository":{"id":8852005,"uuid":"58750222","full_name":"indix/web-auto-extractor","owner":"indix","description":"Automatically extracts structured information from webpages","archived":false,"fork":false,"pushed_at":"2022-06-23T15:31:36.000Z","size":111,"stargazers_count":108,"open_issues_count":9,"forks_count":30,"subscribers_count":46,"default_branch":"master","last_synced_at":"2024-10-29T15:39:36.820Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/indix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-05-13T15:08:13.000Z","updated_at":"2024-08-03T19:52:20.000Z","dependencies_parsed_at":"2022-07-29T08:49:26.617Z","dependency_job_id":null,"html_url":"https://github.com/indix/web-auto-extractor","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fweb-auto-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fweb-auto-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fweb-auto-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/indix%2Fweb-auto-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/indix","download_url":"https://codeload.github.com/indix/web-auto-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247464220,"owners_count":20942970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T08:00:26.511Z","updated_at":"2025-04-06T10:10:50.502Z","avatar_url":"https://github.com/indix.png","language":"JavaScript","funding_links":[],"categories":["[Meta Extract](https://github.com/johnmurch/awesome-seo-scripts/tree/master/meta-extract)"],"sub_categories":[],"readme":"# Web Auto Extractor\n[![Build Status](https://travis-ci.org/indix/web-auto-extractor.svg?branch=master)](https://travis-ci.org/indix/web-auto-extractor)\n\nParse semantically structured information from any HTML webpage.\n\nSupported formats:-\n- Encodings that support [Schema.org](http://schema.org/) vocabularies:-\n  - Microdata\n  - RDFa-lite\n  - JSON-LD\n- Random Meta tags\n\nPopularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.\n\n**[Demo](https://tonicdev.com/npm/web-auto-extractor)** it on tonicdev\n\n## Installation\n`npm install web-auto-extractor`\n\n## [Usage](#usage)\n\n```js\n// IF CommonJS\nvar WAE = require('web-auto-extractor').default\n// IF ES6\nimport WAE from 'web-auto-extractor'\n\nvar parsed = WAE().parse(sampleHTML)\n\n```\n\nLet's use the following text as the `sampleHTML` in our example. It uses Schema.org vocabularies to structure a Product information and is encoded in `microdata` format.\n\n#### [Input](#input)\n```html\n\u003cdiv itemscope itemtype=\"http://schema.org/Product\"\u003e\n  \u003cspan itemprop=\"brand\"\u003eACME\u003c/span\u003e\n  \u003cspan itemprop=\"name\"\u003eExecutive Anvil\u003c/span\u003e\n  \u003cimg itemprop=\"image\" src=\"anvil_executive.jpg\" alt=\"Executive Anvil logo\" /\u003e\n  \u003cspan itemprop=\"description\"\u003eSleeker than ACME's Classic Anvil, the\n    Executive Anvil is perfect for the business traveler\n    looking for something to drop from a height.\n  \u003c/span\u003e\n  Product #: \u003cspan itemprop=\"mpn\"\u003e925872\u003c/span\u003e\n  \u003cspan itemprop=\"aggregateRating\" itemscope itemtype=\"http://schema.org/AggregateRating\"\u003e\n    \u003cspan itemprop=\"ratingValue\"\u003e4.4\u003c/span\u003e stars, based on \u003cspan itemprop=\"reviewCount\"\u003e89\n      \u003c/span\u003e reviews\n  \u003c/span\u003e\n\n  \u003cspan itemprop=\"offers\" itemscope itemtype=\"http://schema.org/Offer\"\u003e\n    Regular price: $179.99\n    \u003cmeta itemprop=\"priceCurrency\" content=\"USD\" /\u003e\n    $\u003cspan itemprop=\"price\"\u003e119.99\u003c/span\u003e\n    (Sale ends \u003ctime itemprop=\"priceValidUntil\" datetime=\"2020-11-05\"\u003e\n      5 November!\u003c/time\u003e)\n    Available from: \u003cspan itemprop=\"seller\" itemscope itemtype=\"http://schema.org/Organization\"\u003e\n                      \u003cspan itemprop=\"name\"\u003eExecutive Objects\u003c/span\u003e\n                    \u003c/span\u003e\n    Condition: \u003clink itemprop=\"itemCondition\" href=\"http://schema.org/UsedCondition\"/\u003ePreviously owned,\n      in excellent condition\n    \u003clink itemprop=\"availability\" href=\"http://schema.org/InStock\"/\u003eIn stock! Order now!\u003c/span\u003e\n  \u003c/span\u003e\n\u003c/div\u003e\n```\n\n#### [Output](#output)\n\nOur `parsed` object should look like -\n\n```json\n{\n  \"microdata\": {\n    \"Product\": [\n      {\n        \"@context\": \"http://schema.org/\",\n        \"@type\": \"Product\",\n        \"brand\": \"ACME\",\n        \"name\": \"Executive Anvil\",\n        \"image\": \"anvil_executive.jpg\",\n        \"description\": \"Sleeker than ACME's Classic Anvil, the\\n    Executive Anvil is perfect for the business traveler\\n    looking for something to drop from a height.\",\n        \"mpn\": \"925872\",\n        \"aggregateRating\": {\n          \"@context\": \"http://schema.org/\",\n          \"@type\": \"AggregateRating\",\n          \"ratingValue\": \"4.4\",\n          \"reviewCount\": \"89\"\n        },\n        \"offers\": {\n          \"@context\": \"http://schema.org/\",\n          \"@type\": \"Offer\",\n          \"priceCurrency\": \"USD\",\n          \"price\": \"119.99\",\n          \"priceValidUntil\": \"5 November!\",\n          \"seller\": {\n            \"@context\": \"http://schema.org/\",\n            \"@type\": \"Organization\",\n            \"name\": \"Executive Objects\"\n          },\n          \"itemCondition\": \"http://schema.org/UsedCondition\",\n          \"availability\": \"http://schema.org/InStock\"\n        }\n      }\n    ]\n  },\n  \"rdfa\": {},\n  \"jsonld\": {},\n  \"metatags\": {\n    \"priceCurrency\": [\n      \"USD\",\n      \"USD\"\n    ]\n  }\n}\n```\n\nThe `parsed` object includes four objects - `microdata`, `rdfa`, `jsonld` and `metatags`. Since the above HTML does not have any information encoded in `rdfa` and `jsonld`, those two objects are empty.\n\n## Caveat\n\nI wouldn't call it a caveat but rather the parser is strict by design. It might not parse like expected if the HTML isn't encoded correctly, so one might assume the parser is broken.\n\nFor example, take the following HTML snippet.\n\n```html\n\u003cdiv itemscope itemtype=\"http://schema.org/Movie\"\u003e\n  \u003ch1 itemprop=\"name\"\u003eGhostbusters\u003c/h1\u003e\n  \u003cdiv itemprop=\"productionCompany\" itemscope itemtype=\"http://schema.org/Organization\"\u003eBlack Rhino\u003c/div\u003e\n  \u003cdiv itemprop=\"countryOfOrigin\" itemscope itemtype=\"http://schema.org/Country\"\u003e\n    Country: \u003cspan itemprop=\"name\" content=\"USA\"\u003eUnited States\u003c/span\u003e\u003cp\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n```\n\nThe problem here is the `itemprop` - `productionCompany` which is of `itemtype` - `Organization` doesn't have any `itemprop` as its children, in this case - `name`.\n\nThe parser assumes every `itemtype` contains an `itemprop`, or every `typeof` contains a `property` in case of `rdfa`. So the `\"Black Rhino\"` information is lost.\n\nIt'll be nice to fix this by having a `non-strict` mode for parsing this information. PRs are welcome.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Findix%2Fweb-auto-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Findix%2Fweb-auto-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Findix%2Fweb-auto-extractor/lists"}