{"id":13545696,"url":"https://github.com/scrapinghub/extruct","last_synced_at":"2025-05-13T22:03:52.945Z","repository":{"id":40347429,"uuid":"44965223","full_name":"scrapinghub/extruct","owner":"scrapinghub","description":"Extract embedded metadata from HTML markup","archived":false,"fork":false,"pushed_at":"2025-03-24T11:06:18.000Z","size":1022,"stargazers_count":900,"open_issues_count":53,"forks_count":117,"subscribers_count":109,"default_branch":"master","last_synced_at":"2025-04-15T00:42:09.893Z","etag":null,"topics":["hacktoberfest","json-ld","microdata","microformats","opengraph","rdfa","semantic-web"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapinghub.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2015-10-26T11:51:21.000Z","updated_at":"2025-04-12T08:24:20.000Z","dependencies_parsed_at":"2024-04-16T15:02:02.260Z","dependency_job_id":"3ddb30b2-23cf-4d54-956a-ba8f9d29ee7e","html_url":"https://github.com/scrapinghub/extruct","commit_stats":{"total_commits":402,"total_committers":33,"mean_commits":"12.181818181818182","dds":0.8109452736318408,"last_synced_commit":"9453e4348e06509e7716ada3de585cddfe2ad0d4"},"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fextruct","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fextruct/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fextruct/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Fextruct/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapinghub","download_url":"https://codeload.github.com/scrapinghub/extruct/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251336807,"owners_count":21573264,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hacktoberfest","json-ld","microdata","microformats","opengraph","rdfa","semantic-web"],"created_at":"2024-08-01T11:01:10.008Z","updated_at":"2025-04-28T15:24:09.547Z","avatar_url":"https://github.com/scrapinghub.png","language":"Python","funding_links":[],"categories":["Python","hacktoberfest","Web Scraping \u0026 Crawling"],"sub_categories":[],"readme":"=======\nextruct\n=======\n\n.. image:: https://github.com/scrapinghub/extruct/workflows/build/badge.svg?branch=master\n    :target: https://github.com/scrapinghub/extruct/actions\n    :alt: Build Status\n\n.. image:: https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000\n    :target: https://codecov.io/gh/scrapinghub/extruct\n    :alt: Coverage report\n\n.. image:: https://img.shields.io/pypi/v/extruct.svg\n   :target: https://pypi.python.org/pypi/extruct\n   :alt: PyPI Version\n\n\n*extruct* is a library for extracting embedded metadata from HTML markup.\n\nCurrently, *extruct* supports:\n\n- `W3C's HTML Microdata`_\n- `embedded JSON-LD`_\n- `Microformat`_ via `mf2py`_\n- `Facebook's Open Graph`_\n- (experimental) `RDFa`_ via `rdflib`_\n- `Dublin Core Metadata (DC-HTML-2003)`_\n\n.. _W3C's HTML Microdata: http://www.w3.org/TR/microdata/\n.. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents\n.. _RDFa: https://www.w3.org/TR/html-rdfa/\n.. _rdflib: https://pypi.python.org/pypi/rdflib/\n.. _Microformat: http://microformats.org/wiki/Main_Page\n.. _mf2py: https://github.com/microformats/mf2py\n.. _Facebook's Open Graph: http://ogp.me/\n.. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/\n\nThe microdata algorithm is a revisit of `this Scrapinghub blog post`_ showing how to use EXSLT extensions.\n\n.. _this Scrapinghub blog post: http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/\n\n\nInstallation\n------------\n\n::\n\n    pip install extruct\n\n\nUsage\n-----\n\nAll-in-one extraction\n+++++++++++++++++++++\n\nThe simplest example how to use extruct is to call\n``extruct.extract(htmlstring, base_url=base_url)``\nwith some HTML string and an optional base URL.\n\nLet's try this on a webpage that uses all the syntaxes supported (RDFa with `ogp`_).\n\nFirst fetch the HTML using python-requests and then feed the response body to ``extruct``::\n\n  \u003e\u003e\u003e import extruct\n  \u003e\u003e\u003e import requests\n  \u003e\u003e\u003e import pprint\n  \u003e\u003e\u003e from w3lib.html import get_base_url\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n  \u003e\u003e\u003e r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')\n  \u003e\u003e\u003e base_url = get_base_url(r.text, r.url)\n  \u003e\u003e\u003e data = extruct.extract(r.text, base_url=base_url)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e pp.pprint(data)\n  { 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',\n                                        'content': 'What is Open Graph Protocol '\n                                                   'and why you need it? Learn to '\n                                                   'implement Open Graph Protocol '\n                                                   'for Facebook on your website. '\n                                                   'Open Graph Protocol Meta Tags.',\n                                        'name': 'description'}],\n                        'namespaces': {},\n                        'terms': []}],\n\n  'json-ld': [ { '@context': 'https://schema.org',\n                   '@id': '#organization',\n                   '@type': 'Organization',\n                   'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',\n                   'name': 'Optimize Smart',\n                   'sameAs': [ 'https://www.facebook.com/optimizesmart/',\n                               'https://uk.linkedin.com/in/analyticsnerd',\n                               'https://www.youtube.com/user/optimizesmart',\n                               'https://twitter.com/analyticsnerd'],\n                   'url': 'https://www.optimizesmart.com/'}],\n    'microdata': [ { 'properties': {'headline': ''},\n                     'type': 'http://schema.org/WPHeader'}],\n    'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],\n                                                       'name': [ 'Open Graph '\n                                                                 'Protocol for '\n                                                                 'Facebook '\n                                                                 'explained with '\n                                                                 'examples\\n'\n                                                                 '\\n'\n                                                                 'Specialized '\n                                                                 'Tracking\\n'\n                                                                 '\\n'\n                                                                 '\\n'\n                                                                 (...)\n                                                                 'Follow '\n                                                                 '@analyticsnerd\\n'\n                                                                 '!function(d,s,id){var '\n                                                                 \"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, \"\n                                                                 \"'script', \"\n                                                                 \"'twitter-wjs');\"]},\n                                       'type': ['h-entry']}],\n                       'properties': { 'name': [ 'Open Graph Protocol for '\n                                                 'Facebook explained with '\n                                                 'examples\\n'\n                                                 (...)\n                                                 'Follow @analyticsnerd\\n'\n                                                 '!function(d,s,id){var '\n                                                 \"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, \"\n                                                 \"'script', 'twitter-wjs');\"]},\n                       'type': ['h-feed']}],\n    'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},\n                     'properties': [ ('og:locale', 'en_US'),\n                                     ('og:type', 'article'),\n                                     ( 'og:title',\n                                       'Open Graph Protocol for Facebook '\n                                       'explained with examples'),\n                                     ( 'og:description',\n                                       'What is Open Graph Protocol and why you '\n                                       'need it? Learn to implement Open Graph '\n                                       'Protocol for Facebook on your website. '\n                                       'Open Graph Protocol Meta Tags.'),\n                                     ( 'og:url',\n                                       'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),\n                                     ('og:site_name', 'Optimize Smart'),\n                                     ( 'og:updated_time',\n                                       '2018-03-09T16:26:35+00:00'),\n                                     ( 'og:image',\n                                       'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),\n                                     ( 'og:image:secure_url',\n                                       'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],\n    'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',\n                'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},\n              { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',\n                'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],\n                'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],\n                'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],\n                'article:section': [{'@value': 'Specialized Tracking'}],\n                'http://ogp.me/ns#description': [ { '@value': 'What is Open '\n                                                              'Graph Protocol '\n                                                              'and why you need '\n                                                              'it? Learn to '\n                                                              'implement Open '\n                                                              'Graph Protocol '\n                                                              'for Facebook on '\n                                                              'your website. '\n                                                              'Open Graph '\n                                                              'Protocol Meta '\n                                                              'Tags.'}],\n                'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],\n                'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],\n                'http://ogp.me/ns#locale': [{'@value': 'en_US'}],\n                'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],\n                'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '\n                                                        'Facebook explained with '\n                                                        'examples'}],\n                'http://ogp.me/ns#type': [{'@value': 'article'}],\n                'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],\n                'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],\n                'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}\n\nSelect syntaxes\n+++++++++++++++\nIt is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::\n\n  \u003e\u003e\u003e r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')\n  \u003e\u003e\u003e base_url = get_base_url(r.text, r.url)\n  \u003e\u003e\u003e data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e pp.pprint(data)\n  { 'microdata': [],\n    'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',\n                                    'fb': 'http://www.facebook.com/2008/fbml',\n                                    'og': 'http://ogp.me/ns#'},\n                     'properties': [ ('fb:app_id', '308540029359'),\n                                     ('og:site_name', 'Songkick'),\n                                     ('og:type', 'songkick-concerts:artist'),\n                                     ('og:title', 'Elysian Fields'),\n                                     ( 'og:description',\n                                       'Find out when Elysian Fields is next '\n                                       'playing live near you. List of all '\n                                       'Elysian Fields tour dates and concerts.'),\n                                     ( 'og:url',\n                                       'https://www.songkick.com/artists/236156-elysian-fields'),\n                                     ( 'og:image',\n                                       'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],\n    'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',\n                'al:ios:app_name': [{'@value': 'Songkick Concerts'}],\n                'al:ios:app_store_id': [{'@value': '438690886'}],\n                'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],\n                'http://ogp.me/ns#description': [ { '@value': 'Find out when '\n                                                              'Elysian Fields is '\n                                                              'next playing live '\n                                                              'near you. List of '\n                                                              'all Elysian '\n                                                              'Fields tour dates '\n                                                              'and concerts.'}],\n                'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],\n                'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],\n                'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],\n                'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],\n                'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],\n                'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}\n\nAlternatively, if you already parsed the HTML before calling extruct, you can use the tree instead of the HTML string: ::\n\n  \u003e\u003e\u003e # using the request from the previous example\n  \u003e\u003e\u003e base_url = get_base_url(r.text, r.url)\n  \u003e\u003e\u003e from extruct.utils import parse_html\n  \u003e\u003e\u003e tree = parse_html(r.text)\n  \u003e\u003e\u003e data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])\n\nMicroformat format doesn't support the HTML tree, so you need to use a HTML string.\n\nUniform\n+++++++\nAnother option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::\n\n    {'@context': 'http://example.com',\n                 '@type': 'example_type',\n                 /* All other the properties in keys here */\n                 }\n\nTo do so set ``uniform=True`` when calling ``extract``, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::\n\n  \u003e\u003e\u003e r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')\n  \u003e\u003e\u003e base_url = get_base_url(r.text, r.url)\n  \u003e\u003e\u003e data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e pp.pprint(data)\n  { 'microdata': [],\n    'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',\n                                 'fb': 'http://www.facebook.com/2008/fbml',\n                                 'og': 'http://ogp.me/ns#'},\n                   '@type': 'songkick-concerts:artist',\n                   'fb:app_id': '308540029359',\n                   'og:description': 'Find out when Elysian Fields is next '\n                                     'playing live near you. List of all '\n                                     'Elysian Fields tour dates and concerts.',\n                   'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',\n                   'og:site_name': 'Songkick',\n                   'og:title': 'Elysian Fields',\n                   'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],\n    'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',\n                'al:ios:app_name': [{'@value': 'Songkick Concerts'}],\n                'al:ios:app_store_id': [{'@value': '438690886'}],\n                'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],\n                'http://ogp.me/ns#description': [ { '@value': 'Find out when '\n                                                              'Elysian Fields is '\n                                                              'next playing live '\n                                                              'near you. List of '\n                                                              'all Elysian '\n                                                              'Fields tour dates '\n                                                              'and concerts.'}],\n                'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],\n                'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],\n                'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],\n                'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],\n                'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],\n                'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}\n\nNB rdfa structure is not uniformed yet.\n\nReturning HTML node\n+++++++++++++++++++\n\nIt is also possible to get references to HTML node for every extracted metadata item.\nThe feature is supported only by microdata syntax.\n\nTo use that, just set the ``return_html_node`` option of ``extract`` method to ``True``.\nAs the result, an additional key \"nodeHtml\" will be included in the result for every\nitem. Each node is of ``lxml.etree.Element`` type: ::\n\n  \u003e\u003e\u003e r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')\n  \u003e\u003e\u003e base_url = get_base_url(r.text, r.url)\n  \u003e\u003e\u003e data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e pp.pprint(data)\n  { 'microdata': [ { 'htmlNode': \u003cElement div at 0x7f10f8e6d3b8\u003e,\n                     'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\\n'\n                                                    'Not your thin sticky pad, '\n                                                    'No-Muv is truly the best!',\n                                     'image': ['', ''],\n                                     'name': ['No-Muv', 'No-Muv'],\n                                     'offers': [ { 'htmlNode': \u003cElement div at 0x7f10f8e6d138\u003e,\n                                                   'properties': { 'availability': 'http://schema.org/InStock',\n                                                                   'price': 'Price:  '\n                                                                            '$45'},\n                                                   'type': 'http://schema.org/Offer'},\n                                                 { 'htmlNode': \u003cElement div at 0x7f10f8e60f48\u003e,\n                                                   'properties': { 'availability': 'http://schema.org/InStock',\n                                                                   'price': '(Select '\n                                                                            'Size/Shape '\n                                                                            'for '\n                                                                            'Pricing)'},\n                                                   'type': 'http://schema.org/Offer'}],\n                                     'ratingValue': ['5.00', '5.00']},\n                     'type': 'http://schema.org/Product'}]}\n\nSingle extractors\n-----------------\n\nYou can also use each extractor individually. See below.\n\nMicrodata extraction\n++++++++++++++++++++\n::\n\n  \u003e\u003e\u003e import pprint\n  \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e from extruct.w3cmicrodata import MicrodataExtractor\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e # example from http://www.w3.org/TR/microdata/#associating-names-with-items\n  \u003e\u003e\u003e html = \"\"\"\u003c!DOCTYPE HTML\u003e\n  ... \u003chtml\u003e\n  ...  \u003chead\u003e\n  ...   \u003ctitle\u003ePhoto gallery\u003c/title\u003e\n  ...  \u003c/head\u003e\n  ...  \u003cbody\u003e\n  ...   \u003ch1\u003eMy photos\u003c/h1\u003e\n  ...   \u003cfigure itemscope itemtype=\"http://n.whatwg.org/work\" itemref=\"licenses\"\u003e\n  ...    \u003cimg itemprop=\"work\" src=\"images/house.jpeg\" alt=\"A white house, boarded up, sits in a forest.\"\u003e\n  ...    \u003cfigcaption itemprop=\"title\"\u003eThe house I found.\u003c/figcaption\u003e\n  ...   \u003c/figure\u003e\n  ...   \u003cfigure itemscope itemtype=\"http://n.whatwg.org/work\" itemref=\"licenses\"\u003e\n  ...    \u003cimg itemprop=\"work\" src=\"images/mailbox.jpeg\" alt=\"Outside the house is a mailbox. It has a leaflet inside.\"\u003e\n  ...    \u003cfigcaption itemprop=\"title\"\u003eThe mailbox.\u003c/figcaption\u003e\n  ...   \u003c/figure\u003e\n  ...   \u003cfooter\u003e\n  ...    \u003cp id=\"licenses\"\u003eAll images licensed under the \u003ca itemprop=\"license\"\n  ...    href=\"http://www.opensource.org/licenses/mit-license.php\"\u003eMIT\n  ...    license\u003c/a\u003e.\u003c/p\u003e\n  ...   \u003c/footer\u003e\n  ...  \u003c/body\u003e\n  ... \u003c/html\u003e\"\"\"\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e mde = MicrodataExtractor()\n  \u003e\u003e\u003e data = mde.extract(html)\n  \u003e\u003e\u003e pp.pprint(data)\n  [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',\n                   'title': 'The house I found.',\n                   'work': 'http://www.example.com/images/house.jpeg'},\n    'type': 'http://n.whatwg.org/work'},\n   {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',\n                   'title': 'The mailbox.',\n                   'work': 'http://www.example.com/images/mailbox.jpeg'},\n    'type': 'http://n.whatwg.org/work'}]\n\nJSON-LD extraction\n++++++++++++++++++\n::\n\n  \u003e\u003e\u003e import pprint\n  \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e from extruct.jsonld import JsonLdExtractor\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e html = \"\"\"\u003c!DOCTYPE HTML\u003e\n  ... \u003chtml\u003e\n  ...  \u003chead\u003e\n  ...   \u003ctitle\u003eSome Person Page\u003c/title\u003e\n  ...  \u003c/head\u003e\n  ...  \u003cbody\u003e\n  ...   \u003ch1\u003eThis guys\u003c/h1\u003e\n  ...     \u003cscript type=\"application/ld+json\"\u003e\n  ...     {\n  ...       \"@context\": \"http://schema.org\",\n  ...       \"@type\": \"Person\",\n  ...       \"name\": \"John Doe\",\n  ...       \"jobTitle\": \"Graduate research assistant\",\n  ...       \"affiliation\": \"University of Dreams\",\n  ...       \"additionalName\": \"Johnny\",\n  ...       \"url\": \"http://www.example.com\",\n  ...       \"address\": {\n  ...         \"@type\": \"PostalAddress\",\n  ...         \"streetAddress\": \"1234 Peach Drive\",\n  ...         \"addressLocality\": \"Wonderland\",\n  ...         \"addressRegion\": \"Georgia\"\n  ...       }\n  ...     }\n  ...     \u003c/script\u003e\n  ...  \u003c/body\u003e\n  ... \u003c/html\u003e\"\"\"\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e jslde = JsonLdExtractor()\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e data = jslde.extract(html)\n  \u003e\u003e\u003e pp.pprint(data)\n  [{'@context': 'http://schema.org',\n    '@type': 'Person',\n    'additionalName': 'Johnny',\n    'address': {'@type': 'PostalAddress',\n                'addressLocality': 'Wonderland',\n                'addressRegion': 'Georgia',\n                'streetAddress': '1234 Peach Drive'},\n    'affiliation': 'University of Dreams',\n    'jobTitle': 'Graduate research assistant',\n    'name': 'John Doe',\n    'url': 'http://www.example.com'}]\n\n\nRDFa extraction (experimental)\n++++++++++++++++++++++++++++++\n\n::\n\n  \u003e\u003e\u003e import pprint\n  \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n  \u003e\u003e\u003e from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available\n  INFO:rdflib:RDFLib Version: 4.2.1\n  /home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.\n    'parsers will not be available.')\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e html = \"\"\"\u003chtml\u003e\n  ...  \u003chead\u003e\n  ...    ...\n  ...  \u003c/head\u003e\n  ...  \u003cbody prefix=\"dc: http://purl.org/dc/terms/ schema: http://schema.org/\"\u003e\n  ...    \u003cdiv resource=\"/alice/posts/trouble_with_bob\" typeof=\"schema:BlogPosting\"\u003e\n  ...       \u003ch2 property=\"dc:title\"\u003eThe trouble with Bob\u003c/h2\u003e\n  ...       ...\n  ...       \u003ch3 property=\"dc:creator schema:creator\" resource=\"#me\"\u003eAlice\u003c/h3\u003e\n  ...       \u003cdiv property=\"schema:articleBody\"\u003e\n  ...         \u003cp\u003eThe trouble with Bob is that he takes much better photos than I do:\u003c/p\u003e\n  ...       \u003c/div\u003e\n  ...      ...\n  ...    \u003c/div\u003e\n  ...  \u003c/body\u003e\n  ... \u003c/html\u003e\n  ... \"\"\"\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e rdfae = RDFaExtractor()\n  \u003e\u003e\u003e pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))\n  [{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',\n    '@type': ['http://schema.org/BlogPosting'],\n    'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],\n    'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],\n    'http://schema.org/articleBody': [{'@value': '\\n'\n                                                 '        The trouble with Bob '\n                                                 'is that he takes much better '\n                                                 'photos than I do:\\n'\n                                                 '      '}],\n    'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]\n\nYou'll get a list of expanded JSON-LD nodes.\n\n\nOpen Graph extraction\n++++++++++++++++++++++++++++++\n\n::\n\n  \u003e\u003e\u003e import pprint\n  \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e from extruct.opengraph import OpenGraphExtractor\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e html = \"\"\"\u003c!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"\u003e\n  ... \u003chtml xmlns=\"https://www.w3.org/1999/xhtml\" xmlns:og=\"https://ogp.me/ns#\" xmlns:fb=\"https://www.facebook.com/2008/fbml\"\u003e\n  ...  \u003chead\u003e\n  ...   \u003ctitle\u003eHimanshu's Open Graph Protocol\u003c/title\u003e\n  ...   \u003cmeta http-equiv=\"Content-Type\" content=\"text/html;charset=WINDOWS-1252\" /\u003e\n  ...   \u003cmeta http-equiv=\"Content-Language\" content=\"en-us\" /\u003e\n  ...   \u003clink rel=\"stylesheet\" type=\"text/css\" href=\"event-education.css\" /\u003e\n  ...   \u003cmeta name=\"verify-v1\" content=\"so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=\" \u003e\n  ...   \u003cmeta property=\"og:title\" content=\"Himanshu's Open Graph Protocol\"/\u003e\n  ...   \u003cmeta property=\"og:type\" content=\"article\"/\u003e\n  ...   \u003cmeta property=\"og:url\" content=\"https://www.eventeducation.com/test.php\"/\u003e\n  ...   \u003cmeta property=\"og:image\" content=\"https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg\"/\u003e\n  ...   \u003cmeta property=\"fb:admins\" content=\"himanshu160\"/\u003e\n  ...   \u003cmeta property=\"og:site_name\" content=\"Event Education\"/\u003e\n  ...   \u003cmeta property=\"og:description\" content=\"Event Education provides free courses on event planning and management to event professionals worldwide.\"/\u003e\n  ...  \u003c/head\u003e\n  ...  \u003cbody\u003e\n  ...   \u003cdiv id=\"fb-root\"\u003e\u003c/div\u003e\n  ...   \u003cscript\u003e(function(d, s, id) {\n  ...               var js, fjs = d.getElementsByTagName(s)[0];\n  ...               if (d.getElementById(id)) return;\n  ...                  js = d.createElement(s); js.id = id;\n  ...                  js.src = \"//connect.facebook.net/en_US/all.js#xfbml=1\u0026appId=501839739845103\";\n  ...                  fjs.parentNode.insertBefore(js, fjs);\n  ...                  }(document, 'script', 'facebook-jssdk'));\u003c/script\u003e\n  ...  \u003c/body\u003e\n  ... \u003c/html\u003e\"\"\"\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e opengraphe = OpenGraphExtractor()\n  \u003e\u003e\u003e pp.pprint(opengraphe.extract(html))\n  [{\"namespace\": {\n        \"og\": \"http://ogp.me/ns#\"\n    },\n    \"properties\": [\n        [\n            \"og:title\",\n            \"Himanshu's Open Graph Protocol\"\n        ],\n        [\n            \"og:type\",\n            \"article\"\n        ],\n        [\n            \"og:url\",\n            \"https://www.eventeducation.com/test.php\"\n        ],\n        [\n            \"og:image\",\n            \"https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg\"\n        ],\n        [\n            \"og:site_name\",\n            \"Event Education\"\n        ],\n        [\n            \"og:description\",\n            \"Event Education provides free courses on event planning and management to event professionals worldwide.\"\n        ]\n      ]\n   }]\n\n\nMicroformat extraction\n++++++++++++++++++++++++++++++\n\n::\n\n  \u003e\u003e\u003e import pprint\n  \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e from extruct.microformat import MicroformatExtractor\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e html = \"\"\"\u003c!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"\u003e\n  ... \u003chtml xmlns=\"https://www.w3.org/1999/xhtml\" xmlns:og=\"https://ogp.me/ns#\" xmlns:fb=\"https://www.facebook.com/2008/fbml\"\u003e\n  ...  \u003chead\u003e\n  ...   \u003ctitle\u003eHimanshu's Open Graph Protocol\u003c/title\u003e\n  ...   \u003cmeta http-equiv=\"Content-Type\" content=\"text/html;charset=WINDOWS-1252\" /\u003e\n  ...   \u003cmeta http-equiv=\"Content-Language\" content=\"en-us\" /\u003e\n  ...   \u003clink rel=\"stylesheet\" type=\"text/css\" href=\"event-education.css\" /\u003e\n  ...   \u003cmeta name=\"verify-v1\" content=\"so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=\" \u003e\n  ...   \u003cmeta property=\"og:title\" content=\"Himanshu's Open Graph Protocol\"/\u003e\n  ...   \u003carticle class=\"h-entry\"\u003e\n  ...    \u003ch1 class=\"p-name\"\u003eMicroformats are amazing\u003c/h1\u003e\n  ...    \u003cp\u003ePublished by \u003ca class=\"p-author h-card\" href=\"http://example.com\"\u003eW. Developer\u003c/a\u003e\n  ...       on \u003ctime class=\"dt-published\" datetime=\"2013-06-13 12:00:00\"\u003e13\u003csup\u003eth\u003c/sup\u003e June 2013\u003c/time\u003e\u003c/p\u003e\n  ...    \u003cp class=\"p-summary\"\u003eIn which I extoll the virtues of using microformats.\u003c/p\u003e\n  ...    \u003cdiv class=\"e-content\"\u003e\n  ...     \u003cp\u003eBlah blah blah\u003c/p\u003e\n  ...    \u003c/div\u003e\n  ...   \u003c/article\u003e\n  ...  \u003c/head\u003e\n  ...  \u003cbody\u003e\u003c/body\u003e\n  ... \u003c/html\u003e\"\"\"\n  \u003e\u003e\u003e\n  \u003e\u003e\u003e microformate = MicroformatExtractor()\n  \u003e\u003e\u003e data = microformate.extract(html)\n  \u003e\u003e\u003e pp.pprint(data)\n  [{\"type\": [\n        \"h-entry\"\n    ],\n    \"properties\": {\n        \"name\": [\n            \"Microformats are amazing\"\n        ],\n        \"author\": [\n            {\n                \"type\": [\n                    \"h-card\"\n                ],\n                \"properties\": {\n                    \"name\": [\n                        \"W. Developer\"\n                    ],\n                    \"url\": [\n                        \"http://example.com\"\n                    ]\n                },\n                \"value\": \"W. Developer\"\n            }\n        ],\n        \"published\": [\n            \"2013-06-13 12:00:00\"\n        ],\n        \"summary\": [\n            \"In which I extoll the virtues of using microformats.\"\n        ],\n        \"content\": [\n            {\n                \"html\": \"\\n\u003cp\u003eBlah blah blah\u003c/p\u003e\\n\",\n                \"value\": \"\\nBlah blah blah\\n\"\n            }\n        ]\n      }\n   }]\n\nDublinCore extraction\n++++++++++++++++++++++++++++++\n::\n\n    \u003e\u003e\u003e import pprint\n    \u003e\u003e\u003e pp = pprint.PrettyPrinter(indent=2)\n    \u003e\u003e\u003e from extruct.dublincore import DublinCoreExtractor\n    \u003e\u003e\u003e html = '''\u003chead profile=\"http://dublincore.org/documents/dcq-html/\"\u003e\n    ... \u003ctitle\u003eExpressing Dublin Core in HTML/XHTML meta and link elements\u003c/title\u003e\n    ... \u003clink rel=\"schema.DC\" href=\"http://purl.org/dc/elements/1.1/\" /\u003e\n    ... \u003clink rel=\"schema.DCTERMS\" href=\"http://purl.org/dc/terms/\" /\u003e\n    ...\n    ...\n    ... \u003cmeta name=\"DC.title\" lang=\"en\" content=\"Expressing Dublin Core\n    ... in HTML/XHTML meta and link elements\" /\u003e\n    ... \u003cmeta name=\"DC.creator\" content=\"Andy Powell, UKOLN, University of Bath\" /\u003e\n    ... \u003cmeta name=\"DCTERMS.issued\" scheme=\"DCTERMS.W3CDTF\" content=\"2003-11-01\" /\u003e\n    ... \u003cmeta name=\"DC.identifier\" scheme=\"DCTERMS.URI\"\n    ... content=\"http://dublincore.org/documents/dcq-html/\" /\u003e\n    ... \u003clink rel=\"DCTERMS.replaces\" hreflang=\"en\"\n    ... href=\"http://dublincore.org/documents/2000/08/15/dcq-html/\" /\u003e\n    ... \u003cmeta name=\"DCTERMS.abstract\" content=\"This document describes how\n    ... qualified Dublin Core metadata can be encoded\n    ... in HTML/XHTML \u0026lt;meta\u0026gt; elements\" /\u003e\n    ... \u003cmeta name=\"DC.format\" scheme=\"DCTERMS.IMT\" content=\"text/html\" /\u003e\n    ... \u003cmeta name=\"DC.type\" scheme=\"DCTERMS.DCMIType\" content=\"Text\" /\u003e\n    ... \u003cmeta name=\"DC.Date.modified\" content=\"2001-07-18\" /\u003e\n    ... \u003cmeta name=\"DCTERMS.modified\" content=\"2001-07-18\" /\u003e'''\n    \u003e\u003e\u003e dublinlde = DublinCoreExtractor()\n    \u003e\u003e\u003e data = dublinlde.extract(html)\n    \u003e\u003e\u003e pp.pprint(data)\n    [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',\n                        'content': 'Expressing Dublin Core\\n'\n                                   'in HTML/XHTML meta and link elements',\n                        'lang': 'en',\n                        'name': 'DC.title'},\n                      { 'URI': 'http://purl.org/dc/elements/1.1/creator',\n                        'content': 'Andy Powell, UKOLN, University of Bath',\n                        'name': 'DC.creator'},\n                      { 'URI': 'http://purl.org/dc/elements/1.1/identifier',\n                        'content': 'http://dublincore.org/documents/dcq-html/',\n                        'name': 'DC.identifier',\n                        'scheme': 'DCTERMS.URI'},\n                      { 'URI': 'http://purl.org/dc/elements/1.1/format',\n                        'content': 'text/html',\n                        'name': 'DC.format',\n                        'scheme': 'DCTERMS.IMT'},\n                      { 'URI': 'http://purl.org/dc/elements/1.1/type',\n                        'content': 'Text',\n                        'name': 'DC.type',\n                        'scheme': 'DCTERMS.DCMIType'}],\n        'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',\n                        'DCTERMS': 'http://purl.org/dc/terms/'},\n        'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',\n                     'content': '2003-11-01',\n                     'name': 'DCTERMS.issued',\n                     'scheme': 'DCTERMS.W3CDTF'},\n                   { 'URI': 'http://purl.org/dc/terms/abstract',\n                     'content': 'This document describes how\\n'\n                                'qualified Dublin Core metadata can be encoded\\n'\n                                'in HTML/XHTML \u003cmeta\u003e elements',\n                     'name': 'DCTERMS.abstract'},\n                   { 'URI': 'http://purl.org/dc/terms/modified',\n                     'content': '2001-07-18',\n                     'name': 'DC.Date.modified'},\n                   { 'URI': 'http://purl.org/dc/terms/modified',\n                     'content': '2001-07-18',\n                     'name': 'DCTERMS.modified'},\n                   { 'URI': 'http://purl.org/dc/terms/replaces',\n                     'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',\n                     'hreflang': 'en',\n                     'rel': 'DCTERMS.replaces'}]}]\n\n\n\nCommand Line Tool\n-----------------\n\n*extruct* provides a command line tool that allows you to fetch a page and\nextract the metadata from it directly from the command line.\n\nDependencies\n++++++++++++\n\nThe command line tool depends on ``requests``, which is not installed by default\nwhen you install **extruct**. In order to use the command line tool, you can\ninstall **extruct** with the `cli` extra requirements::\n\n    pip install 'extruct[cli]'\n\n\nUsage\n+++++\n\n::\n\n    extruct \"http://example.com\"\n\nDownloads \"http://example.com\" and outputs the Microdata, JSON-LD and RDFa, Open Graph\nand Microformat metadata to `stdout`.\n\nSupported Parameters\n++++++++++++++++++++\n\nBy default, the command line tool will try to extract all the supported\nmetadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph\nand Microformat). If you want to restrict the output to just one or a subset of\nthose, you can pass their individual names collected in a list through 'syntaxes' argument.\n\nFor example, this command extracts only Microdata and JSON-LD metadata from\n\"http://example.com\"::\n\n    extruct \"http://example.com\" --syntaxes microdata json-ld\n\nNB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat\n\nDevelopment version\n-------------------\n\n::\n\n    mkvirtualenv extruct\n    pip install -r requirements-dev.txt\n\n\nTests\n-----\n\nRun tests in current environment::\n\n    py.test tests\n\n\nUse tox_ to run tests with different Python versions::\n\n    tox\n\n\n.. _tox: https://testrun.org/tox/latest/\n.. _ogp: https://ogp.me/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Fextruct","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapinghub%2Fextruct","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Fextruct/lists"}