https://github.com/scrapinghub/extruct

Extract embedded metadata from HTML markup
https://github.com/scrapinghub/extruct
hacktoberfest json-ld microdata microformats opengraph rdfa semantic-web
Last synced: about 1 year ago
JSON representation
Extract embedded metadata from HTML markup
Host: GitHub
URL: https://github.com/scrapinghub/extruct
Owner: scrapinghub
License: bsd-3-clause
Created: 2015-10-26T11:51:21.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2025-03-24T11:06:18.000Z (over 1 year ago)
Last Synced: 2025-04-15T00:42:09.893Z (over 1 year ago)
Topics: hacktoberfest, json-ld, microdata, microformats, opengraph, rdfa, semantic-web
Language: Python
Homepage:
Size: 998 KB
Stars: 900
Watchers: 109
Forks: 117
Open Issues: 53
Metadata Files:
- Readme: README.rst
- Changelog: HISTORY.rst
- License: LICENSE
- Authors: AUTHORS
Awesome Lists containing this project

best-of-web-python - GitHub - 47% open · ⏱️ 24.03.2025): (Web Scraping & Crawling)
README

          =======

extruct

=======

.. image:: https://github.com/scrapinghub/extruct/workflows/build/badge.svg?branch=master

    :target: https://github.com/scrapinghub/extruct/actions

    :alt: Build Status

.. image:: https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000

    :target: https://codecov.io/gh/scrapinghub/extruct

    :alt: Coverage report

.. image:: https://img.shields.io/pypi/v/extruct.svg

   :target: https://pypi.python.org/pypi/extruct

   :alt: PyPI Version

*extruct* is a library for extracting embedded metadata from HTML markup.

Currently, *extruct* supports:

- `W3C's HTML Microdata`_

- `embedded JSON-LD`_

- `Microformat`_ via `mf2py`_

- `Facebook's Open Graph`_

- (experimental) `RDFa`_ via `rdflib`_

- `Dublin Core Metadata (DC-HTML-2003)`_

.. _W3C's HTML Microdata: http://www.w3.org/TR/microdata/

.. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents

.. _RDFa: https://www.w3.org/TR/html-rdfa/

.. _rdflib: https://pypi.python.org/pypi/rdflib/

.. _Microformat: http://microformats.org/wiki/Main_Page

.. _mf2py: https://github.com/microformats/mf2py

.. _Facebook's Open Graph: http://ogp.me/

.. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/

The microdata algorithm is a revisit of `this Scrapinghub blog post`_ showing how to use EXSLT extensions.

.. _this Scrapinghub blog post: http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/

Installation

------------

::

    pip install extruct

Usage

-----

All-in-one extraction

+++++++++++++++++++++

The simplest example how to use extruct is to call

``extruct.extract(htmlstring, base_url=base_url)``

with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with `ogp`_).

First fetch the HTML using python-requests and then feed the response body to ``extruct``::

  >>> import extruct

  >>> import requests

  >>> import pprint

  >>> from w3lib.html import get_base_url

  >>>

  >>> pp = pprint.PrettyPrinter(indent=2)

  >>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')

  >>> base_url = get_base_url(r.text, r.url)

  >>> data = extruct.extract(r.text, base_url=base_url)

  >>>

  >>> pp.pprint(data)

  { 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',

                                        'content': 'What is Open Graph Protocol '

                                                   'and why you need it? Learn to '

                                                   'implement Open Graph Protocol '

                                                   'for Facebook on your website. '

                                                   'Open Graph Protocol Meta Tags.',

                                        'name': 'description'}],

                        'namespaces': {},

                        'terms': []}],

  'json-ld': [ { '@context': 'https://schema.org',

                   '@id': '#organization',

                   '@type': 'Organization',

                   'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',

                   'name': 'Optimize Smart',

                   'sameAs': [ 'https://www.facebook.com/optimizesmart/',

                               'https://uk.linkedin.com/in/analyticsnerd',

                               'https://www.youtube.com/user/optimizesmart',

                               'https://twitter.com/analyticsnerd'],

                   'url': 'https://www.optimizesmart.com/'}],

    'microdata': [ { 'properties': {'headline': ''},

                     'type': 'http://schema.org/WPHeader'}],

    'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],

                                                       'name': [ 'Open Graph '

                                                                 'Protocol for '

                                                                 'Facebook '

                                                                 'explained with '

                                                                 'examples\n'

                                                                 '\n'

                                                                 'Specialized '

                                                                 'Tracking\n'

                                                                 '\n'

                                                                 '\n'

                                                                 (...)

                                                                 'Follow '

                                                                 '@analyticsnerd\n'

                                                                 '!function(d,s,id){var '

                                                                 "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "

                                                                 "'script', "

                                                                 "'twitter-wjs');"]},

                                       'type': ['h-entry']}],

                       'properties': { 'name': [ 'Open Graph Protocol for '

                                                 'Facebook explained with '

                                                 'examples\n'

                                                 (...)

                                                 'Follow @analyticsnerd\n'

                                                 '!function(d,s,id){var '

                                                 "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "

                                                 "'script', 'twitter-wjs');"]},

                       'type': ['h-feed']}],

    'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},

                     'properties': [ ('og:locale', 'en_US'),

                                     ('og:type', 'article'),

                                     ( 'og:title',

                                       'Open Graph Protocol for Facebook '

                                       'explained with examples'),

                                     ( 'og:description',

                                       'What is Open Graph Protocol and why you '

                                       'need it? Learn to implement Open Graph '

                                       'Protocol for Facebook on your website. '

                                       'Open Graph Protocol Meta Tags.'),

                                     ( 'og:url',

                                       'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),

                                     ('og:site_name', 'Optimize Smart'),

                                     ( 'og:updated_time',

                                       '2018-03-09T16:26:35+00:00'),

                                     ( 'og:image',

                                       'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),

                                     ( 'og:image:secure_url',

                                       'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],

    'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',

                'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},

              { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',

                'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],

                'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],

                'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],

                'article:section': [{'@value': 'Specialized Tracking'}],

                'http://ogp.me/ns#description': [ { '@value': 'What is Open '

                                                              'Graph Protocol '

                                                              'and why you need '

                                                              'it? Learn to '

                                                              'implement Open '

                                                              'Graph Protocol '

                                                              'for Facebook on '

                                                              'your website. '

                                                              'Open Graph '

                                                              'Protocol Meta '

                                                              'Tags.'}],

                'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],

                'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],

                'http://ogp.me/ns#locale': [{'@value': 'en_US'}],

                'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],

                'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '

                                                        'Facebook explained with '

                                                        'examples'}],

                'http://ogp.me/ns#type': [{'@value': 'article'}],

                'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],

                'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],

                'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes

+++++++++++++++

It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::

  >>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')

  >>> base_url = get_base_url(r.text, r.url)

  >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])

  >>>

  >>> pp.pprint(data)

  { 'microdata': [],

    'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',

                                    'fb': 'http://www.facebook.com/2008/fbml',

                                    'og': 'http://ogp.me/ns#'},

                     'properties': [ ('fb:app_id', '308540029359'),

                                     ('og:site_name', 'Songkick'),

                                     ('og:type', 'songkick-concerts:artist'),

                                     ('og:title', 'Elysian Fields'),

                                     ( 'og:description',

                                       'Find out when Elysian Fields is next '

                                       'playing live near you. List of all '

                                       'Elysian Fields tour dates and concerts.'),

                                     ( 'og:url',

                                       'https://www.songkick.com/artists/236156-elysian-fields'),

                                     ( 'og:image',

                                       'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],

    'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',

                'al:ios:app_name': [{'@value': 'Songkick Concerts'}],

                'al:ios:app_store_id': [{'@value': '438690886'}],

                'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],

                'http://ogp.me/ns#description': [ { '@value': 'Find out when '

                                                              'Elysian Fields is '

                                                              'next playing live '

                                                              'near you. List of '

                                                              'all Elysian '

                                                              'Fields tour dates '

                                                              'and concerts.'}],

                'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],

                'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],

                'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],

                'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],

                'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],

                'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Alternatively, if you already parsed the HTML before calling extruct, you can use the tree instead of the HTML string: ::

  >>> # using the request from the previous example

  >>> base_url = get_base_url(r.text, r.url)

  >>> from extruct.utils import parse_html

  >>> tree = parse_html(r.text)

  >>> data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])

Microformat format doesn't support the HTML tree, so you need to use a HTML string.

Uniform

+++++++

Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::

    {'@context': 'http://example.com',

                 '@type': 'example_type',

                 /* All other the properties in keys here */

                 }

To do so set ``uniform=True`` when calling ``extract``, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::

  >>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')

  >>> base_url = get_base_url(r.text, r.url)

  >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)

  >>>

  >>> pp.pprint(data)

  { 'microdata': [],

    'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',

                                 'fb': 'http://www.facebook.com/2008/fbml',

                                 'og': 'http://ogp.me/ns#'},

                   '@type': 'songkick-concerts:artist',

                   'fb:app_id': '308540029359',

                   'og:description': 'Find out when Elysian Fields is next '

                                     'playing live near you. List of all '

                                     'Elysian Fields tour dates and concerts.',

                   'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',

                   'og:site_name': 'Songkick',

                   'og:title': 'Elysian Fields',

                   'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],

    'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',

                'al:ios:app_name': [{'@value': 'Songkick Concerts'}],

                'al:ios:app_store_id': [{'@value': '438690886'}],

                'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],

                'http://ogp.me/ns#description': [ { '@value': 'Find out when '

                                                              'Elysian Fields is '

                                                              'next playing live '

                                                              'near you. List of '

                                                              'all Elysian '

                                                              'Fields tour dates '

                                                              'and concerts.'}],

                'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],

                'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],

                'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],

                'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],

                'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],

                'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet.

Returning HTML node

+++++++++++++++++++

It is also possible to get references to HTML node for every extracted metadata item.

The feature is supported only by microdata syntax.

To use that, just set the ``return_html_node`` option of ``extract`` method to ``True``.

As the result, an additional key "nodeHtml" will be included in the result for every

item. Each node is of ``lxml.etree.Element`` type: ::

  >>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')

  >>> base_url = get_base_url(r.text, r.url)

  >>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)

  >>>

  >>> pp.pprint(data)

  { 'microdata': [ { 'htmlNode': ,

                     'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'

                                                    'Not your thin sticky pad, '

                                                    'No-Muv is truly the best!',

                                     'image': ['', ''],

                                     'name': ['No-Muv', 'No-Muv'],

                                     'offers': [ { 'htmlNode': ,

                                                   'properties': { 'availability': 'http://schema.org/InStock',

                                                                   'price': 'Price:  '

                                                                            '$45'},

                                                   'type': 'http://schema.org/Offer'},

                                                 { 'htmlNode': ,

                                                   'properties': { 'availability': 'http://schema.org/InStock',

                                                                   'price': '(Select '

                                                                            'Size/Shape '

                                                                            'for '

                                                                            'Pricing)'},

                                                   'type': 'http://schema.org/Offer'}],

                                     'ratingValue': ['5.00', '5.00']},

                     'type': 'http://schema.org/Product'}]}

Single extractors

-----------------

You can also use each extractor individually. See below.

Microdata extraction

++++++++++++++++++++

::

  >>> import pprint

  >>> pp = pprint.PrettyPrinter(indent=2)

  >>>

  >>> from extruct.w3cmicrodata import MicrodataExtractor

  >>>

  >>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items

  >>> html = """

  ... 

  ...  

  ...   Photo gallery

  ...  

  ...  

  ...   
My photos

  ...   

  ...    

  ...    The house I found.

  ...   

  ...   

  ...    

  ...    The mailbox.

  ...   

  ...   

  ...    All images licensed under the MIT

  ...    license.

  ...   

  ...  

  ... """

  >>>

  >>> mde = MicrodataExtractor()

  >>> data = mde.extract(html)

  >>> pp.pprint(data)

  [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',

                   'title': 'The house I found.',

                   'work': 'http://www.example.com/images/house.jpeg'},

    'type': 'http://n.whatwg.org/work'},

   {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',

                   'title': 'The mailbox.',

                   'work': 'http://www.example.com/images/mailbox.jpeg'},

    'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

++++++++++++++++++

::

  >>> import pprint

  >>> pp = pprint.PrettyPrinter(indent=2)

  >>>

  >>> from extruct.jsonld import JsonLdExtractor

  >>>

  >>> html = """

  ... 

  ...  

  ...   Some Person Page

  ...  

  ...  

  ...   
This guys

  ...     

  ...     {

  ...       "@context": "http://schema.org",

  ...       "@type": "Person",

  ...       "name": "John Doe",

  ...       "jobTitle": "Graduate research assistant",

  ...       "affiliation": "University of Dreams",

  ...       "additionalName": "Johnny",

  ...       "url": "http://www.example.com",

  ...       "address": {

  ...         "@type": "PostalAddress",

  ...         "streetAddress": "1234 Peach Drive",

  ...         "addressLocality": "Wonderland",

  ...         "addressRegion": "Georgia"

  ...       }

  ...     }

  ...     

  ...  

  ... """

  >>>

  >>> jslde = JsonLdExtractor()

  >>>

  >>> data = jslde.extract(html)

  >>> pp.pprint(data)

  [{'@context': 'http://schema.org',

    '@type': 'Person',

    'additionalName': 'Johnny',

    'address': {'@type': 'PostalAddress',

                'addressLocality': 'Wonderland',

                'addressRegion': 'Georgia',

                'streetAddress': '1234 Peach Drive'},

    'affiliation': 'University of Dreams',

    'jobTitle': 'Graduate research assistant',

    'name': 'John Doe',

    'url': 'http://www.example.com'}]

RDFa extraction (experimental)

++++++++++++++++++++++++++++++

::

  >>> import pprint

  >>> pp = pprint.PrettyPrinter(indent=2)

  >>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available

  INFO:rdflib:RDFLib Version: 4.2.1

  /home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.

    'parsers will not be available.')

  >>>

  >>> html = """

  ...  

  ...    ...

  ...  

  ...  

  ...    


  ...       The trouble with Bob

  ...       ...

  ...       Alice

  ...       

  ...         The trouble with Bob is that he takes much better photos than I do:

  ...       

  ...      ...

  ...    

  ...  

  ... 

  ... """

  >>>

  >>> rdfae = RDFaExtractor()

  >>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))

  [{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',

    '@type': ['http://schema.org/BlogPosting'],

    'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],

    'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],

    'http://schema.org/articleBody': [{'@value': '\n'

                                                 '        The trouble with Bob '

                                                 'is that he takes much better '

                                                 'photos than I do:\n'

                                                 '      '}],

    'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction

++++++++++++++++++++++++++++++

::

  >>> import pprint

  >>> pp = pprint.PrettyPrinter(indent=2)

  >>>

  >>> from extruct.opengraph import OpenGraphExtractor

  >>>

  >>> html = """

  ... 

  ...  

  ...   Himanshu's Open Graph Protocol

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...  

  ...  

  ...   


  ...   (function(d, s, id) {

  ...               var js, fjs = d.getElementsByTagName(s)[0];

  ...               if (d.getElementById(id)) return;

  ...                  js = d.createElement(s); js.id = id;

  ...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";

  ...                  fjs.parentNode.insertBefore(js, fjs);

  ...                  }(document, 'script', 'facebook-jssdk'));

  ...  

  ... """

  >>>

  >>> opengraphe = OpenGraphExtractor()

  >>> pp.pprint(opengraphe.extract(html))

  [{"namespace": {

        "og": "http://ogp.me/ns#"

    },

    "properties": [

        [

            "og:title",

            "Himanshu's Open Graph Protocol"

        ],

        [

            "og:type",

            "article"

        ],

        [

            "og:url",

            "https://www.eventeducation.com/test.php"

        ],

        [

            "og:image",

            "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"

        ],

        [

            "og:site_name",

            "Event Education"

        ],

        [

            "og:description",

            "Event Education provides free courses on event planning and management to event professionals worldwide."

        ]

      ]

   }]

Microformat extraction

++++++++++++++++++++++++++++++

::

  >>> import pprint

  >>> pp = pprint.PrettyPrinter(indent=2)

  >>>

  >>> from extruct.microformat import MicroformatExtractor

  >>>

  >>> html = """

  ... 

  ...  

  ...   Himanshu's Open Graph Protocol

  ...   

  ...   

  ...   

  ...   

  ...   

  ...   

  ...    
Microformats are amazing

  ...    Published by W. Developer

  ...       on 13^th June 2013

  ...    In which I extoll the virtues of using microformats.

  ...    

  ...     Blah blah blah

  ...    

  ...   

  ...  

  ...  

  ... """

  >>>

  >>> microformate = MicroformatExtractor()

  >>> data = microformate.extract(html)

  >>> pp.pprint(data)

  [{"type": [

        "h-entry"

    ],

    "properties": {

        "name": [

            "Microformats are amazing"

        ],

        "author": [

            {

                "type": [

                    "h-card"

                ],

                "properties": {

                    "name": [

                        "W. Developer"

                    ],

                    "url": [

                        "http://example.com"

                    ]

                },

                "value": "W. Developer"

            }

        ],

        "published": [

            "2013-06-13 12:00:00"

        ],

        "summary": [

            "In which I extoll the virtues of using microformats."

        ],

        "content": [

            {

                "html": "\nBlah blah blah\n",

                "value": "\nBlah blah blah\n"

            }

        ]

      }

   }]

DublinCore extraction

++++++++++++++++++++++++++++++

::

    >>> import pprint

    >>> pp = pprint.PrettyPrinter(indent=2)

    >>> from extruct.dublincore import DublinCoreExtractor

    >>> html = '''

    ... Expressing Dublin Core in HTML/XHTML meta and link elements

    ... 

    ... 

    ...

    ...

    ... 

    ... 

    ... 

    ... 

    ... 

    ... 

    ... 

    ... 

    ... 

    ... '''

    >>> dublinlde = DublinCoreExtractor()

    >>> data = dublinlde.extract(html)

    >>> pp.pprint(data)

    [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',

                        'content': 'Expressing Dublin Core\n'

                                   'in HTML/XHTML meta and link elements',

                        'lang': 'en',

                        'name': 'DC.title'},

                      { 'URI': 'http://purl.org/dc/elements/1.1/creator',

                        'content': 'Andy Powell, UKOLN, University of Bath',

                        'name': 'DC.creator'},

                      { 'URI': 'http://purl.org/dc/elements/1.1/identifier',

                        'content': 'http://dublincore.org/documents/dcq-html/',

                        'name': 'DC.identifier',

                        'scheme': 'DCTERMS.URI'},

                      { 'URI': 'http://purl.org/dc/elements/1.1/format',

                        'content': 'text/html',

                        'name': 'DC.format',

                        'scheme': 'DCTERMS.IMT'},

                      { 'URI': 'http://purl.org/dc/elements/1.1/type',

                        'content': 'Text',

                        'name': 'DC.type',

                        'scheme': 'DCTERMS.DCMIType'}],

        'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',

                        'DCTERMS': 'http://purl.org/dc/terms/'},

        'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',

                     'content': '2003-11-01',

                     'name': 'DCTERMS.issued',

                     'scheme': 'DCTERMS.W3CDTF'},

                   { 'URI': 'http://purl.org/dc/terms/abstract',

                     'content': 'This document describes how\n'

                                'qualified Dublin Core metadata can be encoded\n'

                                'in HTML/XHTML  elements',

                     'name': 'DCTERMS.abstract'},

                   { 'URI': 'http://purl.org/dc/terms/modified',

                     'content': '2001-07-18',

                     'name': 'DC.Date.modified'},

                   { 'URI': 'http://purl.org/dc/terms/modified',

                     'content': '2001-07-18',

                     'name': 'DCTERMS.modified'},

                   { 'URI': 'http://purl.org/dc/terms/replaces',

                     'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',

                     'hreflang': 'en',

                     'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

-----------------

*extruct* provides a command line tool that allows you to fetch a page and

extract the metadata from it directly from the command line.

Dependencies

++++++++++++

The command line tool depends on ``requests``, which is not installed by default

when you install **extruct**. In order to use the command line tool, you can

install **extruct** with the `cli` extra requirements::

    pip install 'extruct[cli]'

Usage

+++++

::

    extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph

and Microformat metadata to `stdout`.

Supported Parameters

++++++++++++++++++++

By default, the command line tool will try to extract all the supported

metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph

and Microformat). If you want to restrict the output to just one or a subset of

those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from

"http://example.com"::

    extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

-------------------

::

    mkvirtualenv extruct

    pip install -r requirements-dev.txt

Tests

-----

Run tests in current environment::

    py.test tests

Use tox_ to run tests with different Python versions::

    tox

.. _tox: https://testrun.org/tox/latest/

.. _ogp: https://ogp.me/
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/scrapinghub/extruct

Awesome Lists containing this project

README

My photos

This guys

The trouble with Bob

Alice

Microformats are amazing