extruct

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, base_url=base_url) with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
                                      'content': 'What is Open Graph Protocol '
                                                 'and why you need it? Learn to '
                                                 'implement Open Graph Protocol '
                                                 'for Facebook on your website. '
                                                 'Open Graph Protocol Meta Tags.',
                                      'name': 'description'}],
                      'namespaces': {},
                      'terms': []}],

'json-ld': [ { '@context': 'https://schema.org',
                 '@id': '#organization',
                 '@type': 'Organization',
                 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
                 'name': 'Optimize Smart',
                 'sameAs': [ 'https://www.facebook.com/optimizesmart/',
                             'https://uk.linkedin.com/in/analyticsnerd',
                             'https://www.youtube.com/user/optimizesmart',
                             'https://twitter.com/analyticsnerd'],
                 'url': 'https://www.optimizesmart.com/'}],
  'microdata': [ { 'properties': {'headline': ''},
                   'type': 'http://schema.org/WPHeader'}],
  'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
                                                     'name': [ 'Open Graph '
                                                               'Protocol for '
                                                               'Facebook '
                                                               'explained with '
                                                               'examples\n'
                                                               '\n'
                                                               'Specialized '
                                                               'Tracking\n'
                                                               '\n'
                                                               '\n'
                                                               (...)
                                                               'Follow '
                                                               '@analyticsnerd\n'
                                                               '!function(d,s,id){var '
                                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                               "'script', "
                                                               "'twitter-wjs');"]},
                                     'type': ['h-entry']}],
                     'properties': { 'name': [ 'Open Graph Protocol for '
                                               'Facebook explained with '
                                               'examples\n'
                                               (...)
                                               'Follow @analyticsnerd\n'
                                               '!function(d,s,id){var '
                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                               "'script', 'twitter-wjs');"]},
                     'type': ['h-feed']}],
  'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
                   'properties': [ ('og:locale', 'en_US'),
                                   ('og:type', 'article'),
                                   ( 'og:title',
                                     'Open Graph Protocol for Facebook '
                                     'explained with examples'),
                                   ( 'og:description',
                                     'What is Open Graph Protocol and why you '
                                     'need it? Learn to implement Open Graph '
                                     'Protocol for Facebook on your website. '
                                     'Open Graph Protocol Meta Tags.'),
                                   ( 'og:url',
                                     'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
                                   ('og:site_name', 'Optimize Smart'),
                                   ( 'og:updated_time',
                                     '2018-03-09T16:26:35+00:00'),
                                   ( 'og:image',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
                                   ( 'og:image:secure_url',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
  'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
              'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
            { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
              'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
              'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
              'article:section': [{'@value': 'Specialized Tracking'}],
              'http://ogp.me/ns#description': [ { '@value': 'What is Open '
                                                            'Graph Protocol '
                                                            'and why you need '
                                                            'it? Learn to '
                                                            'implement Open '
                                                            'Graph Protocol '
                                                            'for Facebook on '
                                                            'your website. '
                                                            'Open Graph '
                                                            'Protocol Meta '
                                                            'Tags.'}],
              'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
              'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
                                                      'Facebook explained with '
                                                      'examples'}],
              'http://ogp.me/ns#type': [{'@value': 'article'}],
              'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
              'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes

It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                  'fb': 'http://www.facebook.com/2008/fbml',
                                  'og': 'http://ogp.me/ns#'},
                   'properties': [ ('fb:app_id', '308540029359'),
                                   ('og:site_name', 'Songkick'),
                                   ('og:type', 'songkick-concerts:artist'),
                                   ('og:title', 'Elysian Fields'),
                                   ( 'og:description',
                                     'Find out when Elysian Fields is next '
                                     'playing live near you. List of all '
                                     'Elysian Fields tour dates and concerts.'),
                                   ( 'og:url',
                                     'https://www.songkick.com/artists/236156-elysian-fields'),
                                   ( 'og:image',
                                     'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Uniform

Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure:

{'@context': 'http://example.com',
             '@type': 'example_type',
             /* All other the properties in keys here */
             }

To do so set uniform=True when calling extract, it's false by default for backward compatibility. Here the same example as before but with uniform set to True:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                               'fb': 'http://www.facebook.com/2008/fbml',
                               'og': 'http://ogp.me/ns#'},
                 '@type': 'songkick-concerts:artist',
                 'fb:app_id': '308540029359',
                 'og:description': 'Find out when Elysian Fields is next '
                                   'playing live near you. List of all '
                                   'Elysian Fields tour dates and concerts.',
                 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
                 'og:site_name': 'Songkick',
                 'og:title': 'Elysian Fields',
                 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet

Returning HTML node

It is also possible to get references to HTML node for every extracted metadata item. The feature is supported only by microdata syntax.

To use that, just set the return_html_node option of extract method to True. As the result, an additional key "nodeHtml" will be included in the result for every item. Each node is of lxml.etree.Element type:

>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
                   'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
                                                  'Not your thin sticky pad, '
                                                  'No-Muv is truly the best!',
                                   'image': ['', ''],
                                   'name': ['No-Muv', 'No-Muv'],
                                   'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': 'Price:  '
                                                                          '$45'},
                                                 'type': 'http://schema.org/Offer'},
                                               { 'htmlNode': <Element div at 0x7f10f8e60f48>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': '(Select '
                                                                          'Size/Shape '
                                                                          'for '
                                                                          'Pricing)'},
                                                 'type': 'http://schema.org/Offer'}],
                                   'ratingValue': ['5.00', '5.00']},
                   'type': 'http://schema.org/Product'}]}

Single extractors

You can also use each extractor individually. See below.

Microdata extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <meta property="og:type" content="article"/>
...   <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
...   <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
...   <meta property="fb:admins" content="himanshu160"/>
...   <meta property="og:site_name" content="Event Education"/>
...   <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
...  </head>
...  <body>
...   <div id="fb-root"></div>
...   <script>(function(d, s, id) {
...               var js, fjs = d.getElementsByTagName(s)[0];
...               if (d.getElementById(id)) return;
...                  js = d.createElement(s); js.id = id;
...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
...                  fjs.parentNode.insertBefore(js, fjs);
...                  }(document, 'script', 'facebook-jssdk'));</script>
...  </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
      "og": "http://ogp.me/ns#"
  },
  "properties": [
      [
          "og:title",
          "Himanshu's Open Graph Protocol"
      ],
      [
          "og:type",
          "article"
      ],
      [
          "og:url",
          "https://www.eventeducation.com/test.php"
      ],
      [
          "og:image",
          "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
      ],
      [
          "og:site_name",
          "Event Education"
      ],
      [
          "og:description",
          "Event Education provides free courses on event planning and management to event professionals worldwide."
      ]
    ]
 }]

Microformat extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <article class="h-entry">
...    <h1 class="p-name">Microformats are amazing</h1>
...    <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
...       on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
...    <p class="p-summary">In which I extoll the virtues of using microformats.</p>
...    <div class="e-content">
...     <p>Blah blah blah</p>
...    </div>
...   </article>
...  </head>
...  <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
      "h-entry"
  ],
  "properties": {
      "name": [
          "Microformats are amazing"
      ],
      "author": [
          {
              "type": [
                  "h-card"
              ],
              "properties": {
                  "name": [
                      "W. Developer"
                  ],
                  "url": [
                      "http://example.com"
                  ]
              },
              "value": "W. Developer"
          }
      ],
      "published": [
          "2013-06-13 12:00:00"
      ],
      "summary": [
          "In which I extoll the virtues of using microformats."
      ],
      "content": [
          {
              "html": "\n<p>Blah blah blah</p>\n",
              "value": "\nBlah blah blah\n"
          }
      ]
    }
 }]

DublinCore extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML &lt;meta&gt; elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                    'content': 'Expressing Dublin Core\n'
                               'in HTML/XHTML meta and link elements',
                    'lang': 'en',
                    'name': 'DC.title'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                    'content': 'Andy Powell, UKOLN, University of Bath',
                    'name': 'DC.creator'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                    'content': 'http://dublincore.org/documents/dcq-html/',
                    'name': 'DC.identifier',
                    'scheme': 'DCTERMS.URI'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/format',
                    'content': 'text/html',
                    'name': 'DC.format',
                    'scheme': 'DCTERMS.IMT'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/type',
                    'content': 'Text',
                    'name': 'DC.type',
                    'scheme': 'DCTERMS.DCMIType'}],
    'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                    'DCTERMS': 'http://purl.org/dc/terms/'},
    'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                 'content': '2003-11-01',
                 'name': 'DCTERMS.issued',
                 'scheme': 'DCTERMS.W3CDTF'},
               { 'URI': 'http://purl.org/dc/terms/abstract',
                 'content': 'This document describes how\n'
                            'qualified Dublin Core metadata can be encoded\n'
                            'in HTML/XHTML <meta> elements',
                 'name': 'DCTERMS.abstract'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DC.Date.modified'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DCTERMS.modified'},
               { 'URI': 'http://purl.org/dc/terms/replaces',
                 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                 'hreflang': 'en',
                 'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install extruct[cli]

Usage

extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph and Microformat metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph and Microformat). If you want to restrict the output to just one or a subset of those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from "http://example.com":

extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

This changeset adds support for rdflib 5.0.0 to the extruct library.

Users of extruct currently often encounter issue #131 (module not found) when they upgrade to rdflib 5.0.0; this is due to the removal of the pyRdfa parser plugin as noted here.

Fortunately, the standalone pyRdfa3 library added support to register as an rdflib plugin in https://github.com/RDFLib/pyrdfa3/pull/26.

What this means is that when pyRdfa3 is installed in the Python environment, it should restore RDFA parsing functionality for users of rdflib 5.0.0 or greater, and resolve the ModuleNotFoundError.

The currently-published PyPi package pyRdfa3, version 3.5.3, includes this functionality via rdf.plugins.parser in entry_points.txt

NB: This release isn't currently tagged in the source repo, but I think pyRdfa3 v3.5.3 is from around commit https://github.com/RDFLib/pyrdfa3/commit/1562b3f6d6d6749e462ef6780ebf8125127c7b6a or https://github.com/RDFLib/pyrdfa3/commit/b623cdd3853f8c314c71186b7af2e12ab74f37b9 based on looking at some diffs.

NB: A note of caution for review/usage is that full Python3 support has been added to pyRdfa3 from version 4.0.0 (in https://github.com/RDFLib/pyrdfa3/pull/34) -- but from inspection of the Python-code-related changes, extruct does not appear to depend at all on the affected Python methods (notably copyErrors and rdf_from_sources which are both provided for CGI-related context in the top-level processURI method).

May fix #131 (alternative approach).

[MRG+1] Add dublincore metadata
This PR adds Dublincore schema to extruct. To implement parsing I used this document as guide: http://dublincore.org/documents/dcq-html/ (Specially on 3. that explains how a DC consumer should act).

More references:

https://en.wikipedia.org/wiki/Dublin_Core

http://dublincore.org/2012/06/14/dcterms

http://dublincore.org/2012/06/14/dcelements

http://www.hipertexto.info/documentos/dublin_core.htm

Fixes #10
opened by joaquingx 15
Add support for rdflib 5.0.0

This changeset adds support for rdflib 5.0.0 to the extruct library.

Users of extruct currently often encounter issue #131 (module not found) when they upgrade to rdflib 5.0.0; this is due to the removal of the pyRdfa parser plugin as noted here.

Fortunately, the standalone pyRdfa3 library added support to register as an rdflib plugin in https://github.com/RDFLib/pyrdfa3/pull/26.

What this means is that when pyRdfa3 is installed in the Python environment, it should restore RDFA parsing functionality for users of rdflib 5.0.0 or greater, and resolve the ModuleNotFoundError.

The currently-published PyPi package pyRdfa3, version 3.5.3, includes this functionality via rdf.plugins.parser in entry_points.txt

NB: This release isn't currently tagged in the source repo, but I think pyRdfa3 v3.5.3 is from around commit https://github.com/RDFLib/pyrdfa3/commit/1562b3f6d6d6749e462ef6780ebf8125127c7b6a or https://github.com/RDFLib/pyrdfa3/commit/b623cdd3853f8c314c71186b7af2e12ab74f37b9 based on looking at some diffs.

NB: A note of caution for review/usage is that full Python3 support has been added to pyRdfa3 from version 4.0.0 (in https://github.com/RDFLib/pyrdfa3/pull/34) -- but from inspection of the Python-code-related changes, extruct does not appear to depend at all on the affected Python methods (notably copyErrors and rdf_from_sources which are both provided for CGI-related context in the top-level processURI method).

May fix #131 (alternative approach).

opened by jayaddison 12
Add og and microformat extraction
Progresses so far:

Added opengraph and microformat extraction

Unified parsing for rdfa, microdata, opengraph and jsonld. Microformat relies on an external lib that uses bs4 to parse so it's not as easy to integrate it as well
opened by Kebniss 11
Idea: migrate from {lxml, html-text} to {html5lib, bleach}
This suggestion is derived from a specific use case where it'd be desirable to build pure-Python application containers for openculinary/crawler.

Currently a dependency on lxml is an obstacle to this, since lxml uses libxml and libxslt, and C bindings for those must be provided and compiled at build-time.

There are some practical issues to resolve:

Whitespace-related tests for W3C microdata are failing

Node-ordering-related tests for RDFA parsing are failing; this should be resolved once an upstream release of rdflib that includes https://github.com/RDFLib/rdflib/pull/1133 is available (in short, the default memory-backed RDF triple-store was not guaranteed order-preserving)

There are questions about the impact of the migration and whether it's worthwhile:

What is the performance impact of the migration, particularly for any customers using extruct for high volumes of data?

What are the comparative safety properties of bleach and bleach-extras compared to lxml's text cleaning functionality?

In short, it may not be worthwhile, but I spent a bit of time looking into the practicality of this and figured it's worth providing that even though it's work-in-progress.
opened by jayaddison 10
Removed rdflib-jsonld as a dependency

As of yesterday's rdflib 6.0.1 release, the rdflib-jsonld dependency is now a part of rdflib proper. This PR just removes it from the requirements.

Tests run fine on my machine (python 3.8.3, Ubuntu), let me know if there's any else I can do to help merge.

opened by BryceStevenWilley 9
Fix incorrectly formatted description property
The following has been done in this PR:

Fixed issue https://github.com/scrapinghub/extruct/issues/113 with incorrectly formatted description property.

Added new test case with website for which the issue occured.

Fixed old test cases (the usage of html_text gets rid of weird new lines).

I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.

The fix was based on the code pushed by @kmike in this PR: https://github.com/scrapinghub/extruct/pull/114. Thanks @kmike!
opened by jakubwasikowski 9
Reverse priorities for repeated properties in uniform format for opengraph

Some pages contain a duplicated definition of some properties like "og:image". See the following pages: https://nerdist.com/article/star-wars-cast-reylo-episode-ix/ https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/

Extruct default behaviour seems to be keep the last one, meanwhile Facebook default behaviour seems to be keep the first one according to results at the developer tool (see https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fnerdist.com%2Farticle%2Fstar-wars-cast-reylo-episode-ix%2F for an example).

Extruct should mimic Facebook behaviour so this PR is reverting the priorities when flattening OpenGraph properties.

opened by ivanprado 9
Move code from __init__ module to _extruct

The namespace would be cleaner if we keep only the imports in __init__. Not naming the new module extruct.extruct as this would interfere with extruct.extract autocompletion.

(another take instead of #77, but I still could not make git to show the diff for the move - only changes are several extra white spaces and an indentation fix).

opened by lopuhin 9
ImportError: No module named 'rdflib.plugins.parsers.pyRdfa'

This issue occurs when we install warcio package

from rdflib.plugins.parsers.pyRdfa import pyRdfa as PyRdfa, Options, logger as pyrdfa_logger ImportError: No module named 'rdflib.plugins.parsers.pyRdfa'

opened by svnshikhil 8

Avoid including itemprop from child itemscopes when using itemref

Extruct deals wrongly with elements referenced by itemref if the referenced element also include other items defined inside with itemscope. For example:

<html>
<body>
<div id="product" itemscope itemtype="http://schema.org/Product" itemref="other-product-properties">
    <span itemprop="name">Executive Anvil</span>
    <img itemprop="image" src="img1.jpg"/>
</div>
<div id="other-product-properties">
    <img itemprop="image" src="img2.jpg"/>
    <div itemscope itemtype="http://schema.org/Product" itemprop="related_products">
        <span itemprop="name">REL PROD 1</span>
        <img itemprop="image" src="rel-prod-1.jpg">
    </div>
</div>
</body>
</html>

Is extracted as (see the duplicate name and the extra image):

[{'@context': 'http://schema.org',
  '@type': 'Product',
  'image': ['img2.jpg', 'rel-prod-1.jpg', 'img1.jpg'],
  'name': ['REL PROD 1', 'Executive Anvil'],
  'related_products': {'@type': 'Product',
                       'image': 'rel-prod-1.jpg',
                       'name': 'REL PROD 1'}}]

This PR patchs the code to solve this problem, so that the extracted result is:

[{'@context': 'http://schema.org',
  '@type': 'Product',
  'image': ['img2.jpg', 'img1.jpg'],
  'name': 'Executive Anvil',
  'related_products': {'@type': 'Product',
                       'image': 'rel-prod-1.jpg',
                       'name': 'REL PROD 1'}}]

I patched function _extract_property_refs() so that properties in a different scope than the referenced element are skipped. I have passed the tests and they pass.

opened by ivanprado 8

RDFa ordering not preserved on duplicated properties

When a property is repeated (i.e. on a page with multiple images annotates as og:image) RDFa return it as a list but is not preserving order. Preserving order is important as usually the first image is the most important. An example of page where this would be happening:

https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/

It seems difficult to solve it in extruct as the problem seems to present in PyRdfa library, and it is even happening in the online service: https://www.w3.org/2012/pyRdfa/Overview.html#distill_by_uri+with_options

Related with https://github.com/scrapinghub/extruct/pull/115 (I created an xfail test for that in this PR)
bug

opened by ivanprado 7
[suggestion] adding type hints?
I see this package supports python 2.7 so the type hints would have to be the comment variety:

https://web.archive.org/web/20220213164145/https://mypy.readthedocs.io/en/stable/cheat_sheet.html

Essentially:

def add(a, b): # type: (int, int) -> int return a + b

instead of:

def add(a: int, b: int) -> int: return a + b

edit: another option would be some type stub files .pyi
opened by sbdchd 7
error extracting json-ld for validated json

When trying to extract this:

from url : https://www.sollan.co.il/product/%D7%9E%D7%A6%D7%91%D7%A8-80-%D7%90%D7%9E%D7%A4%D7%A8-%D7%95%D7%95%D7%A8%D7%98%D7%94-agm-%D7%A1%D7%90%D7%A8%D7%98-%D7%A1%D7%98%D7%95%D7%A4-80ah-stop-start-%D7%95%D7%A8%D7%98%D7%94/

I get this error: Extra data: line 40 column 1 (char 1211)

I validated the schema in multiple places include: [https://validator.schema.org/#url=https%3A%2F%2Fwww.sollan.co.il%2Fproduct%2F%25D7%259E%25D7%25A6%25D7%2591%25D7%25A8-80-%25D7%2590%25D7%259E%25D7%25A4%25D7%25A8-%25D7%2595%25D7%2595%25D7%25A8%25D7%2598%25D7%2594-agm-%25D7%25A1%25D7%2590%25D7%25A8%25D7%2598-%25D7%25A1%25D7%2598%25D7%2595%25D7%25A4-80ah-stop-start-%25D7%2595%25D7%25A8%25D7%2598%25D7%2594%2F]

opened by rmizrahigit 0
changed the opengraph meta data extraction to incorporate the html body.

#192 Added the feature to incorporate all the meta tags outside of the html head, by changing in the function extract_items() in class openClassExtractor. Furthermore, added a test case to named opengraph_test_2 which uses the html of https://www.youtube.com/c/Freecodecamp where the meta tags are also present in the body of the html, and the function is able to correctly identify all the tags and parse it.

opened by frostrot 1

Added twitter card functionality

I have added the twitter card functionality. So now it extracts namespaces and properties of the twitter cards. I have also added 3 test cases

This was a needed feature in issue #179

For example the following now works:

>>> extruct.extract('<!doctype html><html><head><meta name="twitter:card" content="summary">')
{'microdata': [], 
'json-ld': [], 
'opengraph': [], 
'microformat': [], 
'rdfa': [],
 'dublincore': [{'namespaces': {}, 'elements': [], 'terms': []}], 
'twittercard': [{'namespace': {'twitter': 'https://dev.twitter.com/cards#'}, 'properties': [('twitter:card', 'summary')]}]}

opened by blackhat-7 1

Solves issue #171

As @wjdp suggested in the issue #171 , an apostrophe in the channel's name causes the JSONdecode error. the json.loads() function fails when there are hex codes like "\x27" in the script. So I created a function in the jsonld extractor to replace the hex codes with their special characters before passing it into json.loads() function.

opened by AbhinavSE 1
LD+JSON outside HTML element

Hi all.

If there is ld+json outside html element (html.head.body.html.ld+json) then parser returns empty list.

Firefox and W3C validator say: Stray start tag "script". So it is clear that site html document structure is at fault.

But maybe someone can apply fix until webmaster fix this.

Example site: https://www.spinneyslebanon.com/mevgal-bio-feta-cheese-200g.html

opened by bar24 1