A python module to parse the Open Graph Protocol

Erik Rivera

Last update: Nov 12, 2022

Related tags

Overview

OpenGraph is a module of python for parsing the Open Graph Protocol, you can read more about the specification at http://ogp.me/

Installation

$ pip install opengraph

Features

Use it as a python dict
Input and parsing from a specific url
Input and parsung from html previous extracted
HTML output
JSON output

Usage

From an URL

>>> import opengraph
>>> video = opengraph.OpenGraph(url="http://www.youtube.com/watch?v=q3ixBmDzylQ")
>>> video.is_valid()
True
>>> for x,y in video.items():
...     print "%-15s => %s" % (x, y)
...
site_name       => YouTube
description     => Eric Clapton and Paul McCartney perform George Harrison's "While My Guitar Gently Weeps" at the...
title           => While My Guitar Gently Weeps
url             => http://www.youtube.com/watch?v=q3ixBmDzylQ
image           => http://i2.ytimg.com/vi/q3ixBmDzylQ/default.jpg
video:type      => application/x-shockwave-flash
video:height    => 224
video           => http://www.youtube.com/v/q3ixBmDzylQ?version=3&autohide=1
video:width     => 398
type            => video

From HTML

>>> HTML = """
... <html xmlns:og="http://ogp.me/ns#">
... <head>
... <title>The Rock (1996)</title>
... <meta property="og:title" content="The Rock" />
... <meta property="og:type" content="movie" />
... <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
... <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" />
... </head>
... </html>
... """
>>> movie = opengraph.OpenGraph() # or you can instantiate as follows: opengraph.OpenGraph(html=HTML)
>>> movie.parser(HTML)
>>> video.is_valid()
True

Generate JSON or HTML

>>> ogp = opengraph.OpenGraph("http://ogp.me/")
>>> print ogp.to_json()
{"image:type": "image/png", "title": "Open Graph protocol", "url": "http://ogp.me/", "image": "http://ogp.me/logo.png", "scrape": false, "_url": "http://ogp.me/", "image:height": "300", "type": "website", "image:width": "300", "description": "The Open Graph protocol enables any web page to become a rich object in a social graph."}
>>> print ogp.to_html()

<meta property="og:image:type" content="image/png" />
<meta property="og:title" content="Open Graph protocol" />
<meta property="og:url" content="http://ogp.me/" />
<meta property="og:image" content="http://ogp.me/logo.png" />
<meta property="og:scrape" content="False" />
<meta property="og:_url" content="http://ogp.me/" />
<meta property="og:image:height" content="300" />
<meta property="og:type" content="website" />
<meta property="og:image:width" content="300" />
<meta property="og:description" content="The Open Graph protocol enables any web page to become a rich object in a social graph." />

Comments

fix exception when encountering invalid og tag

Pages from some web sites, like nytimes, may contain invalid og tags, which cause exception. This is a fix to check if it uses the right attribute before read the value.

opened by syshen 1
specify a dummy user agent

Some websites block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends and thus results in 403 error when fetching data. this PR adds a dummy one

opened by radfaz 0
Fix malformed json error message

This is a valid Python string, but the json is incorrect. See json spec:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes [...]

For "simplicity" I'm creating the Python string in single quotes, arguing that the codebase already uses a mixed double- and single quoting style, anyway.

Update: Probably not necessary if https://github.com/erikriver/opengraph/pull/4/files get's merged.

opened by norpol 0
Added option to scrape page for attributes in case og meta elements are not present

In order for this to work similar to Facebook's robot, I've added an 'scrape' parameter to the OpenGraph class, when True it will scrape the document's body for fallback values in case the required og meta attribtutes are not present.

opened by jjdelc 0
Make it possible to specify the parser for BeautifulSoup4

If you have lxml installed, BeautifulSoup4 will set lxml as the default parser, so it would be better to be able to specify the parser depending on the situation. https://github.com/erikriver/opengraph/blob/e2322563004c923a4c1ce136733a44efe5fc8caa/opengraph/opengraph.py#L63 This is the default setting because we didn't actually do the parser above.

Depending on the environment, the following issue cases may occur due to the above reasons https://github.com/erikriver/opengraph/issues/37

As a solution, I think it would be a good idea to add a new parser that can be selected in the following arguments https://github.com/erikriver/opengraph/blob/e2322563004c923a4c1ce136733a44efe5fc8caa/opengraph/opengraph.py#L28

opened by fumiya5863 0
docs: fix simple typo, parsung -> parsing

There is a small typo in README.rst.

Should read parsing rather than parsung.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 0

Metadata not in head but in the body

Hi,

I am having an issue with getting the metadata using opengraph_py3, urllib and bs4.

In parser method you are only checking the <head> but it looks like <meta> tags are sometimes in the body. Any ideas how can I fix this ? Is it due to the UserAgent ?

urllib3 1.23
opengraph-py3 0.71
beautifulsoup4 4.6.0

import re
import opengraph_py3 as opengraph
import urllib
from bs4 import BeautifulSoup

raw = urllib.request.FancyURLopener().open("https://youtu.be/DQwU_kU4pUg")
html = raw.read()
soap = BeautifulSoup(html, 'html.parser')

# This is the same code as in `parser`
soap.html.head.findAll(property=re.compile(r'^og'))
# []

soap.html.body.findAll(property=re.compile(r'^og'))
# [<meta content="YouTube" property="og:site_na....]

opened by ThePavolC 0

According to http://ogp.me/ description is not required.

Also fix the example.

Please note the example from the readme was not working

HTML = """ ... ... The Rock (1996) ... ... ... ... ... ... ... ... ... ... """ import opengraph movie = opengraph.OpenGraph() movie.parser(HTML) movie.is_valid() False movie.required_attrs ['title', 'type', 'image', 'url', 'description'] movie.required_attrs.pop(-1) 'description' movie.is_valid() True

opened by LeResKP 0

Not working in Python 3

It works when I run with Python 2, but when I run with Python 3 I get the following error.

Traceback (most recent call last):
  File "og.py", line 9, in <module>
    import opengraph
  File "/usr/local/lib/python3.5/dist-packages/opengraph/__init__.py", line 1, in <module>
    from opengraph import OpenGraph

opened by Zerokami 3

How to set custom User Agent?

Udemy.com is blocking the default User Agent of opengraph.

I'm getting

How do I set a custom user agent for OpenGraph module

urllib2.HTTPError: HTTP Error 403: Unauthorized

As a workaround I have created a custom getter using requests module

def custom_get_img_from_link(link):
    """
    """
    #headers = {"User-Agent":get_random_UA()}
    headers = {"User-Agent": "My bot"}
    r = requests.get(link, headers=headers)

    parsed_uri = urlparse(link)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    OpenGraph.parser = parser
    OpenGraph.scrape = True  # workaround for some subtle bug in opengraph

    page = OpenGraph(html=r.content)

    if page.is_valid():

        image_url = page.get('image', None)

        if not image_url.startswith('http'):
            image_url = urljoin(domain, page['image'])

        return image_url

opened by Zerokami 0

Owner

Erik Rivera

GitHub http://ogp.me/

API to parse tibia.com content into python objects.

Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

25 Oct 31, 2022

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

6 Aug 26, 2022

This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

2 Dec 13, 2021

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Jan 8, 2023

A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

2.6k Dec 31, 2022

A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 8, 2021

A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 4, 2023

Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. ?? Installation Clone the repo locally.

3 Jan 2, 2022

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term