A python module to parse the Open Graph Protocol

Overview

OpenGraph is a module of python for parsing the Open Graph Protocol, you can read more about the specification at http://ogp.me/

Installation

$ pip install opengraph

Features

  • Use it as a python dict
  • Input and parsing from a specific url
  • Input and parsung from html previous extracted
  • HTML output
  • JSON output

Usage

From an URL

>>> import opengraph
>>> video = opengraph.OpenGraph(url="http://www.youtube.com/watch?v=q3ixBmDzylQ")
>>> video.is_valid()
True
>>> for x,y in video.items():
...     print "%-15s => %s" % (x, y)
...
site_name       => YouTube
description     => Eric Clapton and Paul McCartney perform George Harrison's "While My Guitar Gently Weeps" at the...
title           => While My Guitar Gently Weeps
url             => http://www.youtube.com/watch?v=q3ixBmDzylQ
image           => http://i2.ytimg.com/vi/q3ixBmDzylQ/default.jpg
video:type      => application/x-shockwave-flash
video:height    => 224
video           => http://www.youtube.com/v/q3ixBmDzylQ?version=3&autohide=1
video:width     => 398
type            => video

From HTML

>>> HTML = """
... <html xmlns:og="http://ogp.me/ns#">
... <head>
... <title>The Rock (1996)</title>
... <meta property="og:title" content="The Rock" />
... <meta property="og:type" content="movie" />
... <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
... <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" />
... </head>
... </html>
... """
>>> movie = opengraph.OpenGraph() # or you can instantiate as follows: opengraph.OpenGraph(html=HTML)
>>> movie.parser(HTML)
>>> video.is_valid()
True

Generate JSON or HTML

>>> ogp = opengraph.OpenGraph("http://ogp.me/")
>>> print ogp.to_json()
{"image:type": "image/png", "title": "Open Graph protocol", "url": "http://ogp.me/", "image": "http://ogp.me/logo.png", "scrape": false, "_url": "http://ogp.me/", "image:height": "300", "type": "website", "image:width": "300", "description": "The Open Graph protocol enables any web page to become a rich object in a social graph."}
>>> print ogp.to_html()

<meta property="og:image:type" content="image/png" />
<meta property="og:title" content="Open Graph protocol" />
<meta property="og:url" content="http://ogp.me/" />
<meta property="og:image" content="http://ogp.me/logo.png" />
<meta property="og:scrape" content="False" />
<meta property="og:_url" content="http://ogp.me/" />
<meta property="og:image:height" content="300" />
<meta property="og:type" content="website" />
<meta property="og:image:width" content="300" />
<meta property="og:description" content="The Open Graph protocol enables any web page to become a rich object in a social graph." />
Comments
  • fix exception when encountering invalid og tag

    fix exception when encountering invalid og tag

    Pages from some web sites, like nytimes, may contain invalid og tags, which cause exception. This is a fix to check if it uses the right attribute before read the value.

    opened by syshen 1
  • specify a dummy user agent

    specify a dummy user agent

    Some websites block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends and thus results in 403 error when fetching data. this PR adds a dummy one

    opened by radfaz 0
  • Fix malformed json error message

    Fix malformed json error message

    This is a valid Python string, but the json is incorrect. See json spec:

    A string is a sequence of zero or more Unicode characters, wrapped in double quotes [...]

    For "simplicity" I'm creating the Python string in single quotes, arguing that the codebase already uses a mixed double- and single quoting style, anyway.

    Update: Probably not necessary if https://github.com/erikriver/opengraph/pull/4/files get's merged.

    opened by norpol 0
  • Added option to scrape page for attributes in case og meta elements are not present

    Added option to scrape page for attributes in case og meta elements are not present

    In order for this to work similar to Facebook's robot, I've added an 'scrape' parameter to the OpenGraph class, when True it will scrape the document's body for fallback values in case the required og meta attribtutes are not present.

    opened by jjdelc 0
  • Make it possible to specify the parser for BeautifulSoup4

    Make it possible to specify the parser for BeautifulSoup4

    If you have lxml installed, BeautifulSoup4 will set lxml as the default parser, so it would be better to be able to specify the parser depending on the situation. https://github.com/erikriver/opengraph/blob/e2322563004c923a4c1ce136733a44efe5fc8caa/opengraph/opengraph.py#L63 This is the default setting because we didn't actually do the parser above.

    Depending on the environment, the following issue cases may occur due to the above reasons https://github.com/erikriver/opengraph/issues/37

    As a solution, I think it would be a good idea to add a new parser that can be selected in the following arguments https://github.com/erikriver/opengraph/blob/e2322563004c923a4c1ce136733a44efe5fc8caa/opengraph/opengraph.py#L28

    opened by fumiya5863 0
  • docs: fix simple typo, parsung -> parsing

    docs: fix simple typo, parsung -> parsing

    There is a small typo in README.rst.

    Should read parsing rather than parsung.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • Metadata not in head but in the body

    Metadata not in head but in the body

    Hi,

    I am having an issue with getting the metadata using opengraph_py3, urllib and bs4.

    In parser method you are only checking the <head> but it looks like <meta> tags are sometimes in the body. Any ideas how can I fix this ? Is it due to the UserAgent ?

    • urllib3 1.23
    • opengraph-py3 0.71
    • beautifulsoup4 4.6.0
    import re
    import opengraph_py3 as opengraph
    import urllib
    from bs4 import BeautifulSoup
    
    raw = urllib.request.FancyURLopener().open("https://youtu.be/DQwU_kU4pUg")
    html = raw.read()
    soap = BeautifulSoup(html, 'html.parser')
    
    # This is the same code as in `parser`
    soap.html.head.findAll(property=re.compile(r'^og'))
    # []
    
    soap.html.body.findAll(property=re.compile(r'^og'))
    # [<meta content="YouTube" property="og:site_na....]
    
    opened by ThePavolC 0
  • According to http://ogp.me/ description is not required.

    According to http://ogp.me/ description is not required.

    Also fix the example.

    Please note the example from the readme was not working

    HTML = """ ... ... The Rock (1996) ... ... ... ... ... ... ... ... ... ... """ import opengraph movie = opengraph.OpenGraph() movie.parser(HTML) movie.is_valid() False movie.required_attrs ['title', 'type', 'image', 'url', 'description'] movie.required_attrs.pop(-1) 'description' movie.is_valid() True

    opened by LeResKP 0
  • Not working in Python 3

    Not working in Python 3

    It works when I run with Python 2, but when I run with Python 3 I get the following error.

    Traceback (most recent call last):
      File "og.py", line 9, in <module>
        import opengraph
      File "/usr/local/lib/python3.5/dist-packages/opengraph/__init__.py", line 1, in <module>
        from opengraph import OpenGraph
    
    opened by Zerokami 3
  • How to set custom User Agent?

    How to set custom User Agent?

    Udemy.com is blocking the default User Agent of opengraph.

    I'm getting

    How do I set a custom user agent for OpenGraph module

    urllib2.HTTPError: HTTP Error 403: Unauthorized
    

    As a workaround I have created a custom getter using requests module

    def custom_get_img_from_link(link):
        """
        """
        #headers = {"User-Agent":get_random_UA()}
        headers = {"User-Agent": "My bot"}
        r = requests.get(link, headers=headers)
    
        parsed_uri = urlparse(link)
        domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    
        OpenGraph.parser = parser
        OpenGraph.scrape = True  # workaround for some subtle bug in opengraph
    
        page = OpenGraph(html=r.content)
    
        if page.is_valid():
    
            image_url = page.get('image', None)
    
            if not image_url.startswith('http'):
                image_url = urljoin(domain, page['image'])
    
            return image_url
    
    
    opened by Zerokami 0
Owner
Erik Rivera
Erik Rivera
API to parse tibia.com content into python objects.

Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

Allan Galarza 25 Oct 31, 2022
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

OneBit 2 Dec 13, 2021
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

VeNoMouS 2.6k Dec 31, 2022
A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

null 3 Sep 8, 2021
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

null 3k Jan 4, 2023
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. ?? Installation Clone the repo locally.

Rokas 3 Jan 2, 2022
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

null 1 Jan 28, 2022
VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

null 3 Feb 13, 2022
A Pixiv web crawler module

Pixiv-spider A Pixiv spider module WARNING It's an unfinished work, browsing the code carefully before using it. Features 0004 - Readme.md updated, co

Uzuki 1 Nov 14, 2021
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

null 37 Sep 8, 2022
Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

QAI Research 4 Jan 5, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

null 4.3k Jan 7, 2023
A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

Roy Binux 15.7k Jan 4, 2023
Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

wallstreetbets-tracker Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit.

null 91 Dec 8, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 7, 2023