a small library for extracting rich content from urls

Overview

http://media.charlesleifer.com/blog/photos/micawber-logo-0.png

A small library for extracting rich content from urls.

what does it do?

micawber supplies a few methods for retrieving rich metadata about a variety of links, such as links to youtube videos. micawber also provides functions for parsing blocks of text and html and replacing links to videos with rich embedded content.

examples

here is a quick example:

import micawber

# load up rules for some default providers, such as youtube and flickr
providers = micawber.bootstrap_basic()

providers.request('http://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following dictionary:
{
    'author_name': 'pascalbrax',
    'author_url': u'http://www.youtube.com/user/pascalbrax'
    'height': 344,
    'html': u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>',
    'provider_name': 'YouTube',
    'provider_url': 'http://www.youtube.com/',
    'title': 'Future Crew - Second Reality demo - HD',
    'type': u'video',
    'thumbnail_height': 360,
    'thumbnail_url': u'http://i2.ytimg.com/vi/54XHDUOHuzU/hqdefault.jpg',
    'thumbnail_width': 480,
    'url': 'http://www.youtube.com/watch?v=54XHDUOHuzU',
    'width': 459,
    'version': '1.0',
}

providers.parse_text('this is a test:\nhttp://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following string:
this is a test:
<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>

providers.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')

# returns the following html:
<p><iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&amp;feature=oembed" frameborder="0" allowfullscreen="allowfullscreen"></iframe></p>
Comments
  • CSP headers

    CSP headers

    Hi! I'm using Flask but this will be usefull for Django and others Will be supernice to have a feature that accumulates in a per request cache or something which services has been used and correct the content security policy header to include this services as accepted origins

    Otherwise the embedded object will not load blocked by the browser and it is not acceptable to allow any origin but only those needed

    Thanks a lot!

    opened by Garito 9
  • HTML parser doesn't deal with &

    HTML parser doesn't deal with &

    Suppose you've got the following content:

    Testing
    
    http://picasaweb.google.com/lh/sredir?uname=test&target=ALBUM&id=123&authkey=abc
    

    (Note: the link itself is not valid due to mangled IDs (it was a private album))

    Rendering this content as follows will not work:

    {{post.body|linebreaksbr|oembed_html}}
    

    The reason is that the "&" has been escaped and turned into "&amp". The HTML parser over at https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L144 does recognize & extract the URL, but it does not unescape &amp. Hence, &amp is fed to embed.ly... resulting in a 404 over there.

    opened by pennersr 8
  • 'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

    'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

    I get the error shown below when I run the Peewee sample blog app from here: https://github.com/coleifer/peewee/tree/master/examples/blog

    Specifically this happens when Micawber tries to display a post with links that need converting to embeds (e.g. a YouTube video link).

    I've been able to reproduce this reliably with different links (e.g. Vimeo links instead of YouTube) and different browsers. It doesn't always happen immediately, but if you click around to view the posts with embeds, then return to the index page, then view posts again, the error appears and the page is either unavailable or shows the page with no CSS. Errors in the console show that files failed to load: Failed to load resource: net::ERR_SOCKET_NOT_CONNECTED

    This is in a Python 2.7.10 virtualenv on Ubuntu 15.10 running the Flask dev server.

    Interestingly, running it in a Python 3.4 virtualenv works without issues. But it would be great to have a fix for Python 2.

    Exception happened during processing of request from ('127.0.0.1', 33044)
    Traceback (most recent call last):
      File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
        self.process_request(request, client_address)
      File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
        self.finish_request(request, client_address)
      File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
        self.RequestHandlerClass(request, client_address, self)
      File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
        self.handle()
      File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 216, in handle
        rv = BaseHTTPRequestHandler.handle(self)
      File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
        self.handle_one_request()
      File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 247, in handle_one_request
        self.raw_requestline = self.rfile.readline()
    IOError: [Errno 11] Resource temporarily unavailable
    
    opened by keybits 7
  • parse_html overhead

    parse_html overhead

    >>> import micawber
    >>> providers = micawber.bootstrap_basic()
    >>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
    u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'
    

    What is html and body tags ? i do not need it.

    >>> micawber.parse_text('http://www.youtube.com/watch?v=54XHDUOHuzU', providers)
    u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen></iframe>'
    >>> micawber.parse_text('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
    u'<p><a href="http://www.youtube.com/watch?v=54XHDUOHuzU" title="Future Crew - Second Reality demo - HD">Future Crew - Second Reality demo - HD</a></p>'
    

    I don't want link, i want iframe, etc, as in docs, even i have other tags in text.

    I use bs4, but why it is not in docs as dependency?

    ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)

    opened by LennyLip 7
  • 500px and bootstrap_embedly

    500px and bootstrap_embedly

    In [3]: requests.get('http://api.embed.ly/1/oembed?url=https%3A%2F%2Fiso.500px.com%2Fguest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following%2F&maxwidth=500').json()
    Out[3]: 
    {u'author_name': u'DL Cade',
     u'author_url': u'https://iso.500px.com/author/dl/',
     u'description': u"One of December's talented 500px Guest Curators was photographer Joel (Julius) Tjintjelaar , and he fully embraced the real purpose of the Editors' Choice section: to unveil photos and photographers that might not have made the Popular page for one reason or another... but probably should have.",
     u'provider_name': u'500px',
     u'provider_url': u'https://iso.500px.com',
     u'thumbnail_height': 1000,
     u'thumbnail_url': u'https://isocdn.500px.org/wp-content/uploads/2014/12/julius-1500x1000.jpg',
     u'thumbnail_width': 1500,
     u'title': u'Guest Curator Joel (Julius) Tjintjelaar Reveals Three Photographers that Should Have a Larger Following',
     u'type': u'link',
     u'url': u'https://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/',
     u'version': u'1.0'}
    
    In [4]: bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
    ---------------------------------------------------------------------------
    ProviderNotFoundException                 Traceback (most recent call last)
    <ipython-input-4-aca3a4c8cf6f> in <module>()
    ----> 1 bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
    
    /tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
         91                 self.cache.set(key, data)
         92             return data
    ---> 93         return fn(self, url, **params)
         94     return inner
         95 
    
    /tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
        132         if provider:
        133             return provider.request(url, **params)
    --> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
        135 
        136 
    
    ProviderNotFoundException: Provider not found for "http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/"
    
    opened by ad-m 7
  • Youtube Playlists

    Youtube Playlists

    I'm not quite sure where the fault for this lies, but here seems a good start.

    Embedding a youtube playlist using embed.ly directly works okay: http://embed.ly/code?url=https%3A%2F%2Fwww.youtube.com%2Fplaylist%3Flist%3DPLE2714DC8F2BA092D (literally an example playlist heh)

    Running it thorough micawber doesn't embed anything using the URL: https://www.youtube.com/playlist?list=PLE2714DC8F2BA092D - using the embed URL of https://www.youtube.com/embed/videoseries?list=PLE2714DC8F2BA092D results in the first video in the series being embedded but no playlist controls.

    opened by kieranhogg 6
  • feature request: add media.ccc.de integration

    feature request: add media.ccc.de integration

    Hi! Falsely reported to nikola (to add more features), I'm now reporting this here as a feature request: It would be great to integrate videos/ streams from https://media.ccc.de into this library.

    The service is run by the German hacker association Chaos Computer Club (CCC), which hosts annual events itself and lends streaming expertise to many external events via its Video Operation Center (VOC).

    The streaming service is a valuable source of information on many different topics and I think it would be an awesome addition!

    If you have pointers on where I can add it (I assume somewhere in providers.py), I might be able to do a pull request myself. I wouldn't call myself a Python expert though :-)

    opened by dvzrv 5
  • Option to convert only single-line links

    Option to convert only single-line links

    The solution in issue #29 does not fix the problem of having provider links inside of markup (or Markdown, for that matter) somewhere within the line. The resulting code is still mangled, in particular when micawber is combined with a Markdown renderer (Misaka).

    In this custom filter I took parts of parsers.py, essentially removing the else: line = parse_text_full .. block, which I think should be optional.

    In the linked commit I also add my own line-parser for performance reasons, but I'll soon learn to register my own provider and fix that ;-)

    opened by loleg 4
  • bootstrap_basic raw strings / escapes

    bootstrap_basic raw strings / escapes

    I noticed that a lot of the regular expression patterns in bootstrap_basic don't escape dots (match all). This means that a fair number of these patterns will match more than intended.

    In addition most patterns aren't marked as raw strings and therefore contain invalid escape sequences. This isn't noticeable directly, but could cause issues in a future python version.

    For an example of the latter:

    python -W always -c '"https://\S*?soundcloud.com/\S+"' <string>:1: DeprecationWarning: invalid escape sequence \S

    opened by jaap3 4
  • Packaging: examples conflict with flasgger

    Packaging: examples conflict with flasgger

    There is a file conflict between flasgger and micawber, because both install files into the too generic path name examples. For reference, please see this Arch Linux bug.

    As a solution, micawber and flasgger should either not install these examples at all, or if required into a unique directory (e.g. micawber-examples) or another system directory (e.g. on Linux: /usr/share/doc/python-micawber/examples, which is usually done by the packagers).

    I will remove them for now to resolve the file conflict.

    opened by dvzrv 4
  • performance suggestion

    performance suggestion

    I'm considering migrating to micawber from a custom oembed consumer, and wanted to suggest a performance improvement that I am willing to generate a PR for.

    I'd like to extend the ProviderRegistry with a secondary internal register that nests providers under domain names.

    this would allow users to optionally avoid a regex match against every provider and only test the domain.

    some light tests on a quick mockup showed the lookups to run in 30% the time -- including the overhead of parsing the domain name from a url, but about 5% of the time if you have the domain already.

    we would be using this on a high volume indexer, so this performance is a need.

    opened by jvanasco 4
Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

Sheryar 10 Aug 7, 2022
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 6, 2022
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 2, 2023
API to parse tibia.com content into python objects.

Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

Allan Galarza 25 Oct 31, 2022
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. ?? Installation Clone the repo locally.

Rokas 3 Jan 2, 2022
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

null 4.3k Jan 7, 2023
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 6, 2023
An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Social Media Scraper An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line! Go to the website » Vie

null 2 Aug 3, 2022
An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

null 6 Jan 8, 2023
Here I provide the source code for doing web scraping using the python library, it is Selenium.

Here I provide the source code for doing web scraping using the python library, it is Selenium.

M Khaidar 1 Nov 13, 2021
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

null 1 Jul 9, 2022
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit forms.

Dan Claudiu Pop 79 Nov 27, 2022
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Rich is a Python library for rich text and beautiful formatting in the terminal.

Rich 中文 readme • lengua española readme • Läs på svenska Rich is a Python library for rich text and beautiful formatting in the terminal. The Rich API

Will McGugan 41.4k Jan 2, 2023
Rich is a Python library for rich text and beautiful formatting in the terminal.

Rich 中文 readme • lengua española readme • Läs på svenska Rich is a Python library for rich text and beautiful formatting in the terminal. The Rich API

Will McGugan 41.5k Jan 7, 2023