a small library for extracting rich content from urls

Overview

http://media.charlesleifer.com/blog/photos/micawber-logo-0.png

A small library for extracting rich content from urls.

what does it do?

micawber supplies a few methods for retrieving rich metadata about a variety of links, such as links to youtube videos. micawber also provides functions for parsing blocks of text and html and replacing links to videos with rich embedded content.

examples

here is a quick example:

import micawber

# load up rules for some default providers, such as youtube and flickr
providers = micawber.bootstrap_basic()

providers.request('http://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following dictionary:
{
    'author_name': 'pascalbrax',
    'author_url': u'http://www.youtube.com/user/pascalbrax'
    'height': 344,
    'html': u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>',
    'provider_name': 'YouTube',
    'provider_url': 'http://www.youtube.com/',
    'title': 'Future Crew - Second Reality demo - HD',
    'type': u'video',
    'thumbnail_height': 360,
    'thumbnail_url': u'http://i2.ytimg.com/vi/54XHDUOHuzU/hqdefault.jpg',
    'thumbnail_width': 480,
    'url': 'http://www.youtube.com/watch?v=54XHDUOHuzU',
    'width': 459,
    'version': '1.0',
}

providers.parse_text('this is a test:\nhttp://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following string:
this is a test:
<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>

providers.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')

# returns the following html:
<p><iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&amp;feature=oembed" frameborder="0" allowfullscreen="allowfullscreen"></iframe></p>
Comments
  • CSP headers

    CSP headers

    Hi! I'm using Flask but this will be usefull for Django and others Will be supernice to have a feature that accumulates in a per request cache or something which services has been used and correct the content security policy header to include this services as accepted origins

    Otherwise the embedded object will not load blocked by the browser and it is not acceptable to allow any origin but only those needed

    Thanks a lot!

    opened by Garito 9
  • HTML parser doesn't deal with &

    HTML parser doesn't deal with &

    Suppose you've got the following content:

    Testing
    
    http://picasaweb.google.com/lh/sredir?uname=test&target=ALBUM&id=123&authkey=abc
    

    (Note: the link itself is not valid due to mangled IDs (it was a private album))

    Rendering this content as follows will not work:

    {{post.body|linebreaksbr|oembed_html}}
    

    The reason is that the "&" has been escaped and turned into "&amp". The HTML parser over at https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L144 does recognize & extract the URL, but it does not unescape &amp. Hence, &amp is fed to embed.ly... resulting in a 404 over there.

    opened by pennersr 8
  • 'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

    'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

    I get the error shown below when I run the Peewee sample blog app from here: https://github.com/coleifer/peewee/tree/master/examples/blog

    Specifically this happens when Micawber tries to display a post with links that need converting to embeds (e.g. a YouTube video link).

    I've been able to reproduce this reliably with different links (e.g. Vimeo links instead of YouTube) and different browsers. It doesn't always happen immediately, but if you click around to view the posts with embeds, then return to the index page, then view posts again, the error appears and the page is either unavailable or shows the page with no CSS. Errors in the console show that files failed to load: Failed to load resource: net::ERR_SOCKET_NOT_CONNECTED

    This is in a Python 2.7.10 virtualenv on Ubuntu 15.10 running the Flask dev server.

    Interestingly, running it in a Python 3.4 virtualenv works without issues. But it would be great to have a fix for Python 2.

    Exception happened during processing of request from ('127.0.0.1', 33044)
    Traceback (most recent call last):
      File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
        self.process_request(request, client_address)
      File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
        self.finish_request(request, client_address)
      File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
        self.RequestHandlerClass(request, client_address, self)
      File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
        self.handle()
      File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 216, in handle
        rv = BaseHTTPRequestHandler.handle(self)
      File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
        self.handle_one_request()
      File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 247, in handle_one_request
        self.raw_requestline = self.rfile.readline()
    IOError: [Errno 11] Resource temporarily unavailable
    
    opened by keybits 7
  • parse_html overhead

    parse_html overhead

    >>> import micawber
    >>> providers = micawber.bootstrap_basic()
    >>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
    u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'
    

    What is html and body tags ? i do not need it.

    >>> micawber.parse_text('http://www.youtube.com/watch?v=54XHDUOHuzU', providers)
    u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen></iframe>'
    >>> micawber.parse_text('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
    u'<p><a href="http://www.youtube.com/watch?v=54XHDUOHuzU" title="Future Crew - Second Reality demo - HD">Future Crew - Second Reality demo - HD</a></p>'
    

    I don't want link, i want iframe, etc, as in docs, even i have other tags in text.

    I use bs4, but why it is not in docs as dependency?

    ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)

    opened by LennyLip 7
  • 500px and bootstrap_embedly

    500px and bootstrap_embedly

    In [3]: requests.get('http://api.embed.ly/1/oembed?url=https%3A%2F%2Fiso.500px.com%2Fguest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following%2F&maxwidth=500').json()
    Out[3]: 
    {u'author_name': u'DL Cade',
     u'author_url': u'https://iso.500px.com/author/dl/',
     u'description': u"One of December's talented 500px Guest Curators was photographer Joel (Julius) Tjintjelaar , and he fully embraced the real purpose of the Editors' Choice section: to unveil photos and photographers that might not have made the Popular page for one reason or another... but probably should have.",
     u'provider_name': u'500px',
     u'provider_url': u'https://iso.500px.com',
     u'thumbnail_height': 1000,
     u'thumbnail_url': u'https://isocdn.500px.org/wp-content/uploads/2014/12/julius-1500x1000.jpg',
     u'thumbnail_width': 1500,
     u'title': u'Guest Curator Joel (Julius) Tjintjelaar Reveals Three Photographers that Should Have a Larger Following',
     u'type': u'link',
     u'url': u'https://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/',
     u'version': u'1.0'}
    
    In [4]: bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
    ---------------------------------------------------------------------------
    ProviderNotFoundException                 Traceback (most recent call last)
    <ipython-input-4-aca3a4c8cf6f> in <module>()
    ----> 1 bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
    
    /tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
         91                 self.cache.set(key, data)
         92             return data
    ---> 93         return fn(self, url, **params)
         94     return inner
         95 
    
    /tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
        132         if provider:
        133             return provider.request(url, **params)
    --> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
        135 
        136 
    
    ProviderNotFoundException: Provider not found for "http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/"
    
    opened by ad-m 7
  • Youtube Playlists

    Youtube Playlists

    I'm not quite sure where the fault for this lies, but here seems a good start.

    Embedding a youtube playlist using embed.ly directly works okay: http://embed.ly/code?url=https%3A%2F%2Fwww.youtube.com%2Fplaylist%3Flist%3DPLE2714DC8F2BA092D (literally an example playlist heh)

    Running it thorough micawber doesn't embed anything using the URL: https://www.youtube.com/playlist?list=PLE2714DC8F2BA092D - using the embed URL of https://www.youtube.com/embed/videoseries?list=PLE2714DC8F2BA092D results in the first video in the series being embedded but no playlist controls.

    opened by kieranhogg 6
  • feature request: add media.ccc.de integration

    feature request: add media.ccc.de integration

    Hi! Falsely reported to nikola (to add more features), I'm now reporting this here as a feature request: It would be great to integrate videos/ streams from https://media.ccc.de into this library.

    The service is run by the German hacker association Chaos Computer Club (CCC), which hosts annual events itself and lends streaming expertise to many external events via its Video Operation Center (VOC).

    The streaming service is a valuable source of information on many different topics and I think it would be an awesome addition!

    If you have pointers on where I can add it (I assume somewhere in providers.py), I might be able to do a pull request myself. I wouldn't call myself a Python expert though :-)

    opened by dvzrv 5
  • Option to convert only single-line links

    Option to convert only single-line links

    The solution in issue #29 does not fix the problem of having provider links inside of markup (or Markdown, for that matter) somewhere within the line. The resulting code is still mangled, in particular when micawber is combined with a Markdown renderer (Misaka).

    In this custom filter I took parts of parsers.py, essentially removing the else: line = parse_text_full .. block, which I think should be optional.

    In the linked commit I also add my own line-parser for performance reasons, but I'll soon learn to register my own provider and fix that ;-)

    opened by loleg 4
  • bootstrap_basic raw strings / escapes

    bootstrap_basic raw strings / escapes

    I noticed that a lot of the regular expression patterns in bootstrap_basic don't escape dots (match all). This means that a fair number of these patterns will match more than intended.

    In addition most patterns aren't marked as raw strings and therefore contain invalid escape sequences. This isn't noticeable directly, but could cause issues in a future python version.

    For an example of the latter:

    python -W always -c '"https://\S*?soundcloud.com/\S+"' <string>:1: DeprecationWarning: invalid escape sequence \S

    opened by jaap3 4
  • Packaging: examples conflict with flasgger

    Packaging: examples conflict with flasgger

    There is a file conflict between flasgger and micawber, because both install files into the too generic path name examples. For reference, please see this Arch Linux bug.

    As a solution, micawber and flasgger should either not install these examples at all, or if required into a unique directory (e.g. micawber-examples) or another system directory (e.g. on Linux: /usr/share/doc/python-micawber/examples, which is usually done by the packagers).

    I will remove them for now to resolve the file conflict.

    opened by dvzrv 4
  • performance suggestion

    performance suggestion

    I'm considering migrating to micawber from a custom oembed consumer, and wanted to suggest a performance improvement that I am willing to generate a PR for.

    I'd like to extend the ProviderRegistry with a secondary internal register that nests providers under domain names.

    this would allow users to optionally avoid a regex match against every provider and only test the domain.

    some light tests on a quick mockup showed the lookups to run in 30% the time -- including the overhead of parsing the domain name from a url, but about 5% of the time if you have the domain already.

    we would be using this on a high volume indexer, so this performance is a need.

    opened by jvanasco 4
Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Zotero ➡️ Readwise zotero2readwise is a Python library that retrieves all Zotero

Essi Alizadeh 49 Dec 20, 2022
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Rich is a Python library for rich text and beautiful formatting in the terminal.

Rich 中文 readme • lengua española readme • Läs på svenska Rich is a Python library for rich text and beautiful formatting in the terminal. The Rich API

Will McGugan 41.4k Jan 2, 2023
Rich is a Python library for rich text and beautiful formatting in the terminal.

Rich 中文 readme • lengua española readme • Läs på svenska Rich is a Python library for rich text and beautiful formatting in the terminal. The Rich API

Will McGugan 41.5k Jan 7, 2023
Rich is a Python library for rich text and beautiful formatting in the terminal.

The Rich API makes it easy to add color and style to terminal output. Rich can also render pretty tables, progress bars, markdown, syntax highlighted source code, tracebacks, and more — out of the box.

Will McGugan 41.4k Jan 3, 2023
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

img2dataset Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports

Romain Beaumont 1.4k Jan 1, 2023
Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Fast Pattern Fetcher (fpf) Coded with <3 by HS Devansh Raghav Fast Pattern Fetcher, Takes a URLs list and outputs the URLs which contains the paramete

whoami security 5 Feb 20, 2022
Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

null 1 Feb 26, 2022
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 7, 2022
Rich.tui is a TUI (Text User Interface) framework for Python using Rich as a renderer.

rich.tui Rich.tui is a TUI (Text User Interface) framework for Python using Rich as a renderer. The end goal is to be able to rapidly create rich term

Will McGugan 17.1k Jan 4, 2023
A Discord Rich Presence App to set your own custom rich presence.

discord-rich-presence A Discord Rich Presence App to set your own custom rich presence. #BUILDS Ready to use package are available inside "finalpackag

null 1 Nov 22, 2021
Pytest-rich - Pytest + rich integration (proof of concept)

pytest-rich Leverage rich for richer test session output. This plugin is not pub

Bruno Oliveira 170 Dec 2, 2022
Filtering user-generated video content(SberZvukTechDays)Filtering user-generated video content(SberZvukTechDays)

Filtering user-generated video content(SberZvukTechDays) Table of contents General info Team members Technologies Setup Result General info This is a

Roman 6 Apr 6, 2022
Python script for changing the SSH banner content with other content

Banner-changer-py Python script for changing the SSH banner content with other content. The Script will take the content of a specified file range and

null 2 Nov 23, 2021
Small-File-Explorer - I coded a small file explorer with several options

Petit explorateur de fichier / Small file explorer Pour la première option (création de répertoire) / For the first option (creation of a directory) e

Xerox 1 Jan 3, 2022
pyglet is a cross-platform windowing and multimedia library for Python, for developing games and other visually rich applications.

pyglet pyglet is a cross-platform windowing and multimedia library for Python, intended for developing games and other visually rich applications. It

null 1.3k Jan 1, 2023
Helpful functions for use alongside the rich Python library.

?? Rich Tools A python package with helpful functions for use alongside with the rich python library. ???? The current features are: Convert a Pandas

Avi Perl 14 Oct 14, 2022
RichWatch is wrapper around AWS Cloud Watch to display beautiful logs with help of Python library Rich.

RichWatch is TUI (Textual User Interface) for AWS Cloud Watch. It formats and pretty prints Cloud Watch's logs so they are much more readable. Because

null 21 Jul 25, 2022
A tool for extracting plain text from Wikipedia dumps

WikiExtractor WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requ

Giuseppe Attardi 3.2k Dec 31, 2022
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

Maxim 32 Jul 24, 2022