a small library for extracting rich content from urls

Charles Leifer

Last update: Dec 27, 2022

Related tags

Overview

A small library for extracting rich content from urls.

what does it do?

micawber supplies a few methods for retrieving rich metadata about a variety of links, such as links to youtube videos. micawber also provides functions for parsing blocks of text and html and replacing links to videos with rich embedded content.

examples

here is a quick example:

import micawber

# load up rules for some default providers, such as youtube and flickr
providers = micawber.bootstrap_basic()

providers.request('http://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following dictionary:
{
    'author_name': 'pascalbrax',
    'author_url': u'http://www.youtube.com/user/pascalbrax'
    'height': 344,
    'html': u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>',
    'provider_name': 'YouTube',
    'provider_url': 'http://www.youtube.com/',
    'title': 'Future Crew - Second Reality demo - HD',
    'type': u'video',
    'thumbnail_height': 360,
    'thumbnail_url': u'http://i2.ytimg.com/vi/54XHDUOHuzU/hqdefault.jpg',
    'thumbnail_width': 480,
    'url': 'http://www.youtube.com/watch?v=54XHDUOHuzU',
    'width': 459,
    'version': '1.0',
}

providers.parse_text('this is a test:\nhttp://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following string:
this is a test:
<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>

providers.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')

# returns the following html:
<p><iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&amp;feature=oembed" frameborder="0" allowfullscreen="allowfullscreen"></iframe></p>

Comments

CSP headers

Hi! I'm using Flask but this will be usefull for Django and others Will be supernice to have a feature that accumulates in a per request cache or something which services has been used and correct the content security policy header to include this services as accepted origins

Otherwise the embedded object will not load blocked by the browser and it is not acceptable to allow any origin but only those needed

Thanks a lot!

opened by Garito 9
HTML parser doesn't deal with &
Suppose you've got the following content:

Testing http://picasaweb.google.com/lh/sredir?uname=test&target=ALBUM&id=123&authkey=abc

(Note: the link itself is not valid due to mangled IDs (it was a private album))

Rendering this content as follows will not work:

{{post.body|linebreaksbr|oembed_html}}

The reason is that the "&" has been escaped and turned into "&amp". The HTML parser over at https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L144 does recognize & extract the URL, but it does not unescape &amp. Hence, &amp is fed to embed.ly... resulting in a 404 over there.
opened by pennersr 8

'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

I get the error shown below when I run the Peewee sample blog app from here: https://github.com/coleifer/peewee/tree/master/examples/blog

Specifically this happens when Micawber tries to display a post with links that need converting to embeds (e.g. a YouTube video link).

I've been able to reproduce this reliably with different links (e.g. Vimeo links instead of YouTube) and different browsers. It doesn't always happen immediately, but if you click around to view the posts with embeds, then return to the index page, then view posts again, the error appears and the page is either unavailable or shows the page with no CSS. Errors in the console show that files failed to load: Failed to load resource: net::ERR_SOCKET_NOT_CONNECTED

This is in a Python 2.7.10 virtualenv on Ubuntu 15.10 running the Flask dev server.

Interestingly, running it in a Python 3.4 virtualenv works without issues. But it would be great to have a fix for Python 2.

Exception happened during processing of request from ('127.0.0.1', 33044)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
    self.handle()
  File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 216, in handle
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 247, in handle_one_request
    self.raw_requestline = self.rfile.readline()
IOError: [Errno 11] Resource temporarily unavailable

opened by keybits 7

parse_html overhead

>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'

What is html and body tags ? i do not need it.

>>> micawber.parse_text('http://www.youtube.com/watch?v=54XHDUOHuzU', providers)
u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen></iframe>'
>>> micawber.parse_text('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<p><a href="http://www.youtube.com/watch?v=54XHDUOHuzU" title="Future Crew - Second Reality demo - HD">Future Crew - Second Reality demo - HD</a></p>'

I don't want link, i want iframe, etc, as in docs, even i have other tags in text.

I use bs4, but why it is not in docs as dependency?

ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)

opened by LennyLip 7

500px and bootstrap_embedly

In [3]: requests.get('http://api.embed.ly/1/oembed?url=https%3A%2F%2Fiso.500px.com%2Fguest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following%2F&maxwidth=500').json()
Out[3]: 
{u'author_name': u'DL Cade',
 u'author_url': u'https://iso.500px.com/author/dl/',
 u'description': u"One of December's talented 500px Guest Curators was photographer Joel (Julius) Tjintjelaar , and he fully embraced the real purpose of the Editors' Choice section: to unveil photos and photographers that might not have made the Popular page for one reason or another... but probably should have.",
 u'provider_name': u'500px',
 u'provider_url': u'https://iso.500px.com',
 u'thumbnail_height': 1000,
 u'thumbnail_url': u'https://isocdn.500px.org/wp-content/uploads/2014/12/julius-1500x1000.jpg',
 u'thumbnail_width': 1500,
 u'title': u'Guest Curator Joel (Julius) Tjintjelaar Reveals Three Photographers that Should Have a Larger Following',
 u'type': u'link',
 u'url': u'https://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/',
 u'version': u'1.0'}

In [4]: bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
---------------------------------------------------------------------------
ProviderNotFoundException                 Traceback (most recent call last)
<ipython-input-4-aca3a4c8cf6f> in <module>()
----> 1 bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')

/tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
     91                 self.cache.set(key, data)
     92             return data
---> 93         return fn(self, url, **params)
     94     return inner
     95 

/tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
    132         if provider:
    133             return provider.request(url, **params)
--> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
    135 
    136 

ProviderNotFoundException: Provider not found for "http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/"

opened by ad-m 7

Youtube Playlists

I'm not quite sure where the fault for this lies, but here seems a good start.

Embedding a youtube playlist using embed.ly directly works okay: http://embed.ly/code?url=https%3A%2F%2Fwww.youtube.com%2Fplaylist%3Flist%3DPLE2714DC8F2BA092D (literally an example playlist heh)

Running it thorough micawber doesn't embed anything using the URL: https://www.youtube.com/playlist?list=PLE2714DC8F2BA092D - using the embed URL of https://www.youtube.com/embed/videoseries?list=PLE2714DC8F2BA092D results in the first video in the series being embedded but no playlist controls.

opened by kieranhogg 6
feature request: add media.ccc.de integration

Hi! Falsely reported to nikola (to add more features), I'm now reporting this here as a feature request: It would be great to integrate videos/ streams from https://media.ccc.de into this library.

The service is run by the German hacker association Chaos Computer Club (CCC), which hosts annual events itself and lends streaming expertise to many external events via its Video Operation Center (VOC).

The streaming service is a valuable source of information on many different topics and I think it would be an awesome addition!

If you have pointers on where I can add it (I assume somewhere in providers.py), I might be able to do a pull request myself. I wouldn't call myself a Python expert though :-)

opened by dvzrv 5
Option to convert only single-line links

The solution in issue #29 does not fix the problem of having provider links inside of markup (or Markdown, for that matter) somewhere within the line. The resulting code is still mangled, in particular when micawber is combined with a Markdown renderer (Misaka).

In this custom filter I took parts of parsers.py, essentially removing the else: line = parse_text_full .. block, which I think should be optional.

In the linked commit I also add my own line-parser for performance reasons, but I'll soon learn to register my own provider and fix that ;-)

opened by loleg 4
bootstrap_basic raw strings / escapes

I noticed that a lot of the regular expression patterns in bootstrap_basic don't escape dots (match all). This means that a fair number of these patterns will match more than intended.

In addition most patterns aren't marked as raw strings and therefore contain invalid escape sequences. This isn't noticeable directly, but could cause issues in a future python version.

For an example of the latter:

python -W always -c '"https://\S*?soundcloud.com/\S+"' <string>:1: DeprecationWarning: invalid escape sequence \S

opened by jaap3 4
Packaging: examples conflict with flasgger

There is a file conflict between flasgger and micawber, because both install files into the too generic path name examples. For reference, please see this Arch Linux bug.

As a solution, micawber and flasgger should either not install these examples at all, or if required into a unique directory (e.g. micawber-examples) or another system directory (e.g. on Linux: /usr/share/doc/python-micawber/examples, which is usually done by the packagers).

I will remove them for now to resolve the file conflict.

opened by dvzrv 4
performance suggestion

I'm considering migrating to micawber from a custom oembed consumer, and wanted to suggest a performance improvement that I am willing to generate a PR for.

I'd like to extend the ProviderRegistry with a secondary internal register that nests providers under domain names.

this would allow users to optionally avoid a regex match against every provider and only test the domain.

some light tests on a quick mockup showed the lookups to run in 30% the time -- including the overhead of parsing the domain name from a url, but about 5% of the time if you have the domain already.

we would be using this on a high volume indexer, so this performance is a need.

opened by jvanasco 4

Owner

Charles Leifer

GitHub http://micawber.readthedocs.org/

Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

10 Aug 7, 2022

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

6 Dec 6, 2022

a small library for extracting rich content from urls

Related tags

Overview

what does it do?

examples

Comments

CSP headers

HTML parser doesn't deal with &

'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

parse_html overhead

500px and bootstrap_embedly

Youtube Playlists

feature request: add media.ccc.de integration

Option to convert only single-line links

bootstrap_basic raw strings / escapes

Packaging: examples conflict with flasgger

performance suggestion

Owner

Charles Leifer

Unja is a fast & light tool for fetching known URLs from Wayback Machine

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Web Content Retrieval for Humans™

Html Content / Article Extractor, web scrapping lib in Python

API to parse tibia.com content into python objects.

Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

A Python library for automating interaction with websites.

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

🥫 The simple, fast, and modern web scraping library

A pure-python HTML screen-scraping library

Library to scrape and clean web pages to create massive datasets.

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

Here I provide the source code for doing web scraping using the python library, it is Selenium.

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Simple library for exploring/scraping the web or testing a website you’re developing

a small library for extracting rich content from urls

Rich is a Python library for rich text and beautiful formatting in the terminal.

Rich is a Python library for rich text and beautiful formatting in the terminal.