fast python port of arc90's readability tool, updated to match latest readability.js!



Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.


It's easy using pip, just run:

$ pip install readability-lxml


>>> import requests
>>> from readability import Document

>>> response = requests.get('')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="">More information...</a></p>\n</div>

Change Log

  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
  • 0.8 Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords


This code is under the Apache License 2.0 license.

Thanks to

  • 0.2.4 uninstallable .egg uploaded to pypi

    0.2.4 uninstallable .egg uploaded to pypi

    The latest package isn't installable from pypi as it's a .egg. Previous versions appear to have been .zip files.

    I've always just uploaded with sdist upload. I'm not sure how this was setup.

    opened by mitechie 11
  • Differences with Goose

    Differences with Goose

    Hi, can I ask what are the differences with python-goose?

    Or, said in another way, why did you decide to resurrect python-readability, instead of investing in Goose?

    It's a genuine question, I'm evaluating content extraction frameworks, and trying to decide which one to use. So far I prefer Goose, but I'm trying to understand if I missed something. Thank you in advance!

    opened by 0x0ece 9
  • Resolved problem with title.text being None

    Resolved problem with title.text being None

    I’ve gotten this error:

    Traceback (most recent call last):
      File "bin/fetcher", line 46, in <module>
      File "/home/ferret/html-fx/htmlfx/", line 347, in run
        blacklist, default_extractor).listen()
      File "/home/ferret/html-fx/htmlfx/", line 259, in listen
        item = self.process(feed_item)
      File "/home/ferret/html-fx/htmlfx/", line 232, in process
        'title': gist.title,
      File "/home/ferret/html-fx/src/utilofies/utilofies/", line 97, in __get__
        obj.__dict__[self.__name__] = self.func(obj)
      File "/home/ferret/html-fx/htmlfx/", line 85, in title
        return self.readability.title()
      File "/home/ferret/.buildout/eggs/readability_lxml-", line 136, in title
        return get_title(self._html(True))
      File "/home/ferret/.buildout/eggs/readability_lxml-", line 46, in get_title
        if title is None or len(title.text) == 0:
    TypeError: object of type 'NoneType' has no len()

    Do you think my change addresses it properly?

    opened by Telofy 8
  • Crash when parsing articles with invalid link

    Crash when parsing articles with invalid link "http://["

    It seems readability crashes on links with an extra "[" in the url. Here's an example:

    <a href="http://[" title="">Raging Bull</a>

    Here's the stacktrace:

    Traceback (most recent call last):
      File "", line 14, in <module>
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\", line 399, in execute_from_command_line
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\", line 392, in execute
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\", line 242, in run_from_argv
        self.execute(*args, **options.__dict__)
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\", line 285, in execute
        output = self.handle(*args, **options)
      File "C:\Emils\Projects\presskoll\presskoll\webhook\management\commands\", line 69, in handle
        parse_article(rawarticle, overwrite=options["overwrite"], DEBUG=DEBUG)
      File "C:\Emils\Projects\presskoll\presskoll\webhook\", line 59, in parse_article
        title, body = title_and_body_from_article(rawarticle)
      File "C:\Emils\Projects\presskoll\presskoll\webhook\", line 393, in title_and_body_from_article
        doc = document._html(True)
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\readability\", line 119, in _html
        self.html = self._parse(self.input)
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\readability\", line 127, in _parse
        doc.make_links_absolute(base_href, resolve_base_href=True)
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\", line 340, in make_links_absolute
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\", line 469, in rewrite_links
        new_link = link_repl_func(link.strip())
      File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\", line 335, in link_repl
        return urljoin(base_url, href)
      File "C:\Program Files\Python27\Lib\", line 260, in urljoin
        urlparse(url, bscheme, allow_fragments)
      File "C:\Program Files\Python27\Lib\", line 142, in urlparse
        tuple = urlsplit(url, scheme, allow_fragments)
      File "C:\Program Files\Python27\Lib\", line 190, in urlsplit
        raise ValueError("Invalid IPv6 URL")
    ValueError: Invalid IPv6 URL

    I suggest this error is catched and discarded.

    opened by EmilStenstrom 7
  • No Title for most articles

    No Title for most articles

    Is there a known problem that there are nog titles for most articles on internet? When I try "python -m readability.readability -u " on popular news sites i don't get any headings.

    opened by gevezex 7
  • fixed encoding problem I saw when trying to parse this site: https://…

    fixed encoding problem I saw when trying to parse this site: https://…


    The current version without the patch does this:

    $ python Python 3.8.2 (v3.8.2:7b3ab5921f, Feb 24 2020, 17:52:18) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information.

    from readability import Document import requests r = requests.get('') d = Document(r.content) d.title() '美油又崩盘!主力合约一度暴跌42%,不到12美元!中石油紧急开会_凤凰网财经_凤凰网' d.summary() error getting summary: Traceback (most recent call last): File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 196, in summary self.transform_misused_divs_into_paragraphs() File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 427, in transform_misused_divs_into_paragraphs str_(b''.join(map(tostring, list(elem))))): File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring File "src/lxml/serializer.pxi", line 199, in lxml.etree.raiseSerialisationError lxml.etree.SerialisationError: IO_ENCODER Traceback (most recent call last): File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 196, in summary self.transform_misused_divs_into_paragraphs() File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 427, in transform_misused_divs_into_paragraphs str(b''.join(map(tostring, list(elem))))): File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError lxml.etree.SerialisationError: IO_ENCODER

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "", line 1, in File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 237, in summary raise_with_traceback(Unparseable, sys.exc_info()[2], str_(e)) File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/compat/", line 6, in raise_with_traceback raise exc_type(*args, **kwargs).with_traceback(traceback) File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 196, in summary self.transform_misused_divs_into_paragraphs() File "/Users/rosariom/pyvirtualenvs/general_env/lib/python3.8/site-packages/readability/", line 427, in transform_misused_divs_into_paragraphs str_(b''.join(map(tostring, list(elem))))): File "src/lxml/etree.pyx", line 3435, in lxml.etree.tostring File "src/lxml/serializer.pxi", line 139, in lxml.etree._tostring File "src/lxml/serializer.pxi", line 199, in lxml.etree._raiseSerialisationError readability.readability.Unparseable: IO_ENCODER

    with my patch it works like this:

    $ python Python 3.8.2 (v3.8.2:7b3ab5921f, Feb 24 2020, 17:52:18) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information.

    from readability import Document import requests r = requests.get('') d = Document(r.content) d.title() '美油又崩盘!主力合约一度暴跌42%,不到12美元!中石油紧急开会_凤凰网财经_凤凰网' d.summary() '

    中国基金报 泰勒 吴羽 赵婷










    5月合约收于-37.63美元/桶,但这恐怕还不是最致命的一天,正如瑞穗期货主管Bob Yawger指出,EIA数据并没有显示当前原油库存已经达到最大储能,这次暴跌的主要原因是交易者急于平仓。







    4月20日,中石油集团公司党组召开会议。会议指出,当前的挑战前所未有,要充分估计困难、风险和不确定性,切实增强紧迫感,扎实推进提质增效专项行动。低油价使公司大而不强矛盾凸显, “水落石出”,要按照中央要求,不失时机地推进深化改革。会议还提出,要聚焦“两利三率”指标体系,经营上灵活应对,努力把疫情和油价影响控制在最低限度。

    2、中石化党组召开会议 :做好较长时间应对外部环境变化的思想准备和工作准备



    上海期货交易所子公司上海国际能源交易中心已于2020年4月21日发布《关于同意中石油燃料油有限责任公司湛江仓储分公司增加原油期货启用库容的公告》(上能公告 [2020] 18 号)。同意中石油燃料油有限责任公司位于广东省湛江市霞山区友谊路1号湛江港二区栈桥南吹填区的原油期货指定交割仓库存放点启用库容由40万立方米增加至50万立方米,核定库容按70万立方米执行。





























    “末日效应” 无需过度恐慌

    美东时间2:30,WTI 5月原油期货结算收跌55.90美元,跌幅305.97%,报-37.63美元/桶,历史上首次收于负值。WTI 6月原油期货收跌4.60美元,跌幅18.0%,刷新收盘历史低点至20.43美元/桶。
















    中国基金报记者 吴羽






























    油价信息服务部首席石油分析师汤姆•克洛扎(Tom Kloza)表示:“随着期货的到期以及程序交易等的出现,价格可能会波动。” “今天,这个生态系统运转异常。”





    美国能源信息署(US Energy Information Administration)估计,2019年全年平均价格为每加仑2.60美元,每加仑汽油的平均运输和营销成本约为39美分。炼油成本和利润平均增加了34美分。






    《刑法》第二百六十七条 【抢夺罪】抢夺公私财物,数额较大的,或者多次抢夺的,处三年以下有期徒刑、拘役或者管制,并处或者单处罚金;数额巨大或者有其他严重情节的,处三年以上十年以下有期徒刑,并处罚金;数额特别巨大或者有其他特别严重情节的,处十年以上有期徒刑或者无期徒刑,并处罚金或者没收财产。








    opened by rosariom 6
  • Fix #99 - Let external user to decide the option handle_failures

    Fix #99 - Let external user to decide the option handle_failures

    This PR is a candidate solution to solve issue #99 which is about exception caused by python-readability owing to invalid process on IPv6 address. The key idea is to let external user of python-readability to setup option handle_failures='discard' (which currently is invisible to external user) to get around issue #99.

    opened by johnklee 6
  • I can't extract content from this Chinese article

    I can't extract content from this Chinese article

    When I use readability on this article I am unable to extract any content. The encoding is gb2312, but I've converted it to unicode and the summary is still empty. The html elements don't have informative ids/classes, is there any way readability could handle documents like it?

    opened by nathanathan 6
  • Unparseable: local variable 'enc' referenced before assignment

    Unparseable: local variable 'enc' referenced before assignment

    Hi there!

    Extracting doesn’t work anymore when you predecode the strings. This looks pretty trivial though. enc could be initialized with None, unless that would cause any problems in other parts of the code.

    By the way, I would discourage the use of the old chardet library. The range of encodings it can detect is very limited and it’s slow on top. I’ve found cchardet to be a lot better, but really there is the excellent UnicodeDammit library in BeautifulSoup that first tries to extract various explicit encoding specifications and then falls back on such implicit methods. Thanks to their latest refactoring, I could even remove a number of ugly hacks I needed to use the older version.

    /home/telofy/.buildout/eggs/readability_lxml- in summary(self, html_partial)
        152             ruthless = True
        153             while True:
    --> 154                 self._html(True)
        155                 for i in self.tags(self.html, 'script', 'style'):
        156                     i.drop_tree()
    /home/telofy/.buildout/eggs/readability_lxml- in _html(self, force)
        117     def _html(self, force=False):
        118         if force or self.html is None:
    --> 119             self.html = self._parse(self.input)
        120         return self.html
    /home/telofy/.buildout/eggs/readability_lxml- in _parse(self, input)
        122     def _parse(self, input):
    --> 123         doc, self.encoding = build_doc(input)
        124         doc = html_cleaner.clean_html(doc)
        125         base_href = self.options.get('url', None)
    /home/telofy/.buildout/eggs/readability_lxml- in build_doc(page)
         15         page_unicode = page.decode(enc, 'replace')
         16     doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
    ---> 17     return doc, enc
         19 def js_re(src, pattern, flags, repl):
    Unparseable: local variable 'enc' referenced before assignment
    opened by Telofy 6
  • Require older lxml version for OSX compatibility

    Require older lxml version for OSX compatibility

    Currently, the package does not work on OSX, if installed directly via pip, because "pip install lxml" does not currently work on OSX.

    I don't know how many programmer-hours have been lost due to frustrations in installing lxml on OSX, due to OSX shipping out-dated libxml libraries. StackOverflow is littered with questions about it, and even the solutions that worked for other people didn't work for me.

    What did work for me to use this package was to simply install an older lxml. If you write a requirements.txt in the root of this package, and write the lxml dependency as e.g.


    The package will work successfully on OSX. It works up to lxml 2.3.5, and doesn't work with lxml 3.0. I'm not sure what the oldest version your package will work with; I haven't tested that sorry.

    P.S Thanks for your work on this package. If this wasn't here, I'd have probably written it myself.

    P.P.S. After spending hours on this stupid, stupid problem, let me just say: **** developing on OSX sucks.

    opened by syllog1sm 6
  • Makeover for the README

    Makeover for the README

    • [x] move to ~~markdown~~ restructured text
    • [x] bring a simple, working example

    View it here.

    I chopped out some stuff, shout and I'll add what you want back in.

    opened by decentral1se 5
  • isProbablyReaderable


    How difficult would it be to implement isProbablyReaderable(doc, options) (from

    This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed.

    Would this be hard to implement? I could also try working on it.

    opened by Uzay-G 3
  • Problems with

    Problems with

    Take this page, for example:

    • doc.summary() returns only the main text, the first 3 paragraphs, but completely skips the SELECTED READING section.

    Or, take this page:

    • here, on the contrary, doc.summary() returns only the SELECTED READING section, but skips the SPECIAL SECTION :)

    Would be great to find some solution.

    opened by 097115 0
  • <p> wrongly inserted before <i> or <b>

    wrongly inserted before or

    When parsing a simple text such as " my emphasis sentence", Document.summary() insert a paragraph

    before the opening .

    This seems to open mostly when the text is not already in

    but in a


    Example :

    opened by ploum 0
  • Does not handle github pages

    Does not handle github pages

    Document.summary() of github pages is always:

    "You can’t perform that action at this time"

    This doesn’t happen with other forges (gitlab, gittea, …)

    opened by ploum 0
  • .text may guess the encoding incorrectly

    .text may guess the encoding incorrectly

    Steps to reproduce:

    import requests
    from readability import Document
    response = requests.get('')

    However, if we use .content:


    everything will be just fine.

    May be updating README.rst is worth a shot :)

    opened by 097115 4
  • Error when using positive_keywords (or negative_keywords) argument with python >= 3.7

    Error when using positive_keywords (or negative_keywords) argument with python >= 3.7

    Got the following error:

    File "/usr/local/lib/python3.8/site-packages/readability/", line 138, in __init__ self.positive_keywords = compile_pattern(positive_keywords) File "/usr/local/lib/python3.8/site-packages/readability/", line 80, in compile_pattern elif isinstance(elements, re._pattern_type): AttributeError: module 're' has no attribute '_pattern_type'

    Looks like re._pattern_type has been removed in 3.7 but there's an easy fix:

    opened by nbtravis 1
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 1, 2023
Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Zotero ➡️ Readwise zotero2readwise is a Python library that retrieves all Zotero

Essi Alizadeh 49 Dec 20, 2022
Ward is a modern test framework for Python with a focus on productivity and readability.

Ward is a modern test framework for Python with a focus on productivity and readability.

Darren Burns 1k Dec 31, 2022
A simple port scanner for Web/ip scanning Port 0/500 editable inside the .py file

Simple-Port-Scanner a simple port scanner for Web/ip scanning Port 0/500 editable inside the .py file Open Cmd/Terminal Cmd Downloads Run Command: pip

YABOI 1 Nov 22, 2021
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

null 62 Dec 28, 2022
The Devils Eye is an OSINT tool that searches the Darkweb for onion links and descriptions that match with the users query without requiring the use for Tor.

The Devil's Eye searches the darkweb for information relating to the user's query and returns the results including .onion links and their description

Richard Mwewa 135 Dec 31, 2022
RollerScanner — Fast Port Scanner Written On Python

RollerScanner RollerScanner — Fast Port Scanner Written On Python Installation You should clone this repository using: git clone

null 68 Nov 9, 2022
A tool to upgrade dependencies to the latest versions

pip-check-updates A tool to upgrade dependencies to the latest versions, inspired by npm-check-updates Install From PyPi pip install pip-check-updates

Zeheng Li 12 Jan 6, 2023
OpenQueue is a experimental CS: GO match system written in asyncio python.

What is OpenQueue OpenQueue is a experimental CS: GO match system written in asyncio python. Please star! This project was a lot of work & still has a

OpenQueue 10 May 13, 2022
Coffeematcher is a python library to randomly match participants for coffee meetings.

coffeematcher coffeematcher is a python library to randomly match participants for coffee meetings. Installation Clone the repository: git clone https

Thomas Wesselink 3 May 6, 2022
A Python app which retrieves the rank and players' equipped skins during a match

VALORANT rank yoinker About The Project Usage Contributing Contact Acknowledgements Disclaimer About The Project Their Queue Current Skin Current Rank

Isaac Kenyon 270 Jan 4, 2023
JF⚡can - Super fast port scanning & service discovery using Masscan and Nmap. Scan large networks with Masscan and use Nmap's scripting abilities to discover information about services. Generate report.

Description Killing features Perform a large-scale scans using Nmap! Allows you to use Masscan to scan targets and execute Nmap on detected ports with

null 377 Jan 3, 2023
flake8 plugin which forbids match statements (PEP 634)

flake8-match flake8 plugin which forbids match statements (PEP 634)

Anthony Sottile 25 Nov 1, 2022
Backend, modern REST API for obtaining match and odds data crawled from multiple sites. Using FastAPI, MongoDB as database, Motor as async MongoDB client, Scrapy as crawler and Docker.

Introduction Apiestas is a project composed of a backend powered by the awesome framework FastAPI and a crawler powered by Scrapy. This project has fo

Fran Lozano 54 Dec 13, 2022
A code to match you with the perfect Taylor Swift song for your mood and relationship status.

taylorswift A package for matching your current mood and relationship status to a suitable Taylor Swift song. Requirements: Python 2 or 3, and the num

Megan Mansfield 82 Dec 9, 2022
GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

MTV-TSA: Adaptable GAN Encoders for Image Reconstruction via Multi-type Latent Vectors with Two-scale Attentions. This is the official code release fo

owl 37 Dec 24, 2022
YourCity is a platform to match people to their prefect city.

YourCity YourCity is a city matching App that matches users to their ideal city. It is a fullstack React App made with a Redux state manager and a bac

Nico G Pierson 6 Sep 25, 2021
Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: Average time per m

null 11 Jan 7, 2022
An extension to detect if the articles content match its title.

Clickbait Detector An extension to detect if the articles content match its title. This was developed in a period of 24-hours in a hackathon called 'H

Arvind Krishna 5 Jul 26, 2022
Simple script to match riders with drivers.

theBestPooler Simple script to match riders with drivers. It's a greedy, unoptimised search, so no guarantees that it works. It just seems to work (ve

Devansh 1 Nov 22, 2021