Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Related tags

html5lib-python
Overview

html5lib

https://travis-ci.org/html5lib/html5lib-python.svg?branch=master

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

Usage

Simple usage follows this pattern:

import html5lib
with open("mydocument.html", "rb") as f:
    document = html5lib.parse(f)

or:

import html5lib
document = html5lib.parse("<p>Hello World!")

By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom and lxml.etree. To use an alternative format, specify the name of a treebuilder:

import html5lib
with open("mydocument.html", "rb") as f:
    lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should be pass into html5lib as follows:

from contextlib import closing
from urllib2 import urlopen
import html5lib

with closing(urlopen("http://example.com/")) as f:
    document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTP should be pass into html5lib as follows:

from urllib.request import urlopen
import html5lib

with urlopen("http://example.com/") as f:
    document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use:

import html5lib
with open("mydocument.html", "rb") as f:
    parser = html5lib.HTMLParser(strict=True)
    document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder class as the tree keyword argument to use an alternative document format:

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")

More documentation is available at https://html5lib.readthedocs.io/.

Installation

html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:

$ pip install html5lib

The goal is to support a (non-strict) superset of the versions that pip supports.

Optional Dependencies

The following third-party libraries may be used for additional functionality:

  • lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);
  • genshi has a treewalker (but not builder); and
  • chardet can be used as a fallback when character encoding cannot be determined.

Bugs

Please report any bugs on the issue tracker.

Tests

Unit tests require the pytest and mock libraries and can be run using the py.test command in the root directory.

Test data are contained in a separate html5lib-tests repository and included as a submodule, thus for git checkouts they must be initialized:

$ git submodule init
$ git submodule update

If you have all compatible Python implementations available on your system, you can run tests on all of them using the tox utility, which can be found on PyPI.

Questions?

There's a mailing list available for support on Google Groups, html5lib-discuss, though you may get a quicker response asking on IRC in #whatwg on irc.freenode.net.

Issues
  • fix version numbering

    fix version numbering

    I just spent 4 hours with another engineer trying to figure out why our builds weren't producing the same results.

    Turns out I had html5lib "0.9999999" and he had "0.999999999" which was nearly impossible to spot on the pip list readout. >.<

    http://semver.org/#why-use-semantic-versioning

    I'm going to guess that you'll say html5lib isn't ready for production so your versioning scheme is appropriate, but I'm still filing this bug because I'm steamed about losing so much development time. :/

    opened by phillyc 52
  • Do not directly use isolated surrogates in unicode literals

    Do not directly use isolated surrogates in unicode literals

    Jython does not support isolated surrogates in unicode, including in unicode literals. This has been reported in https://github.com/html5lib/html5lib-python/issues/2 This bug is critical for Jython due to the fact that html5lib is a vendor lib for pip, and this is blocking pip from running on Jython.

    For platforms besides Jython, this pull request allows for these surrogates to be constructed in literals, but through an additional step of indirection. For Jython itself, Jython's normal decode of literals will ensure that such invalid unicode strings cannot be constructed from any source.

    To run this on Jython:

    1. Install https://bitbucket.org/jimbaker/jython-socket-reboot, following these instructions: https://wiki.python.org/jython/JythonDeveloperGuide
    2. Use this branch of pip to install nosetests, etc.: https://github.com/jimbaker/pip Note that tox is not yet supported - because we need to get pip running first! :)

    Note that in the dev build, you will find executables in dist/bin, such as dist/bin/jython or dist/bin/pip

    The jython-socket-reboot branch is nearly complete for merging into Jython; it is a major component of Jython 2.7.0 beta 3. (I'm a core dev of Jython.)

    opened by jimbaker 28
  • New release on PyPI?

    New release on PyPI?

    Since this library is vendored by pip, which is itself vendored in CPython's ensurepip module, the fact that there's no release that includes PR #403 is blocking the removal of the deprecated abstract base classes from the collections module, long advertised as happening in Python 3.8, see https://github.com/python/cpython/pull/10596

    Normally I am the last person to ask for releases (I assume maintainers have good reasons for not doing so, and I usually don't like being on the receiving end of such requests), but the deadline for the 3.8 beta release is coming up very soon, so if there's any way to expedite an html5lib release it would really help that effort, and it would avoid stickier solutions like patching html5lib directly in pip.

    opened by pganssle 27
  • Add code coverage reports

    Add code coverage reports

    opened by gsnedders 18
  • Update Travis to use tox, and add Appveyor CI

    Update Travis to use tox, and add Appveyor CI

    Update tox.ini to utilise requirements-test.txt, and run pylint. Require lxml 3.6.0 on Windows as it has wheels available for 2.6-3.4.

    Also enables coverage for PyPy on Travis.

    opened by jayvdb 17
  • WIP More general fix for #127 with addinfourl

    WIP More general fix for #127 with addinfourl

    Do not merge yet, this is incomplete.

    As discussed in #134 I changed this to avoid .read(0) entirely and pass a first chunk to HTMLUnicodeInputStream and HTMLBinaryInputStream, but then I get lost in their implementation and I don’t know what to do with that first chunk.

    Surprisingly, most tests still passed because .seek(0) is used in some places. I added a failing test with a non-seekable input.

    opened by SimonSapin 14
  • How could html5lib be made faster?

    How could html5lib be made faster?

    html5lib is nice, but it's pretty slow. On a fairly large test file, lxml took 50ms and html5lib took 5 seconds, which is 100 times slower.

    Are there any particularly slow parts of html5lib that could be optimized? Would compiling it with Cython help?

    opened by tbodt 13
  • Implement inhead-noscript context, add script parameter to parse

    Implement inhead-noscript context, add script parameter to parse

    48 failing tests now pass. #script-on and #script-off parameters are now considered in testing.

    opened by neumond 11
  • test not running when building on opensuse build server

    test not running when building on opensuse build server

    Hi

    trying to package html5lib for opensuse, but I'm running into the following error when trying to run the tests:

    [ 113s] File "/home/abuild/rpmbuild/BUILD/html5lib-0.999999999/html5lib/tests/tokenizertotree.py", line 10, in [ 113s] from . import test_tokenizer [ 113s] ImportError: cannot import name 'test_tokenizer'

    I can't see that file in the tar-ball or in git over here. Building from the tar-ball on pypi using python 3.5.1.

    opened by arunpersaud 11
  • Re-enable codecov

    Re-enable codecov

    This also removes dependence on ./requirements-install.sh and flake8-run.sh

    opened by jayvdb 11
  • Drop support for EOL Python 2.7 and 3.5

    Drop support for EOL Python 2.7 and 3.5

    The readme says:

    The goal is to support a (non-strict) superset of the versions that pip supports.

    https://github.com/html5lib/html5lib-python#installation

    pip supports:

    CPython 3.6, 3.7, 3.8, 3.9 and latest PyPy3.

    https://pip.pypa.io/en/stable/installation/#compatibility

    Therefore this PR drops support for EOL 2.7 and 3.5 and upgrades to modern syntax using pyupgrade. There will be more cleanup possible, but I didn't want this PR to get too big.

    Similarly, if you'd prefer a smaller PR, like dropping support and removing six, but keeping the older syntax for another PR, just let me know!

    opened by hugovk 0
  • Google Group: Access Error

    Google Group: Access Error

    At the bottom of the README there is a link to: http://groups.google.com/group/html5lib-discuss

    But this page gives me an access error.

    opened by guettli 0
  • Add position information for text nodes

    Add position information for text nodes

    Would it be possible to add position information, i.e. line+column to text nodes? Or, at least make this information available to the tree builder? I implemented a very minimal proof of concept to add the information to each token and pass that along to the dom tree builder and obtain the following result:

    import html5lib
    
    html = '<div>&amp;<p>b<span>c</span></p> cab</div>'
    
    parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
    
    doc = parser.parse(html)
    def parse(n):
        for c in n.childNodes:
            if hasattr(c, 'sourcepos'):
                print(c.sourcepos, c)
            parse(c)
    
    parse(doc)
    
    None <DOM Element: head at 0x10bbed0d0>
    None <DOM Element: body at 0x10bbed1f0>
    (1, 5) <DOM Element: div at 0x10bbfb790>
    (1, 10) <DOM Text node "'&'">
    (1, 13) <DOM Element: p at 0x10bbfb820>
    (1, 14) <DOM Text node "'b'">
    (1, 20) <DOM Element: span at 0x10bbfb8b0>
    (1, 21) <DOM Text node "'c'">
    (1, 33) <DOM Text node "' '">
    (1, 36) <DOM Text node "'cab'">
    

    I would be willing to implement it.

    opened by corynezin 0
  • consider making html5lib.tokenizer public

    consider making html5lib.tokenizer public

    Hello,

    In version https://github.com/html5lib/html5lib-python/releases/tag/0.999999999 , html5lib.tokenizer was made private

    The wpull project (https://github.com/ArchiveTeam/wpull ) uses this library, and if we were to ever migrate to using the 1.X versions, it would negatively impact the application, because instead of just tokenizing a webpage (see https://github.com/ArchiveTeam/wpull/blob/a4ff4a93f613ce18ad3c515aa3d4f5848a88b98c/wpull/document/htmlparse/html5lib_.py ), we would have to use the full tree parsing which is slower and uses more ram

    is there any reason this was made private when the 1.x branch was released?

    opened by mgrandi 0
  • 1.1: testt suite is failing

    1.1: testt suite is failing

    + /usr/bin/python3 -m pytest -k 'not test_encoding'
    =========================================================================== test session starts ============================================================================
    platform linux -- Python 3.8.3, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
    rootdir: /home/tkloczko/rpmbuild/BUILD/html5lib-1.1, configfile: pytest.ini
    plugins: flaky-3.6.1, forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, cov-2.10.1, xdist-2.2.0, asyncio-0.14.0, hypothesis-6.0.2, expect-1.1.0
    collected 0 items / 1 error
    
    ================================================================================== ERRORS ==================================================================================
    ______________________________________________________________________ ERROR collecting test session _______________________________________________________________________
    Direct construction of TokenizerFile has been deprecated, please use TokenizerFile.from_parent.
    See https://docs.pytest.org/en/stable/deprecations.html#node-construction-changed-to-node-from-parent for more details.
    ============================================================================= warnings summary =============================================================================
    ../../../../../usr/lib/python3.8/site-packages/_pytest/config/__init__.py:1183
      /usr/lib/python3.8/site-packages/_pytest/config/__init__.py:1183: PytestDeprecationWarning: The --strict option is deprecated, use --strict-markers instead.
        self.issue_config_time_warning(
    
    -- Docs: https://docs.pytest.org/en/stable/warnings.html
    ========================================================================= short test summary info ==========================================================================
    ERROR
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    ======================================================================= 1 warning, 1 error in 0.25s ========================================================================
    
    opened by kloczek 6
  • AttributeError: module 'html5lib.treebuilders' has no attribute '_base' (Python3.9)

    AttributeError: module 'html5lib.treebuilders' has no attribute '_base' (Python3.9)

    By all sense of the meaning; I'm not a programmer, so I apologize upfront if this is out of place. That said, I was trying to resurrect a project from another researcher switching it over from Python 2.7 to Python 3.9. Lot's of clean up, but I stumbled upon this error at execution of MyProject:

    Traceback (most recent call last): File "/home/user/MyProject.py", line 25, in import argparse,requests,sys,os,threading,bs4,warnings,random File "/usr/local/lib/python3.9/dist-packages/bs4/init.py", line 30, in from .builder import builder_registry, ParserRejectedMarkup File "/usr/local/lib/python3.9/dist-packages/bs4/builder/init.py", line 314, in from . import _html5lib File "/usr/local/lib/python3.9/dist-packages/bs4/builder/_html5lib.py", line 70, in class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: module 'html5lib.treebuilders' has no attribute '_base'

    The project's libraries include:

    • bs4 (BeautifulSoup4 version 4.4.1)
    • html5lib (version 1.1)

    First, I tried removing and reinstalling python3-bs4 and python3-html5lib, but it didn't resolve the issue. However, I had two successes in resolving the error, but I'm not fully sure what kind of impact this may introduce to other applications on my system in the future:

    Resolution 1: I modified "_base" to "base" throughout file "/usr/local/lib/python3.9/dist-packages/bs4/builder/_html5lib.py"

    Resolution 2: Downgraded the html5lib package to version 1.0b8 (python3.9 -m pip --upgrade html5lib==1.0b8)

    opened by bindnera 2
  • Use Python built-in str.lower in preference to asciiUpper2Lower character table translation

    Use Python built-in str.lower in preference to asciiUpper2Lower character table translation

    This change uses Python's builtin str.lower to perform string lowercasing (typically used during case-insensitive element/attribute name comparison), and provides a nice small parsing performance benefit on Python (cpython 3.9.1).

    Benchmark command used: python benchmarks/bench_html.py parse --rigorous

    Before

    .........................................
    html_parse_etree: Mean +- std dev: 208 ms +- 14 ms
    

    After

    .........................................
    html_parse_etree: Mean +- std dev: 200 ms +- 12 ms
    

    This follows on from some conversation in #520 regarding comparison operations and lowercasing.

    opened by jayaddison 0
  • Move away from Travis CI

    Move away from Travis CI

    Probably to GitHub Actions?

    While we're rewriting the config, we may well want to migrate to cibuildwheel at the same time, especially before #445 happens (when we'll then need the infra to build non-universal wheels for releases cross-platform).

    opened by gsnedders 2
  • Compile html5lib with Cython

    Compile html5lib with Cython

    Use Cython to make the parser quicker; see #445. This builds on top of #272.

    This is a long way from ready to land, but shows potential. We probably also want to split this up so many of the earlier changes land first if they make performance sense without Cython.

    The change to attribute representation especially might be of interest to #521 (cc @jayaddison).

    There's also some API changes towards the end of the branch which we may well want to delay landing even beyond the rest of the Cython stuff.

    opened by gsnedders 2
  • Tree Construction: do not use phase-specific handling for implied end-tag tokens emitted during InBody phase

    Tree Construction: do not use phase-specific handling for implied end-tag tokens emitted during InBody phase

    The InBodyPhase parser phase creates and processes implied close-tag tokens for a small number of stopNames elements.

    This changeset updates the phase's logic so that processing of these implied tokens remains within the InBody parser phase.

    This fixes #111 by avoiding a call to the phase-specific InTablePhase.endTagOther to handle the implied end-tag, which as a side-effect disables the insertFromTable flag. That flag must remain True in order for repositioning of fostered elements relative to a <table> tag to be performed correctly.

    opened by jayaddison 0
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Mozilla 2.2k Oct 20, 2021
The awesome document factory

The Awesome Document Factory WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous s

Kozea 4.6k Oct 22, 2021
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Gael Pasgrimaud 2k Oct 22, 2021
The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

null 2k Oct 26, 2021
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

null 1.8k Oct 22, 2021
Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": >>> print(json.dumps(xmltod

Martín Blech 4.6k Oct 23, 2021
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

The Pallets Projects 427 Oct 11, 2021
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 463 Oct 16, 2021
Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Christian Stefanescu 527 Oct 19, 2021