The lxml XML toolkit for Python

Related tags

lxml
Overview

What is lxml?

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory friendly, just so you know.

For an introduction and further documentation, see doc/main.txt.

For installation information, see INSTALL.txt.

For issue tracker, see https://bugs.launchpad.net/lxml

Support the project

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

Most people who use lxml do so because they like using it. You can show us that you like it by blogging about your experience with it and linking to the project website.

If you are using lxml for your work and feel like giving a bit of your own benefit back to support the project, consider sending us money through GitHub Sponsors, Tidelift or PayPal that we can use to buy us free time for the maintenance of this great library, to fix bugs in the software, review and integrate code contributions, to improve its features and documentation, or to just take a deep breath and have a cup of tea every once in a while. Please read the Legal Notice below, at the bottom of this page. Thank you for your support.

Support lxml through GitHub Sponsors

via a Tidelift subscription

or via PayPal:

Donate to the lxml project

Please contact Stefan Behnel for other ways to support the lxml project, as well as commercial consulting, customisations and trainings on lxml and fast Python XML processing.

Travis-CI and AppVeyor support the lxml project with their build and CI servers. Jetbrains supports the lxml project by donating free licenses of their PyCharm IDE. Another supporter of the lxml project is COLOGNE Webdesign.

Project income report

  • Total project income in 2019: EUR 717.52 (59.79 € / month)
    • Tidelift: EUR 360.30
    • Paypal: EUR 157.22
    • other: EUR 200.00

Legal Notice for Donations

Any donation that you make to the lxml project is voluntary and is not a fee for any services, goods, or advantages. By making a donation to the lxml project, you acknowledge that we have the right to use the money you donate in any lawful way and for any lawful purpose we see fit and we are not obligated to disclose the way and purpose to any party unless required by applicable law. Although lxml is free software, to the best of our knowledge the lxml project does not have any tax exempt status. The lxml project is neither a registered non-profit corporation nor a registered charity in any country. Your donation may or may not be tax-deductible; please consult your tax advisor in this matter. We will not publish or disclose your name and/or e-mail address without your consent, unless required by applicable law. Your donation is non-refundable.

Issues
  • Introduce a multi os travis build that builds OSX wheels

    Introduce a multi os travis build that builds OSX wheels

    Improvements could be made to build the manylinux wheels as well but this is a big change on its own. When creating a tag the wheel will pushed as a github release

    So this is unlikely ready to merge in its current state but I think its at a point where some feedback would be helpful.

    Example build Example "release"

    Here is me sanity testing the 2.7 wheel similarly to how the makefile does

     I   ² venv  ~/Downloads  pip install lxml-3.7.3-cp27-cp27m-macosx_10_11_x86_64.whl
    Processing ./lxml-3.7.3-cp27-cp27m-macosx_10_11_x86_64.whl
    Installing collected packages: lxml
    Successfully installed lxml-3.7.3
     I   ² venv  ~/Downloads  python                                1014ms  Sun Apr 30 08:14:12 2017
    Python 2.7.13 (default, Apr  5 2017, 22:17:22)
    [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lxml.etree
    >>> import lxml.objectify
    >>>
    

    and 2.6

     I   ² venv  ~/Downloads  pip install lxml-3.7.3-cp26-cp26m-macosx_10_11_x86_64.whl
    DEPRECATION: Python 2.6 is no longer supported by the Python core team, please upgrade your Python. A future version of pip will drop support for Python 2.6
    Processing ./lxml-3.7.3-cp26-cp26m-macosx_10_11_x86_64.whl
    Installing collected packages: lxml
    Successfully installed lxml-3.7.3
     I   ² venv  ~/Downloads  python                                 728ms  Sun Apr 30 08:25:55 2017
    Python 2.6.9 (unknown, Apr 30 2017, 08:22:20)
    [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lxml.etree
    >>> import lxml.objectify
    >>>
    

    Some notes:

    • I dont bother building all the python3 builds for OSX as they will all fail the same way the allowed_failure I do build does. Filed bug here I can help with that bug but not sure what to do next on it.
    • The linux wheels that get built are likely not useful as they are not manywheels builds. I experimented with doing the manywheels here and while I think its possible I abandoned it (for now) when I hit an FTP error I had no idea how to deal with. here is a Failing build and the abandoned branch
    opened by Bachmann1234 31
  • Add support for ucrt binaries on Windows

    Add support for ucrt binaries on Windows

    Hi,

    This PR is a first big step towards resolving #1326096. I went through the pain to recompile libiconv, libxml2 and libxslt with Visual Studio 2015/ucrt to have binaries that can be used to build a Python 3.5 wheel.

    This PR makes sure that the ucrt binaries are downloaded when we are on Python 3.5. I documented the actual compilation of the binaries in a reproducible manner at https://github.com/mhils/libxml2-win-binaries. After merging this PR and installing the Visual C++ Build Tools (or Visual Studio), a Python 3.5 x86 Windows wheel can be build as follows:

    > "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat"
    > python3 setup.py bdist_wheel --static-deps
    

    I sucessfully tested the resulting wheel on a clean Win7 VM which worked fine.

    If I could ask for a favor, it would be great if you could upload a Python 3.5 Windows wheel to PyPi as soon as possible (feel free to take the wheel linked above or compile your own). We're currently migrating @mitmproxy to Python 3 and lxml is currently the only dependency that holds a pip3 install mitmproxy back.

    Thanks! Max

    opened by mhils 19
  • repair attribute mis-interpretation in ElementTreeContentHandler

    repair attribute mis-interpretation in ElementTreeContentHandler

    regarding https://bugs.launchpad.net/lxml/+bug/1136509, this is a proposed fix for the issue.

    The first part of the fix just rewrites the attributes in startElement to have keys of the form (namespace, key).

    At first, i set namespace to None, but I had a problem with that. It appears that even namespaced attributes like the "xmlns:xsi" in my test document, is also passed to startElement, I suppose because the owning tag doesn't have a namespace. So in this case I'm splitting on the colon and passing in the two tokens to startElementNS, but I'm not sure if this approach is correct. In any case, I added two tests, if you can show what should happen in the tests at least that would make the correct behavior apparent here.

    opened by zzzeek 13
  • Removed the PyPy special cases (for PyPy 4.1)

    Removed the PyPy special cases (for PyPy 4.1)

    PyPy trunk (and future PyPy 4.1) contains now https://bitbucket.org/pypy/pypy/commits/3144c72295ae which improves the cpyext compatibility. It removes the need for these few hacks (which never fully worked, as discussed on pypy-dev).

    opened by arigo 13
  • Do not blindly copy all of the namespaces when tostring():ing a subtree.

    Do not blindly copy all of the namespaces when tostring():ing a subtree.

    When using a subtree of a document do not simply copy all of the namespaces from all of the parents down. Only copy those that we actually use within the subtree. This as copying all namespaces will bloat the subtree with information it should not have.

    This might seem harmless to do in the average case, but it will cause problems when serializing the XML, specifically C14N serialization which will according to specification retain all ns declarations on the root level element. So if this tostring() execution then will insert all parent namespace declarations into the now new root element we will unnecessarily bloat the ns declarations on this new toplevel element.

    Having this said I am not confident this is the best code for doing this, feel free to point me in the direction of better code if you will.

    opened by Pelleplutt 11
  • Improve detection of the libxml2 and libxslt libraries

    Improve detection of the libxml2 and libxslt libraries

    This patch improves detection of the libxml2 and libxslt libraries by cleaning up some of the overly-complex build system.

    The patch also improves support for using pkg-config if available.

    opened by hughmcmaster 10
  • Replace ftplib with urllib to pick up ftp_proxy when building lxml with STATIC_DEPS=true

    Replace ftplib with urllib to pick up ftp_proxy when building lxml with STATIC_DEPS=true

    I'm not sure if this pull-request has right direction but I've made small change to overcome following problem which I have. It'll be more than great if you'd merge this PR, provide feedback on this PR or anything else. Thanks in advance.

    Problem I want to realize CI of one Python project which uses lxml and other modules. To make preparation simple in CI, all of dependencies are written in requirements.txt and ready to be installed by pip install -r requirements.txt. Because the Python project is targeting Windows environment, I've also set up STATIC_DEPS=true in CI job.

    However, build of lxml fails on CI servers because of no direct access to the Internet; Other modules are able to be installed through proxy server (with http_proxy and https_proxy environmental variables) but lxml is not because of usage of ftplib which doesn't take care of ftp_proxy environmental variable.

    Downloading lxml-3.5.0.tar.gz (3.8MB)
    Downloading from URL https://pypi.python.org/packages/source/l/lxml/lxml-3.5.0.tar.gz#md5=9f0c5f1eb43ff44d5455dab4b4efbe73 (from https://pypi.python.org/simple/lxml/)
    Running setup.py (path:C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setup.py) egg_info for package lxml
      Running command python setup.py egg_info
      Building lxml version 3.5.0.
      Traceback (most recent call last):
        File "<string>", line 1, in <module>
        File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setup.py", line 233, in <module>
          **setup_extra_options()
        File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setup.py", line 144, in setup_extra_options
          STATIC_CFLAGS, STATIC_BINARIES)
        File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setupinfo.py", line 55, in ext_modules
          OPTION_DOWNLOAD_DIR, static_include_dirs, static_library_dirs)
        File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\buildlibxml.py", line 86, in get_prebuilt_libxml2xslt
          libs = download_and_extract_zlatkovic_binaries(download_dir)
        File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\buildlibxml.py", line 34, in download_and_extract_zlatkovic_binaries
          for fn in ftp_listdir(url):
        File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\buildlibxml.py", line 106, in ftp_listdir
          server = ftplib.FTP(netloc)
        File "C:\Python34\lib\ftplib.py", line 118, in __init__
          self.connect(host)
        File "C:\Python34\lib\ftplib.py", line 153, in connect
          source_address=self.source_address)
        File "C:\Python34\lib\socket.py", line 512, in create_connection
          raise err
        File "C:\Python34\lib\socket.py", line 503, in create_connection
          sock.connect(sa)
      TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
    

    Notes

    • I tested this change in Python 2.7.10 and 3.4.3 with/without Squid proxy server.
    opened by sakurai-youhei 10
  • Adds a `smart_prefix` option to XPath evaluations to overcome a counter-intuitive design flaw

    Adds a `smart_prefix` option to XPath evaluations to overcome a counter-intuitive design flaw

    Namespaces are one honking great idea -- let's do more of those!

    Using XPath to locate elements is quiet cumbersome when it comes to documents that have a default namespace:

    >>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
    >>> root.nsmap
    {'x': 'http://www.w3.org/2000/svg', None: 'http://www.tei-c.org/ns/1.0'}
    >>> root.xpath('./text/body')
    []
    >>> root.xpath('./text/body', namespaces=root.nsmap)
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
        root.xpath('./text/body', namespaces=root.nsmap)
      File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349)
        evaluator = XPathElementEvaluator(self, namespaces=namespaces,
      File "src/lxml/xpath.pxi", line 261, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170589)
      File "src/lxml/xpath.pxi", line 133, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:168702)
      File "src/lxml/xpath.pxi", line 57, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:167658)
        _BaseContext.__init__(self, namespaces, extensions, error_log, enable_regexp,
      File "src/lxml/extensions.pxi", line 84, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:156529)
        if namespaces:
    TypeError: empty namespace prefix is not supported in XPath
    

    This is a well documented issue (also here) and is commonly solved by manipulating the namespace mapping with an ad-hoc prefix - which loses the information what the default namespace was unless preserved - and adding that to XPath expressions. (another hack, stdlib as well with some insights)

    But this solution doesn't play well in generalising code like adapter classes where it becomes tedious and error prone because XPath expressions are not always identical (did i mention they are counter-intuitive to type?) and keeping track of namespace mappings across loosely coupled code elements introduces boilerplates.

    Ultimately, the interplay of document namespaces and XPath expressions is everything but pythonic and rather complicated than complex, though

    There should be one-- and preferably only one --obvious way to do it.

    The root of this issue is caused by a flaw in the XPath 1.0 specs that libxml2 follows in its implementation:

    A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.

    While XML namespaces actually have a notion of an unaliased default namespace:

    If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.

    XPath 2.0 did eventually fix this:

    A QName in a name test is resolved into an expanded QName using the statically known namespaces in the expression context. It is a static error [err:XPST0081] if the QName has a prefix that does not correspond to any statically known namespace. An unprefixed QName, when used as a name test on an axis whose principal node kind is element, has the namespace URI of the default element/type namespace in the expression context; otherwise, it has no namespace URI.

    There's no XPath 2.0 implementation with Python bindings around (well, there is one to XQuilla that returns raw strings and is far off lxml's capabilities), and it is very unlikely there's one to be implemented as the extension as a whole is a lot - which probably no one needs outside the XQuery/XSLT scene. libxml2 didn't intend to ten years ago, but hey, looking for a thesis to write?

    Thus I propose to backport that bug fix from XPath 2.0 to lxml's XPath interfaces with an opt-in smart_prefix option without considering the whole standard as

    practicality beats purity.

    Behind the scenes the ad-hoc prefix 'solution' described above is applied, but completely hidden from the client code.

    This pull request demonstrates the design and isn't completed yet, at least these issues still need to be addressed:

    • documentation
    • predicates are handled rather hackish and i have doubts that it works with more complex predicates
      • i'd appreciate test proposals for practical examples with such
      • support for predicates with the smart_prefix option could be dropped altogether, finer-grained selection is possible with Python and probably a common usage
    • should this even be the default behavior with opt-out? afaict it wouldn't break any code as supplying a namespace map with a default namespace (mapped to None) is currently invalid
      • i'd keep it out of XSLT anyway
    • should result elements from such queries have a property that stores the option? so later calls on .xpath() of these elements would behave the same if no smart_prefix option is provided
    • can regex.h be used directly from Cython, but that's not specific to this here

    btw, this is the first time i used Cython and my C usage was long ago, i'm happy about every feedback for improvements.

    Now, let's have some fun:

    >>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
    >>> root.nsmap
    {None: 'http://www.tei-c.org/ns/1.0', 'x': 'http://www.w3.org/2000/svg'}
    >>> root.xpath('./text/body', namespaces=root.nsmap, smart_prefix=True)
    [<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
    >>> root.xpath('./text/body', smart_prefix=True)
    [<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
    

    (oh, the inplace build option fails on my local machine without a helpful message. does anyone have a hint on that?)

    invalid 
    opened by funkyfuture 10
  • Fix LP:#1074996

    Fix LP:#1074996

    In python3, urlopen expects a byte stream for the POST data. this patch encodes the data in utf-8 before transmission. Essentially a hack, since a proper fix involves allowing the user to specify an encoding scheme of choice.

    opened by phanimahesh 9
  • Add .appveyor.yml

    Add .appveyor.yml

    This PR adds an AppVeyor configuration that auto-builds lxml wheels for Python 2.7 and 3.5 on Windows: https://ci.appveyor.com/project/mhils/lxml/build/1.0.25

    If you click on the individual jobs, you can go to the artifacts tab and download the wheels from AppVeyor. I tested both the Python 2.7 and 3.5 binaries and they all seem to work :smiley:

    opened by mhils 9
  • Include `libiconv` headers in the static wheels

    Include `libiconv` headers in the static wheels

    Since lxml's C-API could not be compiled without libiconv, include it in the static wheels.

    Fixes https://bugs.launchpad.net/lxml/+bug/1939031

    opened by balsa-sarenac 2
  • Fix index out of range when <body> or <head> is missing

    Fix index out of range when or is missing

    If a HTML file that doesn't contain a or part is parsed, it can lead to a index out of range error when you try getattr(lxmlElement, "body")

    opened by FVolral 1
  • Clarify cleaner doc on setting `comments=False`

    Clarify cleaner doc on setting `comments=False`

    Maybe the actual way to go is to remove this rule you seem not to be sure about. The sure thing is that the effect is quite unexpected for the library user: "I cannot disable comments removing".

    opened by JocelynDelalande 2
  • 'too many bodies' should not include 0

    'too many bodies' should not include 0

    changed check for number of bodies to < 2, as the error 'too many bodies' can currently be thrown when len(bodies) == 0

    opened by calebbarr 1
  • Disable resolve_entities by default w/ one billion laughs expansion bug.

    Disable resolve_entities by default w/ one billion laughs expansion bug.

    Facebook recently issued its largest bug bounty to remove this security bug.

    https://www.facebook.com/BugBounty/posts/778897822124446?stream_ref=10

    This may result in breaking changes but should be considered an important security enhancement.

    Test fix included.

    opened by rogerhu 8
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Gael Pasgrimaud 2k Oct 22, 2021
Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Christian Stefanescu 527 Oct 19, 2021
Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

null 935 Oct 15, 2021
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Python Software Foundation 12.2k Oct 23, 2021
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 463 Oct 16, 2021
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

The Pallets Projects 427 Oct 11, 2021
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

null 1.8k Oct 22, 2021
The awesome document factory

The Awesome Document Factory WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous s

Kozea 4.6k Oct 22, 2021
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Mozilla 2.2k Oct 20, 2021