What is lxml?

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory friendly, just so you know.

For an introduction and further documentation, see doc/main.txt.

For installation information, see INSTALL.txt.

For issue tracker, see https://bugs.launchpad.net/lxml

Support the project

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

Most people who use lxml do so because they like using it. You can show us that you like it by blogging about your experience with it and linking to the project website.

If you are using lxml for your work and feel like giving a bit of your own benefit back to support the project, consider sending us money through GitHub Sponsors, Tidelift or PayPal that we can use to buy us free time for the maintenance of this great library, to fix bugs in the software, review and integrate code contributions, to improve its features and documentation, or to just take a deep breath and have a cup of tea every once in a while. Please read the Legal Notice below, at the bottom of this page. Thank you for your support.

Support lxml through GitHub Sponsors

via a Tidelift subscription

or via PayPal:

Please contact Stefan Behnel for other ways to support the lxml project, as well as commercial consulting, customisations and trainings on lxml and fast Python XML processing.

Travis-CI and AppVeyor support the lxml project with their build and CI servers. Jetbrains supports the lxml project by donating free licenses of their PyCharm IDE. Another supporter of the lxml project is COLOGNE Webdesign.

Project income report

Total project income in 2019: EUR 717.52 (59.79 € / month)
- Tidelift: EUR 360.30
- Paypal: EUR 157.22
- other: EUR 200.00

Legal Notice for Donations

Any donation that you make to the lxml project is voluntary and is not a fee for any services, goods, or advantages. By making a donation to the lxml project, you acknowledge that we have the right to use the money you donate in any lawful way and for any lawful purpose we see fit and we are not obligated to disclose the way and purpose to any party unless required by applicable law. Although lxml is free software, to the best of our knowledge the lxml project does not have any tax exempt status. The lxml project is neither a registered non-profit corporation nor a registered charity in any country. Your donation may or may not be tax-deductible; please consult your tax advisor in this matter. We will not publish or disclose your name and/or e-mail address without your consent, unless required by applicable law. Your donation is non-refundable.

Namespaces are one honking great idea -- let's do more of those!

Using XPath to locate elements is quiet cumbersome when it comes to documents that have a default namespace:

>>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
>>> root.nsmap
{'x': 'http://www.w3.org/2000/svg', None: 'http://www.tei-c.org/ns/1.0'}
>>> root.xpath('./text/body')
[]
>>> root.xpath('./text/body', namespaces=root.nsmap)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    root.xpath('./text/body', namespaces=root.nsmap)
  File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349)
    evaluator = XPathElementEvaluator(self, namespaces=namespaces,
  File "src/lxml/xpath.pxi", line 261, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170589)
  File "src/lxml/xpath.pxi", line 133, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:168702)
  File "src/lxml/xpath.pxi", line 57, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:167658)
    _BaseContext.__init__(self, namespaces, extensions, error_log, enable_regexp,
  File "src/lxml/extensions.pxi", line 84, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:156529)
    if namespaces:
TypeError: empty namespace prefix is not supported in XPath

This is a well documented issue (also here) and is commonly solved by manipulating the namespace mapping with an ad-hoc prefix - which loses the information what the default namespace was unless preserved - and adding that to XPath expressions. (another hack, stdlib as well with some insights)

But this solution doesn't play well in generalising code like adapter classes where it becomes tedious and error prone because XPath expressions are not always identical (did i mention they are counter-intuitive to type?) and keeping track of namespace mappings across loosely coupled code elements introduces boilerplates.

Ultimately, the interplay of document namespaces and XPath expressions is everything but pythonic and rather complicated than complex, though

There should be one-- and preferably only one --obvious way to do it.

The root of this issue is caused by a flaw in the XPath 1.0 specs that libxml2 follows in its implementation:

A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.

While XML namespaces actually have a notion of an unaliased default namespace:

If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.

XPath 2.0 did eventually fix this:

A QName in a name test is resolved into an expanded QName using the statically known namespaces in the expression context. It is a static error [err:XPST0081] if the QName has a prefix that does not correspond to any statically known namespace. An unprefixed QName, when used as a name test on an axis whose principal node kind is element, has the namespace URI of the default element/type namespace in the expression context; otherwise, it has no namespace URI.

There's no XPath 2.0 implementation with Python bindings around (well, there is one to XQuilla that returns raw strings and is far off lxml's capabilities), and it is very unlikely there's one to be implemented as the extension as a whole is a lot - which probably no one needs outside the XQuery/XSLT scene. libxml2 didn't intend to ten years ago, but hey, looking for a thesis to write?

Thus I propose to backport that bug fix from XPath 2.0 to lxml's XPath interfaces with an opt-in smart_prefix option without considering the whole standard as

practicality beats purity.

Behind the scenes the ad-hoc prefix 'solution' described above is applied, but completely hidden from the client code.

This pull request demonstrates the design and isn't completed yet, at least these issues still need to be addressed:

documentation
predicates are handled rather hackish and i have doubts that it works with more complex predicates
- i'd appreciate test proposals for practical examples with such
- support for predicates with the smart_prefix option could be dropped altogether, finer-grained selection is possible with Python and probably a common usage
should this even be the default behavior with opt-out? afaict it wouldn't break any code as supplying a namespace map with a default namespace (mapped to None) is currently invalid
- i'd keep it out of XSLT anyway
should result elements from such queries have a property that stores the option? so later calls on .xpath() of these elements would behave the same if no smart_prefix option is provided
can regex.h be used directly from Cython, but that's not specific to this here

btw, this is the first time i used Cython and my C usage was long ago, i'm happy about every feedback for improvements.

Now, let's have some fun:

>>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
>>> root.nsmap
{None: 'http://www.tei-c.org/ns/1.0', 'x': 'http://www.w3.org/2000/svg'}
>>> root.xpath('./text/body', namespaces=root.nsmap, smart_prefix=True)
[<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
>>> root.xpath('./text/body', smart_prefix=True)
[<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]

(oh, the inplace build option fails on my local machine without a helpful message. does anyone have a hint on that?)

invalid

Introduce a multi os travis build that builds OSX wheels

Improvements could be made to build the manylinux wheels as well but this is a big change on its own. When creating a tag the wheel will pushed as a github release

So this is unlikely ready to merge in its current state but I think its at a point where some feedback would be helpful.

Example build Example "release"

Here is me sanity testing the 2.7 wheel similarly to how the makefile does

 I   ² venv  ~/Downloads  pip install lxml-3.7.3-cp27-cp27m-macosx_10_11_x86_64.whl
Processing ./lxml-3.7.3-cp27-cp27m-macosx_10_11_x86_64.whl
Installing collected packages: lxml
Successfully installed lxml-3.7.3
 I   ² venv  ~/Downloads  python                                1014ms  Sun Apr 30 08:14:12 2017
Python 2.7.13 (default, Apr  5 2017, 22:17:22)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree
>>> import lxml.objectify
>>>

and 2.6

 I   ² venv  ~/Downloads  pip install lxml-3.7.3-cp26-cp26m-macosx_10_11_x86_64.whl
DEPRECATION: Python 2.6 is no longer supported by the Python core team, please upgrade your Python. A future version of pip will drop support for Python 2.6
Processing ./lxml-3.7.3-cp26-cp26m-macosx_10_11_x86_64.whl
Installing collected packages: lxml
Successfully installed lxml-3.7.3
 I   ² venv  ~/Downloads  python                                 728ms  Sun Apr 30 08:25:55 2017
Python 2.6.9 (unknown, Apr 30 2017, 08:22:20)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree
>>> import lxml.objectify
>>>

Some notes:

I dont bother building all the python3 builds for OSX as they will all fail the same way the allowed_failure I do build does. Filed bug here I can help with that bug but not sure what to do next on it.
The linux wheels that get built are likely not useful as they are not manywheels builds. I experimented with doing the manywheels here and while I think its possible I abandoned it (for now) when I hit an FTP error I had no idea how to deal with. here is a Failing build and the abandoned branch

opened by Bachmann1234 31

AppVeyor CI: Add Python 3.11 jobs

AppVeyor deployed new Windows images with Python 3.11 support (https://github.com/appveyor/ci/issues/3844), which means we can use it to build Python 3.11 Windows wheels for lxml. This PR adds three Python 3.11 jobs to the matrix, for the x86, x86-64 and arm64 platforms

Part of Bug #1977998. Partly replaces #355.

I tested the jobs on my branch, and the workflow passes.

I would suggest after this PR, to backport it to the 4.9 maintenance branch and release a new 4.9.2 version which includes these Python 3.11 Windows wheels.

opened by EwoutH 23
Add support for ucrt binaries on Windows
Hi,

This PR is a first big step towards resolving #1326096. I went through the pain to recompile libiconv, libxml2 and libxslt with Visual Studio 2015/ucrt to have binaries that can be used to build a Python 3.5 wheel.

This PR makes sure that the ucrt binaries are downloaded when we are on Python 3.5. I documented the actual compilation of the binaries in a reproducible manner at https://github.com/mhils/libxml2-win-binaries. After merging this PR and installing the Visual C++ Build Tools (or Visual Studio), a Python 3.5 x86 Windows wheel can be build as follows:

> "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" > python3 setup.py bdist_wheel --static-deps

I sucessfully tested the resulting wheel on a clean Win7 VM which worked fine.

If I could ask for a favor, it would be great if you could upload a Python 3.5 Windows wheel to PyPi as soon as possible (feel free to take the wheel linked above or compile your own). We're currently migrating @mitmproxy to Python 3 and lxml is currently the only dependency that holds a pip3 install mitmproxy back.

Thanks! Max
opened by mhils 19
Fix inheritance in lxml.html

As the old comment / FIXME from 8132c755adad4a75ba855d985dd257493bccc7fd notes, the mixin should come first for the inheritance to be correct (the left-most class is the first in the MRO, at least if no diamond inheritance is involved).

Also fix the odd super in HtmlMixin likely stemming from the incorrect MRO.

Fixes the inheritance order of all HTML* base classes though it probably doesn't matter for other than HtmlElement.

opened by xmo-odoo 14
Removed the PyPy special cases (for PyPy 4.1)

PyPy trunk (and future PyPy 4.1) contains now https://bitbucket.org/pypy/pypy/commits/3144c72295ae which improves the cpyext compatibility. It removes the need for these few hacks (which never fully worked, as discussed on pypy-dev).

opened by arigo 13
repair attribute mis-interpretation in ElementTreeContentHandler

regarding https://bugs.launchpad.net/lxml/+bug/1136509, this is a proposed fix for the issue.

The first part of the fix just rewrites the attributes in startElement to have keys of the form (namespace, key).

At first, i set namespace to None, but I had a problem with that. It appears that even namespaced attributes like the "xmlns:xsi" in my test document, is also passed to startElement, I suppose because the owning tag doesn't have a namespace. So in this case I'm splitting on the colon and passing in the two tokens to startElementNS, but I'm not sure if this approach is correct. In any case, I added two tests, if you can show what should happen in the tests at least that would make the correct behavior apparent here.

opened by zzzeek 13
Add Dependabot configuration for GitHub Actions updates

Add a Dependabot configuration that checks once a week if the GitHub Actions are still using the latest version. If not, it opens a PR to update them.

It will actually open very few PRs, since we only have major versions specified (like v3), so only on a major v4 release it will update and open a PR.

This will basically automate the majority of PRs like #356.

See Keeping your actions up to date with Dependabot.

opened by EwoutH 11
GHA wheel CI: Update images, used actions and Python version
A bit of maintenance on the GitHub Actions wheel CI:

Update the used Ubuntu and macOS images to the latest versions, and enable the Windows run

Update the used actions to their latest versions

Use Python 3.10 to build wheels

Add Python 3.11 run

Part of Bug #1977998.
opened by EwoutH 11
Do not blindly copy all of the namespaces when tostring():ing a subtree.

When using a subtree of a document do not simply copy all of the namespaces from all of the parents down. Only copy those that we actually use within the subtree. This as copying all namespaces will bloat the subtree with information it should not have.

This might seem harmless to do in the average case, but it will cause problems when serializing the XML, specifically C14N serialization which will according to specification retain all ns declarations on the root level element. So if this tostring() execution then will insert all parent namespace declarations into the now new root element we will unnecessarily bloat the ns declarations on this new toplevel element.

Having this said I am not confident this is the best code for doing this, feel free to point me in the direction of better code if you will.

opened by Pelleplutt 11
Improve detection of the libxml2 and libxslt libraries

This patch improves detection of the libxml2 and libxslt libraries by cleaning up some of the overly-complex build system.

The patch also improves support for using pkg-config if available.

opened by hughmcmaster 10
Adds a `smart_prefix` option to XPath evaluations to overcome a counter-intuitive design flaw
Namespaces are one honking great idea -- let's do more of those!

Using XPath to locate elements is quiet cumbersome when it comes to documents that have a default namespace:

>>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>') >>> root.nsmap {'x': 'http://www.w3.org/2000/svg', None: 'http://www.tei-c.org/ns/1.0'} >>> root.xpath('./text/body') [] >>> root.xpath('./text/body', namespaces=root.nsmap) Traceback (most recent call last): File "<input>", line 1, in <module> root.xpath('./text/body', namespaces=root.nsmap) File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349) evaluator = XPathElementEvaluator(self, namespaces=namespaces, File "src/lxml/xpath.pxi", line 261, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170589) File "src/lxml/xpath.pxi", line 133, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:168702) File "src/lxml/xpath.pxi", line 57, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:167658) _BaseContext.__init__(self, namespaces, extensions, error_log, enable_regexp, File "src/lxml/extensions.pxi", line 84, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:156529) if namespaces: TypeError: empty namespace prefix is not supported in XPath

This is a well documented issue (also here) and is commonly solved by manipulating the namespace mapping with an ad-hoc prefix - which loses the information what the default namespace was unless preserved - and adding that to XPath expressions. (another hack, stdlib as well with some insights)

But this solution doesn't play well in generalising code like adapter classes where it becomes tedious and error prone because XPath expressions are not always identical (did i mention they are counter-intuitive to type?) and keeping track of namespace mappings across loosely coupled code elements introduces boilerplates.

Ultimately, the interplay of document namespaces and XPath expressions is everything but pythonic and rather complicated than complex, though

There should be one-- and preferably only one --obvious way to do it.

The root of this issue is caused by a flaw in the XPath 1.0 specs that libxml2 follows in its implementation:

A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.

While XML namespaces actually have a notion of an unaliased default namespace:

If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.

XPath 2.0 did eventually fix this:

A QName in a name test is resolved into an expanded QName using the statically known namespaces in the expression context. It is a static error [err:XPST0081] if the QName has a prefix that does not correspond to any statically known namespace. An unprefixed QName, when used as a name test on an axis whose principal node kind is element, has the namespace URI of the default element/type namespace in the expression context; otherwise, it has no namespace URI.

There's no XPath 2.0 implementation with Python bindings around (well, there is one to XQuilla that returns raw strings and is far off lxml's capabilities), and it is very unlikely there's one to be implemented as the extension as a whole is a lot - which probably no one needs outside the XQuery/XSLT scene. libxml2 didn't intend to ten years ago, but hey, looking for a thesis to write?

Thus I propose to backport that bug fix from XPath 2.0 to lxml's XPath interfaces with an opt-in smart_prefix option without considering the whole standard as

practicality beats purity.

Behind the scenes the ad-hoc prefix 'solution' described above is applied, but completely hidden from the client code.

This pull request demonstrates the design and isn't completed yet, at least these issues still need to be addressed:

documentation

predicates are handled rather hackish and i have doubts that it works with more complex predicates

i'd appreciate test proposals for practical examples with such

support for predicates with the smart_prefix option could be dropped altogether, finer-grained selection is possible with Python and probably a common usage

should this even be the default behavior with opt-out? afaict it wouldn't break any code as supplying a namespace map with a default namespace (mapped to None) is currently invalid

i'd keep it out of XSLT anyway

should result elements from such queries have a property that stores the option? so later calls on .xpath() of these elements would behave the same if no smart_prefix option is provided

can regex.h be used directly from Cython, but that's not specific to this here

btw, this is the first time i used Cython and my C usage was long ago, i'm happy about every feedback for improvements.

Now, let's have some fun:

>>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>') >>> root.nsmap {None: 'http://www.tei-c.org/ns/1.0', 'x': 'http://www.w3.org/2000/svg'} >>> root.xpath('./text/body', namespaces=root.nsmap, smart_prefix=True) [<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>] >>> root.xpath('./text/body', smart_prefix=True) [<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]

(oh, the inplace build option fails on my local machine without a helpful message. does anyone have a hint on that?)
invalid
opened by funkyfuture 10
AppVeyor CI: Update to Visual Studio 2022 image

This PR updates the AppVeyor configuration to use the Visual Studio 2022 image.

Update 2022-11-07: The Python 3.11 part of this PR has moved to a new PR, #360.

opened by EwoutH 13
Xpath with namespace and position
I noticed that it is not possible to use elem.find or elem.findall with an xpath that contains position indices if the method is called with the namespaces argument. This behavior has also been reported in Bug #1873886.

It appears that during the tokenization of the xpath, the numbers are treated as tags, i.e. they are concatenated with the default namespace (during function calls with namespaces). This results in a wrong path imo. For example:

>>> from lxml import etree >>> doc = etree.XML(""" <foo xmlns="http://example.com/foo"> <bar>baz</bar> </foo>""") >>> path = "./bar[1]" >>> doc.find(path, namespaces={None:"http://example.com/foo"}) None

The target element is not found here because the path that is used is effectively: ./{http://example.com/foo}bar[{http://example.com/foo}1]

Changes:

I added a check during the tokenization of the xpath to determine whether the processed tag is a number to avoid concatenation with the namespace.
opened by knit-bee 1

Try and preserve the structure of the html during a diff

There exists a bug in the current htmldiff code, where by the generated diff changes the structure of the html (notice that the <div id="middle"> appears at the beginning instead of the middle):

>>> from lxml.html import diff
>>> a = "<div id='first'>some old text</div><div id='last'>more old text</div>"
>>> b = "<div id='first'>some old text</div><div id='middle'>and new text</div><div id='last'>more old text</div>"
>>> diff.htmldiff(a, b)
('<div id="middle"> <div id="first"><ins>some old text</ins></div><ins>and new</ins> <del>some old</del> text</div><div id="last">more old text</div>')
>>>

This patchset is an attempt to fix that issue.

opened by lonetwin 0

Validate that host_whitelist is not a string

Attacker can use https:///evil.com to make a malformed "hostless" URL that would have a netloc == '' -- which is in any string. Strings are not documented to be allowed in this config variable anyhow, so just raise a type error if someone passes in a string by accident.

(This is a breaking change for people who didn't follow the documented types, but shouldn't affect anyone else.)

New test fails on current master.

opened by timmc 1
Don't parse hostname from netloc manually; rely on urlsplit's result

This manual parsing of netloc can be fooled by use of a userinfo component. SplitResult already has a hostname property.

New test test_host_whitelist_sneaky_userinfo fails on master.

opened by timmc 1