Namespaces are one honking great idea -- let's do more of those!
Using XPath to locate elements is quiet cumbersome when it comes to documents that have a default namespace:
>>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
>>> root.nsmap
{'x': 'http://www.w3.org/2000/svg', None: 'http://www.tei-c.org/ns/1.0'}
>>> root.xpath('./text/body')
[]
>>> root.xpath('./text/body', namespaces=root.nsmap)
Traceback (most recent call last):
File "<input>", line 1, in <module>
root.xpath('./text/body', namespaces=root.nsmap)
File "src/lxml/lxml.etree.pyx", line 1584, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:59349)
evaluator = XPathElementEvaluator(self, namespaces=namespaces,
File "src/lxml/xpath.pxi", line 261, in lxml.etree.XPathElementEvaluator.__init__ (src/lxml/lxml.etree.c:170589)
File "src/lxml/xpath.pxi", line 133, in lxml.etree._XPathEvaluatorBase.__init__ (src/lxml/lxml.etree.c:168702)
File "src/lxml/xpath.pxi", line 57, in lxml.etree._XPathContext.__init__ (src/lxml/lxml.etree.c:167658)
_BaseContext.__init__(self, namespaces, extensions, error_log, enable_regexp,
File "src/lxml/extensions.pxi", line 84, in lxml.etree._BaseContext.__init__ (src/lxml/lxml.etree.c:156529)
if namespaces:
TypeError: empty namespace prefix is not supported in XPath
This is a well documented issue (also here) and is commonly solved by manipulating the namespace mapping with an ad-hoc prefix - which loses the information what the default namespace was unless preserved - and adding that to XPath expressions. (another hack, stdlib as well with some insights)
But this solution doesn't play well in generalising code like adapter classes where it becomes tedious and error prone because XPath expressions are not always identical (did i mention they are counter-intuitive to type?) and keeping track of namespace mappings across loosely coupled code elements introduces boilerplates.
Ultimately, the interplay of document namespaces and XPath expressions is everything but pythonic and rather complicated than complex, though
There should be one-- and preferably only one --obvious way to do it.
The root of this issue is caused by a flaw in the XPath 1.0 specs that libxml2
follows in its implementation:
A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.
While XML namespaces actually have a notion of an unaliased default namespace:
If the attribute name matches DefaultAttName, then the namespace name in the attribute value is that of the default namespace in the scope of the element to which the declaration is attached.
XPath 2.0 did eventually fix this:
A QName in a name test is resolved into an expanded QName using the statically known namespaces in the expression context. It is a static error [err:XPST0081] if the QName has a prefix that does not correspond to any statically known namespace. An unprefixed QName, when used as a name test on an axis whose principal node kind is element, has the namespace URI of the default element/type namespace in the expression context; otherwise, it has no namespace URI.
There's no XPath 2.0 implementation with Python bindings around (well, there is one to XQuilla that returns raw strings and is far off lxml
's capabilities), and it is very unlikely there's one to be implemented as the extension as a whole is a lot - which probably no one needs outside the XQuery/XSLT scene. libxml2
didn't intend to ten years ago, but hey, looking for a thesis to write?
Thus I propose to backport that bug fix from XPath 2.0 to lxml
's XPath interfaces with an opt-in smart_prefix
option without considering the whole standard as
practicality beats purity.
Behind the scenes the ad-hoc prefix 'solution' described above is applied, but completely hidden from the client code.
This pull request demonstrates the design and isn't completed yet, at least these issues still need to be addressed:
- documentation
- predicates are handled rather hackish and i have doubts that it works with more complex predicates
- i'd appreciate test proposals for practical examples with such
- support for predicates with the
smart_prefix
option could be dropped altogether, finer-grained selection is possible with Python and probably a common usage
- should this even be the default behavior with opt-out? afaict it wouldn't break any code as supplying a namespace map with a default namespace (mapped to
None
) is currently invalid
- i'd keep it out of XSLT anyway
- should result elements from such queries have a property that stores the option? so later calls on
.xpath()
of these elements would behave the same if no smart_prefix
option is provided
- can
regex.h
be used directly from Cython, but that's not specific to this here
btw, this is the first time i used Cython and my C usage was long ago, i'm happy about every feedback for improvements.
Now, let's have some fun:
>>> root = etree.fromstring('<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:x="http://www.w3.org/2000/svg"><text><body><x:svg /></body></text></TEI>')
>>> root.nsmap
{None: 'http://www.tei-c.org/ns/1.0', 'x': 'http://www.w3.org/2000/svg'}
>>> root.xpath('./text/body', namespaces=root.nsmap, smart_prefix=True)
[<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
>>> root.xpath('./text/body', smart_prefix=True)
[<Element {http://www.tei-c.org/ns/1.0}body at 0x7f8e655587c8>]
(oh, the inplace build option fails on my local machine without a helpful message. does anyone have a hint on that?)
invalid