I implemented several enhancements:
- The ability to create dictionaries of elements based on known index-key elements.
- The ability to force lists for certain tags.
- The ability to strip namespaces from results.
- The ability to receive data in new data structures that separate the XML attributes from the data.
- The ability to parse ElementTree data into a dictionary.
- The ability to use new parsing classes that keep track of default options.
- The ability to use an iterator in streaming mode.
- Whitespace stripping now applies to data in streaming mode.
While implementing the enhancements, I took care not to disturb the default behavior. Therefore, all the changes should not impact existing users.
Also, all the changes have unit tests that cover them. (The code coverage is > 90%. Most of the misses are in lines of code that are meant to handle variations on the Element/ElementTree objects, depending on which library created them.)
Index Keys
Imagine you have this input:
<servers>
<server>
<name>server1</name>
<os>Linux</os>
</server>
<server>
<name>server2</name>
<os>Windows</os>
</server>
</servers>
In this case, it might be helpful to have the 'servers' dictionary keyed off of the server name. You can now do this using the index_keys option. With this option, the named tags will be "promoted" to be the key for their subtree.
So, for example, the index_keys=('name',) option will produce this data structure:
{u'servers': {u'server1': {u'name': u'server1',
u'os': u'Linux'},
u'server2': {u'name': u'server2',
u'os': u'Windows'}}}
But, what if you need the "server" tag because it is intermixed with other tags? In that case, you can turn off the "index_keys_compress" option.
For example:
>>> xmltodict.parse("""
... <devices>
... <server>
... <name>server1</name>
... <os>Linux</os>
... </server>
... <server>
... <name>server2</name>
... <os>Windows</os>
... </server>
... <workstation>
... <name>host1</name>
... <os>Linux</os>
... </workstation>
... <workstation>
... <name>host2</name>
... <os>Windows</os>
... </workstation>
... </devices>
... """, new_style=True, index_keys=('name',), index_keys_compress=False).prettyprint(width=2)
{u'devices': {u'server': {u'server1': {u'name': u'server1',
u'os': u'Linux'},
u'server2': {u'name': u'server2',
u'os': u'Windows'}},
u'workstation': {u'host1': {u'name': u'host1',
u'os': u'Linux'},
u'host2': {u'name': u'host2',
u'os': u'Windows'}}}}
Force Lists
Sometimes, you have a node that may have one or more items. Rather than testing for both a list and single item, you can simplify your code by having the xmltodict parser always create a list for you.
For example, compare these outputs:
>>> xmltodict.parse("""
... <servers>
... <server>
... <name>server1</name>
... <os>Linux</os>
... </server>
... </servers>
... """, new_style=True).prettyprint(width=2)
{u'servers': {u'server': {u'name': u'server1',
u'os': u'Linux'}}}
>>> xmltodict.parse("""
... <servers>
... <server>
... <name>server1</name>
... <os>Linux</os>
... </server>
... <server>
... <name>server2</name>
... <os>Windows</os>
... </server>
... </servers>
... """, new_style=True).prettyprint(width=2)
{u'servers': {u'server': [{u'name': u'server1',
u'os': u'Linux'},
{u'name': u'server2',
u'os': u'Windows'}]}}
In the first case rv['servers']['server'] points to a single item. In the second case, rv['servers']['server'] points to a list.
You can force this to always be a list by setting the "force_list" parameter:
>>> xmltodict.parse("""
... <servers>
... <server>
... <name>server1</name>
... <os>Linux</os>
... </server>
... </servers>
... """, new_style=True, force_list=('server',)).prettyprint(width=2)
{u'servers': {u'server': [{u'name': u'server1',
u'os': u'Linux'}]}}
Strip Namespaces
Let me start by granting the truth that XML namespaces are an essential part of node and attribute names. Now, having said that, there are times when the namespaces are already well-known and are merely extra information that a user can (and will try to) safely ignore. In these cases, you can set the "strip_namespace" option to strip namespaces.
For example:
>>> xmltodict.parse("""
... <servers xmlns="http://a.com/" xmlns:b="http://b.com/">
... <b:server>
... <name>test</name>
... </b:server>
... </servers>
... """, new_style=True, strip_namespace=True).prettyprint(width=2)
{u'servers': {u'server': {u'name': u'test'}}}
New Classes
One of the difficulties in dealing with XML data in Python is representing the richness of the XML data (including, especially, the dual layers of attributes and data) while creating the simplest data structure possible. I'm sure many people have tried. I tried again.
The fundamental premise here is this: if a user cares about an XML attribute, he/she knows to go looking for it. So, it is most important to present the main data in a simple format, and it is sufficient to provide one or more methods for users to find XML attributes.
The three data structures are:
- XMLCDATANode: This "quacks" like a string/unicode. (For example, XMLCDATANode("a") == "a" will evaluate to true.)
- XMLDictNode: This "quacks" like a dict or OrderedDict.
- XMLListNode: This "quacks" like a list.
These data structures have some extra methods to deal with attributes:
- has_xml_attrs(): Returns True if there are XML attributes; False otherwise.
- get_xml_attr(name[, default]): Returns the value of the XML attribute if it exists. Otherwise, it will return the default, if given, or raise a KeyError.
- set_xml_attr(name, value): Sets the value of the XML attribute.
- delete_xml_attr(name): Delete the XML attribute. Raises a KeyError if the XML attribute does not exists.
- get_xml_attr(): Returns the dictionary of XML attributes.
These data structures also implement a prettyprint() method which takes the same options as pprint() (except, of course, for the object to be printed). The prettyprint() method prints out the data only and does not show the XML attributes. This decision was made for readability purposes. The repr() method shows both.
I've already shown some examples of the new classes above. Here's another example:
>>> rv = xmltodict.parse("""
... <servers>
... <server coolness="high">
... <name>server1</name>
... </server>
... </servers>
... """, new_style=True)
>>> repr(rv)
"XMLDictNode(xml_attrs=OrderedDict(), value=OrderedDict([(u'servers', XMLDictNode(xml_attrs=OrderedDict(), value=OrderedDict([(u'server', XMLDictNode(xml_attrs=OrderedDict([(u'coolness', u'high')]), value=OrderedDict([(u'name', XMLCDATANode(xml_attrs=OrderedDict(), value=u'server1'))])))])))]))"
>>> rv.prettyprint(width=2)
{u'servers': {u'server': {u'name': u'server1'}}}
>>> rv.has_xml_attrs()
False
>>> rv['servers'].has_xml_attrs()
False
>>> rv['servers']['server'].has_xml_attrs()
True
>>> rv['servers']['server'].get_xml_attrs()
OrderedDict([(u'coolness', u'high')])
>>> rv['servers']['server'].get_xml_attr('coolness')
u'high'
>>> rv['servers']['server'].get_xml_attr('darkness', '@@NOTFOUND@@')
'@@NOTFOUND@@'
>>> rv['servers']['server']['name']
XMLCDATANode(XMLattrs=OrderedDict(), value=u'server1')
>>> rv['servers']['server']['name'] == 'server1'
True
>>> rv['servers']['server'].set_xml_attr('coolness', 'low')
>>> rv['servers'].set_xml_attr('length', '1')
>>> rv['servers'].set_xml_attr('delete_me', True)
>>> rv['servers'].delete_xml_attr('delete_me')
>>> xmltodict.unparse(rv)
u'<?xml version="1.0" encoding="utf-8"?>\n<servers length="1"><server coolness="low"><name>server1</name></server></servers>'
Parsing ElementTree Data
Sometimes, a user may use a library that returns an Element or ElementTree. In those cases, it would be useful to be able to convert it into an easy to use dictionary without having to first convert it to text. In those cases, the user can use the parse_lxml() method. (This was originally intended for lxml; hence, the name. However, it should work with ElementTree, as well. Indeed, I did much of my testing with cElementTree.)
The parse_lxml() method should take the same options as the parse() method.
Example:
>>> xml = etree.XML("<a><b>data</b></a>")
>>> xmltodict.parse_lxml(xml, new_style=True).prettyprint()
{'a': {'b': u'data'}}
Parsing Classes
Two new classes hold parsing defaults. They can be overridden on each invocation.
- The Parser() class is used for parsing XML text.
- The LXMLParser() class is used for parsing ElementTree objects.
Example:
>>> parser = xmltodict.Parser(new_style=True, index_keys=('name',))
>>> parser("<a><b><name>item1</name></b></a>").prettyprint()
{u'a': {u'item1': {u'name': u'item1'}}}
>>> parser("<a><b><name>item1</name></b></a>", index_keys=()).prettyprint()
{u'a': {u'b': {u'name': u'item1'}}}
>>> parser("<a><b><name>item1</name></b></a>", index_keys_compress=False).prettyprint()
{u'a': {u'b': {u'item1': {u'name': u'item1'}}}}
Iterators/Generators
In streaming mode, you can now use an iterator/generator to loop through the list of matching items. This will be done with incremental parsing of the input file (however, see note below about Jython). The input file is processed 1KB at a time and then each matching node is returned on a subsequent iteration. Once all the matching nodes from the first 1KB are returned, the next 1KB is read (and so on).
(Note: I'm not sure why, but the Travis CI test shows that Jython is failing the unit test that checks to make sure that the parsing really is done incrementally. I need to do more examination to determine whether this is a true failure, or a false failure due to a flaw in the test.)
If the generator
argument evaluates to True and item_depth
is non-zero, the parser will return an iterator. On each iteration, the code will return the next (path, item) tuple at the item_depth
level. These are the same items (in the same format) that would be passed to the callback function; however, they are returned at each iteration.
Two corner cases: If generator
is True and item_depth
is zero, the code will return a single-item list with an empty path and the full document. If generator
is True and item_callback
is also set, the item_callback
will be executed for each iteration prior to the iterator's return.
Example:
>>> xml = """\
... <a prop="x">
... <b>1</b>
... <b>2</b>
... </a>"""
>>> for (path, item) in xmltodict.parse(xml, generator=True, item_depth=2):
... print 'path:%s item:%s' % (path, item)
...
path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2
Whitespace Stripping Enhancement
In streaming mode, whitespace stripping now applies to streaming mode. Previously, if the item at the item_depth
was a CDATA node, whitespace stripping was not applied prior to the item being sent to the callback function. Now, whitespace stripping takes effect prior to the call to the callback function (or return of a value from the iterator).
(Whitespace stripping is still controlled by the strip_whitespace
argument.)
Example of the previous behavior:
>>> print xml2
<a>
<b>
<c>data1</c>
<d>data2</d>
<c>data3</c>
</b>
<b>
<c>data4</c>
<d>data5</d>
<c>data6</c>
</b>
</a>
>>> for (stack, value) in xmltodict.parse(xml2, generator=True, item_depth=3, item_callback=cb):
... print "\tValue: %r" % value
...
Value: u'\n \n data1'
Value: u'\n data2'
Value: u'\n data3'
Value: u'\n \n data4'
Value: u'\n data5'
Value: u'\n data6'
Example of the new behavior:
>>> for (stack, value) in xmltodict.parse(xml2, generator=True, item_depth=3):
... print "\tValue: %r" % value
...
Value: u'data1'
Value: u'data2'
Value: u'data3'
Value: u'data4'
Value: u'data5'
Value: u'data6'