Python module that makes working with XML feel like you are working with JSON

Martín Blech

Last update: Jan 4, 2023

Related tags

HTML Manipulation xmltodict

Overview

xmltodict

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>
...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>
...  """), indent=4))
{
    "mydocument": {
        "@has": "an attribute", 
        "and": {
            "many": [
                "elements", 
                "more elements"
            ]
        }, 
        "plus": {
            "@a": "complex", 
            "#text": "element as well"
        }
    }
}

Namespace support

By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:

>>> xml = """
... <root xmlns="http://defaultns.com/"
...       xmlns:a="http://a.com/"
...       xmlns:b="http://b.com/">
...   <x>1</x>
...   <a:y>2</a:y>
...   <b:z>3</b:z>
... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
True

It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:

>>> namespaces = {
...     'http://defaultns.com/': None, # skip this namespace
...     'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
...     'root': {
...         'x': '1',
...         'ns_a:y': '2',
...         'http://b.com/:z': '3',
...     },
... }
True

Streaming mode

xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print(artist['name'])
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print(article['title'])

$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ bunzip2 enwiki-pages-articles.xml.bz2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ gunzip enwiki.dicts.gz | script1.py
$ gunzip enwiki.dicts.gz | script2.py
...

Roundtripping

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'response': {
...             'status': 'good',
...             'last_updated': '2014-02-16T23:10:12Z',
...     }
... }
>>> print(unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<response>
	<status>good</status>
	<last_updated>2014-02-16T23:10:12Z</last_updated>
</response>

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

>>> import xmltodict
>>> 
>>> mydict = {
...     'text': {
...         '@color':'red',
...         '@stroke':'2',
...         '#text':'This is a test'
...     }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>

Lists that are specified under a key in a dictionary use the key as a tag for each item. But if a list does have a parent key, for example if a list exists inside another list, it does not have a tag to use and the items are converted to a string as shown in the example below. To give tags to nested lists, use the expand_iter keyword argument to provide a tag as demonstrated below. Note that using expand_iter will break roundtripping.

>>> mydict = {
...     "line": {
...         "points": [
...             [1, 5],
...             [2, 6],
...         ]
...     }
... }
>>> print(xmltodict.unparse(mydict, pretty=True))
<?xml version="1.0" encoding="utf-8"?>
<line>
        <points>[1, 5]</points>
        <points>[2, 6]</points>
</line>
>>> print(xmltodict.unparse(mydict, pretty=True, expand_iter="coord"))
<?xml version="1.0" encoding="utf-8"?>
<line>
        <points>
                <coord>1</coord>
                <coord>5</coord>
        </points>
        <points>
                <coord>2</coord>
                <coord>6</coord>
        </points>
</line>

Ok, how do I get it?

Using pypi

You just need to

$ pip install xmltodict

RPM-based distro (Fedora, RHEL, …)

There is an official Fedora package for xmltodict.

$ sudo yum install python-xmltodict

Arch Linux

There is an official Arch Linux package for xmltodict.

$ sudo pacman -S python-xmltodict

Debian-based distro (Debian, Ubuntu, …)

There is an official Debian package for xmltodict.

$ sudo apt install python-xmltodict

FreeBSD

There is an official FreeBSD port for xmltodict.

$ pkg install py36-xmltodict

openSUSE/SLE (SLE 15, Leap 15, Tumbleweed)

There is an official openSUSE package for xmltodict.

# Python2
$ zypper in python2-xmltodict

# Python3
$ zypper in python3-xmltodict

Comments

xml containing 1 child

Consider the following code

xml = """<?xml version="1.0" encoding="utf-8" ?>
<root>
    <children>
        <child>
            <name>A</name>
        </child>
    </children>
</root>"""

xmltodict.parse(xml)['root']['children']['child']

Wouldn't you expect to have an iterable object even when there is only 1 child?

wontfix

opened by ocZio 31

Several Enhancements
I implemented several enhancements:

The ability to create dictionaries of elements based on known index-key elements.

The ability to force lists for certain tags.

The ability to strip namespaces from results.

The ability to receive data in new data structures that separate the XML attributes from the data.

The ability to parse ElementTree data into a dictionary.

The ability to use new parsing classes that keep track of default options.

The ability to use an iterator in streaming mode.

Whitespace stripping now applies to data in streaming mode.

While implementing the enhancements, I took care not to disturb the default behavior. Therefore, all the changes should not impact existing users.

Also, all the changes have unit tests that cover them. (The code coverage is > 90%. Most of the misses are in lines of code that are meant to handle variations on the Element/ElementTree objects, depending on which library created them.)

Index Keys

Imagine you have this input:

<servers> <server> <name>server1</name> <os>Linux</os> </server> <server> <name>server2</name> <os>Windows</os> </server> </servers>

In this case, it might be helpful to have the 'servers' dictionary keyed off of the server name. You can now do this using the index_keys option. With this option, the named tags will be "promoted" to be the key for their subtree.

So, for example, the index_keys=('name',) option will produce this data structure:

{u'servers': {u'server1': {u'name': u'server1', u'os': u'Linux'}, u'server2': {u'name': u'server2', u'os': u'Windows'}}}

But, what if you need the "server" tag because it is intermixed with other tags? In that case, you can turn off the "index_keys_compress" option.

For example:

>>> xmltodict.parse(""" ... <devices> ... <server> ... <name>server1</name> ... <os>Linux</os> ... </server> ... <server> ... <name>server2</name> ... <os>Windows</os> ... </server> ... <workstation> ... <name>host1</name> ... <os>Linux</os> ... </workstation> ... <workstation> ... <name>host2</name> ... <os>Windows</os> ... </workstation> ... </devices> ... """, new_style=True, index_keys=('name',), index_keys_compress=False).prettyprint(width=2) {u'devices': {u'server': {u'server1': {u'name': u'server1', u'os': u'Linux'}, u'server2': {u'name': u'server2', u'os': u'Windows'}}, u'workstation': {u'host1': {u'name': u'host1', u'os': u'Linux'}, u'host2': {u'name': u'host2', u'os': u'Windows'}}}}

Force Lists

Sometimes, you have a node that may have one or more items. Rather than testing for both a list and single item, you can simplify your code by having the xmltodict parser always create a list for you.

For example, compare these outputs:

>>> xmltodict.parse(""" ... <servers> ... <server> ... <name>server1</name> ... <os>Linux</os> ... </server> ... </servers> ... """, new_style=True).prettyprint(width=2) {u'servers': {u'server': {u'name': u'server1', u'os': u'Linux'}}} >>> xmltodict.parse(""" ... <servers> ... <server> ... <name>server1</name> ... <os>Linux</os> ... </server> ... <server> ... <name>server2</name> ... <os>Windows</os> ... </server> ... </servers> ... """, new_style=True).prettyprint(width=2) {u'servers': {u'server': [{u'name': u'server1', u'os': u'Linux'}, {u'name': u'server2', u'os': u'Windows'}]}}

In the first case rv['servers']['server'] points to a single item. In the second case, rv['servers']['server'] points to a list.

You can force this to always be a list by setting the "force_list" parameter:

>>> xmltodict.parse(""" ... <servers> ... <server> ... <name>server1</name> ... <os>Linux</os> ... </server> ... </servers> ... """, new_style=True, force_list=('server',)).prettyprint(width=2) {u'servers': {u'server': [{u'name': u'server1', u'os': u'Linux'}]}}

Strip Namespaces

Let me start by granting the truth that XML namespaces are an essential part of node and attribute names. Now, having said that, there are times when the namespaces are already well-known and are merely extra information that a user can (and will try to) safely ignore. In these cases, you can set the "strip_namespace" option to strip namespaces.

For example:

>>> xmltodict.parse(""" ... <servers xmlns="http://a.com/" xmlns:b="http://b.com/"> ... <b:server> ... <name>test</name> ... </b:server> ... </servers> ... """, new_style=True, strip_namespace=True).prettyprint(width=2) {u'servers': {u'server': {u'name': u'test'}}}

New Classes

One of the difficulties in dealing with XML data in Python is representing the richness of the XML data (including, especially, the dual layers of attributes and data) while creating the simplest data structure possible. I'm sure many people have tried. I tried again.

The fundamental premise here is this: if a user cares about an XML attribute, he/she knows to go looking for it. So, it is most important to present the main data in a simple format, and it is sufficient to provide one or more methods for users to find XML attributes.

The three data structures are:

XMLCDATANode: This "quacks" like a string/unicode. (For example, XMLCDATANode("a") == "a" will evaluate to true.)

XMLDictNode: This "quacks" like a dict or OrderedDict.

XMLListNode: This "quacks" like a list.

These data structures have some extra methods to deal with attributes:

has_xml_attrs(): Returns True if there are XML attributes; False otherwise.

get_xml_attr(name[, default]): Returns the value of the XML attribute if it exists. Otherwise, it will return the default, if given, or raise a KeyError.

set_xml_attr(name, value): Sets the value of the XML attribute.

delete_xml_attr(name): Delete the XML attribute. Raises a KeyError if the XML attribute does not exists.

get_xml_attr(): Returns the dictionary of XML attributes.

These data structures also implement a prettyprint() method which takes the same options as pprint() (except, of course, for the object to be printed). The prettyprint() method prints out the data only and does not show the XML attributes. This decision was made for readability purposes. The repr() method shows both.

I've already shown some examples of the new classes above. Here's another example:

>>> rv = xmltodict.parse(""" ... <servers> ... <server coolness="high"> ... <name>server1</name> ... </server> ... </servers> ... """, new_style=True) >>> repr(rv) "XMLDictNode(xml_attrs=OrderedDict(), value=OrderedDict([(u'servers', XMLDictNode(xml_attrs=OrderedDict(), value=OrderedDict([(u'server', XMLDictNode(xml_attrs=OrderedDict([(u'coolness', u'high')]), value=OrderedDict([(u'name', XMLCDATANode(xml_attrs=OrderedDict(), value=u'server1'))])))])))]))" >>> rv.prettyprint(width=2) {u'servers': {u'server': {u'name': u'server1'}}} >>> rv.has_xml_attrs() False >>> rv['servers'].has_xml_attrs() False >>> rv['servers']['server'].has_xml_attrs() True >>> rv['servers']['server'].get_xml_attrs() OrderedDict([(u'coolness', u'high')]) >>> rv['servers']['server'].get_xml_attr('coolness') u'high' >>> rv['servers']['server'].get_xml_attr('darkness', '@@NOTFOUND@@') '@@NOTFOUND@@' >>> rv['servers']['server']['name'] XMLCDATANode(XMLattrs=OrderedDict(), value=u'server1') >>> rv['servers']['server']['name'] == 'server1' True >>> rv['servers']['server'].set_xml_attr('coolness', 'low') >>> rv['servers'].set_xml_attr('length', '1') >>> rv['servers'].set_xml_attr('delete_me', True) >>> rv['servers'].delete_xml_attr('delete_me') >>> xmltodict.unparse(rv) u'<?xml version="1.0" encoding="utf-8"?>\n<servers length="1"><server coolness="low"><name>server1</name></server></servers>'

Parsing ElementTree Data

Sometimes, a user may use a library that returns an Element or ElementTree. In those cases, it would be useful to be able to convert it into an easy to use dictionary without having to first convert it to text. In those cases, the user can use the parse_lxml() method. (This was originally intended for lxml; hence, the name. However, it should work with ElementTree, as well. Indeed, I did much of my testing with cElementTree.)

The parse_lxml() method should take the same options as the parse() method.

Example:

>>> xml = etree.XML("<a>data</a>") >>> xmltodict.parse_lxml(xml, new_style=True).prettyprint() {'a': {'b': u'data'}}

Parsing Classes

Two new classes hold parsing defaults. They can be overridden on each invocation.

The Parser() class is used for parsing XML text.

The LXMLParser() class is used for parsing ElementTree objects.

Example:

>>> parser = xmltodict.Parser(new_style=True, index_keys=('name',)) >>> parser("<a><name>item1</name></a>").prettyprint() {u'a': {u'item1': {u'name': u'item1'}}} >>> parser("<a><name>item1</name></a>", index_keys=()).prettyprint() {u'a': {u'b': {u'name': u'item1'}}} >>> parser("<a><name>item1</name></a>", index_keys_compress=False).prettyprint() {u'a': {u'b': {u'item1': {u'name': u'item1'}}}}

Iterators/Generators

In streaming mode, you can now use an iterator/generator to loop through the list of matching items. This will be done with incremental parsing of the input file (however, see note below about Jython). The input file is processed 1KB at a time and then each matching node is returned on a subsequent iteration. Once all the matching nodes from the first 1KB are returned, the next 1KB is read (and so on).

(Note: I'm not sure why, but the Travis CI test shows that Jython is failing the unit test that checks to make sure that the parsing really is done incrementally. I need to do more examination to determine whether this is a true failure, or a false failure due to a flaw in the test.)

If the generator argument evaluates to True and item_depth is non-zero, the parser will return an iterator. On each iteration, the code will return the next (path, item) tuple at the item_depth level. These are the same items (in the same format) that would be passed to the callback function; however, they are returned at each iteration.

Two corner cases: If generator is True and item_depth is zero, the code will return a single-item list with an empty path and the full document. If generator is True and item_callback is also set, the item_callback will be executed for each iteration prior to the iterator's return.

Example:

>>> xml = """\ ... <a prop="x"> ... 1 ... 2 ... </a>""" >>> for (path, item) in xmltodict.parse(xml, generator=True, item_depth=2): ... print 'path:%s item:%s' % (path, item) ... path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1 path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2

Whitespace Stripping Enhancement

In streaming mode, whitespace stripping now applies to streaming mode. Previously, if the item at the item_depth was a CDATA node, whitespace stripping was not applied prior to the item being sent to the callback function. Now, whitespace stripping takes effect prior to the call to the callback function (or return of a value from the iterator).

(Whitespace stripping is still controlled by the strip_whitespace argument.)

Example of the previous behavior:

>>> print xml2 <a> <c>data1</c> <d>data2</d> <c>data3</c> <c>data4</c> <d>data5</d> <c>data6</c> </a> >>> for (stack, value) in xmltodict.parse(xml2, generator=True, item_depth=3, item_callback=cb): ... print "\tValue: %r" % value ... Value: u'\n \n data1' Value: u'\n data2' Value: u'\n data3' Value: u'\n \n data4' Value: u'\n data5' Value: u'\n data6'

Example of the new behavior:

>>> for (stack, value) in xmltodict.parse(xml2, generator=True, item_depth=3): ... print "\tValue: %r" % value ... Value: u'data1' Value: u'data2' Value: u'data3' Value: u'data4' Value: u'data5' Value: u'data6'
opened by jonlooney 13
Optional attributes and unknown children count
What's wrong?

1. Optional attributes

If i have a tag which can look like this:

<sometag>blablabla</sometag>

or like this:

<sometag attr="123">blablabla</sometag>

In the first case parse result will look like this:

{ 'sometag': 'blablabla' }

and in the second case it will look like:

{ 'sometag': { '@attr': '123', '#text': 'blablabla' } }

So if I want to get its text content I can't just do something like:

something = parse_result['sometag']['#text']

I have to write such ugly things:

something = parse_result['sometag'] something = something['#text'] if type(something) is OrderedDict else something

2. Unknown children count

If I have a tag which can look like this:

<parent> <child>some text</child> <child>other text</child> <parent>

And I don't know exact children count (child could be only one!) So I couldn't iterate over children like this:

for child in parse_result['parent']['child']: # some code

because list would be used only if there are more than one child. So in this case I also have to perform some ugly type checking. Like this:

children = parse_result['parent']['child'] children = children if type(children) is list else [children] for child in children: # some code

Suggestion

I suggest to add special mode (triggered by special optional argument passed to parse function). In this mode it will always use dictionary for describing tags and always use lists for describing children tags.
opened by vovanz 10
quoted string value is quoted again

Hi! I just found this solution for translation of xml to dict and is amazing!! Thanks a lot! I tiny problem that i encountered is that a value like: <lp>"/"</lp> is translated to "lp": "\"/\""

is there any way to not quote the string and just use it as is? Thanks a lot!

opened by adriansev 9
Only OrderedDicts are returned
This may just be a documentation issue, but when I run: (Python 2.7 OS X)

foo = xmltodict.parse("""<?xml version="1.0" ?> <person> <name>john</name> <age>20</age> </person>""") print foo

I get:

Output: OrderedDict([(u'person', OrderedDict([(u'name', u'john'), (u'age', u'20')]))])

In a nested XML document, this is making hard for me to turn this into JSON
invalid
opened by pjakobsen 9
json to xml with "self-closing tags"

Hi experts, Is there a way to convert from json to xml with self-closing tags. Example: My json is define as below arr = [{"@name": "transactionId", "@value" : "1234", "@type": "u32"}, {"@name": "numTransactions", "@value" : "1", "@type":"u32"}]

with xmltodict.unparse(), the generated XML has these lines.

But I need the self-closing tags, like this

Looking forward for experts suggestions

regards

opened by Kamakshilk 8
unparse handles lists incorrectly?

My Python object looks like so: {'Response': {'@ErrorCode': '00', 'Versions': [{'Version': {'@Updated': u'2013-10-23T18:29:11', 'Basic': {'@MD5': u'a7674c694607b169e57593a4952ea26f'}}}, {'Version': {'@Updated': u'2013-10-23T18:55:53', 'Basic': {'@MD5': u'b50001ee638f7df058d2c5f9157c6e8a'}}}]}}

The resulting XML from 'unparse' puts an endtag for Versions after the first Version, then starts it again before the second list item.

Seems that "Versions" shouldn't be ended after the first "Version" object?

opened by jillh510 8

parse-unparse does not roundtrip on mixed content model

to simply reproduce:

import xmltodict
mix = xmltodict.parse('<mix>before <nested>inside</nested> after</mix>')
xmltodict.unparse(mix, full_document=False)

'<mix><nested>inside</nested>before  after</mix>'

the before-after text gets somehow joined into one '#text' node

opened by marc-portier 7

Child Order Not Maintained with Different Tags
If there is a generic doc that has 4 elements, but one in the middle has a different tag name, that order is not persisted in the round trip

>>> xml = """<doc><el>1</el><el>2</el><el1>3</el1><el>4</el></doc>""" >>> d = xmltodict.parse(xml) >>> round_trip = xmltodict.unparse(d) >>> print(round_trip) '<doc><el>1</el><el>2</el><el>4</el><el1>3</el1></doc>'

As you can see the el1 got moved to the end.
wontfix
opened by mcrowson 7
Moved the data from string to list

Moving the data from string to list causes a significant speed and memory improvement noticed mostly in large XML files. Appending to a list is more efficient than reconstructing a string.

The speed improvement seen by me was up to 1000x faster on xml files weighting 150MBs or more.

opened by bharel 7
Latest version is not x.x.x :-)

Hi, sorry for this issue, but I'm looking to package xmltodict into Debian, but I've a problem with the latest tag. Could your change the v0.9 to v0.9.0 Thanks !

opened by sbadia 7
something is wrong with html escaping
Code:

import xmltodict test_xml="""<?xml version="1.0" encoding="utf-8"?> <a> afafafa – </a> """ print(test_xml) print (xmltodict.unparse( xmltodict.parse(test_xml) ))

Output:

<?xml version="1.0" encoding="utf-8"?> <a> afafafa – </a> <?xml version="1.0" encoding="utf-8"?> <a>afafafa –</a>

Basically, html-encoded en-dash is unescaped when parsed, but is not escaped when unparsed, generating different result then expected. (Expectations were that output of unparse would be the same as original test_xml string)

P.S. Thank you for the lib, I found it very handy for my needs.
opened by ignis32 0

dict to xml - how to discard the parent tag repeating for conversion from list

Hi everyone.

From this dictionary:

feed = {
    'feed': {
        'reviewer_images': [
            {
                'reviewer_image': {
                    'url': "http://google.com"
                }
            },
            {
                'reviewer_image': {
                    'url': "http://google.com"
                }
            }
        ]
    }
}

I have got:

<?xml version="1.0" encoding="utf-8"?>
<feed>
        <reviewer_images>
                <reviewer_image>
                        <url>http://google.com</url>
                </reviewer_image>
        </reviewer_images>
        <reviewer_images>
                <reviewer_image>
                        <url>http://github.com</url>
                </reviewer_image>
        </reviewer_images>
</feed>

Is it possible to get several reviewer_image inside identical reviewer_images tag?:

<?xml version="1.0" encoding="utf-8"?>
<feed>
        <reviewer_images>
                <reviewer_image>
                      <url>http://google.com</url>
                </reviewer_image>
                <reviewer_image>
                      <url>http://github.com</url>
                </reviewer_image>
        </reviewer_images>
</feed>

opened by sergei-sss 1

OSError: [WinError 233] No process at the other end of the pipeline

 in exit
    print(msg)
  File "C:\QGB\Anaconda3\lib\site-packages\colorama\ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "C:\QGB\Anaconda3\lib\site-packages\colorama\ansitowin32.py", line 170, in write
    self.write_and_convert(text)
  File "C:\QGB\Anaconda3\lib\site-packages\colorama\ansitowin32.py", line 198, in write_and_convert
    self.write_plain_text(text, cursor, len(text))
  File "C:\QGB\Anaconda3\lib\site-packages\colorama\ansitowin32.py", line 203, in write_plain_text
    self.wrapped.write(text[start:end])
OSError: [WinError 233] 管道的另一端上无任何进程。

opened by QGB 0

Leading spaces in values are automatically stripped?

I ran into a problem parsing this file with xmltodict: https://archive.org/download/janus-34-scan-zapman/janus-34-scan-zapman_files.xml

The value of 'original' has it's leading space stripped, it should be ' JANUS 34_Scan Zapman_chocr.html.gz', but it is turned into 'JANUS 34_Scan Zapman_chocr.html.gz'

This is probably caused by the commit from this issue: https://github.com/martinblech/xmltodict/issues/15

Given the above commit, it is not clear to me if there is any way to keep spaces inside an element in XML. Is there a way to disable this behaviour?

Here's the relevant part from the file linked above:

<file name=" JANUS 34_Scan Zapman_hocr.html" source="derivative">
<hocr_char_to_word_module_version>1.1.0</hocr_char_to_word_module_version>
<hocr_char_to_word_hocr_version>1.1.15</hocr_char_to_word_hocr_version>
<ocr_parameters>-l fra</ocr_parameters>
<ocr_module_version>0.0.18</ocr_module_version>
<ocr_detected_script>Latin</ocr_detected_script>
<ocr_detected_script_conf>0.4311</ocr_detected_script_conf>
<ocr_detected_lang>fr</ocr_detected_lang>
<ocr_detected_lang_conf>1.0000</ocr_detected_lang_conf>
<format>hOCR</format>
<original> JANUS 34_Scan Zapman_chocr.html.gz</original>
<mtime>1664638619</mtime>
<size>2140105</size>
<md5>1596964e7b6e5aee5e6faedc6d3cb47b</md5>
<crc32>b0c6226b</crc32>
<sha1>07eca05572e97b5abb66fcba4252956ada5f7b10</sha1>
</file>

opened by MerlijnWajer 3

Unparse indent not included in 0.13.0 release?

Hello. Thanks a lot for this great module.

I'm using version 0.13.0 from PyPi and it doesn't include support for unparse indent with an integer, from this PR: https://github.com/martinblech/xmltodict/pull/222

I'm not sure what the timeline in the release and the PR merge was but I'm guessing it will come in the next release then?

Just wanted to make sure. Thank you.

def test_unparse_indent():
    """
    pip freeze | grep xmltodict
    xmltodict==0.13.0
    """
    import xmltodict
    import logging
    log = logging.getLogger(__name__)
    d = xmltodict.parse(
    """<a>
        <b>
        <!-- b comment -->
        <c>
            <!-- c comment -->
            1
        </c>
        <d>2</d>
        </b>
    </a>
    """
    )
    log.debug(xmltodict.unparse(d, pretty=True, indent="  "))
    """
    <?xml version="1.0" encoding="utf-8"?>
    <a>
      <b>
        <c>1</c>
        <d>2</d>
      </b>
    </a>
    """
    log.debug(xmltodict.unparse(d, pretty=True, indent=2))
    """"TypeError: decoding to str: need a bytes-like object, int found"""

opened by AlbertoV5 0

Owner

Martín Blech

GitHub

The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

2.3k Jan 2, 2023

Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

514 Dec 31, 2022

A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

2.2k Dec 29, 2022

A HTML-code compiler-thing that lets you reuse HTML code.

RHTML RHTML stands for Reusable-Hyper-Text-Markup-Language, and is pronounced "Rech-tee-em-el" despite how its abbreviation is. As the name stands, RH

4 Nov 15, 2021

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

1k Dec 27, 2022

Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

710 Jan 4, 2023

Generate HTML using python 3 with an API that follows the DOM standard specfication.

Generate HTML using python 3 with an API that follows the DOM standard specfication. A JavaScript API and tons of cool features. Can be used as a fast prototyping tool.

114 Dec 14, 2022

A python HTML builder library.

PyML A python HTML builder library. Goals Fully functional html builder similar to the javascript node manipulation. Implement an html parser that ret

8 Jul 4, 2022

Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API

Dominate Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API. It allows you to write HTML pages in pure

1.5k Jan 9, 2023

Json utils is a python module that you can use when working with json files.

Json-utils Json utils is a python module that you can use when working with json files. it comes packed with a lot of featrues Features Converting jso

4 Apr 24, 2022

Fully Automated YouTube Channel ▶️with Added Extra Features.

Fully Automated Youtube Channel ▒█▀▀█ █▀▀█ ▀▀█▀▀ ▀▀█▀▀ █░░█ █▀▀▄ █▀▀ █▀▀█ ▒█▀▀▄ █░░█ ░░█░░ ░▒█░░ █░░█ █▀▀▄ █▀▀ █▄▄▀ ▒█▄▄█ ▀▀▀▀ ░░▀░░ ░▒█░░ ░▀▀▀ ▀▀▀░

249 Jan 2, 2023

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

6 Aug 22, 2022

A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

Ubelt is a small library of robust, tested, documented, and simple functions that extend the Python standard library. It has a flat API that all behav

638 Dec 13, 2022

A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

Ubelt is a small library of robust, tested, documented, and simple functions that extend the Python standard library. It has a flat API that all behav

638 Dec 13, 2022

You like pytorch? You like micrograd? You love tinygrad! ❤️

For something in between a pytorch and a karpathy/micrograd This may not be the best deep learning framework, but it is a deep learning framework. Due

9.7k Jan 5, 2023

Simple Python Library to convert JSON to XML

json2xml Simple Python Library to convert JSON to XML

79 Nov 11, 2022

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

15 May 17, 2022

Python module that makes working with XML feel like you are working with JSON

Related tags

Overview

xmltodict

Namespace support

Streaming mode

Roundtripping

Ok, how do I get it?

Using pypi

RPM-based distro (Fedora, RHEL, …)

Arch Linux

Debian-based distro (Debian, Ubuntu, …)

FreeBSD

openSUSE/SLE (SLE 15, Leap 15, Tumbleweed)

Comments

Index Keys

Force Lists

Strip Namespaces

New Classes

Parsing ElementTree Data

Parsing Classes

Iterators/Generators

Whitespace Stripping Enhancement

What's wrong?

1. Optional attributes

2. Unknown children count

Suggestion

Owner

Martín Blech

The lxml XML toolkit for Python

Safely add untrusted strings to HTML/XML markup.

A jquery-like library for python

A HTML-code compiler-thing that lets you reuse HTML code.

Standards-compliant library for parsing and serializing HTML documents and fragments in Python

Python binding to Modest engine (fast HTML5 parser with CSS selectors).

Generate HTML using python 3 with an API that follows the DOM standard specfication.

A python HTML builder library.

Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API

Json utils is a python module that you can use when working with json files.

Fully Automated YouTube Channel ▶️with Added Extra Features.

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

You like pytorch? You like micrograd? You love tinygrad! ❤️

Simple Python Library to convert JSON to XML

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Script to generate a massive volume of data in sql, csv, json or xml format

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Working Time Statistics of working hours and working conditions by industry and company