Converts XML to Python objects

Related tags

python xml pypi
Overview

untangle

Build Status PyPi version Code style: black

Documentation

  • Converts XML to a Python object.
  • Siblings with similar names are grouped into a list.
  • Children can be accessed with parent.child, attributes with element['attribute'].
  • You can call the parse() method with a filename, an URL or an XML string.
  • Substitutes -, . and : with _ <foobar><foo-bar/></foobar> can be accessed with foobar.foo_bar, <foo.bar.baz/> can be accessed with foo_bar_baz and <foo:bar><foo:baz/></foo:bar> can be accessed with foo_bar.foo_baz
  • Works with Python 2.7 and 3.4, 3.5, 3.6, 3.7, 3.8 and pypy

Installation

With pip:

pip install untangle

With conda:

conda install -c conda-forge untangle

Conda feedstock maintained by @htenkanen. Issues and questions about conda-forge packaging / installation can be done here.

Usage

(See and run examples.py or this blog post: Read XML painlessly for more info)

import untangle
obj = untangle.parse(resource)

resource can be:

  • a URL
  • a filename
  • an XML string

Running the above code and passing this XML:

<?xml version="1.0"?>
<root>
	<child name="child1"/>
</root>

allows it to be navigated from the untangled object like this:

obj.root.child['name'] # u'child1'

Changelog

see CHANGELOG.md

Issues
  • Add parse_raw() to handle large XML strings

    Add parse_raw() to handle large XML strings

    In Windows environments the path length limitation causes a path too long exception to be raised from parse(). To combat this, I've written a parse_raw() function that accepts an XML string and will not attempt to estimate what type of data it was given. This seemed like a better choice over adding a type parameter to parse() because of the several options it could be given: filepath, url, xml string, stream, or some value to indicate that it needs to be figured out (like None).

    This has been confirmed to work on Windows 7 with Python 3.4.3 using a document that exceeds 300+ characters.

    opened by turt2live 14
  • can't dir() the parsed object

    can't dir() the parsed object

    I'm not sure if this is a SAX problem. If it is, just close it. But here's the problem:

    >>> filename = 'posts-july-2012.xml'
    >>> from untangle import parse
    >>> o = parse(filename)
    >>> dir(o)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/untangle.py", line 66, in __getattr__
        raise IndexError('Unknown key <%s>' % key)
    IndexError: Unknown key <__dir__>
    
    opened by iffy 7
  • Implemented the Python attribute protocol

    Implemented the Python attribute protocol

    The behavior regarding attributes was unexpected.

    Now these work:

    hasattribute(element, 'aname')
    getattribute(element, 'aname', [])
    
    opened by apalala 7
  • accessing a tag's name

    accessing a tag's name

    I wanted to get the name of a tag and I could do it by writing tag._name. The underscrose suggests that it's a private variable but getting a tag's name is a common task, thus there should be a public way to do it. tag.name didn't work while it seemed to be the intuitive solution.

    opened by jabbalaci 7
  • pypi version out of date

    pypi version out of date

    The version of untangle in pypi is not up to date with this repo.

    opened by mikeshultz 6
  • How do I access deeply nested children?

    How do I access deeply nested children?

    (This issue was initially in a comment for another issue, and I thought it might be best to start a new issue instead of hijacking an old one. Apologies for the confusion.)

    I have an XML file with deeply nested elements, and they are all under higher level elements with the same name. An example:

    <MeasurementRecords attrib="something">
        <HistoryRecords>
            <ValueItemId>100_0000100004_3788_Resource-0.customId_WSx Data Precip Type</ValueItemId>
            <List>
                <HistoryRecord>
                    <Value>60</Value>
                    <State>Valid</State>
                    <TimeStamp>2016-04-20T12:40:00Z</TimeStamp>
                </HistoryRecord>
            </List>
        </HistoryRecords>
        <HistoryRecords>
            <ValueItemId>100_0000100004_3788_Resource-0.customId_Specific Enthalpy (INS)</ValueItemId>
            <List>
                <HistoryRecord>
                    <Value>33</Value>
                    <State>Valid</State>
                    <TimeStamp>2016-04-20T12:40:00Z</TimeStamp>
                </HistoryRecord>
            </List>
        </HistoryRecords>
    

    How do I access the <value> of the Specific Enthalpy element? From other examples I assume I should loop though all the HistoryRecords elements. But when I do that, it appears the children are NOT in the object. My attempt so far:

    for HistoryRecord in RSPobj.MeasurementRecords.HistoryRecords:
        if HistoryRecord.ValueItemId.cdata == "100_0000100004_3788_Resource-0.customId_Specific Enthalpy (INS)":
            pprint(HistoryRecord.ValueItemId)
    

    Gives me:

    $ python parseRSPXMLfiles.py
    Element(name = ValueItemId, attributes = {}, cdata = 100_0000100004_3788_Resource-0.customId_Specific Enthalpy (INS))
    

    Where are all the children?

    I was expecting to be able to do something like this:

    pprint(HistoryRecord.ValueItemId.List.HistoryRecord.Value)
    

    But that gives me this error:

    Traceback (most recent call last):
      File "parseRSPXMLfiles.py", line 17, in <module>
        pprint(HistoryRecord.ValueItemId.List.HistoryRecord.Value)
      File "/usr/lib/python2.7/site-packages/untangle.py", line 66, in __getattr__
        raise IndexError('Unknown key <%s>' % key)
    IndexError: Unknown key <List>
    

    FYI, this:

            pprint(dir(HistoryRec.ValueItemId))
    

    Results in [] being printed.

    opened by jakehawkes 5
  • Make untangle installable from conda-forge?

    Make untangle installable from conda-forge?

    Hi @stchris!

    Thanks for creating this very useful package! I use it myself in here: https://github.com/HTenkanen/transx2gtfs

    I was just creating a conda-forge recipe for my library, and realized that untangle is not currently available from conda.

    Would you be okay with the idea of adding untangle to conda-forge? The thing with conda is, that they are very restrictive that all dependencies of libraries need to come from conda-forge (in this way the reliability of the whole system is ensured). Hence, any libraries that depends on untangle cannot be published in conda-forge easily (such as mine).

    If you are open to this idea, I am happy to help with writing a conda-forge recipe for untangle. The process does not require any modifications to the current repo, but the things are done by forking the "conda-forge/staged-recipes" repo.

    More information about the whole conda process can be found from here: https://conda-forge.org/

    opened by HTenkanen 5
  • UnicodeEncodeError

    UnicodeEncodeError

    Hi! Cool project. I was looking for something like this. I came across a bug.

    I have some xml with unicode chars:

    <?xml version="1.0" encoding="UTF-8"?>
    <page>
        <menu>
        <name>Привет мир</name>
        <items>
            <item>
                <name>Пункт 1</name>
                <url>http://example1.com</url>
            </item>
            <item>
                <name>Пункт 2</name>
                <url>http://example2.com</url>
            </item>
        </items>
        </menu>
    </page>
    
    
    >>> obj = untangle.parse("1.xml")
    >>> obj.page.menu.name
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 46-51: ordinal not in range(128)
    
    opened by un1t 5
  • Using in class

    Using in class

    I may be missing something but following your example of using untangle does not work inside of a class for example:

    import untangle

    Class foo:

    def load(self):
         xmldoc = untangle.parse('test.xml')
    

    I have tried assigning it to a class variable but was unable to. Any help would be appreciated.

    opened by excellentingenuity 5
  • Optional xml.sax features

    Optional xml.sax features

    I'm using untangle to parse XML files and it's great. However, I'm operating offline, and by default xml.sax tries to load external entities such as DTDs. Loading external entities is a controllable parser "feature."

    This PR adds the ability to pass xml.sax parser features as extra arguments to parse(), so for example

    untangle.parse(my_xml, feature_external_ges=False)
    

    becomes

    parser.setFeature(xml.sax.handler.feature_external_ges, False)
    

    parse() raises AttributeError if a nonexistent feature is requested.

    opened by ransford 5
  • Updating pip & Suggestion for better doc in parse method

    Updating pip & Suggestion for better doc in parse method

    I just read the documentation for the parse method and thought this suggestion would make it more readable.

    opened by domi877 1
  • Workaround Path/XML string too long on Windows OS

    Workaround Path/XML string too long on Windows OS

    Provides a workaround for #45 (and #52?) when attempting to input a long XML string into os.path.exists() function when on Windows OS.

    For reference, the size of string must be ≥ 2^15 (or 32768) characters - at least on Windows 10 - to trigger the issue!

    opened by Liam-Deacon 1
  • Fix simple typo: addded -> added

    Fix simple typo: addded -> added

    Closes #70

    opened by timgates42 1
  • Fix simple typo: addded -> added

    Fix simple typo: addded -> added

    There is a small typo in CHANGELOG.md. Should read added rather than addded.

    opened by timgates42 0
  • Untangle fails on windows with xml string

    Untangle fails on windows with xml string

    When passing an xml string on windows untangle returns an error that the XML string is too long to be a file location and the python run crashes

    opened by ethanrucinski 1
  • Support Element truthiness on Python 3

    Support Element truthiness on Python 3

    Python 3 changed __nonzero__ to __bool__ (https://portingguide.readthedocs.io/en/latest/core-obj-misc.html#customizing-truthiness-bool) -- this makes the standard class implementation Python 3 and provides an alias for Python 2 compatibility.

    Previous to this change, all Element instances were considered false on Python 3.

    opened by davidjb 2
  • TypeError: 'Element' object does not support item assignment

    TypeError: 'Element' object does not support item assignment

    Able to read xml but wanted to change the attribute value and append it as a new object. But it gives assignment error.

    sample_data

    <TBASettings>
     <Tenant urlTag="nmdcdemo" id="9001" tbaStatus="true" nonValidatedExtensions=".jpg"/>
    </TBASettings>
    

    wanted to change and append new object as

    <TBASettings>
     <Tenant urlTag="nmdcdemo" id="9001" tbaStatus="true" nonValidatedExtensions=".jpg"/>
    <Tenant urlTag="nmdcdemo2" id="9002" tbaStatus="true" nonValidatedExtensions=".jpg"/>
    </TBASettings>
    
    unclear 
    opened by thehayat 1
  • Basic example doesn't work

    Basic example doesn't work

    Using last version of untangle as of today :

    import untangle
    obj = untangle.parse(b)
    
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
      File "...\eb-virt\lib\site-packages\untangle.py", line 180, in parse
        parser.parse(filename)
      File "...\python36\Lib\xml\sax\expatreader.py", line 105, in parse
        source = saxutils.prepare_input_source(source)
      File "...\python36\Lib\xml\sax\saxutils.py", line 348, in prepare_input_source
        if isinstance(f.read(0), str):
    TypeError: 'NoneType' object is not callable
    
    windows 
    opened by Jay-Dai 3
  • Unexpected results when comparing two untangle.Element objects with the == operator

    Unexpected results when comparing two untangle.Element objects with the == operator

    Recently I had to check whether an item in a list of untangle.Element was inside a second list using the in operator. Turns out the result was very misleading: Note: I know I shouldn't initialise an untangle.Element object like this (in my real life code I am using untangle.parse('file.xml') which creates a list of various untangle.Element), but this is the smallest syntactically correct code I could supply for this illustration.

    import untangle
    a = untangle.Element('a', '1')
    b = untangle.Element('b', '1')
    listA = [a, b]
    c = untangle.Element('c', '1')
    print(c in listA)
    

    This prints True, but should print False as it does in:

    a = object()
    b = object()
    listA = [a, b]
    c = object()
    print(c in listA)
    

    So, since the in operator uses == to compare items, I thought it could be a problem with how == is being implemented in the Element class, which I think I confirmed by running:

    import untangle
    a = untangle.Element('a', '1')
    b = untangle.Element('b', '1')
    print(a == b)
    

    This prints out True, but should print False, as this other code does:

    a = object()
    b = object()
    print(a == b)
    

    Using Python 3.6.8 and untangle 1.1.1 @adijbr contributed to these tests and bug report. Thanks!

    bug 
    opened by amorimlb 2
  • xml.sax package is vulnerable to XML External Entities (XXE) injection

    xml.sax package is vulnerable to XML External Entities (XXE) injection

    I'd like to mention that xml.sax package is vulnerable to XML External Entities (XXE) injection.

    The feature_external_ges is True by default true: Include all external general (text) entities which means vulnerable to XXE injection.

    For example, a possible solution could be set to False at untangle.py#L185

    from xml.sax import make_parser, SAXException
    from xml.sax.handler import feature_external_ges
    
    parser = make_parser()
    parser.setFeature(feature_external_ges, False)
    

    Please check:

    • https://docs.python.org/3/library/xml.html#xml-vulnerabilities
    • https://pypi.org/project/defusedxml/#python-xml-libraries
    • https://www.owasp.org/index.php/Top_10-2017_A4-XML_External_Entities_(XXE)
    • https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet
    bug 
    opened by morenopc 1
Owner
Christian Stefanescu
Christian Stefanescu
Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

null 935 Oct 15, 2021
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Gael Pasgrimaud 2k Oct 22, 2021
The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

null 2k Oct 26, 2021
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

The Pallets Projects 427 Oct 11, 2021
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Python Software Foundation 12.2k Oct 23, 2021
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

null 1.8k Oct 22, 2021
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 463 Oct 16, 2021
The awesome document factory

The Awesome Document Factory WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous s

Kozea 4.6k Oct 22, 2021
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Mozilla 2.2k Oct 20, 2021