A Python library that simplifies the extraction of datasets from XML content.

James Spurin

Last update: Dec 30, 2022

Related tags

Documentation xmldataset

Overview

xmldataset: simple xml parsing 🗃️

https://camo.githubusercontent.com/13c4e50d88df7178ae1882a203ed57b641674f94/68747470733a2f2f63646e2e7261776769742e636f6d2f73696e647265736f726875732f617765736f6d652f643733303566333864323966656437386661383536353265336136336531353464643865383832392f6d656469612f62616467652e737667

https://travis-ci.org/spurin/xmldataset.png?branch=master

XML Dataset: simple xml parsing

Documentation: https://xmldataset.readthedocs.io

A Python library that simplifies the extraction of datasets from XML content.

XML is a simple markup format. Whilst simple, extracting data of interest is often more complicated than it needs to be.

xmldataset addresses this through an easy to use plaintext declaration that follows the structure of the XML document. The declaration is indented, matching the XML structure, the data we are interested in is tagged against a dataset.

Features

Handles missing data from the XML structure, if it’s missing in the XML it is not populated in the dataset
Handles both XML Elements and Attributes using the plaintext collection schema (attributes are depicted as a sublevel of an element)
Easy to rename XML attributes/elements during processing to meet your requirements
Inline manipulation of XML content through the process mechanism
Dispatch mechanism, allows datasets to be dispatched for every N instance to allow asynchronous processing

Comments

A tool for showing the tree structure would be nice.

As the title said. A tree showing method would be nice, so you can use it to create blank profiles. This would save energy since you don't need to copy/paste the xml string to get a blank profile string ready.

Could look like that:

xml = """<?xml version="1.0"?>
<colleagues>
    <colleague>
        <title>The Boss</title>
        <phone>+1 202-663-9108</phone>
        <email>boss@the_company.com</email>
    </colleague>
    <colleague>
       <title>Admin Assistant</title>
        <phone>+1 347-999-5454</phone>
        <email>admin@the_company.com</email>
    </colleague>
    <colleague>
        <title>Minion</title>
        <phone>+1 792-123-4109</phone>
        <email>minion@the_company.com</email>
    </colleague>
</colleagues>"""

show_tree(xml) =>

"""
colleagues
    colleague
        title
        phone
        email"""

This also could open some possibilities for automation. Imagine you could offer a default profile, if no profile was passed in. Just using this tree and concatenate 'dataset:parent' for every child. Just an idea.

enhancement help wanted

opened by KleinerNull 9

Added a function profile_gen to generate profile formatted output from any given xml. Fixes #5.
This is fix for the issue #5

Note: This is a initial version. I also wrote a test function. The code works and the test passes. Following the agile pattern, make it work and make it better. I have made it to work and pass tests. I am sure there is lot of room for improvements or even a total redesign. Open to suggestions and advice.

λ python -m xmldataset Parse using profile output :

{'title_and_author': [{'title': "XML Developer's Guide", 'author': 'Gambardella, Matthew'}, {'title': 'Midnight Rain', 'author': 'Ralls, Kim'}, {'title': 'Maeve Ascendant', 'author': 'Corets, Eva'}, {'title': "Oberon's Legacy", 'author': 'Corets, Eva'}, {'title': 'The Sundered Grail', 'author': 'Corets, Eva'}, {'title': 'Lover Birds', 'author': 'Randall, Cynthia'}, {'title': 'Splish Splash', 'author': 'Thurman, Paula'}, {'title': 'Creepy Crawlies', 'author': 'Knorr, Stefan'}, {'title': 'Paradox Lost', 'author': 'Kress, Peter'}, {'title': 'Microsoft .NET: The Programming Bible', 'author': "O'Brien, Tim"}, {'title': 'MSXML3: A Comprehensive Guide', 'author': "O'Brien, Tim"}, {'title': 'Visual Studio 7: A Comprehensive Guide', 'author': 'Galos, Mike'}]}

Profile Gen output :

catalog lowest specificbefore specificvalue book optionalexternal externaldata author title genre price publish_date description specificafter specificvalue
opened by tinkerbotfoo 7
Read Value and Attribute
I'm trying to figure out how to read the following xml (abridged):

xml = """ <document> <amount currency="USD">123.00</amount> </document> """

I'm trying to get the amount and the currency into the same dataset but I am not successful. This is what I have tried:

import xmldataset profile = """ document amount = dataset:records currency = dataset:records """ xmldataset.parse_using_profile(xml, profile) >>> {'records': [{'amount': '123.00'}]}

With the approach above the currency attribute is ignored. I tried working around this as follows:

profile = """ document amount currency = dataset:records amount = dataset:records """ xmldataset.parse_using_profile(xml, profile) >>> {'records': [{'currency': 'USD', 'amount': '123.00'}, {'currency': 'USD', 'amount': '123.00'}]}

Unfortunately this doesn't work either: Now I get duplicate records. In my actual code the records differ slightly (I have a more complex profile).

Is there a way get both the attribute of an element and the elements value into a dataset?
opened by href 4

How to get multiple values from external value ?

consider the following profile ., i would need to get both the probe name & probe id from the ancestor level into the data. However only one value (probe id) in this case gets propagated, probe name is ignored. Is it a know limitation or i am not getting the profile syntax correct ?

profile= """
prtg
    sensortree
        nodes
            group
                probenode
                    name = external_dataset:probenode,name:probenode_name
                    id = external_dataset:probenode,name:probenode_id
                    device    
                        name    = dataset:nodes
                        tags    = dataset:nodes
                        host    = dataset:nodes
                        active  = dataset:nodes,name:active_status
                        __EXTERNAL_VALUE__ = probenode:probenode_name:nodes
                        __EXTERNAL_VALUE__ = probenode:probenode_id:nodes
"""

enhancement help wanted

opened by tinkerbotfoo 4

Reading output into a pandas dataframe
With the method from_records one can use pandas. Such an example in the documentation seems useful.

result = xmldataset.parse_using_profile(xml, profile) df = pd.DataFrame.from_records(result['...'])
enhancement
opened by keluc 4
Bug or intended?

I've tested different structured xml tree and I found something unexpected. The code is in this notebook.

My problem is in cell 5.

Has that something to do with XML itselfs and the underlying elementtree implementation? So, bug or intended?
question

opened by KleinerNull 3

Duplicate results

When I try to read XMLs containing text nodes with attributes, the library outputs the same list twice.

For example, if I slightly modify your sample code like this:

import xmldataset
import pprint

ppsetup = pprint.PrettyPrinter(indent=4)
pp = ppsetup.pprint

xml = """<?xml version="1.0"?>
<colleagues>
    <colleague title="The Boss">John Smith</colleague>
    <colleague title="Admin Assistant">Jane Doe</colleague>
    <colleague title="Minion">Anne Other</colleague>
</colleagues>"""

And I try to parse the content:

profile = """
colleagues
    colleague
        __EXTERNAL_VALUE__ = colleagues:colleague:employees
        title = dataset:employees
    colleague = external_dataset:colleagues
        """

pp(xmldataset.parse_using_profile(xml, profile))

The result is

{   'employees': [   {'colleague': 'John Smith', 'title': 'The Boss'},
                     {'colleague': 'Jane Doe', 'title': 'Admin Assistant'},
                     {'colleague': 'Anne Other', 'title': 'Minion'},
                     {'colleague': 'John Smith', 'title': 'The Boss'},
                     {'colleague': 'Jane Doe', 'title': 'Admin Assistant'},
                     {'colleague': 'Anne Other', 'title': 'Minion'}]}

As you can see, each colleague is repeated twice. Is it a bug, or am I doing something wrong in the declaration of the profile or in the parsing?

I'm working on Ubuntu 16.04 using Python 3.5.2.

opened by basaldella 2

Added example 12 to docs to demonstrate using multiple external values. fixes #9

I added this example based on test_external_data_before function so some one reading the docs might understand this functionality. Plz feel free to edit or modify as you see fit.
enhancement

opened by tinkerbotfoo 2
Identifier for unique datasets

This would be an interesting idea, I have prepared a demo notebook with a nasty xml response to parse from amazon ;)

As you can see the dataset fulfillment_order is unique, but still it is saved in a list, just with one element. It would be nice if you can say with an identifier, that that dataset is unique and won't be stored in a list, instead in a normal dictionary.

UNIQUE_DATASET is my proposal for a name for such a identifier.

Otherwise, the possibility to flatten all one-element lists in the output would be also interesting.
enhancement help wanted

opened by KleinerNull 2
Needs refactoring

I was bored and looked into your code. I think it is not a good idea to write the complete code into the __init__.py.

I'd say one can split the _XMLDataset and the parse_using_profile code into seperate sub modules. So it can be easier to extend and change the code in the future.

After you'll have commited your actual changes I would refacture it for you, just don't want to mess with the merging problems now.

opened by KleinerNull 1

Generate lists when multiple children for the same parent

For the following XML test script, I would like to have multiple phone numbers for each colleague. How would I go about in getting the phone numbers for each colleague as a list in the extracted python dictionary?

import xmldataset
import pprint

# Setup Pretty Printing
ppsetup = pprint.PrettyPrinter(indent=4)
pp = ppsetup.pprint


xml = """<?xml version="1.0"?>
<colleagues>
    <colleague>
        <title>The Boss</title>
        <phones>
            <phone>+1 202-663-9108</phone>
            <phone>+1 202-663-9107</phone>
        </phones>
        <email>boss@the_company.com</email>
    </colleague>
    <colleague>
        <title>Admin Assistant</title>
        <phones>
            <phone>+1 347-999-5454</phone>
            <phone>+1 347-999-5455</phone>
        </phones>
        <email>admin@the_company.com</email>
    </colleague>
    <colleague>
        <title>Minion</title>
        <phones>
            <phone>+1 792-123-4109</phone>
            <phone>+1 792-123-4110</phone>
        </phones>
        <email>minion@the_company.com</email>
    </colleague>
</colleagues>"""


# xmldataset declaration
profile = """
colleagues
    colleague
        title = dataset:colleagues
        email = dataset:colleagues
        phones
            phone = dataset:colleagues

        """

# Print the output
print(pp(xmldataset.parse_using_profile(xml, profile)))

current output:

                      {   'email': 'boss@the_company.com',
                          'phone': '+1 202-663-9107',
                          'title': 'Admin Assistant'},
                      {'phone': '+1 347-999-5454'},
                      {   'email': 'admin@the_company.com',
                          'phone': '+1 347-999-5455',
                          'title': 'Minion'},
                      {'phone': '+1 792-123-4109'},
                      {   'email': 'minion@the_company.com',
                          'phone': '+1 792-123-4110'}]}```

expected output:
```{   'colleagues': [   {'title': 'The Boss',
                       'email': 'boss@the_company.com',
                          'phone': ['+1 202-663-9107',  '+1 202-663-9108']},
                      {    'title': 'Admin Assistant',
                           'email': 'admin@the_company.com',
                          'phone': ['+1 347-999-5455', '+1 347-999-5454']}
                      {    'title': 'Minion',
                         'email': 'minion@the_company.com',
                          'phone': ['+1 792-123-4110',  '+1 792-123-4109']}]}```

opened by nishtala 0

Owner

James Spurin

Cloud Engineer / Software Developer | Kubernetes (CKA / CKAD) | Technical Author (Dive Into Ansible) | DevOps | Automation

GitHub https://xmldataset.readthedocs.io

A Python library that simplifies the extraction of datasets from XML content.

Related tags

Overview

xmldataset: simple xml parsing 🗃️

Comments

A tool for showing the tree structure would be nice.

Added a function profile_gen to generate profile formatted output from any given xml. Fixes #5.

Read Value and Attribute

How to get multiple values from external value ?

Reading output into a pandas dataframe

Bug or intended?

Duplicate results

Added example 12 to docs to demonstrate using multiple external values. fixes #9

Identifier for unique datasets

Needs refactoring

Generate lists when multiple children for the same parent

Owner

James Spurin

EasyModerationKit is an open-source framework designed to moderate and filter inappropriate content.

A tutorial for people to run synthetic data replica's from source healthcare datasets

A Python library for setting up projects using tabular data.

Simple yet powerful CAD (Computer Aided Design) library, written with Python.

Fast syllable estimation library based on pattern matching.

charcade is a string manipulation library that can animate, color, and bruteforce strings

A collection of simple python mini projects to enhance your python skills

Python Eacc is a minimalist but flexible Lexer/Parser tool in Python.

Repository for learning Python (Python Tutorial)

A python package to avoid writing and maintaining duplicated python docstrings.

advance python series: Data Classes, OOPs, python

A comprehensive and FREE Online Python Development tutorial going step-by-step into the world of Python.

A simple USI Shogi Engine written in python using python-shogi.

Python-slp - Side Ledger Protocol With Python

Python-samples - This project is to help someone need some practices when learning python language

Valentine-with-Python - A Python program generates an animation of a heart with cool texts of your loved one

A curated list of awesome tools for Sphinx Python Documentation Generator

API Documentation for Python Projects

🏆 A ranked list of awesome python developer tools and libraries. Updated weekly.