A Python library that simplifies the extraction of datasets from XML content.

Overview

xmldataset: simple xml parsing 🗃️

https://camo.githubusercontent.com/13c4e50d88df7178ae1882a203ed57b641674f94/68747470733a2f2f63646e2e7261776769742e636f6d2f73696e647265736f726875732f617765736f6d652f643733303566333864323966656437386661383536353265336136336531353464643865383832392f6d656469612f62616467652e737667 https://travis-ci.org/spurin/xmldataset.png?branch=master https://badge.fury.io/py/xmldataset.png

XML Dataset: simple xml parsing

https://xmldataset.readthedocs.io/en/latest/_static/logo.jpg

A Python library that simplifies the extraction of datasets from XML content.

XML is a simple markup format. Whilst simple, extracting data of interest is often more complicated than it needs to be.

xmldataset addresses this through an easy to use plaintext declaration that follows the structure of the XML document. The declaration is indented, matching the XML structure, the data we are interested in is tagged against a dataset.

Features

  • Handles missing data from the XML structure, if it’s missing in the XML it is not populated in the dataset
  • Handles both XML Elements and Attributes using the plaintext collection schema (attributes are depicted as a sublevel of an element)
  • Easy to rename XML attributes/elements during processing to meet your requirements
  • Inline manipulation of XML content through the process mechanism
  • Dispatch mechanism, allows datasets to be dispatched for every N instance to allow asynchronous processing
Comments
  • A tool for showing the tree structure would be nice.

    A tool for showing the tree structure would be nice.

    As the title said. A tree showing method would be nice, so you can use it to create blank profiles. This would save energy since you don't need to copy/paste the xml string to get a blank profile string ready.

    Could look like that:

    xml = """<?xml version="1.0"?>
    <colleagues>
        <colleague>
            <title>The Boss</title>
            <phone>+1 202-663-9108</phone>
            <email>boss@the_company.com</email>
        </colleague>
        <colleague>
           <title>Admin Assistant</title>
            <phone>+1 347-999-5454</phone>
            <email>admin@the_company.com</email>
        </colleague>
        <colleague>
            <title>Minion</title>
            <phone>+1 792-123-4109</phone>
            <email>minion@the_company.com</email>
        </colleague>
    </colleagues>"""
    
    show_tree(xml) =>
    
    """
    colleagues
        colleague
            title
            phone
            email"""
    

    This also could open some possibilities for automation. Imagine you could offer a default profile, if no profile was passed in. Just using this tree and concatenate 'dataset:parent' for every child. Just an idea.

    enhancement help wanted 
    opened by KleinerNull 9
  • Added a function profile_gen to generate profile formatted output from any given xml. Fixes #5.

    Added a function profile_gen to generate profile formatted output from any given xml. Fixes #5.

    This is fix for the issue #5

    Note: This is a initial version. I also wrote a test function. The code works and the test passes. Following the agile pattern, make it work and make it better. I have made it to work and pass tests. I am sure there is lot of room for improvements or even a total redesign. Open to suggestions and advice.

    λ python -m xmldataset Parse using profile output :

    {'title_and_author': [{'title': "XML Developer's Guide", 'author': 'Gambardella, Matthew'}, {'title': 'Midnight Rain', 'author': 'Ralls, Kim'}, {'title': 'Maeve Ascendant', 'author': 'Corets, Eva'}, {'title': "Oberon's Legacy", 'author': 'Corets, Eva'}, {'title': 'The Sundered Grail', 'author': 'Corets, Eva'}, {'title': 'Lover Birds', 'author': 'Randall, Cynthia'}, {'title': 'Splish Splash', 'author': 'Thurman, Paula'}, {'title': 'Creepy Crawlies', 'author': 'Knorr, Stefan'}, {'title': 'Paradox Lost', 'author': 'Kress, Peter'}, {'title': 'Microsoft .NET: The Programming Bible', 'author': "O'Brien, Tim"}, {'title': 'MSXML3: A Comprehensive Guide', 'author': "O'Brien, Tim"}, {'title': 'Visual Studio 7: A Comprehensive Guide', 'author': 'Galos, Mike'}]}

    Profile Gen output :

    catalog
            lowest
                    specificbefore
                            specificvalue
                    book
                            optionalexternal
                                    externaldata
                            author
                            title
                            genre
                            price
                            publish_date
                            description
                    specificafter
                            specificvalue
    
    opened by tinkerbotfoo 7
  • Read Value and Attribute

    Read Value and Attribute

    I'm trying to figure out how to read the following xml (abridged):

    xml = """
    <document>
         <amount currency="USD">123.00</amount>
    </document>
    """
    

    I'm trying to get the amount and the currency into the same dataset but I am not successful. This is what I have tried:

    import xmldataset
    
    profile = """
    document
        amount = dataset:records
            currency = dataset:records
    """
    
    xmldataset.parse_using_profile(xml, profile)
    >>> {'records': [{'amount': '123.00'}]}
    

    With the approach above the currency attribute is ignored. I tried working around this as follows:

    profile = """
    document
        amount
            currency = dataset:records
        amount = dataset:records
    """
    
    xmldataset.parse_using_profile(xml, profile)
    >>> {'records': [{'currency': 'USD', 'amount': '123.00'}, {'currency': 'USD', 'amount': '123.00'}]}
    

    Unfortunately this doesn't work either: Now I get duplicate records. In my actual code the records differ slightly (I have a more complex profile).

    Is there a way get both the attribute of an element and the elements value into a dataset?

    opened by href 4
  • How to get multiple values from external value ?

    How to get multiple values from external value ?

    consider the following profile ., i would need to get both the probe name & probe id from the ancestor level into the data. However only one value (probe id) in this case gets propagated, probe name is ignored. Is it a know limitation or i am not getting the profile syntax correct ?

    profile= """
    prtg
        sensortree
            nodes
                group
                    probenode
                        name = external_dataset:probenode,name:probenode_name
                        id = external_dataset:probenode,name:probenode_id
                        device    
                            name    = dataset:nodes
                            tags    = dataset:nodes
                            host    = dataset:nodes
                            active  = dataset:nodes,name:active_status
                            __EXTERNAL_VALUE__ = probenode:probenode_name:nodes
                            __EXTERNAL_VALUE__ = probenode:probenode_id:nodes
    """    
    
    enhancement help wanted 
    opened by tinkerbotfoo 4
  • Reading output into a pandas dataframe

    Reading output into a pandas dataframe

    With the method from_records one can use pandas. Such an example in the documentation seems useful.

    result = xmldataset.parse_using_profile(xml, profile)
    df = pd.DataFrame.from_records(result['...'])
    
    enhancement 
    opened by keluc 4
  • Bug or intended?

    Bug or intended?

    I've tested different structured xml tree and I found something unexpected. The code is in this notebook.

    My problem is in cell 5.

    Has that something to do with XML itselfs and the underlying elementtree implementation? So, bug or intended?

    question 
    opened by KleinerNull 3
  • Duplicate results

    Duplicate results

    When I try to read XMLs containing text nodes with attributes, the library outputs the same list twice.

    For example, if I slightly modify your sample code like this:

    import xmldataset
    import pprint
    
    ppsetup = pprint.PrettyPrinter(indent=4)
    pp = ppsetup.pprint
    
    xml = """<?xml version="1.0"?>
    <colleagues>
        <colleague title="The Boss">John Smith</colleague>
        <colleague title="Admin Assistant">Jane Doe</colleague>
        <colleague title="Minion">Anne Other</colleague>
    </colleagues>"""
    

    And I try to parse the content:

    profile = """
    colleagues
        colleague
            __EXTERNAL_VALUE__ = colleagues:colleague:employees
            title = dataset:employees
        colleague = external_dataset:colleagues
            """
    
    pp(xmldataset.parse_using_profile(xml, profile))
    

    The result is

    {   'employees': [   {'colleague': 'John Smith', 'title': 'The Boss'},
                         {'colleague': 'Jane Doe', 'title': 'Admin Assistant'},
                         {'colleague': 'Anne Other', 'title': 'Minion'},
                         {'colleague': 'John Smith', 'title': 'The Boss'},
                         {'colleague': 'Jane Doe', 'title': 'Admin Assistant'},
                         {'colleague': 'Anne Other', 'title': 'Minion'}]}
    

    As you can see, each colleague is repeated twice. Is it a bug, or am I doing something wrong in the declaration of the profile or in the parsing?

    I'm working on Ubuntu 16.04 using Python 3.5.2.

    opened by basaldella 2
  • Added example 12 to docs to demonstrate using multiple external values. fixes #9

    Added example 12 to docs to demonstrate using multiple external values. fixes #9

    I added this example based on test_external_data_before function so some one reading the docs might understand this functionality. Plz feel free to edit or modify as you see fit.

    enhancement 
    opened by tinkerbotfoo 2
  • Identifier for unique datasets

    Identifier for unique datasets

    This would be an interesting idea, I have prepared a demo notebook with a nasty xml response to parse from amazon ;)

    As you can see the dataset fulfillment_order is unique, but still it is saved in a list, just with one element. It would be nice if you can say with an identifier, that that dataset is unique and won't be stored in a list, instead in a normal dictionary.

    UNIQUE_DATASET is my proposal for a name for such a identifier.

    Otherwise, the possibility to flatten all one-element lists in the output would be also interesting.

    enhancement help wanted 
    opened by KleinerNull 2
  • Needs refactoring

    Needs refactoring

    I was bored and looked into your code. I think it is not a good idea to write the complete code into the __init__.py.

    I'd say one can split the _XMLDataset and the parse_using_profile code into seperate sub modules. So it can be easier to extend and change the code in the future.

    After you'll have commited your actual changes I would refacture it for you, just don't want to mess with the merging problems now.

    opened by KleinerNull 1
  • Generate lists when multiple children for the same parent

    Generate lists when multiple children for the same parent

    For the following XML test script, I would like to have multiple phone numbers for each colleague. How would I go about in getting the phone numbers for each colleague as a list in the extracted python dictionary?

    import xmldataset
    import pprint
    
    # Setup Pretty Printing
    ppsetup = pprint.PrettyPrinter(indent=4)
    pp = ppsetup.pprint
    
    
    xml = """<?xml version="1.0"?>
    <colleagues>
        <colleague>
            <title>The Boss</title>
            <phones>
                <phone>+1 202-663-9108</phone>
                <phone>+1 202-663-9107</phone>
            </phones>
            <email>boss@the_company.com</email>
        </colleague>
        <colleague>
            <title>Admin Assistant</title>
            <phones>
                <phone>+1 347-999-5454</phone>
                <phone>+1 347-999-5455</phone>
            </phones>
            <email>admin@the_company.com</email>
        </colleague>
        <colleague>
            <title>Minion</title>
            <phones>
                <phone>+1 792-123-4109</phone>
                <phone>+1 792-123-4110</phone>
            </phones>
            <email>minion@the_company.com</email>
        </colleague>
    </colleagues>"""
    
    
    # xmldataset declaration
    profile = """
    colleagues
        colleague
            title = dataset:colleagues
            email = dataset:colleagues
            phones
                phone = dataset:colleagues
    
            """
    
    # Print the output
    print(pp(xmldataset.parse_using_profile(xml, profile)))
    
    

    current output:

                          {   'email': 'boss@the_company.com',
                              'phone': '+1 202-663-9107',
                              'title': 'Admin Assistant'},
                          {'phone': '+1 347-999-5454'},
                          {   'email': 'admin@the_company.com',
                              'phone': '+1 347-999-5455',
                              'title': 'Minion'},
                          {'phone': '+1 792-123-4109'},
                          {   'email': 'minion@the_company.com',
                              'phone': '+1 792-123-4110'}]}```
    
    expected output:
    ```{   'colleagues': [   {'title': 'The Boss',
                           'email': 'boss@the_company.com',
                              'phone': ['+1 202-663-9107',  '+1 202-663-9108']},
                          {    'title': 'Admin Assistant',
                               'email': 'admin@the_company.com',
                              'phone': ['+1 347-999-5455', '+1 347-999-5454']}
                          {    'title': 'Minion',
                             'email': 'minion@the_company.com',
                              'phone': ['+1 792-123-4110',  '+1 792-123-4109']}]}```
    opened by nishtala 0
Owner
James Spurin
Cloud Engineer / Software Developer | Kubernetes (CKA / CKAD) | Technical Author (Dive Into Ansible) | DevOps | Automation
James Spurin
EasyModerationKit is an open-source framework designed to moderate and filter inappropriate content.

EasyModerationKit is a public transparency statement. It declares any repositories and legalities used in the EasyModeration system. It allows for implementing EasyModeration into an advanced character/word/phrase detection system.

Aarav 1 Jan 16, 2022
A tutorial for people to run synthetic data replica's from source healthcare datasets

Synthetic-Data-Replica-for-Healthcare Description What is this? A tailored hands-on tutorial showing how to use Python to create synthetic data replic

null 11 Mar 22, 2022
A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

null 0 Dec 13, 2022
Simple yet powerful CAD (Computer Aided Design) library, written with Python.

Py-MADCAD >>> it's time to throw parametric softwares out ! Simple yet powerful CAD (Computer Aided Design) library, written with Python. Installation

jimy byerley 124 Jan 6, 2023
Fast syllable estimation library based on pattern matching.

Syllables: A fast syllable estimator for Python Syllables is a fast, simple syllable estimator for Python. It's intended for use in places where speed

ProseGrinder 26 Dec 14, 2022
charcade is a string manipulation library that can animate, color, and bruteforce strings

charcade charcade is a string manipulation library that can animate, color, and bruteforce strings. Features Animating text for CLI applications with

Aaron 8 May 23, 2022
A collection of simple python mini projects to enhance your python skills

A collection of simple python mini projects to enhance your python skills

PYTHON WORLD 12.1k Jan 5, 2023
Python Eacc is a minimalist but flexible Lexer/Parser tool in Python.

Python Eacc is a parsing tool it implements a flexible lexer and a straightforward approach to analyze documents.

Iury de oliveira gomes figueiredo 60 Nov 16, 2022
Repository for learning Python (Python Tutorial)

Repository for learning Python (Python Tutorial) Languages and Tools ?? Overview ?? Repository for learning Python (Python Tutorial) Languages and Too

Swiftman 2 Aug 22, 2022
A python package to avoid writing and maintaining duplicated python docstrings.

docstring-inheritance is a python package to avoid writing and maintaining duplicated python docstrings.

Antoine Dechaume 15 Dec 7, 2022
advance python series: Data Classes, OOPs, python

Working With Pydantic - Built-in Data Process ========================== Normal way to process data (reading json file): the normal princiople, it's f

Phung Hưng Binh 1 Nov 8, 2021
A comprehensive and FREE Online Python Development tutorial going step-by-step into the world of Python.

FREE Reverse Engineering Self-Study Course HERE Fundamental Python The book and code repo for the FREE Fundamental Python book by Kevin Thomas. FREE B

Kevin Thomas 7 Mar 19, 2022
A simple USI Shogi Engine written in python using python-shogi.

Revengeshogi My attempt at creating a USI Shogi Engine in python using python-shogi. Current State of Engine Currently only generating random moves us

null 1 Jan 6, 2022
Python-slp - Side Ledger Protocol With Python

Side Ledger Protocol Run python-slp node First install Mongo DB and run the mong

Solar 3 Mar 2, 2022
Python-samples - This project is to help someone need some practices when learning python language

Python-samples - This project is to help someone need some practices when learning python language

Gui Chen 0 Feb 14, 2022
Valentine-with-Python - A Python program generates an animation of a heart with cool texts of your loved one

Valentine with Python Valentines with Python is a mini fun project I have coded.

Niraj Tiwari 4 Dec 31, 2022
A curated list of awesome tools for Sphinx Python Documentation Generator

Awesome Sphinx (Python Documentation Generator) A curated list of awesome extra libraries, software and resources for Sphinx (Python Documentation Gen

Hyunjun Kim 831 Dec 27, 2022
API Documentation for Python Projects

API Documentation for Python Projects. Example pdoc -o ./html pdoc generates this website: pdoc.dev/docs. Installation pip install pdoc pdoc is compat

mitmproxy 1.4k Jan 7, 2023
🏆 A ranked list of awesome python developer tools and libraries. Updated weekly.

Best-of Python Developer Tools ?? A ranked list of awesome python developer tools and libraries. Updated weekly. This curated list contains 250 awesom

Machine Learning Tooling 646 Jan 7, 2023