Python reader for Linked Data in HDF5 files

The HDF Group

Last update: May 17, 2022

Related tags

Data Analysis h5ld

Overview

`h5ld`: HDF5 Linked Data

Linked Data are becoming more popular for user-created metadata in HDF5 files. This Python package provides readers for the HDF5-based formats with such metadata . Entire linked data content is read in one operation and made available as an rdflib graph object.

Currently supported:

Allotrope Data Format (ADF)

Installation

pip install git+https://github.com/HDFGroup/h5ld@{LABEL}

where {LABEL} is either master or a tag label.

Requirements:

Python >= 3.7
h5py >= 3.3.0
rdflib >= 5.0.0

License

This software is open source. See this file for details.

Quick Start

This package can be used either as a command-line tool or programmatically. On the command-line, the package dumps the link data of an input HDF5 file into several popular RDF formats supported by the rdflib package. For example:

python -m h5ld -f json-ld -o output.json INPUT.h5

will dump the input file's RDF data to a file output.json in the JSON-LD format. Omitting an output file prints out the same content so it can be ingested by another command-line tool. Full description is available from:

python -m h5ld --help

There is also a programmatic interface for integration into Python applications. Each h5ld reader will provide the following methods and attributes:

File format name.

print(f"Input file format is: {reader.name}")

Short (usually an acronym) of the file format.

print(f"File format acronym: {reader.short_name}")

Check if the reader is the right choice for the input file.

with h5py.File("input.h5", mode="r") as f:
    if reader.verify_format(f):
        # Do something...
      else:
          print("Sorry but not the right h5ld reader.")

Check if there is linked data content in the input HDF5 file. Optionally, print an appropriate description of the data.
```
with h5py.File("input.h5", mode="r") as f:
    reader.check_ld(f, report=True)
```

Read linked data and export it to a destination in the requested RDF format.

with h5py.File("input.h5", mode="r") as f:
    reader(f).dump_ld("output.json", format="json-ld")

Read linked data and return either an rdflib.Graph or rdflib.ConjunctiveGraph object.

with h5py.File("input.h5", mode="r") as f:
    graph = reader(f).get_ld()

A Python dictionary with the reader's namespace prefixes and their IRIs.

with h5py.File("input.h5", mode="r") as f:
    rdr = reader(f)
    namespaces = rdr.namespaces

You might also like...

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

1 Oct 2, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

40 Dec 12, 2022

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Functional Data Analysis Python package

Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid

184 Dec 27, 2022

Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

3 Oct 3, 2022

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Comments

Update rdflib version requirement

Please update the RDFlib version requirements to >= 6.1.1, not >= 5.0.0. There are quite a few significant improvements in RDFlib 6.x and 6.1.x fixes a few 6.0.x things. Thanks!

opened by nicholascar 1

Python reader for Linked Data in HDF5 files

Related tags

Overview

`h5ld`: HDF5 Linked Data

Installation

License

Quick Start

You might also like...

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Utilize data analytics skills to solve real-world business problems using Humana’s big data

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Comments

Update rdflib version requirement

Owner

The HDF Group

nrgpy is the Python package for processing NRG Data Files

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Python data processing, analysis, visualization, and data operations

Catalogue data - A Python Scripts to prepare catalogue data

Includes all files needed to satisfy hw02 requirements

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Python reader for Linked Data in HDF5 files

Related tags

Overview

h5ld: HDF5 Linked Data

Installation

License

Quick Start

You might also like...

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Utilize data analytics skills to solve real-world business problems using Humana’s big data

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Comments

Update rdflib version requirement

Owner

The HDF Group

nrgpy is the Python package for processing NRG Data Files

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Python data processing, analysis, visualization, and data operations

Catalogue data - A Python Scripts to prepare catalogue data

Includes all files needed to satisfy hw02 requirements

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

`h5ld`: HDF5 Linked Data