Python Package for DataHerb: create, search, and load datasets.

DataHerb

Last update: Feb 11, 2022

Related tags

Data Analysis python data data-mining database dataset data-analysis

Overview

The Python Package for DataHerb

A DataHerb Core Service to Create and Load Datasets.

Install

pip install dataherb

Documentation: dataherb.github.io/dataherb-python

The DataHerb Command-Line Tool

Requires Python 3

The DataHerb cli provides tools to create dataset metadata, validate metadata, search dataset in flora, and download dataset.

Search and Download

Search by keyword

dataherb search covid19
# Shows the minimal metadata

Search by dataherb id

dataherb search -i covid19_eu_data
# Shows the full metadata

Download dataset by dataherb id

dataherb download covid19_eu_data
# Downloads this dataset: http://dataherb.io/flora/covid19_eu_data

Create Dataset Using Command Line Tool

We provide a template for dataset creation.

Within a dataset folder where the data files are located, use the following command line tool to create the metadata template.

dataherb create

Upload dataset to remote

Within the dataset folder, run

dataherb upload

UI for all the datasets in a flora

dataherb serve

Use DataHerb in Your Code

Load Data into DataFrame

# Load the package
from dataherb.flora import Flora

# Initialize Flora service
# The Flora service holds all the dataset metadata
use_flora = "path/to/my/flora.json"
dataherb = Flora(flora=use_flora)

# Search datasets with keyword(s)
geo_datasets = dataherb.search("geo")
print(geo_datasets)

# Get a specific file from a dataset and load as DataFrame
tz_df = pd.read_csv(
  dataherb.herb(
      "geonames_timezone"
  ).get_resource(
      "dataset/geonames_timezone.csv"
  )
)
print(tz_df)

The DataHerb Project

What is DataHerb

DataHerb is an open-source data discovery and management tool.

A DataHerb or Herb is a dataset. A dataset comes with the data files, and the metadata of the data files.
A Herb Resource or Resource is a data file in the DataHerb.
A Flora is the combination of all the DataHerbs.

In many data projects, finding the right datasets to enhance your data is one of the most time consuming part. DataHerb adds flavor to your data project. By creating metadata and manage the datasets systematically, locating an dataset is much easier.

Currently, dataherb supports sync dataset between local and S3/git. Each dataset can have its own remote location.

What is DataHerb Flora

We desigined the following workflow to share and index open datasets.

The repo dataherb-flora is a demo flora that lists some datasets and demonstrated on the website https://dataherb.github.io. At this moment, the whole system is being renovated.

Development

Create a conda environment.
Install requirements: pip install -r requirements.txt

Documentation

The source of the documentation for this package is located at docs.

References and Acknolwedgement

dataherb uses datapackage in the core. datapackage is a python library for the data-package standard. The core schema of the dataset is essentially the data-package standard.

Comments

would you like to take a look at our api?

I come across this repo and found it very similar to our API, though much more mature. https://github.com/Glacier-Ice/data-sci-api

we have problems in creating a standard of dataset collection and API documentation for end-users

is there a way we can collaborate?

opened by Stockard 4
Format search results for better ux

The current search result shows too much information. It would be good to format the result into a way that is easier to read and get the id if needed.
enhancement

opened by emptymalei 1
use rapidfuzz instead of fuzzywuzzy

FuzzyWuzzy is GPLv2 licensed which would force you to licence the whole project under GPLv2. I had the same problem on one of my projects and so I wrote rapidfuzz which is implementing the same algorithm but is based on a version of fuzzywuzzy that was MIT Licensed and is therefor MIT Licensed aswell, so it can be used in here without forcing a License change. As a nice bonus it is fully implemented in C++ and comes with a few Algorithmic improvements making it faster than FuzzyWuzzy.

opened by maxbachmann 1
Use One File for Each Herb in Flora
Is it better to have one file for each herb in flora?

Situition

Currently, the flora is defined in a single json file.

It becomes hard to read. This is not fitting into the human-readable principle.

It becomes hard to manage. We are currently sorting everything in the big file. When we have a problem, the whole flora will be unusable.

Solution

Use separate files for herbs.

Simply Copy dataherb.json

Copy dataherb.json to workdir/{id}/dataherb.json or {id}.json will work.

Using folders allows us to put in more files. For example, we can take datapackage content out to make it more managable.

Build the flora from all these files.

[x] Implement this new structure.

Ready for a Demo repo of flora

In this way, we can put up a repo for open datasets easily and allow users to add more easily.

Possible creating process

Create package directly on GitHub by uploading the dataherb.json file.

But there should be a validation process to avoid duplicate id.

[ ] Setup a demo repo as demo flora.

enhancement
opened by emptymalei 0
Overhaul: New Core Management, Local Indexing Webpage, Flexible Flora Database
This is a completely new era of Dataherb.

New Stuff

Supporting S3 as source

Serve whole flora as webpages with search

User config for flora

Multiple flora on one machine

We also redesigned the core.
opened by emptymalei 0
Add dataset using the URL of a remote repo
We don't only upload datasets, we might also want to load datasets from remote.

Here we propose to add the option to add datasets using the URL.

Build a Herb from remote data

Option to add metadata only or download everything.

Adding metadata only will only add data to the flora

Thus we can not find the dataset folder with the corresponding id.

This can be used to decide if a dataset is metadata only or fully downloaded.
opened by emptymalei 0
Sync Flora Metafolder
Managing flora using command line

Version control of the flora is not really hard. We just get into the folder and use git.

But it would be much easier if we can simply run dataherb sync flora

Approaches:

Invoking command line: ref cookiecutter

enhancement
opened by emptymalei 0

Releases(0.1.6)

0.1.6(Feb 10, 2022)
Fixed

Command line tool dataherb configure -l now only opens the folder.

Command line too dataherb download will also display where the dataset is downloaded to. This makes it easier for the user to find the downloaded dataset.

Source code(tar.gz)
Source code(zip)
0.1.5(Aug 12, 2021)

Using Dedicated Folders for Herbs

In the previous versions, we can only use a single file to host all the flora metadata. It will become unmanageable and hard to read as the number of herbs grows. (#14)

In this version, we introduce a new structure for the flora metadata. Each herb is getting its own folder! This structure makes it easier for us to read and manage by hand. It is also better for version-controling your flora.

(🌱 Best wishes to your herbs in their own pots. )
Source code(tar.gz)
Source code(zip)
0.1.4(Aug 7, 2021)
Added

🎉 Better search result formatting in terminal (See docs for a screenshot.)

📺 Show config using dataherb configure --show

Changed

Better config management. Config has been promoted to a class.

Source code(tar.gz)
Source code(zip)
0.1.3(Aug 7, 2021)
Added

Server to serve flora as a website

Configuration system

Remove herb from flora

Add herb to flora

and more

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 14, 2020)
Now we can use

dataherb validate

to validate the metadata file.
Source code(tar.gz)
Source code(zip)
0.0.3(Feb 23, 2020)

dataherb command line tool now automatically finds the data files and generate part of the metadata based on the files. CSV files are automatically parsed.
Source code(tar.gz)
Source code(zip)
0.0.2(Feb 16, 2020)

Source code(tar.gz)
Source code(zip)

Owner

DataHerb

Get datasets in a blink of an eye | Experimenting with simple modular small dataset discovery

GitHub https://dataherb.github.io/dataherb-python

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 2, 2023

Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

7 Sep 30, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

180 Dec 18, 2022

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

3 Jul 5, 2022

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

20 Jan 5, 2023

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

411 Dec 27, 2022

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

2 Feb 14, 2022

Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

1 Nov 14, 2021

Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021

A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

9 Sep 15, 2022

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 1, 2021

Python Package for DataHerb: create, search, and load datasets.

Related tags

Overview

The Python Package for DataHerb

A DataHerb Core Service to Create and Load Datasets.

Install

The DataHerb Command-Line Tool

Search and Download

Create Dataset Using Command Line Tool

Upload dataset to remote

UI for all the datasets in a flora

Use DataHerb in Your Code

Load Data into DataFrame

The DataHerb Project

What is DataHerb

What is DataHerb Flora

Development

Documentation

References and Acknolwedgement

Comments

would you like to take a look at our api?

Format search results for better ux

use rapidfuzz instead of fuzzywuzzy

Use One File for Each Herb in Flora

Situition

Solution

Simply Copy dataherb.json

Ready for a Demo repo of flora

Overhaul: New Core Management, Local Indexing Webpage, Flexible Flora Database

New Stuff

Add dataset using the URL of a remote repo

Sync Flora Metafolder

Managing flora using command line

Releases(0.1.6)

0.1.6(Feb 10, 2022)

Fixed

0.1.5(Aug 12, 2021)

Using Dedicated Folders for Herbs

0.1.4(Aug 7, 2021)

Added

Changed

0.1.3(Aug 7, 2021)

Added

0.0.5(Mar 14, 2020)

0.0.3(Feb 23, 2020)

0.0.2(Feb 16, 2020)

Owner

DataHerb

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Retail-Sim is python package to easily create synthetic dataset of retaile store.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Python tools for querying and manipulating BIDS datasets.

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Example Of Splunk Search Query With Python And Splunk Python SDK

Active Learning demo using two small datasets

A tool to compare differences between dataframes and create a differences report in Excel

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

Create HTML profiling reports from pandas DataFrame objects

A library to create multi-page Streamlit applications with ease.

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

VevestaX is an open source Python package for ML Engineers and Data Scientists.

Python package to transfer data in a fast, reliable, and packetized form.