An easy-to-use feature store

ByteHub AI

Last update: Dec 9, 2022

Related tags

Data Analysis data-science machine-learning timeseries pandas data-engineering forecasting machinelearning dask feature-engineering machinelearning-python feature-store featurestore bytehub-cloud

Overview

ByteHub

An easy-to-use feature store.

💾 What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

Be simple to use, with a Pandas-like API;
Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

🦉 Features

Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

📖 Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

🚀 Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

🐾 Roadmap

Tasks to automate updates to features using orchestration tools like Airflow

Comments

Error when using google cloud storage as a backend

Hi, I am trying to get my setup working on google cloud, when i try saving the dataframe to cloud storage, using fs.save_datframe I run into an error like the one shown below,

_call non-retriable exception: Disallowed unicode characters present in object name ''tutorial/feature/append-dataframe/partition=npartitions=1

I confirmed my google cloud storage saving functionality for writing dataframe using . to_parquest and supplying the google cloud storage path gs://{bucketname}/{foldername}, which seems to work as expected.

opened by sharabhshukla 4
Frequently append data or row to feature dataframes indexed on time

Hi,

I have a use case where time series dataframe needs to be appended with the latest values every 5 minutes, is there a method or an elegant way of appending time series value/row to a feature dataframe very few minutes. One way would be to load the entire dataframe from the featurestore everytime in memory and then append a row and then rewrite the entire dataframe. But I was wondering if there is a method that just appends it or sort of updates it ?

opened by sharabhshukla 2

Azure plugins for Feast (FEAture STore)

Feast on Azure This project provides resources to enable running a feast feature store on Azure. Feast Azure Provider The Feast Azure provider acts li

70 Dec 31, 2022

A modern, easy to use, feature-rich, and async ready API wrapper improved and revived from original discord.py.

A Python API wrapper that is improved and revived from the original discord.py

19 Nov 6, 2021

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python.

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using async and await

4 Nov 5, 2021

Low-level, feature rich and easy to use discord python wrapper

PWRCord Low-level, feature rich and easy to use discord python wrapper Important Note: At this point, this library API is considered unstable and can

1 Dec 26, 2021

☄️ High performance, easy to use and feature-rich Solana SDK for Python.

Solathon is an high performance, easy to use and feature-rich Solana SDK for Python. Easy for beginners, powerful for real world applications.

28 Oct 10, 2022

disfork A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using async a

2 Feb 9, 2022

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

7 Oct 23, 2022

Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser or phone. Install it on your server, access it anywhere and enjoy.

Vigilio Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser

141 Jan 6, 2023

A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Torch-RecHub A Lighting Pytorch Framework for Recommendation Models, Easy-to-use and Easy-to-extend. 安装 pip install torch-rechub 主要特性 scikit-learn风格易用

67 Jan 4, 2023

Middleware for Starlette that allows you to store and access the context data of a request. Can be used with logging so logs automatically use request headers such as x-request-id or x-correlation-id.

starlette context Middleware for Starlette that allows you to store and access the context data of a request. Can be used with logging so logs automat

300 Dec 26, 2022

110 Feb 16, 2021

A wiki system with complex functionality for simple integration and a superb interface. Store your knowledge with style: Use django models.

django-wiki Django support The below table explains which Django versions are supported. Release Django Upgrade from 0.7.x 2.2, 3.0, 3.1 0.5 or 0.6 0.

1.6k Dec 28, 2022

theHasher Tool created for generate strong and unbreakable passwords by using Hash Functions.Generate Hashes and store them in txt files.Use the txt files as lists to execute Brute Force Attacks!

$theHasher theHasher is a Tool for generating hashes using some of the most Famous Hashes Functions ever created. You can save your hashes to correspo

6 Feb 2, 2022

Password-Manager - A Password Manager application made using Python. You can use this python application to store and to see the stored passwords

Password Manager 🔑 This is a Password Manager Application which is made using P

1 Jul 17, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

this is a lite easy to use virtual keyboard project for anyone to use

An easy-to-use feature store

Related tags

Overview

ByteHub

💾 What is a feature store?

🦉 Features

📖 Documentation and tutorials

🚀 Quick-start

🐾 Roadmap

You might also like...

Azure plugins for Feast (FEAture STore)

A modern, easy to use, feature-rich, and async ready API wrapper improved and revived from original discord.py.

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python.

Low-level, feature rich and easy to use discord python wrapper

☄️ High performance, easy to use and feature-rich Solana SDK for Python.

A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser or phone. Install it on your server, access it anywhere and enjoy.

A Lighting Pytorch Framework for Recommendation System, Easy-to-use and Easy-to-extend.

Middleware for Starlette that allows you to store and access the context data of a request. Can be used with logging so logs automatically use request headers such as x-request-id or x-correlation-id.

Middleware for Starlette that allows you to store and access the context data of a request. Can be used with logging so logs automatically use request headers such as x-request-id or x-correlation-id.

Middleware for Starlette that allows you to store and access the context data of a request. Can be used with logging so logs automatically use request headers such as x-request-id or x-correlation-id.

A wiki system with complex functionality for simple integration and a superb interface. Store your knowledge with style: Use django models.

theHasher Tool created for generate strong and unbreakable passwords by using Hash Functions.Generate Hashes and store them in txt files.Use the txt files as lists to execute Brute Force Attacks!

Password-Manager - A Password Manager application made using Python. You can use this python application to store and to see the stored passwords

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

this is a lite easy to use virtual keyboard project for anyone to use

A collection of easy-to-use, ready-to-use, interesting deep neural network models

Feature engineering library that helps you keep track of feature dependencies, documentation and schema

Comments

Error when using google cloud storage as a backend

Frequently append data or row to feature dataframes indexed on time

Owner

ByteHub AI

Fast, flexible and easy to use probabilistic modelling in Python.

Feature Detection Based Template Matching

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

PyEmits, a python package for easy manipulation in time-series data.

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana