flexible time-series processing & feature extraction

PreDiCT.IDLab

Last update: Dec 28, 2022

Related tags

Machine Learning python processing data-science library time-series feature-extraction

Overview

tsflex is a toolkit for flexible time-series processing & feature extraction, making few assumptions about input data.

Useful links

Installation

If you are using pip, just execute the following command:

pip install tsflex

Why tsflex? ✨

flexible;
- handles multi-variate time-series
- versatile function support
  => integrates natively with many packages for processing (e.g., scipy.signal) & feature extraction (e.g., numpy, scipy.stats)
- feature-extraction handles multiple strides & window sizes
efficient view-based operations
=> extremely low memory peak & fast execution times (see benchmarks)
maintains the time-index of the data
makes little to no assumptions about the time-series data

Usage

tsflex is built to be intuitive, so we encourage you to copy-paste this code and toy with some parameters!

Series processing

import pandas as pd; import scipy.signal as ssig; import numpy as np
from tsflex.processing import SeriesProcessor, SeriesPipeline

# 1. -------- Get your time-indexed data --------
# Data contains 3 columns; ["ACC_x", "ACC_y", "ACC_z"]
url = "https://github.com/predict-idlab/tsflex/raw/main/examples/data/empatica/acc.parquet"
data = pd.read_parquet(url).set_index("timestamp")

# 2 -------- Construct your processing pipeline --------
processing_pipe = SeriesPipeline(
    processors=[
        SeriesProcessor(function=np.abs, series_names=["ACC_x", "ACC_y", "ACC_z"]),
        SeriesProcessor(ssig.medfilt, ["ACC_x", "ACC_y", "ACC_z"], kernel_size=5)  # (with kwargs!)
    ]
)
# -- 2.1. Append processing steps to your processing pipeline
processing_pipe.append(SeriesProcessor(ssig.detrend, ["ACC_x", "ACC_y", "ACC_z"]))

# 3 -------- Process the data --------
processing_pipe.process(data=data)

Feature extraction

import pandas as pd; import scipy.stats as ssig; import numpy as np
from tsflex.features import FeatureDescriptor, FeatureCollection, NumpyFuncWrapper

# 1. -------- Get your time-indexed data --------
# Data contains 1 column; ["TMP"]
url = "https://github.com/predict-idlab/tsflex/raw/main/examples/data/empatica/tmp.parquet"
data = pd.read_parquet(url).set_index("timestamp")

# 2 -------- Construct your feature collection --------
fc = FeatureCollection(
    feature_descriptors=[
        FeatureDescriptor(
            function=NumpyFuncWrapper(func=ssig.skew, output_names="skew"),
            series_name="TMP", 
            window="5min",  # Use 5 minutes 
            stride="2.5min",  # With steps of 2.5 minutes
        )
    ]
)
# -- 2.1. Add features to your feature collection
fc.add(FeatureDescriptor(np.min, "TMP", '2.5min', '2.5min'))

# 3 -------- Calculate features --------
fc.calculate(data=data)

Scikit-learn integration

TODO

👤 Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost

Comments

:sparkles: vectorized feature function support
TODOs:

[x] Update docs with new arguments & explain vectorized feature function behavior

[x] Benchmark runtime of vectorized functions

[x] Benchmark memory peak of vectorized functions

[x] Add proper checks, with appropriate error msg, for equally sampled data assumption

[x] Add tests
opened by jvdd 9
:recycle: refactor indexing + :scissors: decouple stride & window + :sparkles: support segment idxs
:recycle: Refactor indexing

:bug: fix bug with vectorized=True for strided rolling

[x] vectorized support for single feature windows

[x] vectorized support for empty feature windows

[x] test the above

:recycle: refactor the strided window segmentation (& indexing)

[x] add include_final_window argument to FeatureCollection .calculate (& StridedRolling)

[x] update docs

[x] test the above

[ ] remove support for TimeSequenceStridedRolling => Decided to not do this in this PR. We will leave this for another PR.

:see_no_evil: undo breaking change from #62 (to make code backwards compatible)

[x] revert the default window_idx argument in FeatureCollection.calculate to "end"

:robot: extend test matrix & update dependencies

[x] Add Python 3.10 to the test matrix

[x] Update dependencies (& remove locked statsmodels dependency - https://github.com/blue-yonder/tsfresh/issues/897)

:scissors: Decouple stride

We (@jonasvdd, @emield12, and @jvdd) believe that the stride should not be hardly coupled with a FeatureDescriptor. Therefore, to make tsflex more flexible (:wink:) we make the stride argument optional for FeatureDescriptor and MultipleFeatureDescriptor and add the functionality to pass your stride(s) to the FeatureCollection.calculate method.

:eyes: externally visible changes

[x] make stride optional in FeatureDescriptor

[x] add stride argument to FeatureCollection.calculate

[x] FeatureDescriptor / FeatureCollection.calculate should accept multiple strides

[x] StridedRolling should accept multiple strides

[x] test the above

[x] remove stride from output column name

[x] update reduce method

[x] update reduce method tests

[x] update the logging to handle multiple strides

[x] extend logging tests

:package: internal changes

[x] change stride -> strides: which is either a list of stride sizes (float or pd.Timedelta) or None (in StridedRolling and StridedRollingFactory)

[x] identify feature descriptors (FD) based on their window - output names

[x] set interection after StridedRolling search sorted TODO: moet geoptimaliseerd worden

:sparkles: Support setpoints

[x] support setpoints

[x] test the above => Note: we allow setpoints of different timezones as the np.datetime64 conversion of these allow comparison..

[ ] trim range if not in data => Decided to not do this! (as this is somewhat an ambiguous operation) As using segment indexes is already an advanced operation, it is the user its responsability to either trim the segmented indexes or make their features robust.

:see_no_evil: other stuff

:bug: fix bug with features.logger that does not handle numeric window & strides

[x] improve parsing for window and stride values in _parse_logging_execution_to_df

[x] test the above

:eyes: other minor stuff

[x] support offline data load

[x] warn the user with a RuntimeWarining when the data its index (passed to FeatureCollection.calculate) is not monotonically increasing

[x] test the above
opened by jvdd 6
[MRG] Remove deprecated closed argument in pd.daterange

The closed argument was set to None which is the default, removing it should thus not have any impact. Since Pandas 1.4.0 this argument has been deprecated in favour of the inclusive argument so you get a lot of warnings when running the code. The default argument to inclusive is "both" which has the same behaviour as the current code. I thus see no need to add it. https://pandas.pydata.org/docs/reference/api/pandas.date_range.html

opened by jeroenboeye 4
:bug: fix bug with bound_method + :sparkles: new integrations
This PR handles

(1) :bug: a bug in the bound_method + sequence based strided rolling

[x] check if .loc induces memory peak

[x] agree on what behavior is preferred for segmentation indexing

Agreed behavior:

make window_idx="begin" default instead of "end"

sequences should be segmented into n segments if there are exactly n segments possible (e.g., window=2, stride=2 => 5 segments on sequence of length 10)

we remain the current behavior of the end-index of the segments.

[x] extend tests

(2) :sparkles: extends integration with other feature extraction packages

[x] add and test catch22 integration wrapper

[x] :x: add and test scikit-learn transformer wrapper

Decided to not do this (for now): have atm an ambiguas implementation -> more details in #56

(3): :zap: faster irregular data check

(4): :fire: add kaggle TPSAPR2022 notebook to ML examples
opened by jvdd 3
Get features for each line

Hello,

I would like to generate features for each observation of my time serie and not only window by window.

Does this possibility exist in tsfex and do you know how to do it ?

Thanks in advance
question

opened by IKetchup 3
:ambulance: Fix windows bug
This PR

Adds Windows & MacOS testing to the matrix

Fix a bug on Windows (PermissionError: [WinError 32]), which occurs when a logging file handler is not properly unregistered and closed

Disable multiprocessing on Windows, see #51
opened by jvdd 1
:pushpin: Update dependencies
Changes;

Overall dependencies are updated

Statsmodels dependencies are fixed @ 0.12.2 to avoid tsfresh errors. See https://github.com/blue-yonder/tsfresh/issues/897
opened by jvdd 1
time-series batch / whole-series feature calculation
Objectives:

Functionality

[ ] convenient way to extract features over the whole, unsegmented data (see also #67)

[ ] Discuss + decide together with @jvdd @mbignotti what option seems best to serve this functionality (regarding end user perspective)

Available options:

introduce a new method to the FeatureCollection (as done here):

advantages ✅

Explicit method definition, less confusion for end-users

disadvanges :x:

A new method is introduced / less uniform interface to perform computation

Perform unsegmented feature computation when all window and/or stride are NOT set.

advantages ✅

more homogenous interface

disadvanges :x:

somewhat more implicitness code example:

# NOTE: window and stride parameters are omitted. fc = FeatureCollection( FeatureDescriptor( function = np.mean, series_name="Value", ) ) # Uses the whole (unsegmented) series of `data` to # calculate the features. method remains the same. fc.calculate(data=df, return_df=True)

Perform unsegmented feature computation when all windows are set to -1

a combination of (2.) and (3.)

As for now, this is performed by introducing the calculate_unsegmented method to the FeatureCollection:

Bug fixes

[ ] fix the window_idx="end" and (window-size > data-range) bug
opened by jonasvdd 0
Window and stride arguments are making it harder to use the package. feature_collection.reduce example

First of all, this package is awesome. The community that deals with time series data needed to improve the game and tsflex have everything to be the main library.

However, here are a few specific suggestions:

Remove "windows" and "strides" arguments altogether for feature extraction: It does seem a bit excessive but hear me out. They are good arguments but not fundamental for feature extraction. They could be used in data preparation, Alteryx has a library called "compose" (https://github.com/alteryx/compose) just for the purpose of creating multiple time frame windows. Once the "window" is ready, just select the functions. I propose tsflex main function (feature_collection.calculate) just use time series data and a list of functions for feature extraction, no window or strides.

Explaining further: The way I view the implementation of the essentials would be only this: feature_collection.calculate(time_series_df, functions). If any of the columns of the time series had any data type other than int, float, it could simply raise an error or ignore the column.

Window and stride also make feature_collection.reduce function hard to use: After feature selection and having selected a few columns of the many created using tsflex I use the reduce that gives me the functions for transformation/extraction. The problem is that the naming convention includes window and strides (e.g: Open__mean__w=233500_s=233500) which means I have to have a time series with the same characteristics/size, which often doesn't happen. I use the arguments windows and strides like the following:

simple_feats = MultipleFeatureDescriptors( functions=tsfresh_settings_wrapper(settings), series_names="Open", windows=len(stock_data) - 1, strides=len(stock_data) - 1, ) feature_collection = FeatureCollection(simple_feats) features_df = feature_collection.calculate( stock_full, return_df=True, show_progress=True, approve_sparsity=(True) )

I use this because I need to process the whole dataset.

Anyway, I hope this is helpful.
enhancement question

opened by arturdaraujo 2

Question: Feature extraction on time series batch

Hello, First of all, I would like to thank you for the really nice library. I think it is much more straight forward and at the same time flexible, compared to similar libraries. I have a use case where sometimes I need to compute features in a rolling fashion, for which the window parameter of the FeatureDescriptor object is really helpful, and some other times I need to compute features on time series batches. That is, the window parameter equals the length of the entire time series. However, I'm having a few issues with the latter case. Here is an example:

import numpy as np
import pandas as pd
from tsflex.features import FeatureDescriptor, FeatureCollection

series = np.random.rand(100)
ts_index = pd.date_range(start="2022-06-09 00:00:00", periods=len(series), freq="min")
df = pd.DataFrame({"Value": series}, index=ts_index)

fc = FeatureCollection(
    FeatureDescriptor(
        function = np.mean,
        series_name="Value",
        window=len(df),
        stride=1
    )
)

fc.calculate(data=df, return_df=True)

If I run the code above, I get the following error (personal info are hidden):

Traceback (most recent call last):
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 394, in calculate
    calculated_feature_list = [self._executor(idx) for idx in idxs]
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 394, in <listcomp>
    calculated_feature_list = [self._executor(idx) for idx in idxs]
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 208, in _executor
    stroll, function = get_stroll_func(idx)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 245, in get_stroll_function
    stroll = StridedRollingFactory.get_segmenter(**stroll_arg_dict)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling_factory.py", line 75, in get_segmenter
    return TimeIndexSampleStridedRolling(data, window, stride, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling.py", line 495, in __init__
    super().__init__(series_list, window, stride, *args, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling.py", line 373, in __init__
    super().__init__(data, window, stride, *args, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/tsflex/features/segmenter/strided_rolling.py", line 147, in __init__
    if np.ptp(container.end_indexes - container.start_indexes) != 0:
  File "<__array_function__ internals>", line 180, in ptp
  File "/****/*****/*****/***/***/***/python3.8/site-packages/numpy/core/fromnumeric.py", line 2667, in ptp
    return _methods._ptp(a, axis=axis, out=out, **kwargs)
  File "/****/*****/*****/***/***/***/python3.8/site-packages/numpy/core/_methods.py", line 278, in _ptp
    umr_maximum(a, axis, None, out, keepdims),
ValueError: zero-size array to reduction operation maximum which has no identity
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/***/*****/****/****/***/***/python3.8/site-packages/tsflex/features/feature_collection.py", line 418, in calculate
    raise RuntimeError(
RuntimeError: Feature Extraction halted due to error while extracting one (or multiple) feature(s)! See stack trace above.

If I specify window=len(df) - 1,it works but then, of course, it is not using the last data point in the calculation.

Am I doing something wrong? Is there a way to achieve the required behaviour?

Thanks a lot!

Environment: python==3.8.13 numpy==1.22.4 pandas==1.4.2 tsflex==0.2.3.7.7

bug

opened by mbignotti 11

improve `get_processor_logs`
Current version:

new features:

add the columns: duration % -> can be directly calculated from the duration column, so does not need to be stored within the logs itself output_names -> the output names of the adjusted / newly created series. Can help to improve the function recall?

enhancement
opened by jonasvdd 0
Improve jargon / logic of window position
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling Maybe we can use

the same naming convention

or even re-use their underlying logic (i.e., don't reimplement the wheel)
opened by jonasvdd 0

Releases(v0.2.3)

v0.2.3(Nov 16, 2021)

❗ See also: tsflex v0.2.2 which is even more 🔥 than this one

New features

💚 Next to the tsfresh integrations, tsflex's feature extraction now fully integrates with seglearn and tsfel ⬇️

from seglearn.feature_functions import base_features
from tsfel.feature_extraction import get_features_by_domain

from tsflex.features import FeatureCollection, MultipleFeatureDescriptors
from tsflex.features.integrations import seglearn_feature_dict_wrapper, tsfel_feature_dict_wrapper
from tsflex.utils.data import load_empatica_data

# Load sequence-indexed data (in this case a time-index)
df_tmp, df_acc = load_empatica_data(['tmp', 'acc'])

# Construct your feature extraction configuration & extract features
fc = FeatureCollection(
    MultipleFeatureDescriptors(
        functions=[
            *seglearn_feature_dict_wrapper(base_features()),
            *tsfel_feature_dict_wrapper(get_features_by_domain('statistical')),
        ],
        series_names=["TMP", "ACC_x", "ACC_y"],
        windows=["5min", "15min"],
        strides="5min"
    )
)

fc.calculate(data=[df_tmp, df_acc], return_df=True)

Changes

🎉 The FeatureCollection.calculcate it's feauture-DataFrame output now has a determenistic column order see - #40

Source code(tar.gz)
Source code(zip)

v0.2.2(Nov 12, 2021)
New features

🔥 Now also supports feature-extraction on numeric-index data (and thus not only time-based data)

💚 Seamless integration with tsfresh, check out the example below:

from tsfresh.feature_extraction import MinimalFCParameters; import scipy.stats as ss from tsflex.features import FeatureCollection, MultipleFeatureDescriptors from tsflex.features.integrations import tsfresh_settings_wrapper from tsflex.utils.data import load_empatica_data # Load sequence-indexed data (in this case a time-index) df_tmp, df_acc = load_empatica_data(['tmp', 'acc']) # Construct your feature extraction configuration & extract features fc = FeatureCollection( MultipleFeatureDescriptors( functions=tsfresh_settings_wrapper(MinimalFCParameters()) + [ss.skew], series_names=["TMP", "ACC_x", "ACC_y"], windows=["5min", "15min"], strides="5min" ) ) fc.calculate(data=[df_tmp, df_acc], return_df=True)

⚡ Optimized strided-rolling feature-extraction, see the newly generated benchmark ⬇️

Added FeatureCollection.reduce() which comes in really handy when feature selection is performed in your machine-learning pipeline

🐻 chunk_data() now also supports DataFrame-dicts as input, which can be more convenient when having DataFrames with a lot of columns for which you want to specify the sample-frequencies.

🌻 SeriesPipeline is now more compose-like as it now accepts SeriesPipeline instances

Changes

🧵 Changed pathos ➡️ multiprocess as multiprocessing back-end

🔧 Moved the bound_method argument to FeatureCollection.calculate()

📝 Rewrote strided-rolling back-end in a more OO manner (introduced the segmenter module), which complies with our roadmap of providing more segmenting functionality

Source code(tar.gz)
Source code(zip)

Owner

PreDiCT.IDLab

Repositories of the IDLab PreDiCT group

GitHub https://predict-idlab.github.io/tsflex/

Automatic extraction of relevant features from time series:

tsfresh This repository contains the TSFRESH python package. The abbreviation stands for "Time Series Feature extraction based on scalable hypothesis

7k Jan 6, 2023

neurodsp is a collection of approaches for applying digital signal processing to neural time series

neurodsp is a collection of approaches for applying digital signal processing to neural time series, including algorithms that have been proposed for the analysis of neural time series. It also includes simulation tools for generating plausible simulations of neural time series.

224 Dec 2, 2022

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 5, 2023

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

15.4k Jan 7, 2023

Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

2k Jan 2, 2023

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

6k Jan 6, 2023

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

1.3k Dec 22, 2022

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022

Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

3.3k Jan 3, 2023

A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

5.2k Jan 4, 2023

STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

2.5k Jan 6, 2023

A Python package for time series classification

pyts: a Python package for time series classification pyts is a Python package for time series classification. It aims to make time series classificat

1.4k Jan 1, 2023

Time series forecasting with PyTorch

Our article on Towards Data Science introduces the package and provides background information. Pytorch Forecasting aims to ease state-of-the-art time

2.5k Jan 2, 2023

Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

536 Dec 29, 2022

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

519 Jan 3, 2023

flexible time-series processing & feature extraction

Related tags

Overview

Useful links

Installation

Why tsflex? ✨

Usage

Scikit-learn integration

Comments

:recycle: Refactor indexing

:scissors: Decouple stride

:sparkles: Support setpoints

:see_no_evil: other stuff

Objectives:

Functionality

Bug fixes

Releases(v0.2.3)

v0.2.3(Nov 16, 2021)

New features

Changes

v0.2.2(Nov 12, 2021)

New features

Changes

Owner

PreDiCT.IDLab

Automatic extraction of relevant features from time series:

neurodsp is a collection of approaches for applying digital signal processing to neural time series

A machine learning toolkit dedicated to time-series data

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Open source time series library for Python

A unified framework for machine learning with time series

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

A machine learning toolkit dedicated to time-series data

Probabilistic time series modeling in Python

A python library for easy manipulation and forecasting of time series.

STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

A Python package for time series classification

Time series forecasting with PyTorch

Python module for machine learning time series:

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

A Python toolkit for rule-based/unsupervised anomaly detection in time series

AtsPy: Automated Time Series Models in Python (by @firmai)

A python library for Bayesian time series modeling

An open-source library of algorithms to analyse time series in GPU and CPU.