Python module for machine learning time series:

David Burns

Last update: Dec 29, 2022

Related tags

Overview

seglearn

Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extraction, feature processing, and final estimator. Seglearn provides a flexible approach to multivariate time series and related contextual (meta) data for classification, regression, and forecasting problems. Support and examples are provided for learning time series with classical machine learning and deep learning models. It is compatible with scikit-learn.

Documentation

Installation documentation, API documentation, and examples can be found on the documentation.

Dependencies

seglearn is tested to work under Python 3.5. The dependency requirements are based on the last scikit-learn release:

scipy(>=0.17.0)
numpy(>=1.11.0)
scikit-learn(>=0.21.3)

Additionally, to run the examples, you need:

matplotlib(>=2.0.0)
keras (>=2.1.4) for the neural network examples
pandas

In order to run the test cases, you need:

pytest

The neural network examples were tested on keras using the tensorflow-gpu backend, which is recommended.

Installation

seglearn-learn is currently available on the PyPi's repository and you can install it via pip:

pip install -U seglearn

or if you use python3:

pip3 install -U seglearn

If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from GitHub and install all dependencies:

git clone https://github.com/dmbee/seglearn.git
cd seglearn
pip install .

Or install using pip and GitHub:

pip install -U git+https://github.com/dmbee/seglearn.git

Testing

After installation, you can use pytest to run the test suite from seglearn's root directory:

pytest

Change Log

Version history can be viewed in the Change Log.

Development

The development of this scikit-learn-contrib is in line with the one of the scikit-learn community. Therefore, you can refer to their Development Guide.

Please submit new pull requests on the dev branch with unit tests and an example to demonstrate any new functionality / api changes.

Citing seglearn

If you use seglearn in a scientific publication, we would appreciate citations to the following paper:

@article{arXiv:1803.08118,
author  = {David Burns, Cari Whyne},
title   = {Seglearn: A Python Package for Learning Sequences and Time Series},
journal = {arXiv},
year    = {2018},
url     = {https://arxiv.org/abs/1803.08118}
}

If you use the seglearn test data in a scientific publication, we would appreciate citations to the following paper:

@article{arXiv:1802.01489,
author  = {David Burns, Nathan Leung, Michael Hardisty, Cari Whyne, Patrick Henry, Stewart McLachlin},
title   = {Shoulder Physiotherapy Exercise Recognition: Machine Learning the Inertial Signals from a Smartwatch},
journal = {arXiv},
year    = {2018},
url     = {https://arxiv.org/abs/1802.01489}
}

Comments

Resampling with imbalanced-learn samplers

Hi David,

I added the patch_sampler(imblearn_sampler_class) function which can be used to derive a dynamically created (and pickable) sampler class compatible with Pype. The derived class implements a transform method which returns the data unchanged. The fit_transform method calls the fit_resample method of the imbalanced-learn sampler which resamples the data. These steps are important to ensure that resampling only applies to training data but not to test data (the example shows that Pype.fit calls the fit_transform method, whereas score calls the transform method).

Cheers, Matthias

opened by qtux 14
Reverse Transform of Segment.

Hello, I have an issue about the shape of prediction is not same to the original labels due to the segmentation process. Do you have any function to convert the shape of prediction back to the shape before the segmentation?

opened by ninfueng 12
transform: Add step param to segment transforms

Hi David,

I implemented the possibility to use a step size instead of the overlap percentage when segmenting. It has proven to be quite useful for me :).

Cheers, Matthias

opened by qtux 8

What the segment class is used for

Hello, I can't understand the usage of the segment class, in what cases I need to use this transform and how does it help? I also couldn't find an example as how to incorporate contextual variables? When I run it on toy data - it is very unclear what happened, since X is unchanged by y was reduced to a single value:

# Single multivariate time series with 3 samples of 4 variables
X = [np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])]
# Time series target
y = [np.array([True, False, False])]
print("X: " , X)
print("y: " ,y)
segment = Segment(width=3, overlap=1)
X, y, _ = segment.fit_transform(X, y)
print('After segmentation:')
print("X:", X)
print("y: ", y)

X : [array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])]
y : [array([ True, False, False])]
After segmentation:
X: [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]]
y:  [False]

opened by orko19 7

Passing data to temporal_split and other functions
Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to temporal_split but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:

KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000],\n dtype='int64', length=1001)] are in the [columns]"

If, on the other hand, I pass them as df I get:

AttributeError: 'DataFrame' object has no attribute 'ts_data'

The same holds true if I manually split the DataFrames and pass them to seg.fit_transform(X_train, y_train) I tried to put the date column in the df as well as in the index but the error is still there. What's wrong?

Info of the Dataframe:

<class 'pandas.core.frame.DataFrame'> Int64Index: 1017 entries, 896 to 1912 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 1017 non-null datetime64[ns] 1 id 1017 non-null object 2 price 1017 non-null float64 3 month 1017 non-null int64 4 year 1017 non-null int64 5 event_name_1 1017 non-null int64 6 event_type_1 1017 non-null int64 7 event_name_2 1017 non-null int64 8 event_type_2 1017 non-null int64 9 snap_CA 1017 non-null int64 10 dow 1017 non-null int64 11 is_weekend 1017 non-null int64 12 is_holiday 1017 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(10), object(1) memory usage: 111.2+ KB

I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.
opened by adalseno 7
Function transformer

Hi David,

as promised in #13, here is the generic FunctionTransformer for applying functions to Xt. Resampling Xt is disallowed and y not changed (and not provided to the supplied function).

Cheers, Matthias

opened by qtux 7
Postprocessing

New feature: ReconstructTs

Should go in new postprocessing module and reconstruct time series target labels from predictions on segments and mapped to the original data samples.

This could be implemented using interpolation (nearest neighbor) for categorical targets and anything else for continuous targets.

I don't think this can be integrated in the current pipeline atm. Another option would be to design another pipeline class that has this implemented as its last step.
enhancement

opened by dmbee 4
Add FunctionXYTransformer
Hi David,

I added a generic transformer which allows one to arbitrarily transform the data (Xt and yt) before or after segmentation using a custom function. Useful examples which come to mind are:

selecting specific parts of data (e.g. column filter)

applying filter functions (e.g. bandpass filter EMG data)

I added a safeguard against unintentional sub- or oversampling of the data as this would affect the training data (which is fine) and the test data (which is not fine) . The safeguard can be deactivated to allow for legitimate use cases similar to segmentation or interpolation of the data.

Could you please verify if my reasoning is correct. I would remove the last two commits if it were not.

Cheers, Matthias
opened by qtux 4
Column transformer for segmented data

Hi David,

I wrote a simple wrapper to use the sklearn ColumnTransformer on segmented data which is kind of useful when dealing with heterogeneous (multivariate) time series data. I've taken a look into supporting contextual data but did not find an easy way to make the current code work with the TS_Data class. Maybe copying and adapting the whole ColumnTransformer code instead of patching some parts of it could lead to a proper solution to support both. Nevertheless, I hope you find the SegmentedColumnTransformer to be useful.

Cheers, Matthias

opened by qtux 4

Pype broken with scikit-lean 0.24

When using Pype with scikit-lean version 0.24 I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/seglearn/pipe.py", line 59, in __init__
    super(Pype, self).__init__(steps, memory)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 74, in inner_f
    return f(**kwargs)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 118, in __init__
    self._validate_steps()
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 157, in _validate_steps
    self._validate_names(names)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 70, in _validate_names
    invalid_names = set(names).intersection(self.get_params(deep=False))
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 137, in get_params
    return self._get_params('steps', deep=deep)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 29, in _get_params
    out = super().get_params(deep=deep)
  File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py", line 195, in get_params
    value = getattr(self, key)
AttributeError: 'Pype' object has no attribute 'scorer'

Example to reproduce the error:

python=3.8
scikit-learn=0.24
seglearn=1.2.1

from seglearn.transform import SegmentX
from seglearn.pipe import Pype

pipe = Pype([('segment', SegmentX())])   # will crash on creation

From a quick view on the seglearn's source code https://github.com/dmbee/seglearn/blob/9000eee4a1edd7dbb0bc4c3b63970c1b71c77a31/seglearn/pipe.py#L58-L64 one solution could be to move the call of the super's __init__ to the end

def __init__(self, steps, scorer=None, memory=None):
    self.scorer = scorer
    self.N_train = None
    self.N_test = None
    self.N_fit = None
    self.history = None
    super(Pype, self).__init__(steps, memory)

opened by pwoller 3

will it work for multivariate time series prediction both regression and classification
great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values 2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

color weight gender height age

1 black 56 m 160 34 2 white 77 f 170 54 3 yellow 87 m 167 43 4 white 55 m 198 72 5 white 88 f 176 32
opened by Sandy4321 3
Pype and Pipeline version incompatible

Pype version: 1.2.2 Sklearn version: 1.0.1 Python version: 3.8.10

Problem: init() takes 2 positional arguments but 3 were given

Solution: change to super(Pype, self).init(steps, memory=memory) in seglearn/pipe.py

opened by MyRespect 1
Dataframe of multiple multivariate time series

I have a z different time series with different lengths. For each time series, there are a different number of time points with timestamps and for each time point, there is an m different features and observed float outcome for this time point. My aim is modeling a regressor (given m features what is the outcome). I have trained a regressor by omitting the temporal dimension of a dataset (train on all data points using m features and predict the outcome), but it resulted in a poor result. (Multiple multivariate time series with different length and sampling frequency)

My aim is to add temporal dimension for each time point (like adding new features on rolling fashion, for each time point, mean of past values of features, std of past feature values etc). I could not find any example of adding new features to a data frame of Multiple multivariate time series with different length and sampling frequency. Can you help me?

opened by quancore 1
Question about data representation
How can I work with seglearn if I have a data representation that is presented here.

I have two cases. In the first case I have a variable that is time dependent so I would like to extract features from the previous values in order to build the X matrix.

2011-01-01 01:00:00 1.073392 2011-01-01 02:00:00 0.274406 2011-01-01 03:00:00 1.446233 2011-01-01 04:00:00 -0.035727

In the second case I have the same problem but having along one or more dependent (and time dependent) variables that I want to use them in order to predict the third one.
opened by chkoar 5
RollingSplit

Implement data splitter with rolling splits similar to sklearn.model_selection.TimeSeriesSplit but with compatibility for data sets with more than a single time series and contextual data.
enhancement help wanted

opened by dmbee 1

Owner

David Burns

Orthopaedic Surgery Resident PhD Candidate, Biomedical Engineering Sunnybrook Research Institute University of Toronto, Canada

GitHub https://dmbee.github.io/seglearn/

A data preprocessing package for time series data. Design for machine learning and deep learning.

152 Jan 7, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 5, 2023

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

6k Jan 6, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022

Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance. I

2.8k Jan 5, 2023

Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

21 Jun 25, 2022

Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

2k Jan 2, 2023

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

1.3k Dec 22, 2022

Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

3.3k Jan 3, 2023

A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

5.2k Jan 4, 2023

STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim