Python module for machine learning time series:

Overview

Travis Pypi PythonVersion CircleCI Coveralls Downloads

seglearn

Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extraction, feature processing, and final estimator. Seglearn provides a flexible approach to multivariate time series and related contextual (meta) data for classification, regression, and forecasting problems. Support and examples are provided for learning time series with classical machine learning and deep learning models. It is compatible with scikit-learn.

Documentation

Installation documentation, API documentation, and examples can be found on the documentation.

Dependencies

seglearn is tested to work under Python 3.5. The dependency requirements are based on the last scikit-learn release:

  • scipy(>=0.17.0)
  • numpy(>=1.11.0)
  • scikit-learn(>=0.21.3)

Additionally, to run the examples, you need:

  • matplotlib(>=2.0.0)
  • keras (>=2.1.4) for the neural network examples
  • pandas

In order to run the test cases, you need:

  • pytest

The neural network examples were tested on keras using the tensorflow-gpu backend, which is recommended.

Installation

seglearn-learn is currently available on the PyPi's repository and you can install it via pip:

pip install -U seglearn

or if you use python3:

pip3 install -U seglearn

If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from GitHub and install all dependencies:

git clone https://github.com/dmbee/seglearn.git
cd seglearn
pip install .

Or install using pip and GitHub:

pip install -U git+https://github.com/dmbee/seglearn.git

Testing

After installation, you can use pytest to run the test suite from seglearn's root directory:

pytest

Change Log

Version history can be viewed in the Change Log.

Development

The development of this scikit-learn-contrib is in line with the one of the scikit-learn community. Therefore, you can refer to their Development Guide.

Please submit new pull requests on the dev branch with unit tests and an example to demonstrate any new functionality / api changes.

Citing seglearn

If you use seglearn in a scientific publication, we would appreciate citations to the following paper:

@article{arXiv:1803.08118,
author  = {David Burns, Cari Whyne},
title   = {Seglearn: A Python Package for Learning Sequences and Time Series},
journal = {arXiv},
year    = {2018},
url     = {https://arxiv.org/abs/1803.08118}
}

If you use the seglearn test data in a scientific publication, we would appreciate citations to the following paper:

@article{arXiv:1802.01489,
author  = {David Burns, Nathan Leung, Michael Hardisty, Cari Whyne, Patrick Henry, Stewart McLachlin},
title   = {Shoulder Physiotherapy Exercise Recognition: Machine Learning the Inertial Signals from a Smartwatch},
journal = {arXiv},
year    = {2018},
url     = {https://arxiv.org/abs/1802.01489}
}
Comments
  • Resampling with imbalanced-learn samplers

    Resampling with imbalanced-learn samplers

    Hi David,

    I added the patch_sampler(imblearn_sampler_class) function which can be used to derive a dynamically created (and pickable) sampler class compatible with Pype. The derived class implements a transform method which returns the data unchanged. The fit_transform method calls the fit_resample method of the imbalanced-learn sampler which resamples the data. These steps are important to ensure that resampling only applies to training data but not to test data (the example shows that Pype.fit calls the fit_transform method, whereas score calls the transform method).

    Cheers, Matthias

    opened by qtux 14
  • Reverse Transform of Segment.

    Reverse Transform of Segment.

    Hello, I have an issue about the shape of prediction is not same to the original labels due to the segmentation process. Do you have any function to convert the shape of prediction back to the shape before the segmentation?

    opened by ninfueng 12
  • transform: Add step param to segment transforms

    transform: Add step param to segment transforms

    Hi David,

    I implemented the possibility to use a step size instead of the overlap percentage when segmenting. It has proven to be quite useful for me :).

    Cheers, Matthias

    opened by qtux 8
  • What the segment class is used for

    What the segment class is used for

    Hello, I can't understand the usage of the segment class, in what cases I need to use this transform and how does it help? I also couldn't find an example as how to incorporate contextual variables? When I run it on toy data - it is very unclear what happened, since X is unchanged by y was reduced to a single value:

    # Single multivariate time series with 3 samples of 4 variables
    X = [np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])]
    # Time series target
    y = [np.array([True, False, False])]
    print("X: " , X)
    print("y: " ,y)
    segment = Segment(width=3, overlap=1)
    X, y, _ = segment.fit_transform(X, y)
    print('After segmentation:')
    print("X:", X)
    print("y: ", y)
    
    X : [array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])]
    y : [array([ True, False, False])]
    After segmentation:
    X: [[[ 0  1  2  3]
      [ 4  5  6  7]
      [ 8  9 10 11]]]
    y:  [False]
    
    opened by orko19 7
  • Passing data to temporal_split and other functions

    Passing data to temporal_split and other functions

    Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to temporal_split but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:

    KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000],\n dtype='int64', length=1001)] are in the [columns]"

    If, on the other hand, I pass them as df I get:

    AttributeError: 'DataFrame' object has no attribute 'ts_data'

    The same holds true if I manually split the DataFrames and pass them to seg.fit_transform(X_train, y_train) I tried to put the date column in the df as well as in the index but the error is still there. What's wrong?

    Info of the Dataframe:

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1017 entries, 896 to 1912
    Data columns (total 13 columns):
     #   Column        Non-Null Count  Dtype         
    ---  ------        --------------  -----         
     0   date          1017 non-null   datetime64[ns]
     1   id            1017 non-null   object        
     2   price         1017 non-null   float64       
     3   month         1017 non-null   int64         
     4   year          1017 non-null   int64         
     5   event_name_1  1017 non-null   int64         
     6   event_type_1  1017 non-null   int64         
     7   event_name_2  1017 non-null   int64         
     8   event_type_2  1017 non-null   int64         
     9   snap_CA       1017 non-null   int64         
     10  dow           1017 non-null   int64         
     11  is_weekend    1017 non-null   int64         
     12  is_holiday    1017 non-null   int64         
    dtypes: datetime64[ns](1), float64(1), int64(10), object(1)
    memory usage: 111.2+ KB
    

    I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.

    opened by adalseno 7
  • Function transformer

    Function transformer

    Hi David,

    as promised in #13, here is the generic FunctionTransformer for applying functions to Xt. Resampling Xt is disallowed and y not changed (and not provided to the supplied function).

    Cheers, Matthias

    opened by qtux 7
  • Postprocessing

    Postprocessing

    New feature: ReconstructTs

    Should go in new postprocessing module and reconstruct time series target labels from predictions on segments and mapped to the original data samples.

    This could be implemented using interpolation (nearest neighbor) for categorical targets and anything else for continuous targets.

    I don't think this can be integrated in the current pipeline atm. Another option would be to design another pipeline class that has this implemented as its last step.

    enhancement 
    opened by dmbee 4
  • Add FunctionXYTransformer

    Add FunctionXYTransformer

    Hi David,

    I added a generic transformer which allows one to arbitrarily transform the data (Xt and yt) before or after segmentation using a custom function. Useful examples which come to mind are:

    • selecting specific parts of data (e.g. column filter)
    • applying filter functions (e.g. bandpass filter EMG data)

    I added a safeguard against unintentional sub- or oversampling of the data as this would affect the training data (which is fine) and the test data (which is not fine) . The safeguard can be deactivated to allow for legitimate use cases similar to segmentation or interpolation of the data.

    Could you please verify if my reasoning is correct. I would remove the last two commits if it were not.

    Cheers, Matthias

    opened by qtux 4
  • Column transformer for segmented data

    Column transformer for segmented data

    Hi David,

    I wrote a simple wrapper to use the sklearn ColumnTransformer on segmented data which is kind of useful when dealing with heterogeneous (multivariate) time series data. I've taken a look into supporting contextual data but did not find an easy way to make the current code work with the TS_Data class. Maybe copying and adapting the whole ColumnTransformer code instead of patching some parts of it could lead to a proper solution to support both. Nevertheless, I hope you find the SegmentedColumnTransformer to be useful.

    Cheers, Matthias

    opened by qtux 4
  • Pype broken with scikit-lean 0.24

    Pype broken with scikit-lean 0.24

    When using Pype with scikit-lean version 0.24 I get the following error:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/seglearn/pipe.py", line 59, in __init__
        super(Pype, self).__init__(steps, memory)
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py", line 74, in inner_f
        return f(**kwargs)
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 118, in __init__
        self._validate_steps()
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 157, in _validate_steps
        self._validate_names(names)
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 70, in _validate_names
        invalid_names = set(names).intersection(self.get_params(deep=False))
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/pipeline.py", line 137, in get_params
        return self._get_params('steps', deep=deep)
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 29, in _get_params
        out = super().get_params(deep=deep)
      File "/home/wllr/miniconda3/envs/py38/lib/python3.8/site-packages/sklearn/base.py", line 195, in get_params
        value = getattr(self, key)
    AttributeError: 'Pype' object has no attribute 'scorer'
    

    Example to reproduce the error:

    • python=3.8
    • scikit-learn=0.24
    • seglearn=1.2.1
    from seglearn.transform import SegmentX
    from seglearn.pipe import Pype
    
    pipe = Pype([('segment', SegmentX())])   # will crash on creation
    

    From a quick view on the seglearn's source code https://github.com/dmbee/seglearn/blob/9000eee4a1edd7dbb0bc4c3b63970c1b71c77a31/seglearn/pipe.py#L58-L64 one solution could be to move the call of the super's __init__ to the end

    def __init__(self, steps, scorer=None, memory=None):
        self.scorer = scorer
        self.N_train = None
        self.N_test = None
        self.N_fit = None
        self.history = None
        super(Pype, self).__init__(steps, memory)
    
    opened by pwoller 3
  • will it work for multivariate time series prediction   both regression and classification

    will it work for multivariate time series prediction both regression and classification

    great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values 2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

    color        weight     gender  height  age  
    

    1 black 56 m 160 34 2 white 77 f 170 54 3 yellow 87 m 167 43 4 white 55 m 198 72 5 white 88 f 176 32

    opened by Sandy4321 3
  • Pype and Pipeline version incompatible

    Pype and Pipeline version incompatible

    Pype version: 1.2.2 Sklearn version: 1.0.1 Python version: 3.8.10

    Problem: init() takes 2 positional arguments but 3 were given

    Solution: change to super(Pype, self).init(steps, memory=memory) in seglearn/pipe.py

    opened by MyRespect 1
  • Dataframe of multiple multivariate time series

    Dataframe of multiple multivariate time series

    I have a z different time series with different lengths. For each time series, there are a different number of time points with timestamps and for each time point, there is an m different features and observed float outcome for this time point. My aim is modeling a regressor (given m features what is the outcome). I have trained a regressor by omitting the temporal dimension of a dataset (train on all data points using m features and predict the outcome), but it resulted in a poor result. (Multiple multivariate time series with different length and sampling frequency)

    My aim is to add temporal dimension for each time point (like adding new features on rolling fashion, for each time point, mean of past values of features, std of past feature values etc). I could not find any example of adding new features to a data frame of Multiple multivariate time series with different length and sampling frequency. Can you help me?

    opened by quancore 1
  • Question about data representation

    Question about data representation

    How can I work with seglearn if I have a data representation that is presented here.

    I have two cases. In the first case I have a variable that is time dependent so I would like to extract features from the previous values in order to build the X matrix.

    2011-01-01 01:00:00    1.073392
    2011-01-01 02:00:00    0.274406
    2011-01-01 03:00:00    1.446233
    2011-01-01 04:00:00   -0.035727
    

    In the second case I have the same problem but having along one or more dependent (and time dependent) variables that I want to use them in order to predict the third one.

    opened by chkoar 5
  • RollingSplit

    RollingSplit

    Implement data splitter with rolling splits similar to sklearn.model_selection.TimeSeriesSplit but with compatibility for data sets with more than a single time series and contextual data.

    enhancement help wanted 
    opened by dmbee 1
Owner
David Burns
Orthopaedic Surgery Resident PhD Candidate, Biomedical Engineering Sunnybrook Research Institute University of Toronto, Canada
David Burns
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.3k Jan 5, 2023
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 6k Jan 6, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.3k Dec 29, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance. I

Salesforce 2.8k Jan 5, 2023
Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

CompatibL 21 Jun 25, 2022
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 2, 2023
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 3, 2023
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 5.2k Jan 4, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 6, 2023
A Python package for time series classification

pyts: a Python package for time series classification pyts is a Python package for time series classification. It aims to make time series classificat

Johann Faouzi 1.4k Jan 1, 2023
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 888 Dec 30, 2022
AtsPy: Automated Time Series Models in Python (by @firmai)

Automated Time Series Models in Python (AtsPy) SSRN Report Easily develop state of the art time series models to forecast univariate data series. Simp

Derek Snow 465 Jan 2, 2023
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Sam 438 Dec 17, 2022
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

null 3 Nov 24, 2021
PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

Wenjie Du 179 Dec 31, 2022
Python module for data science and machine learning users.

dsnk-distributions package dsnk distribution is a Python module for data science and machine learning that was created with the goal of reducing calcu

Emmanuel ASIFIWE 1 Nov 23, 2021
Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

X - Data Science Initiative 410 Dec 14, 2022