A machine learning toolkit dedicated to time-series data

Overview

tslearn

The machine learning toolkit for time series analysis in Python

PyPI Documentation Build (Azure Pipelines) Codecov Downloads


Section Description
Installation Installing the dependencies and tslearn
Getting started A quick introduction on how to use tslearn
Available features An extensive overview of tslearn's functionalities
Documentation A link to our API reference and a gallery of examples
Contributing A guide for heroes willing to contribute
Citation A citation for tslearn for scholarly articles

Installation

There are different alternatives to install tslearn:

  • PyPi: python -m pip install tslearn
  • Conda: conda install -c conda-forge tslearn
  • Git: python -m pip install https://github.com/tslearn-team/tslearn/archive/master.zip

In order for the installation to be successful, the required dependencies must be installed. For a more detailed guide on how to install tslearn, please see the Documentation.

Getting started

1. Getting the data in the right format

tslearn expects a time series dataset to be formatted as a 3D numpy array. The three dimensions correspond to the number of time series, the number of measurements per time series and the number of dimensions respectively (n_ts, max_sz, d). In order to get the data in the right format, different solutions exist:

It should further be noted that tslearn supports variable-length timeseries.

>>> from tslearn.utils import to_time_series_dataset
>>> my_first_time_series = [1, 3, 4, 2]
>>> my_second_time_series = [1, 2, 4, 2]
>>> my_third_time_series = [1, 2, 4, 2, 2]
>>> X = to_time_series_dataset([my_first_time_series,
                                my_second_time_series,
                                my_third_time_series])
>>> y = [0, 1, 1]

2. Data preprocessing and transformations

Optionally, tslearn has several utilities to preprocess the data. In order to facilitate the convergence of different algorithms, you can scale time series. Alternatively, in order to speed up training times, one can resample the data or apply a piece-wise transformation.

>>> from tslearn.preprocessing import TimeSeriesScalerMinMax
>>> X_scaled = TimeSeriesScalerMinMax().fit_transform(X)
>>> print(X_scaled)
[[[0.] [0.667] [1.] [0.333] [nan]]
 [[0.] [0.333] [1.] [0.333] [nan]]
 [[0.] [0.333] [1.] [0.333] [0.333]]]

3. Training a model

After getting the data in the right format, a model can be trained. Depending on the use case, tslearn supports different tasks: classification, clustering and regression. For an extensive overview of possibilities, check out our gallery of examples.

>>> from tslearn.neighbors import KNeighborsTimeSeriesClassifier
>>> knn = KNeighborsTimeSeriesClassifier(n_neighbors=1)
>>> knn.fit(X_scaled, y)
>>> print(knn.predict(X_scaled))
[0 1 1]

As can be seen, the models in tslearn follow the same API as those of the well-known scikit-learn. Moreover, they are fully compatible with it, allowing to use different scikit-learn utilities such as hyper-parameter tuning and pipelines.

4. More analyses

tslearn further allows to perform all different types of analysis. Examples include calculating barycenters of a group of time series or calculate the distances between time series using a variety of distance metrics.

Available features

data processing clustering classification regression metrics
UCR Datasets Scaling TimeSeriesKMeans KNN Classifier KNN Regressor Dynamic Time Warping
Generators Piecewise KShape TimeSeriesSVC TimeSeriesSVR Global Alignment Kernel
Conversion(1, 2) KernelKmeans ShapeletModel MLP Barycenters
Early Classification Matrix Profile

Documentation

The documentation is hosted at readthedocs. It includes an API, gallery of examples and a user guide.

Contributing

If you would like to contribute to tslearn, please have a look at our contribution guidelines. A list of interesting TODO's can be found here. If you want other ML methods for time series to be added to this TODO list, do not hesitate to open an issue!

Referencing tslearn

If you use tslearn in a scientific publication, we would appreciate citations:

@article{JMLR:v21:20-091,
  author  = {Romain Tavenard and Johann Faouzi and Gilles Vandewiele and 
             Felix Divo and Guillaume Androz and Chester Holtz and 
             Marie Payne and Roman Yurchak and Marc Ru{\ss}wurm and 
             Kushal Kolar and Eli Woods},
  title   = {Tslearn, A Machine Learning Toolkit for Time Series Data},
  journal = {Journal of Machine Learning Research},
  year    = {2020},
  volume  = {21},
  number  = {118},
  pages   = {1-6},
  url     = {http://jmlr.org/papers/v21/20-091.html}
}

Acknowledgments

Authors would like to thank Mathieu Blondel for providing code for Kernel k-means and Soft-DTW.

Issues
  • [MRG] Flow to test for sklearn compatibility

    [MRG] Flow to test for sklearn compatibility

    Hello,

    This is a PR which allows to test automatically for all tslearn estimators whether they comply to the required checks of sklearn, allowing them to be used in their utilities such as GridSearchCV, Pipeline, ... The code to do this is currently located in tslearn/testing_utils.py, but should be moved to tslearn/testing when available.

    I also included an example demonstrating how GlobalGAKMeans can now be used with an sklearn pipeline, in tslearn/docs/examples/plot_gakkmeans_sklearn.

    All feedback is more than welcome!

    Kind regards, Gilles

    opened by GillesVandewiele 162
  • [WIP] Save models to hdf5 and other formats

    [WIP] Save models to hdf5 and other formats

    Hi,

    I thought it would be useful to save the KShape model without pickling. I implemented a simple to_hdf5() method for saving a KShape model to an hdf5 file and from_hdf5() for reloading it so that predictions can be done with the model.

    Changes to the KShape class:

    • the class attribute "model_attrs" is a list of attributes that are sufficient to describe the model.
    • to_dict() method packages the model attributes and params to a dict.
    • to_hdf5() and from_hdf() can be used to save/load the model to/from hdf5 files.
    • put instance attributes in constructor

    An hdftools module is added to handle saving a dict of numpy arrays to an hdf file.

    Usage:

    ks.to_hdf5('/path/to/file.h5')
    model = KShape.from_hdf5('path/to/file.h5')
    
    opened by kushalkolar 37
  • [MRG] Adding SAX+MINDIST to KNN

    [MRG] Adding SAX+MINDIST to KNN

    This PR contains the following changes:

    • 'sax' is now a valid metric for KNN:
    knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='sax')
    
    • Added BaseEstimator to classes in preprocessing module so that they can be used within a Pipeline (errors were raised when using TimeSeriesScalerMeanVariance)

    • Fixed a bug in kneighbors method which would always return [0] as nearest neighbor for every sample.

    knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='dtw')
    knn.fit(X_train, y_train)
    _, ind = knn.kneighbors(X_test)
    # ind would be filled with 0's
    
    • Slightly changed to code of kneighbors so that its result is consistent with sklearn. There was a small difference in breaking ties (tslearn would pick largest index while sklearn would pick the smallest index). Now the following code is equivalent:
    knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='dtw')
    knn.fit(X_train, y_train)
    _, ind = knn.kneighbors(X_test)
    
    knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='precomputed')
    all_X = numpy.vstack((X_train, X_test))
    distances = pairwise_distances(all_X, metric=dtw)
    X_train = distances[:len(X_train), :len(X_train)]
    X_test = distances[len(X_train):, :len(X_train)]
    knn.fit(X_train, y_train)
    _, ind = knn.kneighbors(X_test)
    
    # both ind vectors are now equal (while that was not the case before this PR)
    

    Some remarks:

    • I am unexperienced with numba; adding an njit decorator to cdist_sax did not work immediately, I could perhaps use some help with that.
    opened by GillesVandewiele 37
  • [MRG] Shapelet Support Tensorflow 2

    [MRG] Shapelet Support Tensorflow 2

    Made a few changes to support Tensorflow 2, and remove Keras as a separate dependancy. I'm just testing out tslearn and am not sure if these changes are wanted. No offense will be taken if these don't get included. :)

    Have a great day, I'm excited to see what tslearn has to offer.

    opened by page1 34
  • Replace implicit imports with explicit imports

    Replace implicit imports with explicit imports

    Fixes #134

    As title says, the implicit imports are replaced with explicit imports in test_estimators.py. It was a bit hard to find some of them from scikit-learn. Let's see if it improves code coverage.

    opened by johannfaouzi 27
  • [MRG] Accept variable-length time series for some pairs metrics/estimators

    [MRG] Accept variable-length time series for some pairs metrics/estimators

    This is an attempt to make it possible to use estimators with metrics like DTW on variable-length time series.

    The first attempt here is to make DTW/soft-DTW usable for kNN estimators on variable-length time series.

    The test I ran is:

    from tslearn.neighbors import KNeighborsTimeSeriesClassifier
    from tslearn.utils import to_time_series_dataset
    
    
    X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]])
    y = [0, 0, 1]
    
    clf = KNeighborsTimeSeriesClassifier(metric="dtw",
                                         n_neighbors=1,
                                         metric_params={"global_constraint": "sakoe_chiba"})
    clf.fit(X, y)
    print("---", clf._ts_fit)
    print(clf.predict(X))
    

    First, we have to think about whether the hack I introduced is a good way to reach our goal and second, once we have chosen a way to proceed, we will have to:

    • do the same for other estimators (all those that accept dtw, soft-dtw, gak as metrics, ideally)
    • find a way to hack sklearn k-fold variants, since there are some checks for all-finite entries in the datasets there which fail for variable-length time series, if I remember correctly

    @GillesVandewiele since you recently worked on making the estimators sklearn-compatible, could you review this PR?

    opened by rtavenar 27
  • Make binary wheels for all platforms

    Make binary wheels for all platforms

    Making binary wheels and uploading them to PyPi, would allow to pip install tslearn without needing a compiler or Cython.

    Usually this requires quite a bit of work, see e.g. https://github.com/MacPython/scikit-learn-wheels/. However there is a shortcut with https://github.com/regro/conda-press that might allow generating wheels from conda-forge builds. I have not used it yet personally, but it could be worth a try.

    opened by rth 23
  • kNN using SAX+MINDIST

    kNN using SAX+MINDIST

    When using this class what are the available "metrics" parameters that can be used? only "dtw"? any recommendation if i would want to use euclidean or for example the SAX distance, on using this classifier on a dataset with a SAX representation?

    new feature 
    opened by ManuelMonteiro24 22
  • [WIP] Fix sklearn import deprecation warnings

    [WIP] Fix sklearn import deprecation warnings

    This PR fixes the deprecation warnings that are raised when importing certain (now private) modules from sklearn.

    Private API

    Many things will move to a private API in the new sklearn version. Their module name will change and have a leading underscore. e.g. sklearn.neighbors.base becomes sklearn.neighbors._base. Unfortunately, these new module names will cause a crash in environments with older sklearn versions.

    The proposed fix is the following for all deprecation warninings:

    try:
        from sklearn.neighbors._base import KNeighborsMixin
    except ImportError:
        from sklearn.neighbors.base import KNeighborsMixin
    
    opened by GillesVandewiele 20
  • Add initial guess as centroid

    Add initial guess as centroid

    According to the issue #58 , here a proposal to improve clustering (only for the KShape method for now) by letting the user choose an initial guess as centroids. This guess is a numpy array of int which are the indices of the samples to be used as centroids instead of a random vector.

    opened by gandroz 19
  • Scalable matrix profile

    Scalable matrix profile

    Is your feature request related to a problem? Please describe. tslearn has a matrix profile module that relies on a naive implementation. Based on a discussion with @seanlaw in #126 we could maybe consider having STUMPY as an optional dependency for this matrixprofile module in order to benefit from their scalable implementations.

    Describe the solution you'd like That would require to improve on the existing MatrixProfile class by allowing to pick an implementation (using parameters passed at __init__ time) and the _transform(...) method should call the correct function

    One additional thing to check is how stumpy deals with:

    • [x] variable-length time series
    • [ ] multidimensional time series

    I will probably not have time to work on it. If anyone is interested to give a hand on this, feel free to tell.

    new feature good first issue 
    opened by rtavenar 18
  • [Bugfix] Fixes an off-by-one error in kshape. (Issue #385)

    [Bugfix] Fixes an off-by-one error in kshape. (Issue #385)

    Problem pointed out in issue #385. The original paper uses MATLAB-style indexing, i.e. starting on 1, as opposed to Python-style indexing starting on 0. Therefore there was a mismatch between tslearn and line 6, Algorithm 1 of that paper.

    It should be noted that this indexing issue seems to be accounted for in this PR.

    opened by temcomp 0
  • Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans.

    Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans.

    Dear Dev Team,

    @ecederstrand @rth @rflamary @apachaves @felixdivo

    Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans. I currently tried using n_jobs for parallel processing in Databricks but the time taken for clustering is same for 8 CPU and 32CPU machine. It clearly doesn't help.

    Can you please suggest what can be the best approach to reduce the time matrix.

    Thanks, Ishwar Sukheja

    new feature 
    opened by sukhejai 2
  • [Error] ModuleNotFoundError: No module named 'tslearn.metrics.cysax'

    [Error] ModuleNotFoundError: No module named 'tslearn.metrics.cysax'

    Describe the bug I got an error when I was trying to run pytest on one of the unit tests. It cannot find the module tslearn.metrics.cysax

    To Reproduce If it helps, I did the following step:

    # in my conda environment
    # in tslearn directory
    
    pip install -r requirement.txt
    python setup.py install 
    pytest tslearn/tests/test_metrics.py
    

    Expected behavior The file does exists. So, there shouldn't have been a problem! I also checked out the setup.cfg but couldn't figure out the cause.

    Environment (please complete the following information):

    • OS: Win10
    • tslearn version: 0.5.2

    Additional context I am providing the error below.

    $ pytest tslearn/tests/test_metrics.py
    ============================= test session starts =============================
    platform win32 -- Python 3.8.5, pytest-7.1.2, pluggy-1.0.0
    rootdir: E:\+Machine_Learning_Journey\contributions\tslearn, configfile: setup.cfg
    plugins: anyio-3.5.0
    collected 0 items / 1 error
    
    =================================== ERRORS ====================================
    _______________ ERROR collecting tslearn/tests/test_metrics.py ________________
    ImportError while importing test module 'E:\+Machine_Learning_Journey\contributions\tslearn\tslearn\tests\test_metrics.py'.
    Hint: make sure your test modules/packages have valid Python names.
    Traceback:
    C:\Users\nimas\anaconda3\lib\importlib\__init__.py:127: in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
    tslearn\tests\test_metrics.py:4: in <module>
        import tslearn.metrics
    tslearn\metrics\__init__.py:18: in <module>
        from .sax import cdist_sax
    tslearn\metrics\sax.py:2: in <module>
        from .cysax import cydist_sax
    E   ModuleNotFoundError: No module named 'tslearn.metrics.cysax'
    ============================== warnings summary ===============================
    <frozen importlib._bootstrap>:219
      <frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 80 from C header, got 88 from PyObject
    
    -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
    =========================== short test summary info ===========================
    ERROR tslearn/tests/test_metrics.py
    !!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
    ========================= 1 warning, 1 error in 1.72s =========================
    
    
    bug 
    opened by NimaSarajpoor 1
  • Feature Importance/Influence in Multivariate Time Series Clustering

    Feature Importance/Influence in Multivariate Time Series Clustering

    Is there a way to determine the importance of each features in multivariate time series for the decision of the clustering? For example, feature x has the most influence in cluster y.

    My time series is modeled as (n_ts, ts_length, n_dim) with n_dim as the number of features.

    new feature 
    opened by ajanadj 1
  • Remove redundant second prange in inner for-loop

    Remove redundant second prange in inner for-loop

    According to the numba document, Loop Serialization occurs when we have more than one prange in the nested loops. In these cases, only the outmost prange is executed in parallel. This PR just clean the code a little bit.


    NOTE: this is neither a bug nor a feature request. So, I couldn't open an issue as I couldn't find any proper option there.

    opened by NimaSarajpoor 19
Releases(v0.5.2)
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Facebook Research 3.9k Aug 5, 2022
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 2.9k Aug 8, 2022
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 144 Aug 5, 2022
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 844 Aug 10, 2022
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 5.6k Aug 13, 2022
Python module for machine learning time series:

seglearn Seglearn is a python package for machine learning time series or sequences. It provides an integrated pipeline for segmentation, feature extr

David Burns 522 Jul 31, 2022
Merlion: A Machine Learning Framework for Time Series Intelligence

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance. I

Salesforce 2.6k Aug 3, 2022
Empyrial is a Python-based open-source quantitative investment library dedicated to financial institutions and retail investors

By Investors, For Investors. Want to read this in Chinese? Click here Empyrial is a Python-based open-source quantitative investment library dedicated

Santosh 577 Aug 7, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.3k Aug 6, 2022
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 50 Aug 1, 2022
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 14.8k Aug 9, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.3k Aug 5, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 466 Aug 8, 2022
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 73 Jul 30, 2022
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 6 Aug 9, 2022
PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

Wenjie Du 82 Aug 3, 2022
Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

CompatibL 21 Jun 25, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 17 Jul 22, 2022
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 23 Jul 27, 2022