STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

Overview
PyPI Version PyPI Downloads Conda-forge Version Conda-forge Downloads License Test Status Code Coverage ReadTheDocs Status Binder JOSS DOI FOSSA Twitter

STUMPY Logo

STUMPY

STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of time series data mining tasks such as:

  • pattern/motif (approximately repeated subsequences within a longer time series) discovery
  • anomaly/novelty (discord) discovery
  • shapelet discovery
  • semantic segmentation
  • streaming (on-line) data
  • fast approximate matrix profiles
  • time series chains (temporally ordered set of subsequence patterns)
  • and more ...

Whether you are an academic, data scientist, software developer, or time series enthusiast, STUMPY is straightforward to install and our goal is to allow you to get to your time series insights faster. See documentation for more information.

How to use STUMPY

Please see our API documentation for a complete list of available functions and see our informative tutorials for more comprehensive example use cases. Below, you will find code snippets that quickly demonstrate how to use STUMPY.

Typical usage (1-dimensional time series data) with STUMP:

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stump(your_time_series, m=window_size)

Distributed usage for 1-dimensional time series data with Dask Distributed via STUMPED:

import stumpy
import numpy as np
from dask.distributed import Client

if __name__ == "__main__":
    dask_client = Client()

    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stumped(dask_client, your_time_series, m=window_size)

GPU usage for 1-dimensional time series data with GPU-STUMP:

import stumpy
import numpy as np
from numba import cuda

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern
    all_gpu_devices = [device.id for device in cuda.list_devices()]  # Get a list of all available GPU devices

    matrix_profile = stumpy.gpu_stump(your_time_series, m=window_size, device_id=all_gpu_devices)

Multi-dimensional time series data with MSTUMP:

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(3, 1000)  # Each row represents data from a different dimension while each column represents data from the same dimension
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile, matrix_profile_indices = stumpy.mstump(your_time_series, m=window_size)

Distributed multi-dimensional time series data analysis with Dask Distributed MSTUMPED:

import stumpy
import numpy as np
from dask.distributed import Client

if __name__ == "__main__":
    dask_client = Client()

    your_time_series = np.random.rand(3, 1000)   # Each row represents data from a different dimension while each column represents data from the same dimension
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile, matrix_profile_indices = stumpy.mstumped(dask_client, your_time_series, m=window_size)

Time Series Chains with Anchored Time Series Chains (ATSC):

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stump(your_time_series, m=window_size)

    left_matrix_profile_index = matrix_profile[:, 2]
    right_matrix_profile_index = matrix_profile[:, 3]
    idx = 10  # Subsequence index for which to retrieve the anchored time series chain for

    anchored_chain = stumpy.atsc(left_matrix_profile_index, right_matrix_profile_index, idx)

    all_chain_set, longest_unanchored_chain = stumpy.allc(left_matrix_profile_index, right_matrix_profile_index)

Semantic Segmentation with Fast Low-cost Unipotent Semantic Segmentation (FLUSS):

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stump(your_time_series, m=window_size)

    subseq_len = 50
    correct_arc_curve, regime_locations = stumpy.fluss(matrix_profile[:, 1],
                                                    L=subseq_len,
                                                    n_regimes=2,
                                                    excl_factor=1
                                                    )

Dependencies

Supported Python and NumPy versions are determined according to the NEP 29 deprecation policy.

Where to get it

Conda install (preferred):

conda install -c conda-forge stumpy

PyPI install, presuming you have numpy, scipy, and numba installed:

python -m pip install stumpy

To install stumpy from source, see the instructions in the documentation.

Documentation

In order to fully understand and appreciate the underlying algorithms and applications, it is imperative that you read the original publications. For a more detailed example of how to use STUMPY please consult the latest documentation or explore the following tutorials:

  1. The Matrix Profile
  2. STUMPY Basics
  3. Time Series Chains
  4. Semantic Segmentation

Performance

We tested the performance of computing the exact matrix profile using the Numba JIT compiled version of the code on randomly generated time series data with various lengths (i.e., np.random.rand(n)) along with different CPU and GPU hardware resources.

STUMPY Performance Plot

The raw results are displayed in the table below as Hours:Minutes:Seconds.Milliseconds and with a constant window size of m = 50. Note that these reported runtimes include the time that it takes to move the data from the host to all of the GPU device(s). You may need to scroll to the right side of the table in order to see all of the runtimes.

i n = 2i GPU-STOMP STUMP.2 STUMP.16 STUMPED.128 STUMPED.256 GPU-STUMP.1 GPU-STUMP.2 GPU-STUMP.DGX1 GPU-STUMP.DGX2
6 64 00:00:10.00 00:00:00.00 00:00:00.00 00:00:05.77 00:00:06.08 00:00:00.03 00:00:01.63 NaN NaN
7 128 00:00:10.00 00:00:00.00 00:00:00.00 00:00:05.93 00:00:07.29 00:00:00.04 00:00:01.66 NaN NaN
8 256 00:00:10.00 00:00:00.00 00:00:00.01 00:00:05.95 00:00:07.59 00:00:00.08 00:00:01.69 00:00:06.68 00:00:25.68
9 512 00:00:10.00 00:00:00.00 00:00:00.02 00:00:05.97 00:00:07.47 00:00:00.13 00:00:01.66 00:00:06.59 00:00:27.66
10 1024 00:00:10.00 00:00:00.02 00:00:00.04 00:00:05.69 00:00:07.64 00:00:00.24 00:00:01.72 00:00:06.70 00:00:30.49
11 2048 NaN 00:00:00.05 00:00:00.09 00:00:05.60 00:00:07.83 00:00:00.53 00:00:01.88 00:00:06.87 00:00:31.09
12 4096 NaN 00:00:00.22 00:00:00.19 00:00:06.26 00:00:07.90 00:00:01.04 00:00:02.19 00:00:06.91 00:00:33.93
13 8192 NaN 00:00:00.50 00:00:00.41 00:00:06.29 00:00:07.73 00:00:01.97 00:00:02.49 00:00:06.61 00:00:33.81
14 16384 NaN 00:00:01.79 00:00:00.99 00:00:06.24 00:00:08.18 00:00:03.69 00:00:03.29 00:00:07.36 00:00:35.23
15 32768 NaN 00:00:06.17 00:00:02.39 00:00:06.48 00:00:08.29 00:00:07.45 00:00:04.93 00:00:07.02 00:00:36.09
16 65536 NaN 00:00:22.94 00:00:06.42 00:00:07.33 00:00:09.01 00:00:14.89 00:00:08.12 00:00:08.10 00:00:36.54
17 131072 00:00:10.00 00:01:29.27 00:00:19.52 00:00:09.75 00:00:10.53 00:00:29.97 00:00:15.42 00:00:09.45 00:00:37.33
18 262144 00:00:18.00 00:05:56.50 00:01:08.44 00:00:33.38 00:00:24.07 00:00:59.62 00:00:27.41 00:00:13.18 00:00:39.30
19 524288 00:00:46.00 00:25:34.58 00:03:56.82 00:01:35.27 00:03:43.66 00:01:56.67 00:00:54.05 00:00:19.65 00:00:41.45
20 1048576 00:02:30.00 01:51:13.43 00:19:54.75 00:04:37.15 00:03:01.16 00:05:06.48 00:02:24.73 00:00:32.95 00:00:46.14
21 2097152 00:09:15.00 09:25:47.64 03:05:07.64 00:13:36.51 00:08:47.47 00:20:27.94 00:09:41.43 00:01:06.51 00:01:02.67
22 4194304 NaN 36:12:23.74 10:37:51.21 00:55:44.43 00:32:06.70 01:21:12.33 00:38:30.86 00:04:03.26 00:02:23.47
23 8388608 NaN 143:16:09.94 38:42:51.42 03:33:30.53 02:00:49.37 05:11:44.45 02:33:14.60 00:15:46.26 00:08:03.76
24 16777216 NaN NaN NaN 14:39:11.99 07:13:47.12 20:43:03.80 09:48:43.42 01:00:24.06 00:29:07.84
NaN 17729800 09:16:12.00 NaN NaN 15:31:31.75 07:18:42.54 23:09:22.43 10:54:08.64 01:07:35.39 00:32:51.55
25 33554432 NaN NaN NaN 56:03:46.81 26:27:41.29 83:29:21.06 39:17:43.82 03:59:32.79 01:54:56.52
26 67108864 NaN NaN NaN 211:17:37.60 106:40:17.17 328:58:04.68 157:18:30.50 15:42:15.94 07:18:52.91
NaN 100000000 291:07:12.00 NaN NaN NaN 234:51:35.39 NaN NaN 35:03:44.61 16:22:40.81
27 134217728 NaN NaN NaN NaN NaN NaN NaN 64:41:55.09 29:13:48.12

Hardware Resources

GPU-STOMP: These results are reproduced from the original Matrix Profile II paper - NVIDIA Tesla K80 (contains 2 GPUs) and serves as the performance benchmark to compare against.

STUMP.2: stumpy.stump executed with 2 CPUs in Total - 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors parallelized with Numba on a single server without Dask.

STUMP.16: stumpy.stump executed with 16 CPUs in Total - 16x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors parallelized with Numba on a single server without Dask.

STUMPED.128: stumpy.stumped executed with 128 CPUs in Total - 8x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors x 16 servers, parallelized with Numba, and distributed with Dask Distributed.

STUMPED.256: stumpy.stumped executed with 256 CPUs in Total - 8x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors x 32 servers, parallelized with Numba, and distributed with Dask Distributed.

GPU-STUMP.1: stumpy.gpu_stump executed with 1x NVIDIA GeForce GTX 1080 Ti GPU, 512 threads per block, 200W power limit, compiled to CUDA with Numba, and parallelized with Python multiprocessing

GPU-STUMP.2: stumpy.gpu_stump executed with 2x NVIDIA GeForce GTX 1080 Ti GPU, 512 threads per block, 200W power limit, compiled to CUDA with Numba, and parallelized with Python multiprocessing

GPU-STUMP.DGX1: stumpy.gpu_stump executed with 8x NVIDIA Tesla V100, 512 threads per block, compiled to CUDA with Numba, and parallelized with Python multiprocessing

GPU-STUMP.DGX2: stumpy.gpu_stump executed with 16x NVIDIA Tesla V100, 512 threads per block, compiled to CUDA with Numba, and parallelized with Python multiprocessing

Running Tests

Tests are written in the tests directory and processed using PyTest and requires coverage.py for code coverage analysis. Tests can be executed with:

./test.sh

Python Version

STUMPY supports Python 3.7+ and, due to the use of unicode variable names/identifiers, is not compatible with Python 2.x. Given the small dependencies, STUMPY may work on older versions of Python but this is beyond the scope of our support and we strongly recommend that you upgrade to the most recent version of Python.

Getting Help

First, please check the discussions and issues on Github to see if your question has already been answered there. If no solution is available there feel free to open a new discussion or issue and the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project, make your changes, and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

S.M. Law, (2019). STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining. Journal of Open Source Software, 4(39), 1504.
@article{law2019stumpy,
  title={{STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining}},
  author={Law, Sean M.},
  journal={{The Journal of Open Source Software}},
  volume={4},
  number={39},
  pages={1504},
  year={2019}
}

References

Yeh, Chin-Chia Michael, et al. (2016) Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords, and Shapelets. ICDM:1317-1322. Link

Zhu, Yan, et al. (2016) Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. ICDM:739-748. Link

Yeh, Chin-Chia Michael, et al. (2017) Matrix Profile VI: Meaningful Multidimensional Motif Discovery. ICDM:565-574. Link

Zhu, Yan, et al. (2017) Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining. ICDM:695-704. Link

Gharghabi, Shaghayegh, et al. (2017) Matrix Profile VIII: Domain Agnostic Online Semantic Segmentation at Superhuman Performance Levels. ICDM:117-126. Link

Zhu, Yan, et al. (2017) Exploiting a Novel Algorithm and GPUs to Break the Ten Quadrillion Pairwise Comparisons Barrier for Time Series Motifs and Joins. KAIS:203-236. Link

Zhu, Yan, et al. (2018) Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speeds. ICDM:837-846. Link

Yeh, Chin-Chia Michael, et al. (2018) Time Series Joins, Motifs, Discords and Shapelets: a Unifying View that Exploits the Matrix Profile. Data Min Knowl Disc:83-123. Link

Gharghabi, Shaghayegh, et al. (2018) "Matrix Profile XII: MPdist: A Novel Time Series Distance Measure to Allow Data Mining in More Challenging Scenarios." ICDM:965-970. Link

Zimmerman, Zachary, et al. (2019) Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise Comparisons a Day and Beyond. SoCC '19:74-86. Link

Akbarinia, Reza, and Betrand Cloez. (2019) Efficient Matrix Profile Computation Using Different Distance Functions. arXiv:1901.05708. Link

Kamgar, Kaveh, et al. (2019) Matrix Profile XV: Exploiting Time Series Consensus Motifs to Find Structure in Time Series Sets. ICDM:1156-1161. Link

License & Trademark

STUMPY
Copyright 2019 TD Ameritrade. Released under the terms of the 3-Clause BSD license.
STUMPY is a trademark of TD Ameritrade IP Company, Inc. All rights reserved.
Issues
  • [WIP] Discovering Discords of arbitrary length using MERLIN #417

    [WIP] Discovering Discords of arbitrary length using MERLIN #417

    In this PR, we would like to implement MERLIN algorithm that discovers discords of arbitrary length. Although the MATLAB implementation of the faster version of MERLIN is available on MERLIN: SUPPORT, we, for now, implement the original version as proposed in the MERLIN.

    What I have done so far:

    • Add Introduction
    • Explain the advantage / disadvantage of MatrixProfile for discovering discords
    • Explain two core ideas of MERLIN algorithm
    • Implement the first phase of DRAG (DRAG is the first part of MERLIN)

    NOTE: (1) I already implemented MERLIN algorithm and enhanced it a little bit. Both MERLIN and STUMPY: MatrixProfile gives EXACTLY the same output regarding the discords indices and their distances to their NN. The discord NN indices are the same in almost all cases.

    (2) In terms of performance, MERLIN outperforms STUMPY for long time series.

    opened by NimaSarajpoor 120
  • [WIP] Add Tutorial for Matrix Profile XXI: MERLIN algorithm #Issue 417

    [WIP] Add Tutorial for Matrix Profile XXI: MERLIN algorithm #Issue 417

    Pull Request Checklist

    Below is a simple checklist but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist)
    opened by NimaSarajpoor 94
  • [WIP] Add `mmotifs` and `aamp_mmoitfs` Unit Tests

    [WIP] Add `mmotifs` and `aamp_mmoitfs` Unit Tests

    Pull Request Checklist

    Below is a simple checklist but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist) #552
    opened by SaVoAMP 89
  • Added Annotation Vectors Tutorial, #177

    Added Annotation Vectors Tutorial, #177

    Added first draft of the Annotation Vectors Tutorial, requested in issue #177, for feedback. The tutorial includes all 4 examples explored in the Matrix Profile V paper (https://www.cs.ucr.edu/~hdau001/guided_motif_search/). Planning to remove 2/3 out of the 4 examples. Let me know which ones I should keep/remove.

    Pull Request Checklist

    Below is a simple checklist but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist)
    opened by alvii147 88
  • [WIP] Module for motif and discord discovery

    [WIP] Module for motif and discord discovery

    I gave the motif discovery module a go (issue #184), what do you think? (Ignore all changes outside motifs.py and test_motifs.py, as we discuss this in another PR)

    There are still more tests to be done, but I'm not sure how to approach it yet. Ideally I'd like a naive_k_motifs method.

    opened by mihailescum 81
  • Consensus Motifs: Ostinato algorithm; most central motif; reproduce Fig 1, 2, 9 from paper Matrix Profile XV

    Consensus Motifs: Ostinato algorithm; most central motif; reproduce Fig 1, 2, 9 from paper Matrix Profile XV

    This uses the mtDNA example. Ostinato transcribed from table 2 in paper. Performance is really slow.

    resolves #277, resolves #278

    Pull Request Checklist

    Below is a simple checklist but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist)
    opened by DanBenHa 58
  • Add M-dimensional Motif Discovery

    Add M-dimensional Motif Discovery

    Given that there is interest for 1-D discord/motif search, it may be useful to have M-dimensional motif/discord search capabilities.

    According to the last paragraph in Section IV E of the mSTAMP paper:

    In the case where multiple semantically meaningful k-dimensional motifs are presented in the multidimensional time series (e.g., Fig. 5), we can just interactively apply the MDL-based method to discover the motif. There are two steps in each iteration: 1) apply the MDL-based method to find the k-dimensional motif with the minimum bit size and 2) remove the found k-dimensional motif by replacing the matrix profile values of the found motif (and its trivial match) to infinity. If we apply the two steps above to the time series shown in Fig. 5, the 3-dimensional motif would be discovered in the first iteration, and the 2-dimensional motif would be discovered in the second iteration. In terms of the terminal condition for the iterative method, it can be either be an input for the user or a more advanced technique could be applied. Due to space limitations, we will have to leave the discussion on termination condition to future works. An example of applying such iterative algorithm on real-world physical activity monitoring time series is shown in Section V.E.

    So, this means that it is possible to find different motifs at the same index location but with a different subset of dimensions!

    enhancement help wanted 
    opened by seanlaw 52
  • Add the introduction part to Snippet Tutorial

    Add the introduction part to Snippet Tutorial

    Corresponding to Issue #374

    Pull Request Checklist

    Below is a simple checklist but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [ ] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist)
    opened by NimaSarajpoor 47
  • [WIP] Tutorial on Time Series Geometric Chain

    [WIP] Tutorial on Time Series Geometric Chain

    Initial attempt on the Issue #211: Creating Tutorial for Time Series Geometric Chain

    Pull Request Checklist

    Below is a simple checklist, but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist)
    opened by NimaSarajpoor 44
  • [WIP] Multi-Motif Match

    [WIP] Multi-Motif Match

    Pull Request Checklist

    Below is a simple checklist but please do not hesitate to ask for assistance!

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Create a new branch
    • [x] Make necessary code changes
    • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist) #187
    opened by SaVoAMP 37
  • [WIP] Add finite tracking dictionary

    [WIP] Add finite tracking dictionary

    Work in Progress for Issue #206

    Really clean codebase, Sean. Just updating as I work so you know there's progress. Created finite mask attribute that tracks with _T.

    TODO: Find where to apply finite mask TODO: Add test to include nan/inf data

    Pull Request Checklist

    • [x] Fork, clone, and checkout the newest version of the code
    • [x] Make necessary changes
    • [x] Install black (i.e., python -m pip install black or conda install black)
    • [x] Install flake8 (i.e., python -m pip install flake8 or conda install flake8)
    • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install pytest-cov)
    • [x] Run black . in the root stumpy directory
    • [x] Run flake8 . in the root stumpy directory
    • [x] Run ./setup.py && ./test.sh in the root stumpy directory
    • [x] Reference a Github issue (and create one if one doesn't already exist)
    opened by JosephTLucas 35
  • Fixed #610 Imprecision in identical case

    Fixed #610 Imprecision in identical case

    This PR resolves issue #610 by adding a new config variable to reset the already-calculated pearson value to 1.0 when it exceeds the threshold set by the config variable.

    opened by NimaSarajpoor 1
  • Incorrect descriptions in Docstrings

    Incorrect descriptions in Docstrings

    I have been reading the module mpdist and I noticed:

    • The description of the output of _mpdist_vect is missing from its docstring
    • The description of μ_Q and σ_Q (in _mpdist_vect) are not correct. I think they should be numpy.ndarrays not float.
    • The description of μ_Q and σ_Q (in core. _mass_distance_matrix) are not correct. I think they should be numpy.ndarrays not float.
    documentation enhancement communication and transparency 
    opened by NimaSarajpoor 2
  • Speeding up `scrump`

    Speeding up `scrump`

    I have an idea that might speed up scrump when its parameter pre_scrump is set to True. (After some thought, I concluded that this reduction in computing time would be probably insignificant. However, I am going to share my idea here anyway)

    When pre_scrump=True, we calculate an approximate matrix profile by going through some samples (indices): https://github.com/TDAmeritrade/stumpy/blob/c432db42267f3f88698c4dd4cee9128bc01d563f/stumpy/scrump.py#L346

    (note: l = n - m + 1)

    And, for each sample i:

    • we calculate its distance profile.
    • we also use QT to calculate the distance between S_(i + j) and S_(nn_i + j) for j in range(1, s)
    • we also use QT to calculate the distance between S_(i - j) and S_(nn_i - j) for j in range(1, s)

    So, of Combination(l, 2) cells of distance matrix (in self-join), we roughly calculate:

    • $\frac{l}{s} \times l$ (regarding distance profile)
    • $\frac{l}{s} \times s$ (regarding move forward)
    • $\frac{l}{s} \times s$ (regarding move backward)

    So, at the end of prescrump process, we already calculated roughly $\frac{1}{s} l^{2} + 2l$ elements of distance matrix. I think if we assume l is large, we can then say:

    • we need to calculate $\frac{1}{2} l^{2}$ cells to obtain the full distance matrix (in self-join)
    • At the end of prescrump, we already calculated $\frac{1}{s} l^{2}$ cells of them.

    So, if we avoid re-calculating $\frac{1}{s} l^{2}$ number of distances throughtout .update() process, then:

    $$ \frac{\frac{1}{s} l^{2}}{\frac{1}{2} l^{2}} = \frac{2}{s}$$

    So, if l = 100_000 and s = 100 (0.001 of l), we are reducing the computation time by 2% (?!). This is not just for a single update but rather for several calls of update that complete the distance matrix. So, I think the improvement is probably insignificant for a single call of update method.

    I am not sure about my rough calculation though :)


    I think keeping track of what pair-wise distances have been calculated can also help us avoid the duplicate issue that might appear in the top-k matrix profiles as discussed in PR #595.

    enhancement question 
    opened by NimaSarajpoor 0
  • Ensure 2D Array Matrix Profile Outputs

    Ensure 2D Array Matrix Profile Outputs

    As we move toward supporting top-k matrix profiles, we need to ensure consistency of our outputs and they need to be 2D instead of 1D.

    This is related to #592 and #639

    enhancement 
    opened by seanlaw 7
  • Add `T_subseq_isfinite` Note to `normalize` Docstring

    Add `T_subseq_isfinite` Note to `normalize` Docstring

    In some functions like, core.mass or motifs.match, it is possible to supply an optional T_subseq_isfinite parameter IFF normalize=False. Therefore, we should add a note to the normalize docstring to express this option.

    Note that we should NOT recommend replacing M_T with T_subseq_isfinite as this is confusing.

    documentation enhancement communication and transparency 
    opened by seanlaw 0
Releases(v1.11.1)
Owner
TD Ameritrade
Open source by TD Ameritrade
TD Ameritrade
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 685 Aug 9, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.7k Aug 9, 2022
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Facebook Research 3.9k Aug 5, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 21.6k Aug 13, 2022
K-means clustering is a method used for clustering analysis, especially in data mining and statistics.

K Means Algorithm What is K Means This algorithm is an iterative algorithm that partitions the dataset according to their features into K number of pr

null 1 Nov 1, 2021
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 144 Aug 5, 2022
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 4.5k Aug 12, 2022
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Aug 7, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.2k Aug 1, 2022
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Sam 425 Aug 7, 2022
An open-source library of algorithms to analyse time series in GPU and CPU.

An open-source library of algorithms to analyse time series in GPU and CPU.

Shapelets 211 Jul 21, 2022
Nixtla is an open-source time series forecasting library.

Nixtla Nixtla is an open-source time series forecasting library. We are helping data scientists and developers to have access to open source state-of-

Nixtla 308 Aug 10, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 466 Aug 8, 2022
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 6 Aug 9, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.1k Aug 11, 2022
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.2k Aug 13, 2022
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 14.8k Aug 9, 2022
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.2k Aug 9, 2022
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 73 Jul 30, 2022