Time Series Cross-Validation -- an extension for scikit-learn

Wenjie Zheng

Last update: Jan 1, 2023

Related tags

Deep Learning data-science machine-learning time-series cross-validation model-selection hyperparameter-optimization tuning-parameters backtesting

Overview

TSCV: Time Series Cross-Validation

This repository is a scikit-learn extension for time series cross-validation. It introduces gaps between the training set and the test set, which mitigates the temporal dependence of time series and prevents information leakage.

Installation

pip install tscv

conda install -c conda-forge tscv

Usage

This extension defines 3 cross-validator classes and 1 function:

GapLeavePOut
GapKFold
GapRollForward
gap_train_test_split

The three classes can all be passed, as the cv argument, to scikit-learn functions such as cross-validate, cross_val_score, and cross_val_predict, just like the native cross-validator classes.

The one function is an alternative to the train_test_split function in scikit-learn.

Examples

The following example uses GapKFold instead of KFold as the cross-validator.

import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import cross_val_score
from tscv import GapKFold

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)

# use GapKFold as the cross-validator
cv = GapKFold(n_splits=5, gap_before=5, gap_after=5)
scores = cross_val_score(clf, iris.data, iris.target, cv=cv)

The following example uses gap_train_test_split to split the data set into the training set and the test set.

import numpy as np
from tscv import gap_train_test_split

X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = gap_train_test_split(X, y, test_size=2, gap_size=2)

Contributing

Report bugs in the issue tracker
Express your use cases in the issue tracker

Documentations

tscv.readthedocs.io

Acknowledgments

I would like to thank Jeffrey Racine and Christoph Bergmeir for the helpful discussion.

License

BSD-3-Clause

Citation

Wenjie Zheng. (2021). Time Series Cross-Validation (TSCV): an extension for scikit-learn. Zenodo. http://doi.org/10.5281/zenodo.4707309

@software{zheng_2021_4707309,
  title={{Time Series Cross-Validation (TSCV): an extension for scikit-learn}},
  author={Zheng, Wenjie},
  month={april},
  year={2021},
  publisher={Zenodo},
  doi={10.5281/zenodo.4707309},
  url={http://doi.org/10.5281/zenodo.4707309}
}

Comments

Make it work with cross_val_predict
Is it possible to somehow make the CV work with cross_val_predict function. Fore example, if I try:

cv = GapWalkForward(n_splits=3, gap_size=1, test_size=2) cross_val_predict(estimator=SGDClassifier(), X=X_sample, y=y_bin_sample, cv=cv, n_jobs=6)

it returns an error

ValueError: cross_val_predict only works for partitions

but I would like to have predictions so I can make consfusion matrx and other statistics.

Is it possible to make it work with your cross-validators?
opened by MislavSag 8
Documentation

Documentation and examples do not address the splitting of data set into training and test sets.

If using one of the cross validators, does the data set need to be sorted in time order? Is there way to designate a datetime column so the class understands on what basis to sequentially split data?

opened by mksamelson 3
split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

Just flagging a minor issue:

We found this after poetry update-ing our dependencies, inadvertently bumping scikit-learn to 0.24.0. This broke code we have that uses tscv

relevant scikit-learn source-code from version 0.23.0 https://github.com/scikit-learn/scikit-learn/blob/0.23.0/sklearn/utils/init.py#L274-L275

The method has been made private in scikit-learn 0.24.0: https://github.com/scikit-learn/scikit-learn/blob/0.24.0/sklearn/utils/init.py#L271

I did not investigate further, we pinned scikit-learn to 0.23.0 and that's OK for now, but some refactoring may be in order to move off the private method.

opened by rob-sokolowski 3
Error when Importing TSCV Gapwalkforward

Using TSCV Gapwalkforward successfully with Python 3.7.

Suddenly getting following error:

ImportError Traceback (most recent call last) in 41 #Modeling 42 ---> 43 from tscv import GapWalkForward 44 from sklearn.utils import shuffle 45 from sklearn.model_selection import KFold

~\Anaconda3\envs\py37\lib\site-packages\tscv_init_.py in ----> 1 from .split import GapCrossValidator 2 from .split import GapLeavePOut 3 from .split import GapKFold 4 from .split import GapWalkForward 5 from .split import gap_train_test_split

~\Anaconda3\envs\py37\lib\site-packages\tscv\split.py in 7 8 import numpy as np ----> 9 from sklearn.utils import indexable, safe_indexing 10 from sklearn.utils.validation import _num_samples 11 from sklearn.base import _pprint

ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'

Any insight? I get this when simply importing Gapwalkforward.

opened by mksamelson 2
GapWalkForward Issue with Scikit-learn 0.24.1

When I upgrade to Scikit-learn 0.24.1 I get an issue:

cannot import name 'safe_indexing' from 'sklearn.utils'

This appears to be a change within scikit-learn as indicated here:

https://stackoverflow.com/questions/65602076/yellowbrick-importerror-cannot-import-name-safe-indexing-from-sklearn-utils

No issue using scikit-learn 0.23.2

opened by mksamelson 2
Release 0.0.4 for GridSearch compat

Would it be possible to issue a new release on PyPI to include the latest changes from this commit which aligns the get_n_splits method signature with the abstract method signature required by GridSearchCV?

opened by wderose 2
Warning once is not enough

https://github.com/WenjieZ/TSCV/blob/f8b832fab1dca0e2d2d46029308c2d06eef8b858/tscv/split.py#L253

This warning should appear for every occurrence. Use standard output instead.

opened by WenjieZ 1
Retrained version of GapWalkForward: GapRollForward

The current implementation is based on legacy K-Fold cross-validation requiring an explicit value for the n_splits parameter. It puts the burden of calculating desired value of n_splits on the user.

A better implementation should allow the user to initiate a GapWalkForward class without specifying the value for n_splits. Instead, it can deduct the right value through the other inputs.

It is theoretically desirable to keep both channels of kickstarting a GapWalkForward class. In practice, however, it is hard to maintain both within a single class. Therefore, I decide to ~~deprecate the n_splits channel~~ implement a new class dubbed GapRollForward in v0.1.0 -- the version after the next.

opened by WenjieZ 1
Changed GapWalkForward.get_n_splits to match abstract method signatur…

…e. Now works with GridSearchCV. Otherwise using GapWalkForward as the cross validation class passed to GridSearchCV will fail with "TypeError: get_n_splits() takes 1 positional argument but 4 were given."

opened by lawsonmcw 1

Import error with latest sklearn version

Hi guys, this issue occured after the upgrade to 1.1.3

ImportError: cannot import name '_pprint' from 'sklearn.base'

/.venv/lib/python3.10/site-packages/tscv/_split.py:19 in      │
│ <module>                                                                                         │
│                                                                                                  │
│    16 import numpy as np                                                                         │
│    17 from sklearn.utils import indexable                                                        │
│    18 from sklearn.utils.validation import _num_samples, check_consistent_length                 │
│ ❱  19 from sklearn.base import _pprint                                                           │
│    20 from sklearn.utils import _safe_indexing                                                   │
│    21                                                                                            │
│    22

Could you please fix it ?

Kind regards, Jim

opened by teneon 1

Consistently use the test sets as reference for `gap_before` and `gap_after`
There are two ways of defining a derived cross-validator. One is to redefine _iter_test_indices or _iter_test_masks (test viewpoint), and the other is to redefine _iter_train_masks or _iter_train_indices (train viewpoint).

Currently, these two methods assign different semantic meanings to the parameters gap_before and gap_after. The test viewpoint uses the test sets as the reference:

train gap_before test gap_after train

The train viewpoint uses the training sets as the reference:

test gap_before train gap_after test

This diverged behavior is ~~not intended~~ inappropriate. The package should insist on the test viewpoint, and hence this PR. It will be enforced in v0.2.

I don't think this issue has touched any users, for the derived classes in this package use _iter_test_indices exclusively (test viewpoint). No users have reported this issue either. If you suspect that you have been affected by it, please reply to this PR.
opened by WenjieZ 1

time boost in folds generation

With contiguous test sets:

cv_orig = GapKFold(n_splits=5, gap_before=1, gap_after=1)

for train_index, test_index in cv_orig.split(np.arange(10)):
    print("TRAIN:", train_index, "TEST:", test_index)


... TRAIN: [3 4 5 6 7 8 9] TEST: [0 1]
... TRAIN: [0 5 6 7 8 9] TEST: [2 3]
... TRAIN: [0 1 2 7 8 9] TEST: [4 5]
... TRAIN: [0 1 2 3 4 9] TEST: [6 7]
... TRAIN: [0 1 2 3 4 5 6] TEST: [8 9]

cv_opt = GapKFold(n_splits=5, gap_before=1, gap_after=1)

for train_index, test_index in cv_opt.split(np.arange(10)):
    print("TRAIN:", train_index, "TEST:", test_index)


... TRAIN: [3 4 5 6 7 8 9] TEST: [0 1]
... TRAIN: [0 5 6 7 8 9] TEST: [2 3]
... TRAIN: [0 1 2 7 8 9] TEST: [4 5]
... TRAIN: [0 1 2 3 4 9] TEST: [6 7]
... TRAIN: [0 1 2 3 4 5 6] TEST: [8 9]

%%timeit
folds = list(cv_orig.split(np.arange(10000)))


... 1.21 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
folds = list(cv_opt.split(np.arange(10000)))


... 4.74 ms ± 44.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With uncontiguous test sets:

cv_orig = _XXX_(_xxx_, gap_before=1, gap_after=1)

for train_index, test_index in cv_orig.split(np.arange(10)):
    print("TRAIN:", train_index, "TEST:", test_index)


... TRAIN: [5 6 7 8 9] TEST: [0 1 2 3]
... TRAIN: [7 8 9] TEST: [0 1 4 5]
... TRAIN: [3 4 9] TEST: [0 1 6 7]
... TRAIN: [3 4 5 6] TEST: [0 1 8 9]
... TRAIN: [0 7 8 9] TEST: [2 3 4 5]
... TRAIN: [0 9] TEST: [2 3 6 7]
... TRAIN: [0 5 6] TEST: [2 3 8 9]
... TRAIN: [0 1 2 9] TEST: [4 5 6 7]
... TRAIN: [0 1 2] TEST: [4 5 8 9]
... TRAIN: [0 1 2 3 4] TEST: [6 7 8 9]

cv_opt = _XXX_(_xxx_, gap_before=1, gap_after=1)

for train_index, test_index in cv_opt.split(np.arange(10)):
    print("TRAIN:", train_index, "TEST:", test_index)


... TRAIN: [5 6 7 8 9] TEST: [0 1 2 3]
... TRAIN: [7 8 9] TEST: [0 1 4 5]
... TRAIN: [3 4 9] TEST: [0 1 6 7]
... TRAIN: [3 4 5 6] TEST: [0 1 8 9]
... TRAIN: [0 7 8 9] TEST: [2 3 4 5]
... TRAIN: [0 9] TEST: [2 3 6 7]
... TRAIN: [0 5 6] TEST: [2 3 8 9]
... TRAIN: [0 1 2 9] TEST: [4 5 6 7]
... TRAIN: [0 1 2] TEST: [4 5 8 9]
... TRAIN: [0 1 2 3 4] TEST: [6 7 8 9]

%%timeit
folds = list(cv_orig.split(np.arange(10000)))

... 1.23 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
folds = list(cv_opt.split(np.arange(10000)))

... 4.78 ms ± 49.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

opened by aldder 3

add CombinatorialGapKFold

From "Advances in Financial Machine Learning" book by Marcos López de Prado the implemented version of Combinatorial Cross Validation with Purging and Embargoing

explaining video: https://www.youtube.com/watch?v=hDQssGntmFA

opened by aldder 3
Implement Rep-Holdout

Thank you for this repository and the implemented CV-methods; especially GapRollForward. I was looking for exactly this package.

I was wondering if you are interested in implementing another CV-Method for time series, called Rep-Holdout. It is used in this evaluation paper (https://arxiv.org/abs/1905.11744) and has good performance compared to all other CV-methods - some of which you have implemented here.

As I understand it, it is somewhat like sklearn.model_selection.TimeSeriesSplit but with a randomized selection of all possible folds. Here is the description from the paper as an image:

The authors provided code in R but it is written very differently than how it needs to look in Python. I adapted your functions to implement it in python but I am not the best coder and it really only serves my purpose of tuning a specific model. Seeing as the performance of Rep-Holdout is good and -to me at least - it makes sense for time series cross validation, maybe you are interested in adding this function to your package?

opened by georgeblck 8
Intution on setting number of gaps

If for example, I have data without gaps, when and why would I still create a break between my train and validation? I have seen the argument for setting gaps when the period that needs to be predicted may be N days after the train. Are there other reasons? And if so, what is the intuition on knowing how many gaps to include before/after the training set?

opened by tyokota 0

Releases(v0.1.2)

v0.1.2(Apr 21, 2021)

Updated the setup file.
Source code(tar.gz)
Source code(zip)
tscv-0.1.2-py3-none-any.whl(18.00 KB)
v0.1.1(Apr 16, 2021)

Minor changes
Source code(tar.gz)
Source code(zip)
tscv-0.1.1-py3-none-any.whl(17.74 KB)
v0.1.0(Apr 15, 2021)
Add the new, more flexible and thus powerful GapRollForward to replace GapWalkForward.

Source code(tar.gz)
Source code(zip)
tscv-0.1.0-py3-none-any.whl(17.74 KB)
v0.0.5(Mar 29, 2021)
The release solves the Scikit-Learn v0.24 compatibility issue as well as implements the following enhancements:

Make 0 training size possible in GapWalkForward.

Overlapping the test set in GapWalkForward via the rollback_size parameter.

Improve the user experience in of gap_train_test_split.

Add a deprecation message to GapWalkForward.

Source code(tar.gz)
Source code(zip)
tscv-0.0.5-py3-none-any.whl(9.03 KB)
v0.0.5-rc1(Mar 18, 2021)
The release solves the Scikit-Learn v0.24 compatibility issue as well as implements the following enhancements:

Make 0 training size possible in GapWalkForward.

Overlapping the test set in GapWalkForward via the rollback_size parameter.

Improve the user experience in of gap_train_test_split.

Source code(tar.gz)
Source code(zip)
tscv-0.0.5rc1-py3-none-any.whl(9.01 KB)
v0.0.4(Dec 8, 2019)

Changed GapWalkForward.get_n_splits to match abstract method signature.

It aligns the get_n_splits method signature with the abstract method signature required by GridSearchCV.
Source code(tar.gz)
Source code(zip)
v0.0.3(May 19, 2019)

Source code(tar.gz)
Source code(zip)
v0.0.2(May 15, 2019)

Source code(tar.gz)
Source code(zip)
v0.0.1(May 14, 2019)

Source code(tar.gz)
Source code(zip)

Owner

Wenjie Zheng

Statistical Learning Solution Expert

GitHub https://tscv.readthedocs.io

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

380 Nov 5, 2022

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

SciKit-Learn Laboratory This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. O

528 Nov 25, 2022

Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

482 Jan 4, 2023

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

4.9k Dec 31, 2022

scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

52.5k Jan 8, 2023

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

3.8k Feb 13, 2021

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

4.9k Jan 3, 2023

Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 2, 2023

scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

417 Dec 20, 2022

Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

1.3k Jan 3, 2023

Genetic feature selection module for scikit-learn

sklearn-genetic Genetic feature selection module for scikit-learn Genetic algorithms mimic the process of natural selection to search for optimal valu

260 Dec 14, 2022

Use evolutionary algorithms instead of gridsearch in scikit-learn

sklearn-deap Use evolutionary algorithms instead of gridsearch in scikit-learn. This allows you to reduce the time required to find the best parameter

709 Jan 3, 2023

SigOpt wrappers for scikit-learn methods

SigOpt + scikit-learn Interfacing This package implements useful interfaces and wrappers for using SigOpt and scikit-learn together Getting Started In

73 Sep 30, 2022

Using python and scikit-learn to make stock predictions

MachineLearningStocks in python: a starter project and guide EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained MachineLearni

1.3k Dec 29, 2022

A scikit-learn-compatible module for estimating prediction intervals.

|Anaconda|_ MAPIE - Model Agnostic Prediction Interval Estimator MAPIE allows you to easily estimate prediction intervals using your favourite sklearn

584 Dec 27, 2022

Regression Metrics Calculation Made easy for tensorflow2 and scikit-learn

Regression Metrics Installation To install the package from the PyPi repository you can execute the following command: pip install regressionmetrics I

11 Dec 16, 2022

Convert scikit-learn models to PyTorch modules

sk2torch sk2torch converts scikit-learn models into PyTorch modules that can be tuned with backpropagation and even compiled as TorchScript. Problems

101 Dec 16, 2022

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

1.4k Dec 22, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

Time Series Cross-Validation -- an extension for scikit-learn

Related tags

Overview

TSCV: Time Series Cross-Validation

Installation

Usage

Examples

Contributing

Documentations

Acknowledgments

License

Citation

Comments

Releases(v0.1.2)

v0.1.2(Apr 21, 2021)

v0.1.1(Apr 16, 2021)

v0.1.0(Apr 15, 2021)

v0.0.5(Mar 29, 2021)

v0.0.5-rc1(Mar 18, 2021)

v0.0.4(Dec 8, 2019)

v0.0.3(May 19, 2019)

v0.0.2(May 15, 2019)

v0.0.1(May 14, 2019)

Owner

Wenjie Zheng

A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

Python package for Bayesian Machine Learning with scikit-learn API

A scikit-learn compatible neural network library that wraps PyTorch

scikit-learn: machine learning in Python

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch

Scikit-learn compatible estimation of general graphical models

scikit-learn inspired API for CRFsuite

Genetic Programming in Python, with a scikit-learn inspired API

Genetic feature selection module for scikit-learn

Use evolutionary algorithms instead of gridsearch in scikit-learn

SigOpt wrappers for scikit-learn methods

Using python and scikit-learn to make stock predictions

A scikit-learn-compatible module for estimating prediction intervals.

Regression Metrics Calculation Made easy for tensorflow2 and scikit-learn

Convert scikit-learn models to PyTorch modules

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training