mlforecast
Scalable machine learning based time series forecasting.
Install
PyPI
pip install mlforecast
Optional dependencies
If you want more functionality you can instead use pip install mlforecast[extra1,extra2,...]
. The current extra dependencies are:
- aws: adds the functionality to use S3 as the storage in the CLI.
- cli: includes the validations necessary to use the CLI.
- distributed: installs dask to perform distributed training. Note that you'll also need to install either LightGBM or XGBoost.
For example, if you want to perform distributed training through the CLI using S3 as your storage you'll need all three extras, which you can get using: pip install mlforecast[aws,cli,distributed]
.
conda-forge
conda install -c conda-forge mlforecast
Note that this installation comes with the required dependencies for the local interface. If you want to:
- Use s3 as storage:
conda install -c conda-forge s3path
- Perform distributed training:
conda install -c conda-forge dask
and either LightGBM or XGBoost.
How to use
The following provides a very basic overview, for a more detailed description see the documentation.
Programmatic API
Store your time series in a pandas dataframe with an index named unique_id that identifies each time serie, a column ds that contains the datestamps and a column y with the values.
from mlforecast.utils import generate_daily_series
series = generate_daily_series(20)
display_df(series.head())
unique_id | ds | y |
---|---|---|
id_00 | 2000-01-01 00:00:00 | 0.264447 |
id_00 | 2000-01-02 00:00:00 | 1.28402 |
id_00 | 2000-01-03 00:00:00 | 2.4628 |
id_00 | 2000-01-04 00:00:00 | 3.03552 |
id_00 | 2000-01-05 00:00:00 | 4.04356 |
Then create a TimeSeries
object with the features that you want to use. These include lags, transformations on the lags and date features. The lag transformations are defined as numba jitted functions that transform an array, if they have additional arguments you supply a tuple (transform_func
, arg1
, arg2
, ...).
from mlforecast.core import TimeSeries
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean
ts = TimeSeries(
lags=[7, 14],
lag_transforms={
1: [expanding_mean],
7: [(rolling_mean, 7), (rolling_mean, 14)]
},
date_features=['dayofweek', 'month']
)
ts
TimeSeries(freq=<Day>, transforms=['lag-7', 'lag-14', 'expanding_mean_lag-1', 'rolling_mean_lag-7_window_size-7', 'rolling_mean_lag-7_window_size-14'], date_features=['dayofweek', 'month'], num_threads=8)
Next define a model. If you want to use the local interface this can be any regressor that follows the scikit-learn API. For distributed training there are LGBMForecast
and XGBForecast
.
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=0)
Now instantiate your forecast object with the model and the time series. There are two types of forecasters, Forecast
which is local and DistributedForecast
which performs the whole process in a distributed way.
from mlforecast.forecast import Forecast
fcst = Forecast(model, ts)
To compute the features and train the model using them call .fit
on your Forecast
object.
fcst.fit(series)
Forecast(model=RandomForestRegressor(random_state=0), ts=TimeSeries(freq=<Day>, transforms=['lag-7', 'lag-14', 'expanding_mean_lag-1', 'rolling_mean_lag-7_window_size-7', 'rolling_mean_lag-7_window_size-14'], date_features=['dayofweek', 'month'], num_threads=8))
To get the forecasts for the next 14 days call .predict(14)
on the forecaster. This will update the target with each prediction and recompute the features to get the next one.
predictions = fcst.predict(14)
display_df(predictions.head())
unique_id | ds | y_pred |
---|---|---|
id_00 | 2000-08-10 00:00:00 | 5.24484 |
id_00 | 2000-08-11 00:00:00 | 6.25861 |
id_00 | 2000-08-12 00:00:00 | 0.225484 |
id_00 | 2000-08-13 00:00:00 | 1.22896 |
id_00 | 2000-08-14 00:00:00 | 2.30246 |
CLI
If you're looking for computing quick baselines, want to avoid some boilerplate or just like using CLIs better then you can use the mlforecast
binary with a configuration file like the following:
!cat sample_configs/local.yaml
data:
prefix: data
input: train
output: outputs
format: parquet
features:
freq: D
lags: [7, 14]
lag_transforms:
1:
- expanding_mean
7:
- rolling_mean:
window_size: 7
- rolling_mean:
window_size: 14
date_features: ["dayofweek", "month", "year"]
num_threads: 2
backtest:
n_windows: 2
window_size: 7
forecast:
horizon: 7
local:
model:
name: sklearn.ensemble.RandomForestRegressor
params:
n_estimators: 10
max_depth: 7
The configuration is validated using FlowConfig
.
This configuration will use the data in data.prefix/data.input
to train and write the results to data.prefix/data.output
both with data.format
.
data_path = Path('data')
data_path.mkdir()
series.to_parquet(data_path/'train')
!mlforecast sample_configs/local.yaml
Split 1 MSE: 0.0251
Split 2 MSE: 0.0180
list((data_path/'outputs').iterdir())
[PosixPath('data/outputs/valid_1.parquet'),
PosixPath('data/outputs/valid_0.parquet'),
PosixPath('data/outputs/forecast.parquet')]