PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life.
- Engineering
- FSI industry (Financial Services Industry)
- FMCG (Fast Moving Consumer Good)
Data scientist's work consists of:
- forecasting
- prediction/simulation
- data prepration
- cleansing
- anomaly detection
- descriptive data analysis/exploratory data analysis
each new business unit shall build the following wheels again and again
- data pipeline
- extraction
- transformation
- cleansing
- feature engineering
- remove outliers
- AI landing for prediction, forecasting
- write it back to database
- ml framework
- multiple model training
- multiple model prediction
- kfold validation
- anomaly detection
- forecasting
- deep learning model in easy way
- ensemble modelling
- exploratory data analysis
- descriptive data analysis
- ...
That's why I create this project, also for fun. haha
This project is under active development, free to use (Apache 2.0) I am happy to see anyone can contribute for more advancement on features
Install
pip install pyemits
Features highlight
- Easy training
import numpy as np
from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
X = np.random.randint(1, 100, size=(1000, 10))
y = np.random.randint(1, 100, size=(1000, 1))
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()
- Accept neural network as model
import numpy as np
from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))
keras_lstm_model = KerasWrapper.from_simple_lstm_model((10, 10), 4)
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()
also keep flexibility on customized model
import numpy as np
from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))
from keras.layers import Dense, Dropout, LSTM
from keras import Sequential
model = Sequential()
model.add(LSTM(128,
activation='softmax',
input_shape=(10, 10),
))
model.add(Dropout(0.1))
model.add(Dense(4))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])
keras_lstm_model = KerasWrapper(model, nickname='LSTM')
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()
or attach it in algo config
import numpy as np
from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
from pyemits.common.config_model import KerasSequentialConfig
X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))
from keras.layers import Dense, Dropout, LSTM
from keras import Sequential
keras_lstm_model = KerasWrapper(nickname='LSTM')
config = KerasSequentialConfig(layer=[LSTM(128,
activation='softmax',
input_shape=(10, 10),
),
Dropout(0.1),
Dense(4)],
compile=dict(loss='mse', optimizer='adam', metrics=['mse']))
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model],
[config],
raw_data_model,
{'fit_config' : [dict(epochs=10, batch_size=32)]})
trainer.fit()
PyTorch, MXNet under development you can leave me a message if you want to contribute
- MultiOutput training
import numpy as np
from pyemits.core.ml.regression.trainer import RegressionDataModel, MultiOutputRegTrainer
from pyemits.core.preprocessing.splitting import SlidingWindowSplitter
X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))
# when use auto-regressive like MultiOutput, pls set ravel = True
# ravel = False, when you are using LSTM which support multiple dimension
splitter = SlidingWindowSplitter(24,24,ravel=True)
X, y = splitter.split(X, y)
raw_data_model = RegressionDataModel(X,y)
trainer = MultiOutputRegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()
- Parallel training
- provide fast training using parallel job
- use RegTrainer as base, but add Parallel running
import numpy as np
from pyemits.core.ml.regression.trainer import RegressionDataModel, ParallelRegTrainer
X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))
raw_data_model = RegressionDataModel(X,y)
trainer = ParallelRegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()
or you can use RegTrainer for multiple model, but it is not in Parallel job
import numpy as np
from pyemits.core.ml.regression.trainer import RegressionDataModel, RegTrainer
X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))
raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()
- KFold training
- KFoldConfig is global config, will apply to all
import numpy as np
from pyemits.core.ml.regression.trainer import RegressionDataModel, KFoldCVTrainer
from pyemits.common.config_model import KFoldConfig
X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))
raw_data_model = RegressionDataModel(X,y)
trainer = KFoldCVTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model, {'kfold_config':KFoldConfig(n_splits=10)})
trainer.fit()
- Easy prediction
import numpy as np
from pyemits.core.ml.regression.trainer import RegressionDataModel, RegTrainer
from pyemits.core.ml.regression.predictor import RegPredictor
X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))
raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()
predictor = RegPredictor(trainer.clf_models, 'RegTrainer')
predictor.predict(RegressionDataModel(X))
- Forecast at scale
- see examples: forecast at scale.ipynb
- Data Model
from pyemits.common.data_model import RegressionDataModel
import numpy as np
X = np.random.randint(1, 100, size=(1000,10,10))
y = np.random.randint(1, 100, size=(1000, 1))
data_model = RegressionDataModel(X, y)
data_model._update_variable('X_shape', (1000,10,10))
data_model.X_shape
data_model.add_meta_data('X_shape', (1000,10,10))
data_model.meta_data
- Anomaly detection (under development)
- see module: anomaly detection
- Kalman filter
- Evaluation (under development)
- see module: evaluation
- backtesting
- model evaluation
- Ensemble (under development)
- blending
- stacking
- voting
- by combo package
- moa
- aom
- average
- median
- maximization
- IO
- db connection
- local
- dashboard ???
- other miscellaneous feature
- continuous evaluation
- aggregation
- dimensional reduction
- data profile (intensive data overview)
- to be confirmed
References
the following libraries gave me some idea/insight
- greykit
- changepoint detection
- model summary
- seaonality
- pytorch-forecasting
- darts
- pyaf
- orbit
- kats/prophets by facebook
- sktime
- gluon ts
- tslearn
- pyts
- luminaries
- tods
- autots
- pyodds
- scikit-hts