PyEmits, a python package for easy manipulation in time-series data.

Overview

Project Icon

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life.

  • Engineering
  • FSI industry (Financial Services Industry)
  • FMCG (Fast Moving Consumer Good)

Data scientist's work consists of:

  • forecasting
  • prediction/simulation
  • data prepration
  • cleansing
  • anomaly detection
  • descriptive data analysis/exploratory data analysis

each new business unit shall build the following wheels again and again

  1. data pipeline
    1. extraction
    2. transformation
      1. cleansing
      2. feature engineering
      3. remove outliers
      4. AI landing for prediction, forecasting
    3. write it back to database
  2. ml framework
    1. multiple model training
    2. multiple model prediction
    3. kfold validation
    4. anomaly detection
    5. forecasting
    6. deep learning model in easy way
    7. ensemble modelling
  3. exploratory data analysis
    1. descriptive data analysis
    2. ...

That's why I create this project, also for fun. haha

This project is under active development, free to use (Apache 2.0) I am happy to see anyone can contribute for more advancement on features

Install

pip install pyemits

Features highlight

  1. Easy training
import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel

X = np.random.randint(1, 100, size=(1000, 10))
y = np.random.randint(1, 100, size=(1000, 1))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()
  1. Accept neural network as model
import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

keras_lstm_model = KerasWrapper.from_simple_lstm_model((10, 10), 4)
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

also keep flexibility on customized model

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

model = Sequential()
model.add(LSTM(128,
               activation='softmax',
               input_shape=(10, 10),
               ))
model.add(Dropout(0.1))
model.add(Dense(4))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

keras_lstm_model = KerasWrapper(model, nickname='LSTM')
raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model], [None], raw_data_model)
trainer.fit()

or attach it in algo config

import numpy as np

from pyemits.core.ml.regression.trainer import RegTrainer, RegressionDataModel
from pyemits.core.ml.regression.nn import KerasWrapper
from pyemits.common.config_model import KerasSequentialConfig

X = np.random.randint(1, 100, size=(1000, 10, 10))
y = np.random.randint(1, 100, size=(1000, 4))

from keras.layers import Dense, Dropout, LSTM
from keras import Sequential

keras_lstm_model = KerasWrapper(nickname='LSTM')
config = KerasSequentialConfig(layer=[LSTM(128,
                                           activation='softmax',
                                           input_shape=(10, 10),
                                           ),
                                      Dropout(0.1),
                                      Dense(4)],
                               compile=dict(loss='mse', optimizer='adam', metrics=['mse']))

raw_data_model = RegressionDataModel(X, y)
trainer = RegTrainer([keras_lstm_model],
                     [config],
                     raw_data_model, 
                     {'fit_config' : [dict(epochs=10, batch_size=32)]})
trainer.fit()

PyTorch, MXNet under development you can leave me a message if you want to contribute

  1. MultiOutput training
import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, MultiOutputRegTrainer
from pyemits.core.preprocessing.splitting import SlidingWindowSplitter

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

# when use auto-regressive like MultiOutput, pls set ravel = True
# ravel = False, when you are using LSTM which support multiple dimension
splitter = SlidingWindowSplitter(24,24,ravel=True)
X, y = splitter.split(X, y)

raw_data_model = RegressionDataModel(X,y)
trainer = MultiOutputRegTrainer(['XGBoost'], [None], raw_data_model)
trainer.fit()
  1. Parallel training
    • provide fast training using parallel job
    • use RegTrainer as base, but add Parallel running
import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel, ParallelRegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = ParallelRegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

or you can use RegTrainer for multiple model, but it is not in Parallel job

import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()
  1. KFold training
    • KFoldConfig is global config, will apply to all
import numpy as np 

from pyemits.core.ml.regression.trainer import RegressionDataModel,  KFoldCVTrainer
from pyemits.common.config_model import KFoldConfig

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = KFoldCVTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model, {'kfold_config':KFoldConfig(n_splits=10)})
trainer.fit()
  1. Easy prediction
import numpy as np 
from pyemits.core.ml.regression.trainer import RegressionDataModel,  RegTrainer
from pyemits.core.ml.regression.predictor import RegPredictor

X = np.random.randint(1, 100, size=(10000, 1))
y = np.random.randint(1, 100, size=(10000, 1))

raw_data_model = RegressionDataModel(X,y)
trainer = RegTrainer(['XGBoost', 'LightGBM'], [None, None], raw_data_model)
trainer.fit()

predictor = RegPredictor(trainer.clf_models, 'RegTrainer')
predictor.predict(RegressionDataModel(X))
  1. Forecast at scale
  2. Data Model
from pyemits.common.data_model import RegressionDataModel
import numpy as np
X = np.random.randint(1, 100, size=(1000,10,10))
y = np.random.randint(1, 100, size=(1000, 1))

data_model = RegressionDataModel(X, y)

data_model._update_variable('X_shape', (1000,10,10))
data_model.X_shape

data_model.add_meta_data('X_shape', (1000,10,10))
data_model.meta_data
  1. Anomaly detection (under development)
  2. Evaluation (under development)
    • see module: evaluation
    • backtesting
    • model evaluation
  3. Ensemble (under development)
    • blending
    • stacking
    • voting
    • by combo package
      • moa
      • aom
      • average
      • median
      • maximization
  4. IO
    • db connection
    • local
  5. dashboard ???
  6. other miscellaneous feature
    • continuous evaluation
    • aggregation
    • dimensional reduction
    • data profile (intensive data overview)
  7. to be confirmed

References

the following libraries gave me some idea/insight

  1. greykit
    1. changepoint detection
    2. model summary
    3. seaonality
  2. pytorch-forecasting
  3. darts
  4. pyaf
  5. orbit
  6. kats/prophets by facebook
  7. sktime
  8. gluon ts
  9. tslearn
  10. pyts
  11. luminaries
  12. tods
  13. autots
  14. pyodds
  15. scikit-hts
Owner
Thompson
Data Analyst, Scientist, Engineer, Research and Development
Thompson
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 216 Nov 29, 2021
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 21 Nov 13, 2021
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.1k Dec 2, 2021
GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

GWpy 321 Nov 25, 2021
Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

Nicolai Haaber Junge 1 Oct 20, 2021
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 6 Nov 26, 2021
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 17 Oct 4, 2021
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 2.9k Dec 2, 2021
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 168 Nov 22, 2021
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

null 40 Nov 27, 2021
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 1 Nov 21, 2021
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

null 1 Nov 25, 2021
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 2.8k Nov 29, 2021
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 48 Nov 16, 2021
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 736 Nov 23, 2021
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 2.3k Dec 2, 2021
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 727 Nov 23, 2021
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 851 Dec 2, 2021
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenario.

epispot 10 Nov 25, 2021