Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversions.

Overview

Travis status Coverage Status PyPI version

Convoys

pic

Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversions. There is a lot more info if you head over to the documentation. You can also take a look at this blog post about Convoys.

Installation

The easiest way right now is to install the latest version from PyPI:

pip install convoys

More info

Convoys was built by Erik Bernhardsson and has the MIT license.

Comments
  • Rename the cdf method to predict to be more in line with scikit-learn

    Rename the cdf method to predict to be more in line with scikit-learn

    The cdf name was an attempt to make it more similar to the probability distributions in scipy.stats but in retrospect that was confusing since the convoys models do not represent probability distributions – they are more like regression models that output something between 0 and 1.

    This deprecates the old method names but keeps them for the time being.

    EDIT: also splits the prediction method into predict vs predict_ci along the lines of https://martinfowler.com/bliki/FlagArgument.html

    opened by erikbern 8
  • Fit a Beta distribution to the c parameter

    Fit a Beta distribution to the c parameter

    Instead of using bootstrapping to estimate uncertainty of c, just fit a Beta distribution directly

    This is probably 10-100x faster although a few more lines of math (lots of gammaln)

    Will do the same thing for Weibull and Gamma and then remove the bootstrapping (and the old non-beta models). Will also remove a few other things like sharing parameters etc.

    opened by erikbern 8
  • n_iterations needs to be an int or generalized gamma model breaks.

    n_iterations needs to be an int or generalized gamma model breaks.

    Hey big fan of the project.

    A Numpy Call to np.empty in Emcee is unhappy since numpy 1.11 (I think) that n_iterations is a float.

    np.ceil() is used to calculate the number of iterations and it returns a float. Changing the result to an int fixes Generalized Gamma as of Numpy version 1.21.2.

    opened by Will-So 7
  • weibull regression (MAP and bayesian version)

    weibull regression (MAP and bayesian version)

    Implemented a regression model using pymc3, so fully Bayesian.

    The good news is that it was very easy to implement (once I discovered the pymc3.Potential class). The bad news is this is slow as hell. It barely works for more than a few thousand datapoints – whereas the current models happily fit 100k datapoints in a few seconds.

    I'm tempted to revert to my original idea of just fitting this using MAP and then computing the Hessian of the MAP to get a normal approximation of the posterior distribution. Seems a bit janky but I think it will be a pretty good approximation in practice, and probably ~100x faster.

    opened by erikbern 6
  • Minibatches, dropout, fix Gamma, easier learning algorithm

    Minibatches, dropout, fix Gamma, easier learning algorithm

    This changes a bunch of stuff

    • Minibatches for faster convergence (although it makes every epoch a lot slower). This also gets rid of the checks for nan etc
    • Dropout rather than regression. Dropout on the inputs.
    • Fit an intercept to the regression model as a separate parameter just so that it's not affected by dropout
    • Turns out tf.igamma has a bug in the gradient wrt a (I'll open an issue with tensorflow later). Rather than setting it using gradient descent, do a minor perturbation after each step
    • Once gamma was fixed, turns out we can have a much simpler learning rate scheme, not save snapshots etc.
    opened by erikbern 5
  • Use more accurate gamma derivatives

    Use more accurate gamma derivatives

    I too have struggled with gamma derivatives! This PR is my current best effort, so I can share it with convoys.

    1. Use an analytical solution for derivative w.r.t the x variable
    2. Use a O(h^4) order finite difference method w.r.t. the a (k) variable. This has much better stability in the second derivative too.

    I've also included gammaincc, as gammaincc=1-gammainc and you have 1-gammainc in your code, so this is a possible small convenience.

    opened by CamDavidsonPilon 4
  • prevent regression.py from producing garbage output in jupyter notebooks

    prevent regression.py from producing garbage output in jupyter notebooks

    PR for #93

    Check if user is calling GeneralizedGamma within a Jupyter Notebook and suppress sys.stdout.flush() if true. Still need to think about how to replicate the carriage return so we can see the updating as the model fits.

    Current output:
    image

    PR output: image

    opened by AlephNotation 4
  • fix numerical issues with the nonparametric model

    fix numerical issues with the nonparametric model

    Found a few issues

    • Something with the covariance matrix not being positive semidefinite, causing numpy.random.normal_multivariate to complain and generate bogus values. This is despite eigenvalues being all positive
    • A few of the variances were super large, causing enormous values of z that expit can't handle. Clipping fixes that problem
    • It seems like a bias is introduced due to maybe floating point rounding. Not sure what's up, but with n=1000 its end up having a fairly substantial upwards bias early in the distribution, whereas this disappears for n=100. Not quite sure what's up.

    After these changes, Weibull estimation of synthetic Weibull data lines up fairly well with the nonparametric estimation

    image

    image

    opened by erikbern 4
  • Decoupling visualization from models

    Decoupling visualization from models

    Hey there! We, the Buffer data team, recently discovered this awesome package, and we're starting to use it in different analysis.

    We're used to doing most of the plotting with R. I've started to work on getting the data back from the Matplotlib figure but seems like a hack and was wondering if you've thought about decoupling the plotting from the modelling.

    Prophet, from Facebook, does a great job at that and it'll return a DataFrame with the required data to plot. The same prophet library will also have a default .plot function that uses Matplotlib. That helps users use other plotting frameworks.

    I'm happy to help with the coding if I can figure out how to better do the decoupling. Let me know if you have any questions too. :smile:

    Thanks for open sourcing such a helpful library!

    PS: We've also found that using a large group size will result in a confusing legend in the final plot. This one can be probably fixed using the proper Matplotlib arguments though. This example shows weeks in one of our plots:

    2020-03-24_16:28:35_183x348

    opened by davidgasquez 3
  • vectorizing time diff calc for speed boost

    vectorizing time diff calc for speed boost

    Throwing the time diff calculation in get_arrays into an apply function for vectorization -- seemed to make it run 5x faster when I tested it locally. Not too different in terms of code, I'm guessing the T_raw.append that took the longest time

    opened by stphnma 3
  • Dumb mistake + Switch Gamma to Powell

    Dumb mistake + Switch Gamma to Powell

    this was some debugging thing i accidentally committed

    Also simplified the Gamma stuff by switching to Powell which is a lot slower but seems a lot simpler and doesn't seem to suffer from weird ass nan behavior.

    opened by erikbern 3
  • Example dos does not work

    Example dos does not work

    Steps to reproduce:

    • clone the repo
    • python -m venv venv and source venv/bin/activate
    • pip install convoys==0.2.1
    • python examples/dob_violations.py

    Stacktrace:

    File "examples/dob_violations.py", line 50, in run() File "examples/dob_violations.py", line 25, in run convoys.plotting.plot_cohorts(G, B, T, model=model, ci=0.95, File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/plotting.py", line 62, in plot_cohorts m.fit(G, B, T) File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/multi.py", line 31, in fit self.base_model.fit(X, B, T) File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/regression.py", line 269, in fit for i, _ in enumerate(sampler.sample(p0, iterations=n_iterations)): File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/emcee/ensemble.py", line 379, in sample self.backend.grow(iterations, state.blobs) File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/emcee/backends/backend.py", line 175, in grow a = np.empty((i, self.nwalkers, self.ndim), dtype=self.dtype) TypeError: 'numpy.float64' object cannot be interpreted as an integer

    opened by jacopotagliabue 0
  • Numpy type error occurs when setting mcmc==True

    Numpy type error occurs when setting mcmc==True

    I am running into an numpy type error (see below) whenever I set value for the parameters mcmc==True or ci==0.95 in any convoys model. This is consistent across data sources and even occurs when using the example data sets. If I remove these parameters the code runs as expected with no errors. Is this something anyone else has come across? Any help is much appreciated!

    TypeError: 'numpy.float64' object cannot be interpreted as an integer
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-160-acdf74229cef> in <module>
    ----> 1 model_test.fit(X,B,T)
    
    ~/opt/anaconda3/lib/python3.8/site-packages/convoys/multi.py in fit(self, G, B, T)
         29         for i, group in enumerate(G):
         30             X[i,group] = 1
    ---> 31         self.base_model.fit(X, B, T)
         32 
         33     def _get_x(self, group):
    
    ~/opt/anaconda3/lib/python3.8/site-packages/convoys/regression.py in fit(self, X, B, T, W)
        267                     ' %d walkers [' % n_walkers,
        268                     progressbar.AdaptiveETA(), ']'])
    --> 269             for i, _ in enumerate(sampler.sample(p0, iterations=n_iterations)):
        270                 bar.update(i+1)
        271             result['samples'] = sampler.chain[:, n_burnin:, :] \
    
    ~/opt/anaconda3/lib/python3.8/site-packages/emcee/ensemble.py in sample(self, initial_state, log_prob0, rstate0, blobs0, iterations, tune, skip_initial_state_check, thin_by, thin, store, progress, progress_kwargs)
        377             checkpoint_step = thin_by
        378             if store:
    --> 379                 self.backend.grow(iterations, state.blobs)
        380 
        381         # Set up a wrapper around the relevant model functions
    
    ~/opt/anaconda3/lib/python3.8/site-packages/emcee/backends/backend.py in grow(self, ngrow, blobs)
        173         self._check_blobs(blobs)
        174         i = ngrow - (len(self.chain) - self.iteration)
    --> 175         a = np.empty((i, self.nwalkers, self.ndim), dtype=self.dtype)
        176         self.chain = np.concatenate((self.chain, a), axis=0)
        177         a = np.empty((i, self.nwalkers), dtype=self.dtype)
    
    TypeError: 'numpy.float64' object cannot be interpreted as an integer
    
    opened by 7cb15 0
  • GeneralizedGamma's stdout flushing going crazy in Jupyter Labs

    GeneralizedGamma's stdout flushing going crazy in Jupyter Labs

    Calling fit on GeneralizedGamma within a Jupyter Labs notebook results in a lot of whitespace in the output. I think this is because sys.stdout.flush() does not play well Labs. Master

    opened by AlephNotation 0
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Sam 438 Dec 17, 2022
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 3, 2023
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Facebook Research 4.1k Dec 29, 2022
Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification Introduction. This package includes the pyth

null 5 Dec 6, 2022
Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

Christoph Mark 129 Dec 24, 2022
Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 3.7k Jan 7, 2023
Automated modeling and machine learning framework FEDOT

This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML). It can build custom modeling pipelines for different real-world processes in an automated way using an evolutionary approach. FEDOT supports classification (binary and multiclass), regression, clustering, and time series prediction tasks.

National Center for Cognitive Research of ITMO University 148 Jul 5, 2021
Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic problems, create concrete problem instances, and solve these instances with standard solvers.

Pyomo 1.4k Dec 28, 2022
UpliftML: A Python Package for Scalable Uplift Modeling

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Booking.com 254 Dec 31, 2022
MICOM is a Python package for metabolic modeling of microbial communities

Welcome MICOM is a Python package for metabolic modeling of microbial communities currently developed in the Gibbons Lab at the Institute for Systems

null 57 Dec 21, 2022
A Pythonic framework for threat modeling

pytm: A Pythonic framework for threat modeling Introduction Traditional threat modeling too often comes late to the party, or sometimes not at all. In

Izar Tarandach 644 Dec 20, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

Krishna Priyatham Potluri 73 Dec 1, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

null 164 Jan 4, 2023
Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

AriesTriputranto 1 Dec 13, 2021
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 2, 2023
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 5.2k Jan 4, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 6, 2023