Pytorch implementations of Bayes By Backprop, MC Dropout, SGLD, the Local Reparametrization Trick, KF-Laplace, SG-HMC and more

Overview

Bayesian Neural Networks

License: MIT Python 2.7+ Pytorch 1.0

Pytorch implementations for the following approximate inference methods:

We also provide code for:

Prerequisites

  • PyTorch
  • Numpy
  • Matplotlib

The project is written in python 2.7 and Pytorch 1.0.1. If CUDA is available, it will be used automatically. The models can also run on CPU as they are not excessively big.

Usage

Structure

Regression experiments

We carried out homoscedastic and heteroscedastic regression experiements on toy datasets, generated with (Gaussian Process ground truth), as well as on real data (six UCI datasets).

Notebooks/classification/(ModelName)_(ExperimentType).ipynb: Contains experiments using (ModelName) on (ExperimentType), i.e. homoscedastic/heteroscedastic. The heteroscedastic notebooks contain both toy and UCI dataset experiments for a given (ModelName).

We also provide Google Colab notebooks. This means that you can run on a GPU (for free!). No modifications required - all dependencies and datasets are added from within the notebooks - except for selecting Runtime -> Change runtime type -> Hardware accelerator -> GPU.

MNIST classification experiments

train_(ModelName)_(Dataset).py: Trains (ModelName) on (Dataset). Training metrics and model weights will be saved to the specified directories.

src/: General utilities and model definitions.

Notebooks/classification: An asortment of notebooks which allow for model training, evaluation and running of digit rotation uncertainty experiments. They also allow for weight distribution plotting and weight pruning. They allow for loading of pre-trained models for experimentation.

Bayes by Backprop (BBP)

(https://arxiv.org/abs/1505.05424)

Colab notebooks with regression models: BBP homoscedastic / heteroscedastic

Train a model on MNIST:

python train_BayesByBackprop_MNIST.py [--model [MODEL]] [--prior_sig [PRIOR_SIG]] [--epochs [EPOCHS]] [--lr [LR]] [--n_samples [N_SAMPLES]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

For an explanation of the script's arguments:

python train_BayesByBackprop_MNIST.py -h

Best results are obtained with a Laplace prior.

Local Reparametrisation Trick

(https://arxiv.org/abs/1506.02557)

Bayes By Backprop inference where the mean and variance of activations are calculated in closed form. Activations are sampled instead of weights. This makes the variance of the Monte Carlo ELBO estimator scale as 1/M, where M is the minibatch size. Sampling weights scales (M-1)/M. The KL divergence between gaussians can also be computed in closed form, further reducing variance. Computation of each epoch is faster and so is convergence.

Train a model on MNIST:

python train_BayesByBackprop_MNIST.py --model Local_Reparam [--prior_sig [PRIOR_SIG]] [--epochs [EPOCHS]] [--lr [LR]] [--n_samples [N_SAMPLES]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

MC Dropout

(https://arxiv.org/abs/1506.02142)

A fixed dropout rate of 0.5 is set.

Colab notebooks with regression models: MC Dropout homoscedastic heteroscedastic

Train a model on MNIST:

python train_MCDropout_MNIST.py [--weight_decay [WEIGHT_DECAY]] [--epochs [EPOCHS]] [--lr [LR]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

For an explanation of the script's arguments:

python train_MCDropout_MNIST.py -h

Stochastic Gradient Langevin Dynamics (SGLD)

(https://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf)

In order to converge to the true posterior over w, the learning rate should be annealed according to the Robbins-Monro conditions. In practise, we use a fixed learning rate.

Colab notebooks with regression models: SGLD homoscedastic / heteroscedastic

Train a model on MNIST:

python train_SGLD_MNIST.py [--use_preconditioning [USE_PRECONDITIONING]] [--prior_sig [PRIOR_SIG]] [--epochs [EPOCHS]] [--lr [LR]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

For an explanation of the script's arguments:

python train_SGLD_MNIST.py -h

pSGLD

(https://arxiv.org/abs/1512.07666)

SGLD with RMSprop preconditioning. A higher learning rate should be used than for vanilla SGLD.

Train a model on MNIST:

python train_SGLD_MNIST.py --use_preconditioning True [--prior_sig [PRIOR_SIG]] [--epochs [EPOCHS]] [--lr [LR]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

Bootstrap MAP Ensemble

Multiple networks are trained on subsamples of the dataset.

Colab notebooks with regression models: MAP Ensemble homoscedastic / heteroscedastic

Train an ensemble on MNIST:

python train_Bootrap_Ensemble_MNIST.py [--weight_decay [WEIGHT_DECAY]] [--subsample [SUBSAMPLE]] [--n_nets [N_NETS]] [--epochs [EPOCHS]] [--lr [LR]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

For an explanation of the script's arguments:

python train_Bootrap_Ensemble_MNIST.py -h

Kronecker-Factorised Laplace

(https://openreview.net/pdf?id=Skdvd2xAZ)

Train a MAP network and then calculate a second order taylor series aproxiamtion to the curvature around a mode of the posterior. A block diagonal Hessian approximation is used, where only intra-layer dependencies are accounted for. The Hessian is further approximated as the kronecker product of the expectation of a single datapoint's Hessian factors. Approximating the Hessian can take a while. Fortunately it only needs to be done once.

Train a MAP network on MNIST and approximate Hessian:

python train_KFLaplace_MNIST.py [--weight_decay [WEIGHT_DECAY]] [--hessian_diag_sig [HESSIAN_DIAG_SIG]] [--epochs [EPOCHS]] [--lr [LR]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

For an explanation of the script's arguments:

python train_KFLaplace_MNIST.py -h

Note that we save the unscaled and uninverted Hessian factors. This will allow for computationally cheap changes to the prior at inference time as the Hessian will not need to be re-computed. Inference will require inverting the approximated Hessian factors and sampling from a matrix normal distribution. This is shown in notebooks/KFAC_Laplace_MNIST.ipynb

Stochastic Gradient Hamiltonian Monte Carlo

(https://arxiv.org/abs/1402.4102)

We implement the scale-adapted version of this algorithm, proposed here to find hyperparameters automatically during burn-in. We place a Gaussian prior over network weights and a Gamma hyperprior over the Gaussian's precision.

Run SG-HMC-SA burn in and sampler, saving weights in specified file.

python train_SGHMC_MNIST.py [--epochs [EPOCHS]] [--sample_freq [SAMPLE_FREQ]] [--burn_in [BURN_IN]] [--lr [LR]] [--models_dir [MODELS_DIR]] [--results_dir [RESULTS_DIR]]

For an explanation of the script's arguments:

python train_SGHMC_MNIST.py -h

Approximate Inference in Neural Networks

Map inference provides a point estimate of parameter values. When provided with out of distribution inputs, such as rotated digits, these models then to make wrong predictions with high confidence.

Uncertainty Decomposition

We can measure uncertainty in our models' predictions through predictive entropy. We can decompose this term in order to distinguish between 2 types of uncertainty. Uncertainty caused by noise in the data, or Aleatoric uncertainty, can be quantified as the expected entropy of model predictions. Model uncertainty or Epistemic uncertainty can be measured as the difference between total entropy and aleatoric entropy.

Results

Homoscedastic Regression

Toy homoscedastic regression task. Data is generated by a GP with a RBF kernel (l = 1, σn = 0.3). We use a single-output FC network with one hidden layer of 200 ReLU units to predict the regression mean μ(x). A fixed log σ is learnt separately.

Heteroscedastic Regression

Same scenario as previous section but log σ(x) is predicted from the input.

Toy heteroscedastic regression task. Data is generated by a GP with a RBF kernel (l = 1 σn = 0.3 · |x + 2|). We use a two-head network with 200 ReLU units to predict the regression mean μ(x) and log-standard deviation log σ(x).

Regression on UCI datasets

We performed heteroscedastic regression on the six UCI datasets (housing, concrete, energy efficiency, power plant, red wine and yacht datasets), using 10-foild cross validation. All these experiments are contained in the heteroscedastic notebooks. Note that results depend heavily on hyperparameter selection. Plots below show log-likelihoods and RMSEs on the train (semi-transparent colour) and test (solid colour). Circles and error bars correspond to the 10-fold cross validation mean and standard deviations respectively.

MNIST Classification

W is marginalised with 100 samples of the weights for all models except MAP, where only one set of weights is used.

MNIST Test MAP MAP Ensemble BBP Gaussian BBP GMM BBP Laplace BBP Local Reparam MC Dropout SGLD pSGLD
Log Like -572.9 -496.54 -1100.29 -1008.28 -892.85 -1086.43 -435.458 -828.29 -661.25
Error % 1.58 1.53 2.60 2.38 2.28 2.61 1.37 1.76 1.76

MNIST test results for methods under consideration. Estensive hyperparameter tunning has not been performed. We approximate the posterior predictive distribution with 100 MC samples. We use a FC network with two 1200 unit ReLU layers. If unspecified, the prior is Gaussian with std=0.1. P-SGLD uses RMSprop preconditioning.

The original paper for Bayes By Backprop reports around 1% error on MNIST. We find that this result is attainable only if approximate posterior variances are initialised to be very small (BBP Gauss 2). In this scenario, the distributions over weights resemble deltas, giving good predictive performance but bad uncertainty estimates. However, when initialising the variances to match the prior (BBP Gauss 1), we obtain the above results. The training curves for both of these hyperparameter configuration schemes are shown below:

MNIST Uncertainty

Total, aleatoric and epistemic uncertainties obtained when creating OOD samples by augmenting the MNIST test set with rotations:

Total and epistemic uncertainties obtained by testing our models, - which have been trained on MNIST -, on the KMNIST dataset:

Adversarial robustness

Total, aleatoric and epistemic uncertainties obtained when feeding our models with adversarial samples (fgsm).

Weight Distributions

Histograms of weights sampled from each model trained on MNIST. We draw 10 samples of w for each model.

Weight Pruning

#TODO

Comments
  • UCI log likelihood calculation

    UCI log likelihood calculation

    Hi. Great Repository! I was having trouble recreating the results from some of the papers which you implemented and I found your examples to be more complete than the original author repositories which is very helpful.

    I have a question regarding your calculation of the test set log likelihood of the UCI datasets...

    In this file (https://github.com/JavierAntoran/Bayesian-Neural-Networks/blob/master/notebooks/regression/mc_dropout_hetero.ipynb) you use the flow...

    - make tensor of (samples, mu) and (samples, log_sigma)
    - use means.mean() as mu and (means.var() + mean(sigma) ** 2) ** 0.5 as sigma
    - calculate sum of all log probability / data instances
    

    Then as a last step you add log(y_sigma) which is the standard deviation of the y values before they were normalized. Why do you do this and where does the necessity to do this come from?

    The original MC Dropout and Concrete Dropout repositories use some form of the logsumexp trick do calculate the log probabilities, but so far I have been unable to recreate the UCI results using their method. I get much closer with your method but I cannot justify the last step to myself.

    Reference

    Concrete Dropout

    MC Dropout

    opened by jeffwillette 5
  • Posterior Predictive Distribution

    Posterior Predictive Distribution

    Hi,

    I’m trying to approximate the posterior predictive distribution that corresponds to the MC Dropout and Bayes By Backprop neural networks (which I see you say is possible in the section MNIST classification in README). I’m new to Python, so I’m having a little trouble figuring out how exactly you do this/ what part of the code carries this out?

    I tried to go about it by playing around with your function get_weight_samples, but noticed that this gives me the same weights each time I train the same network. For example, when training the MC Dropout network, I assumed the output of get_weight_samples would change as in theory different nodes are dropped during training each time. My confusion here makes me think that I maybe have misinterpreted what this function is supposed to be doing.

    Any clarification would be greatly appreciated! I’m sorry if this wasn’t the right place to post a question of this nature – new to Github and still learning the ropes.

    question 
    opened by CBird210 5
  • about the sampled output of bayes by backprop

    about the sampled output of bayes by backprop

    Hi, in BBB, we sample the outputs several times (5 for example) when testing. This is due to the nature of MC. But how do we use these 5 outputs of one sample? Averaging them? Summing them? Or just take the best one as this sample's output vector? Are there any theory about how to use these sampled results? Thanks!

    opened by ShellingFord221 5
  • about Gaussian prior

    about Gaussian prior

    Hi, in bbp_homoscedastic.ipynb, it seems that you choose a normal Gaussian prior rather than a scale mixture prior. I think that mixture Gaussian prior can better model the real distribution of weight w. Thanks!

    opened by ShellingFord221 4
  • about local reparametrisation trick

    about local reparametrisation trick

    Hi, in local reparametrisation trick, when computing output of convolution layer, we use alpha * mu^2 to replace sigma^2. But when computing KL, should we also use alpha * mu^2 to replace the weight's variance? Why or why not? Thanks!

    opened by ShellingFord221 3
  • Scaling langevin noise

    Scaling langevin noise

    Hi @JavierAntoran @stratisMarkou ,

    I am currently using SGLD in another project. I have some example code, where the noise added to the parameters is scaled by the learning rate:

    noise = torch.randn(n.size()) * param_noise_sigma * lr.

    I belive, this is wrong and so I scaled the noise with the square root of the learning rate:

    noise = torch.randn(n.size()) * np.sqrt(lr)

    This works well. However, you scale the learning rate by

    1/np.sqrt(sigma)

    If I do this, the noise gets way too big and the network does not learn. Also, in the initial paper by Welling et al. it seems to me that the first approach is right, though I already found out that the variance of the noise scales with lr**2, which is why the square root is necessary.

    I hope you can explain to me, why the ()^-1 is necessary. Thank you in advance!

    question 
    opened by maltetoelle 2
  • bbp_hetero uncertainties plot

    bbp_hetero uncertainties plot

    For the bbp_hetero notebook, should be plot be done as: plt.fill_between(np.linspace(-5, 5, 200)* x_std + x_mean, means - aleatoric, means + aleatoric, color = c[0], alpha = 0.3, label = 'Aleatoric') plt.fill_between(np.linspace(-5, 5, 200)* x_std + x_mean, means - epistemic, means + epistemic, color = c[1], alpha = 0.3, label = 'Epistemic')

    At the moment, the boundaries are mixed.

    Thanks for the attention so far, Celso

    bug 
    opened by caxelrud 2
  • May I ask the code about calculating the Hessian in the Kronecker-Factorised Laplace methods?

    May I ask the code about calculating the Hessian in the Kronecker-Factorised Laplace methods?

    def softmax_CE_preact_hessian(last_layer_acts): side = last_layer_acts.shape[1] I = torch.eye(side).type(torch.ByteTensor) # for i != j H = -ai * aj -- Note that these are activations not pre-activations Hl = - last_layer_acts.unsqueeze(1) * last_layer_acts.unsqueeze(2) # for i == j H = ai * (1 - ai) Hl[:, I] = last_layer_acts * (1 - last_layer_acts) return Hl

    This function calculates the hessian for the first layer's activations. Why can the hessian be obtained by this process?

    for i != j H = -ai * aj -- Note that these are activations not pre-activations

    for i == j H = ai * (1 - ai)

    Is this process shown in the paper (https://openreview.net/pdf?id=Skdvd2xAZ)?

    Looking forward to your reply! Really thank you!

    question 
    opened by git595463210 1
  • Looking for the implementation of

    Looking for the implementation of "Getting a GlUE"

    Hi @JavierAntoran , my name is Hao, I'm searching for a valid method for the Uncertainty Explanations, like to decompose the Uncertainty of the prediction back to the features, until I found your paper "Getting a GLUE: A Method for Explaining Uncertainty Estimates", and find it fantastic, do you have implementations of the "CLUE" in this repo or where can I find it, because I see the reference of this github in the paper? I'm researching on using ICU data to do Sepsis Prediction, really appreciate it if you can help.

    Thanks in Advance Hao

    opened by SuperHao-Wu 1
  • About calculating the logarithm of the Gaussian mixture model

    About calculating the logarithm of the Gaussian mixture model

    Hi! First of all, thank you for providing your code! I think there is an error in class "spike_slab_2GMM" in the "priors.py" file: normalised_like = self.pi1 + torch.exp(N1_ll - max_loglike) + self.pi2 + torch.exp(N2_ll - max_loglike) After calculation and verification, I think it should be changed to: normalised_like = self.pi1 * torch.exp(N1_ll - max_loglike) + self.pi2 * torch.exp(N2_ll - max_loglike) I'm sorry to disturb you in your busy schedule~

    bug question 
    opened by MrHhorse 1
  • Small typos in two files

    Small typos in two files

    Your codes are helpful for me. However, I found it seems there are some mistakes in the file train_BayesByBackprop_MNIST.py and train_BootrapEnsemble_MNIST.py at line 219 and 210 respectively. Those in the other files are right.

    Looking forward to your reply.

    opened by lan-qing 1
  • Question: log_gaussian_loss function used in MC Dropout and SGLD

    Question: log_gaussian_loss function used in MC Dropout and SGLD

    Firstly, thank you for all these great notebooks, they've been very helpful in building a better understanding of these methods.

    I am wondering where the function log_gaussian_loss originates from? I'm struggling to find reference to it in the literature, though very likely looking in the wrong places.

    In MC dropout it seems that one output neuron is for the prediction and another that feeds into this loss function, and I'm struggling to get a simpler version working whereby there's one output neuron and a different loss function. Where does this technique originate from?

    Thanks again

    question 
    opened by acse-aol21 1
  • [Question][MC dropout] drop hidden units for each data point or for each mini-batch data?

    [Question][MC dropout] drop hidden units for each data point or for each mini-batch data?

    Hi JavierAntoran,

    Thanks for the wonderful code first, and it is really helpful for me working in the related area. I'd like to consult a question about the MC dropout. In BBB with local reparameterization, the activation values are sampled for each data point instead of directly sampling a weight distribution to reduce the computational complexity. So, in MC dropout, shall we do the similar procedure, e.g. dropout hidden units for each data point in training or testing phase? I notice that your MC dropout model seems uses the same dropout for a mini-batch data and the default batch size is 128. Should I change the batch size to 1 to achieve the goal of dropping hidden units for each data point?

    Looking forward to your reply. Thanks a lot

    question 
    opened by alwaysuu 1
  • [Question] BBB vs BBB w/ Local Reparameterization

    [Question] BBB vs BBB w/ Local Reparameterization

    Hi @JavierAntoran @stratisMarkou,

    First of all, thanks for making all of this code available - it's been great to look through!

    Im currently spending some time trying to work through the Weight Uncertainty in Neural Networks in order to implement Bayes-by-Backprop. I was struggling to understand the difference between your implementation of Bayes-by-Backprop and Bayes-by-Backprop with Local Reparameterization.

    I was under the impression that the local reparameterization was the following:

    https://github.com/JavierAntoran/Bayesian-Neural-Networks/blob/022b9cedb69be3fecc83d4b0efe4b5a848119c2a/src/Bayes_By_Backprop/model.py#L58-L66

    However this same approach is used in both methods.

    The main difference I see in the code you've implemented is the calculation of the KL Divergence in closed form in the Local Reparameterization version of the code due to the use of a Gaussian prior / posterior distribution.

    I was wondering if my understanding of the local reparameterization method was wrong, or if I had simply misunderstood the code?

    Any guidance would be much appreciated!

    good first issue question 
    opened by danielkelshaw 4
  • Panda not reading the first row

    Panda not reading the first row

    Minor bug. Instead of: data = pd.read_csv('housing.data', header=0, delimiter="\s+").values Use: data = pd.read_csv('housing.data', header=None, delimiter="\s+").values

    bug 
    opened by caxelrud 1
Owner
Machine Learning PhD student at University of Cambridge. Telecommunications (EE/CS) engineer.
null
N-gram models- Unsmoothed, Laplace, Deleted Interpolation

N-gram models- Unsmoothed, Laplace, Deleted Interpolation

Ravika Nagpal 1 Jan 4, 2022
Laplace Redux -- Effortless Bayesian Deep Learning

Laplace Redux - Effortless Bayesian Deep Learning This repository contains the code to run the experiments for the paper Laplace Redux - Effortless Ba

Runa Eschenhagen 28 Dec 7, 2022
Pytorch implementation of Learning Rate Dropout.

Learning-Rate-Dropout Pytorch implementation of Learning Rate Dropout. Paper Link: https://arxiv.org/pdf/1912.00144.pdf Train ResNet-34 for Cifar10: r

null 42 Nov 25, 2022
Unofficial PyTorch implementation of Guided Dropout

Unofficial PyTorch implementation of Guided Dropout This is a simple implementation of Guided Dropout for research. We try to reproduce the algorithm

null 2 Jan 7, 2022
R-Drop: Regularized Dropout for Neural Networks

R-Drop: Regularized Dropout for Neural Networks R-drop is a simple yet very effective regularization method built upon dropout, by minimizing the bidi

null 756 Dec 27, 2022
Clustering with variational Bayes and population Monte Carlo

pypmc pypmc is a python package focusing on adaptive importance sampling. It can be used for integration and sampling from a user-defined target densi

null 45 Feb 6, 2022
Code accompanying the paper "How Tight Can PAC-Bayes be in the Small Data Regime?"

How Tight Can PAC-Bayes be in the Small Data Regime? This is the code to reproduce all experiments for the following paper: @inproceedings{Foong:2021:

null 5 Dec 21, 2021
Predict Breast Cancer Wisconsin (Diagnostic) using Naive Bayes

Naive-Bayes Predict Breast Cancer Wisconsin (Diagnostic) using Naive Bayes Downloading Data Set Use our Breast Cancer Wisconsin Data Set Also you can

Faeze Habibi 0 Apr 6, 2022
Pytorch Lightning 1.2k Jan 6, 2023
PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

Petros Christodoulou 4.7k Jan 4, 2023
Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

Shayne O'Brien 471 Dec 16, 2022
Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

null 183 Dec 28, 2022
Implementations of polygamma, lgamma, and beta functions for PyTorch

lgamma Implementations of polygamma, lgamma, and beta functions for PyTorch. It's very hacky, but that's usually ok for research use. To build, run: .

Rachit Singh 24 Nov 9, 2021
Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

Gemini Light 4 Dec 31, 2022
Pytorch Implementations of large number classical backbone CNNs, data enhancement, torch loss, attention, visualization and some common algorithms.

Torch-template-for-deep-learning Pytorch implementations of some **classical backbone CNNs, data enhancement, torch loss, attention, visualization and

Li Shengyan 270 Dec 31, 2022
PyTorch implementations of the paper: "DR.VIC: Decomposition and Reasoning for Video Individual Counting, CVPR, 2022"

DRNet for Video Indvidual Counting (CVPR 2022) Introduction This is the official PyTorch implementation of paper: DR.VIC: Decomposition and Reasoning

tao han 35 Nov 22, 2022
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

null 3k Jan 8, 2023
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

DLR-RM 4.7k Jan 1, 2023