A PyTorch implementation of L-BFGS.

Overview

PyTorch-LBFGS: A PyTorch Implementation of L-BFGS

Authors: Hao-Jun Michael Shi (Northwestern University) and Dheevatsa Mudigere (Facebook)

What is it?

PyTorch-LBFGS is a modular implementation of L-BFGS, a popular quasi-Newton method, for PyTorch that is compatible with many recent algorithmic advancements for improving and stabilizing stochastic quasi-Newton methods and addresses many of the deficiencies with the existing PyTorch L-BFGS implementation. It is designed to provide maximal flexibility to researchers and practitioners in the design and implementation of stochastic quasi-Newton methods for training neural networks.

Main Features

  1. Compatible with multi-batch and full-overlap L-BFGS
  2. Line searches including (stochastic) Armijo backtracking line search (with or without cubic interpolation) and weak Wolfe line search for automatic steplength (or learning rate) selection
  3. Powell damping with more sophisticated curvature pair rejection or damping criterion for constructing the quasi-Newton matrix

Getting Started

To use the L-BFGS optimizer module, simply add /functions/LBFGS.py to your current path and use

from LBFGS import LBFGS, FullBatchLBFGS

to import the L-BFGS or full-batch L-BFGS optimizer, respectively.

Alternatively, you can add LBFGS.py into torch.optim on your local PyTorch installation. To do this, simply add LBFGS.py to /path/to/site-packages/torch/optim, then modify /path/to/site-packages/torch/optim/__init__.py to include the lines from LBFGS.py import LBFGS, FullBatchLBFGS and del LBFGS, FullBatchLBFGS. After restarting your Python kernel, you will be able to use PyTorch-LBFGS's LBFGS optimizer like any other optimizer in PyTorch.

To see how full-batch, full-overlap, or multi-batch L-BFGS may be easily implemented with a fixed steplength, Armijo backtracking line search, or Wolfe line search, please see the example codes provided in the /examples/ folder.

Understanding the Main Features

We give a brief overview of (L-)BFGS and each of the main features of the optimization algorithm.

0. Quasi-Newton Methods

Quasi-Newton methods build an approximation to the Hessian to apply a Newton-like algorithm . To do this, it solves for a matrix that satisfies the secant condition . L-BFGS is one particular optimization algorithm in the family of quasi-Newton methods that approximates the BFGS algorithm using limited memory. Whereas BFGS requires storing a dense matrix, L-BFGS only requires storing 5-20 vectors to approximate the matrix implicitly and constructs the matrix-vector product on-the-fly via a two-loop recursion.

In the deterministic or full-batch setting, L-BFGS constructs an approximation to the Hessian by collecting curvature pairs defined by differences in consecutive gradients and iterates, i.e. and . In our implementation, the curvature pairs are updated after an optimization step is taken (which yields ).

Note that other popular optimization methods for deep learning, such as Adam, construct diagonal scalings, whereas L-BFGS constructs a positive definite matrix for scaling the (stochastic) gradient direction.

There are three components to using this algorithm:

  1. two_loop_recursion(vec): Applies the L-BFGS two-loop recursion to construct the vector .
  2. step(p_k, g_Ok, g_Sk=None, options={}): Takes a step and stores for constructing the next curvature pair. In addition, is provided to store for Powell damping or the curvature pair rejection criterion. (If it is not specified, then .) options pass necessary parameters or callable functions to the line search.
  3. curvature_update(flat_grad, eps=0.2, damping=True): Updates the L-BFGS matrix by computing the curvature pair using flat_grad and the stored then checks the Powell damping criterion to possibly reject or modify the curvature pair.

If one is interested in using full-batch or deterministic L-BFGS, one can call the FullBatchLBFGSoptimizer. The step(options) function for FullBatchLBFGS wraps the two-loop recursion, curvature updating, and step functions from LBFGS to simplify the implementation in this case.

Using quasi-Newton methods in the noisy regime requires more work. We will describe below some of the key features of our implementation that will help stabilize L-BFGS when used in conjunction with stochastic gradients.

1. Stable Quasi-Newton Updating

The key to applying quasi-Newton updating in the noisy setting is to require consistency in the gradient difference in order to prevent differencing noise.

We provide examples of two approaches for doing this:

  1. Full-Overlap: This approach simply requires us to evaluate the gradient on a sample twice at both the next and current iterate, hence introducing the additional cost of a forward and backward pass over the sample at each iteration, depending on the line search that is used. In particular, given a sample , we obtain by computing .

full-overlap

  1. Multi-Batch: This approach uses the difference between the gradients over the overlap between two consecutive samples , hence not requiring any additional cost for curvature pair updating, but incurs sampling bias. This approach also suffers from being generally more tedious to code, although it is more efficient. Note that each sample is the union of the overlap from the previous and current iteration and an additional set of samples, i.e. . Given two consecutive samples and , we obtain by computing . In multi_batch_lbfgs_example.py, the variable g_Ok denotes and the variable g_Ok_prev represents .

multi-batch

The code is designed to allow for both of these approaches by delegating control of the samples and the gradients passed to the optimizer to the user. Whereas the existing PyTorch L-BFGS module runs L-BFGS on a fixed sample (possibly full-batch) for a set number of iterations or until convergence, this implementation permits sampling a new mini-batch stochastic gradient at each iteration and is hence amenable with stochastic quasi-Newton methods, and follows the design of other optimizers where one step is equivalent to a single iteration of the algorithm.

2. Line Searches

Deterministic quasi-Newton methods, particularly BFGS and L-BFGS, have traditionally been coupled with line searches that automatically determine a good steplength (or learning rate) and exploit these well-constructed search directions. Although these line searches have been crucial to the success of quasi-Newton algorithms in deterministic nonlinear optimization, the power of line searches in machine learning have generally been overlooked due to concerns regarding computational cost. To overcome these issues, stochastic or probabilistic line searches have been developed to determine steplengths in the noisy setting.

We provide four basic (stochastic) line searches that may be used in conjunction with L-BFGS in the step function:

  1. (Stochastic) Armijo Backtracking Line Search: Ensures that the Armijo or sufficient decrease condition is satisfied on the function evaluated by the closure() function by backtracking from each trial point by multiplying by a constant factor less than 1.
  2. (Stochastic) Armijo Backtracking Line Search with Cubic Interpolation: Similar to the basic backtracking line search but utilizes a quadratic or cubic interpolation to determine the next trial. This is based on Mark Schmidt's minFunc MATLAB code.
  3. (Stochastic) Weak Wolfe Line Search: Based on Michael Overton's weak Wolfe line search implementation in MATLAB, ensures that both the sufficient decrease condition and curvature condition are satisfied on the function evaluated by the closure() function by performing a bisection search.
  4. (Stochastic) Weak Wolfe Line Search with Cubic Interpolation: Similar to the weak Wolfe line search but utilizes quadratic interpolation to determine the next trial when backtracking.

Note: For quasi-Newton algorithms, the weak Wolfe line search, although immensely simple, gives similar performance to the strong Wolfe line search, a more complex line search algorithm that utilizes a bracketing and zoom phase, for smooth, nonlinear optimization. In the nonsmooth setting, the weak Wolfe line search (not the strong Wolfe line search) is essential for quasi-Newton algorithms. For these reasons, we only implement a weak Wolfe line search here.

One may also use a constant steplength provided by the user, as in the original PyTorch implementation. See https://en.wikipedia.org/wiki/Wolfe_conditions for more detail on the sufficient decrease and curvature conditions.

To use these, when defining the optimizer, the user can specify the line search by setting line_search to Armijo, Wolfe, or None. The user must then define the options (typically a closure for reevaluating the model and loss) passed to the step function to perform the line search. The lr parameter defines the initial steplength in the line search algorithm.

We also provide a inplace toggle in the options to determine whether or not the variables are updated in-place in the line searches. In-place updating is faster but less numerically accurate than storing the current iterate and reloading after each trial in the line search.

3. Curvature Pair Rejection Criterion and Powell Damping

Another key for L-BFGS is to determine when the history used in constructing the L-BFGS matrix is updated. In particular, one needs to ensure that the matrix remains positive definite. Existing implementations of L-BFGS have generally checked if or , rejecting the curvature pair if the condition is not satisfied. However, both of these approaches suffer from lack of scale-invariance of the objective function and reject the curvature pairs when the algorithm is converging close to the solution.

Rather than doing this, we propose using the Powell damping condition described in Nocedal and Wright (2006) as the rejection criteria, which ensures that . Alternatively, one can modify the definition of to ensure that the condition explicitly holds by applying Powell damping to the gradient difference. This has been found to be useful for the stochastic nonconvex setting.

One can perform curvature pair rejection by setting damping=False or apply Powell damping by simply setting damping=True in the step function. Powell damping is not applied by default.

Which variant of stochastic L-BFGS should I use?

By default, the algorithm uses a (stochastic) Wolfe line search without Powell damping. We recommend implementing this in conjunction with the full-overlap approach with a sufficiently large batch size (say, 2048, 4096, or 8192) as this is easiest to implement and leads to the most stable performance. If one uses an Armijo backtracking line search or fixed steplength, we suggest incorporating Powell damping to prevent skipping curvature updates. Since stochastic quasi-Newton methods are still an active research area, this is by no means the final algorithm. We encourage users to try other variants of stochastic L-BFGS to see what works well.

Recent Changes

We've added the following minor features:

  • Full-Batch L-BFGS wrapper.
  • Option for in-place updates.
  • Quadratic interpolation in Wolfe line search backtracking.

To Do

In maintaining this module, we are working to add the following features:

  • Additional initializations of the L-BFGS matrix aside from the Barzilai-Borwein scaling.
  • Wrappers for specific optimizers developed in various papers.
  • Using Hessian-vector products for computing curvature pairs.
  • More sophisticated stochastic line searches.
  • Easy parallelization of L-BFGS methods.

Acknowledgements

Thanks to Raghu Bollapragada, Jorge Nocedal, and Yuchen Xie for feedback on the details of this implementation, and Kenjy Demeester, Jaroslav Fowkes, and Dominique Orban for help on installing CUTEst and its Python interface for testing the implementation. Many thanks to Jacob Gardner, Geoff Pleiss, and Ben Letham for feedback and help on additional testing on Gaussian processes in GPyTorch.

References

  1. Berahas, Albert S., Jorge Nocedal, and Martin Takác. "A Multi-Batch L-BFGS Method for Machine Learning." Advances in Neural Information Processing Systems. 2016.
  2. Bollapragada, Raghu, et al. "A Progressive Batching L-BFGS Method for Machine Learning." International Conference on Machine Learning. 2018.
  3. Lewis, Adrian S., and Michael L. Overton. "Nonsmooth Optimization via Quasi-Newton Methods." Mathematical Programming 141.1-2 (2013): 135-163.
  4. Liu, Dong C., and Jorge Nocedal. "On the Limited Memory BFGS Method for Large Scale Optimization." Mathematical Programming 45.1-3 (1989): 503-528.
  5. Nocedal, Jorge. "Updating Quasi-Newton Matrices With Limited Storage." Mathematics of Computation 35.151 (1980): 773-782.
  6. Nocedal, Jorge, and Stephen J. Wright. "Numerical Optimization." Springer New York, 2006.
  7. Schmidt, Mark. "minFunc: Unconstrained Differentiable Multivariate Optimization in Matlab." Software available at http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html (2005).
  8. Schraudolph, Nicol N., Jin Yu, and Simon Günter. "A Stochastic Quasi-Newton Method for Online Convex Optimization." Artificial Intelligence and Statistics. 2007.
  9. Wang, Xiao, et al. "Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization." SIAM Journal on Optimization 27.2 (2017): 927-956.

Feedback, Questions, or Suggestions?

We are looking for feedback on the code! If you'd like to share your experience with using this module, please let us know here: https://discuss.pytorch.org/t/feedback-on-pytorch-l-bfgs-implementation/26678.

For more technical issues, questions, or suggestions, please use the Issues tab on the Github repository. Any suggestions on improving the modules are always welcome!

Comments
  • Line search variants as function

    Line search variants as function

    https://github.com/hjmshi/PyTorch-LBFGS/blob/6677e1f22c54b78f1481b517fd9e823484935de6/functions/LBFGS.py#L424

    Instead of lengthy if-else sections, maybe better to have the different line search options as independent function ? I remember we had talked about this, but not sure what we decided and why :)

    opened by dmudiger 9
  • Line search failing frequently

    Line search failing frequently

    In my use-case of full batch L-BFGS, I'm often running into issues where the line search is is failing. But at the same time I don't want to make the max_ls too large or else the optimizer spends a lot of time trying to find a tiny step size that satisfies the Wolfe conditions, which ends up not affecting the loss function much at all and wasting computation time.

    I've tried playing around with the Wolfe condition constants c1 and c2 but it hasn't helped.

    Is there some way to force the optimizer to still take a step at the end of its line-search even if it can't find a step-size that satisfies the Wolfe conditions?

    opened by KeAWang 5
  • Error due to to .view operation on gradients

    Error due to to .view operation on gradients

    I am only running on CPU right now, but will move on to powerful GPUs once I get it to work on CPU. I am using pytorch 1.6.0.

    I have a class for iteratively solving an inverse problem. This class uses your LBFGS optimizer, specifically, with the following parameters - self.optimizer = FullBatchLBFGS(self.model.parameters(), lr=1, history_size=10, line_search='Wolfe', debug=True)

    I use the following code for optimization based on your examples - options = {'closure': closure, 'current_loss': self.loss_fn(self.model(), data), 'eta': 2,'max_ls': 100, 'interpolate': True, 'inplace': False} obj, grad, lr, backtracks, clos_evals, grad_evals, desc_dir, fail = self.optimizer.step(options=options)

    However, I get the following error - File "/Users/mohan3/Desktop/Devs/Phase_Img/Code/phasetorch/phasetorch/pret.py", line 59, in run obj, grad, lr, backtracks, clos_evals, grad_evals, desc_dir, fail = self.optimizer.step( File "/Users/mohan3/Desktop/Devs/Phase_Img/Code/phasetorch/phasetorch/LBFGS.py", line 1107, in step return self._step(p, grad, options=options) File "/Users/mohan3/Desktop/Devs/Phase_Img/Code/phasetorch/phasetorch/LBFGS.py", line 872, in _step g_new = self._gather_flat_grad() File "/Users/mohan3/Desktop/Devs/Phase_Img/Code/phasetorch/phasetorch/LBFGS.py", line 250, in _gather_flat_grad view = p.grad.data.view(-1) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

    It seems like there is a problem with using .view within the file LBFGS.py. The error message suggests to try using reshape. Is it possible to re-implement this method using torch.reshape? Just replacing the .view with torch.reshape seems to result in a new error message "Not a descent direction!".

    opened by adityamnk 4
  • Slight change in network when no step is taken in line search due to numerical error

    Slight change in network when no step is taken in line search due to numerical error

    I've been observing another interesting issue: There are many occasions were the steplength should be 0 and the network parameters should not have been changed as a result of the line search failing (i.e. reaching the maximum number of line search steps). I believe that this is because of the way I had implemented the line search - it tracks the current and previous steplength and directly modifies the network parameters to save memory. Of course, the way to reset the parameters then is to add the update with steplength -t (which is what is done). Interestingly though, I observe that there are slight changes in the loss, which is most likely due to numerical error. Is this an issue we should be concerned about or is this something we can ignore?

    opened by hjmshi 4
  • [minor] Use torchvision for data ?

    [minor] Use torchvision for data ?

    https://github.com/hjmshi/PyTorch-LBFGS/blob/6677e1f22c54b78f1481b517fd9e823484935de6/examples/full_batch_lbfgs_example.py#L42

    Use torchvision or something similar for data, instead of having to have keras (+TF/Theano backend) for just the data loading :)

    opened by dmudiger 4
  • Some confusion about the code.

    Some confusion about the code.

    When I use the Powell damping to hold the PSD property of the metric matrix, I find the Bs update are written as "Bs.copy_(g_Sk.mul(-t))". Is it right?

    opened by yangorwell 2
  • Ceres Like API?

    Ceres Like API?

    Hi all,

    I have a problem statement, where I am optimizing over parameters T, with several losses L. For speed I actually already implemented the Jacobian computation of dLdT.

    In Ceres, I just needed to pass the residuals, current parameters and dLdT and ceres would just go ahead and apply Gauss Newton with Line Search or Trust Region and optimize it for me.

    Do you have an example, or a mechanism where I can pass dLdT, residuals and T and you handle the optimization? I want to use this because Ceres is too slow (Takes almost 100ms per iteration) for 200 parameters and 200 Losses. (dLdT Jacobian is a matrix 200x200)

    opened by soulslicer 2
  • Why do parameters still change if the linesearch fails?

    Why do parameters still change if the linesearch fails?

    Using the example code for GP regression (updated to use the master branch of gpytorch) and manually setting max_ls=1 and lr=0.01 to force linesearch failure:

    import math
    import torch
    import gpytorch
    from matplotlib import pyplot as plt
    
    sys.path.append('../../../PyTorch-LBFGS/functions/')
    from LBFGS import FullBatchLBFGS
    
    # Training data is 11 points in [0,1] inclusive regularly spaced
    train_x = torch.linspace(0, 1, 100)
    # True function is sin(2*pi*x) with Gaussian noise
    train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2
    
    # We will use the simplest form of GP model, exact inference
    class ExactGPModel(gpytorch.models.ExactGP):
        def __init__(self, train_x, train_y, likelihood):
            super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
            self.mean_module = gpytorch.means.ConstantMean()
            self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        
        def forward(self, x):
            mean_x = self.mean_module(x)
            covar_x = self.covar_module(x)
            return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
    
    # initialize likelihood and model
    likelihood = gpytorch.likelihoods.GaussianLikelihood()
    model = ExactGPModel(train_x, train_y, likelihood)
    
    # Find optimal model hyperparameters
    model.train()
    likelihood.train()
    
    # Use full-batch L-BFGS optimizer
    optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
    
    # "Loss" for GPs - the marginal log likelihood
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    
    # define closure
    def closure():
        optimizer.zero_grad()
        output = model(train_x)
        loss = -mll(output, train_y)
        return loss
    
    loss = closure()
    loss.backward()
    
    training_iter = 10
    for i in range(training_iter):
    
        # perform step and update curvature
        options = {'closure': closure, 'current_loss': loss, 'max_ls': 1}
        loss, _, lr, _, F_eval, G_eval, _, _ = optimizer.step(options)
    
        print('Iter %d/%d - Loss: %.16f - LR: %.3f - Func Evals: %0.0f - Grad Evals: %0.0f - Raw-Lengthscale: %.16f - Raw_Noise: %.16f' % (
            i + 1, training_iter, loss.item(), lr, F_eval, G_eval,
            model.covar_module.base_kernel.raw_lengthscale.item(),
            model.likelihood.raw_noise.item()
            ))
    
    

    I get the following output

    Iter 1/10 - Loss: 0.9470608234405518 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000004039440 - Raw_Noise: 0.0000000014626897 - Fail: True
    Iter 2/10 - Loss: 0.9470605254173279 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000004354150 - Raw_Noise: 0.0000000029325762 - Fail: True
    Iter 3/10 - Loss: 0.9470604658126831 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000005869125 - Raw_Noise: 0.0000000007250947 - Fail: True
    Iter 4/10 - Loss: 0.9470607042312622 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000006170258 - Raw_Noise: 0.0000000007257976 - Fail: True
    Iter 5/10 - Loss: 0.9470604062080383 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000004235302 - Raw_Noise: 0.0000000022039979 - Fail: True
    Iter 6/10 - Loss: 0.9470605254173279 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000002037677 - Raw_Noise: 0.0000000036952374 - Fail: True
    Iter 7/10 - Loss: 0.9470604062080383 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000005822218 - Raw_Noise: 0.0000000052020050 - Fail: True
    Iter 8/10 - Loss: 0.9470604062080383 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000008010645 - Raw_Noise: 0.0000000066766215 - Fail: True
    Iter 9/10 - Loss: 0.9470604658126831 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000005968881 - Raw_Noise: 0.0000000059220691 - Fail: True
    Iter 10/10 - Loss: 0.9470604658126831 - LR: 0.000 - Func Evals: 3 - Grad Evals: 2 - Raw-Lengthscale: -0.0000000002212988 - Raw_Noise: 0.0000000066797297 - Fail: True
    

    Clearly the linesearch is failing and the lr is set to 0. But why are the parameters still changing?

    opened by KeAWang 2
  • How to cite this implementation?

    How to cite this implementation?

    If we would like to refer to this implementation of FullBatchLBFGS in a publication, is there a specific paper that we should cite? If so, would it be possible for the README to include a bibtex for this purpose?

    opened by KeAWang 1
  • Applying LBFGS to multidemensional data (w.r.t outpu variable)

    Applying LBFGS to multidemensional data (w.r.t outpu variable)

    Hallo,

    I tried to apply your LBFGS Algorithm to a different test data set. This Tets data set comprises a two-dimensional input space and a two dimensional output space. The data are generated as followed:

    # Generate Training Data
    n = 20
    train_x = torch.zeros(pow(n, 2), 2)
    for i in range(n):
        for j in range(n):
            # Each coordinate varies from 0 to 1 in n=100 steps
            train_x[i * n + j][0] = float(i) / (n-1)
            train_x[i * n + j][1] = float(j) / (n-1)
    
    train_y_1 = (torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01))/4
    train_y_2 = torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01)
    
    train_y = torch.stack([train_y_1, train_y_2], -1)
    
    test_x = torch.rand((n, len(train_x.shape)))
    test_y_1 = (torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01))/4
    test_y_2 = torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01)
    test_y = torch.stack([test_y_1, test_y_2], -1)
    
    train_x = train_x.unsqueeze(0).repeat(2, 1, 1)
    train_y = train_y.transpose(-2, -1)
    

    The remaining code remained almost the the same as in your GPR example:

    class ExactGPModel(gpytorch.models.ExactGP):
        def __init__(self, train_x, train_y, likelihood):
            super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
            self.mean_module = gpytorch.means.ConstantMean()
            self.covar_module = gpytorch.kernels.MaternKernel(nu=1.5)
        def forward(self, x):
            mean_x = self.mean_module(x)
            covar_x = self.covar_module(x)
            return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
    
    likelihood = gpytorch.likelihoods.GaussianLikelihood(num_tasks=2)
    model = ExactGPModel(train_x, train_y, likelihood)
    model.train()
    likelihood.train()
    optimizer = FullBatchLBFGS(model.parameters())
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    
    def closure():
        optimizer.zero_grad()
        output = model(train_x)
        loss = -mll(output, train_y)
        return loss
    
    loss = closure()
    loss.backward()
    
    training_iter = 10
    for i in range(training_iter):
        # perform step and update curvature
        options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}
        loss, _, lr, _, F_eval, G_eval, _, _ = optimizer.step(options)
        print(
            'Iter %d/%d - Loss: %.3f - LR: %.3f - Func Evals: %0.0f - Grad Evals: %0.0f - Log-Lengthscale: %.3f - Log_Noise: %.3f' % (
                i + 1, training_iter, loss.item(), lr, F_eval, G_eval,
                model.covar_module.base_kernel.log_lengthscale.item(),
                model.likelihood.log_noise.item()
            ))
    

    Unfortunately I get a problem with the two dimensional out put:

    File "D:/Programmieren/convolutional_gaussian_process_regression/test_LBFGS.py", line 90, in loss.backward() File "D:\ProgramData\Anaconda3\lib\site-packages\torch\tensor.py", line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "D:\ProgramData\Anaconda3\lib\site-packages\torch\autograd_init_.py", line 84, in backward grad_tensors = make_grads(tensors, grad_tensors) File "D:\ProgramData\Anaconda3\lib\site-packages\torch\autograd_init.py", line 28, in _make_grads raise RuntimeError("grad can be implicitly created only for scalar outputs") RuntimeError: grad can be implicitly created only for scalar outputs

    I hope you can give me some feedback, how I can make your package applicable for multidimensional problems

    THX Lazloo

    opened by Lazloo 1
  •  (yet) Another stochastic L-BFGS implementation for interest

    (yet) Another stochastic L-BFGS implementation for interest

    Hi Michael,

    I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)

    The idea behind the algorithm is pretty simple in brief:

    • when optimizing a stochastic empirical loss function, use bootstrap sampling over mini-batches of data to estimate the function value, the variance, and sampling bias
    • run a first-order solver using a line search that treats the variance estimates from the bootstapped function/gradient evaluations as numerical uncertainties, such that the line search routine looks for step sizes points that satisfy the Wolfe conditions within the uncertainty level

    I implemented the idea above in the following repository using tensorflow: https://github.com/mbhynes/boots

    I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. here

    There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.

    The real practical downside to all this is that neither tensorflow nor pytorch are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since the boots implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)

    Anyway, hope it's of interest, and all the best!

    opened by mbhynes 0
  • 8-bit L-BFGS

    8-bit L-BFGS

    Hello @hjmshi ,

    Recently I stumbled upon the Bits&Bytes wrapper. Since then I can't help but wonder whether we could adapt this idea for the second order methods, namely, 8-bit L-BFGS i.e., L-BFGS using an 8-bit state instead of an 32-bit.

    I was wondering whata are your initial thoughts on it ?

    opened by Demirrr 0
  • How to avoid

    How to avoid"Line search failed"?

    I'm trying FullBatchLBFGS with wolfe line search on a fitting task of a small dataset.

    1. Levenbert-Marquardt is giving me fairly accuracy. Can I expect LBFGS provide similar accuracy as LM? Say, for a synthesized dataset, y=1/x
    2. I'm getting lots of "Line search failed; curvature pair update skipped". How could I avoid this?
    3. I noticed that the result is stochastic. Where does randomness come from since I'm using a full batch training (no shuffling).

    Thanks!

    opened by duhd1993 3
  • RuntimeError: Expected object of type torch.FloatTensor but found type torch.DoubleTensor for argument #3 'other'

    RuntimeError: Expected object of type torch.FloatTensor but found type torch.DoubleTensor for argument #3 'other'

    The error occurs in functions/LBFGS.py, line 854. I think this error comes from t. It becomes a double precision number with the process of convergence. I tried

    if(F_new > F_k + (c1*t*gtd).float()):
    

    or

    if(F_new.double() > F_k.double() + (c1*t*gtd)):
    

    However, if I do so, the loss will not decrease any more.

    opened by jihaonew 2
  • Trying to accelerate LBFGS

    Trying to accelerate LBFGS

    Thanks for this implementation. Recently, I've been working on a package to use machine learning for chemistry problems where I use pytorch to train some models. I have been able to perform distributed training using a library called dask that accelerated the training phase.

    When I use first-order optimization algorithms such as Adam, I can get up to 3 optimization steps per second (but those algorithms converge slowly compared to second-order ones). When using LBFGS I just get 1 optimization step each 7 seconds for the same number of parameters. I am interested in using a dask client to make some parts of your LBFGS implementation to work in a distributed manner so that each optimization step is faster. I started reading the code, and have a very brief idea of the LBFGS algorithm. However, I wondered if you could give me some hints about the parts of the module that could be independently computed and therefore distributed?

    I would appreciate your thoughts on this.

    opened by muammar 1
Owner
Hao-Jun Michael Shi
Hao-Jun Michael Shi
Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

Tez: a simple pytorch trainer NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something does

abhishek thakur 1.1k Jan 4, 2023
null 270 Dec 24, 2022
A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

Fidelity Investments 56 Sep 13, 2022
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

null 878 Dec 30, 2022
PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

Cong Cai 12 Dec 19, 2021
A PyTorch implementation of EfficientNet

EfficientNet PyTorch Quickstart Install with pip install efficientnet_pytorch and load a pretrained EfficientNet with: from efficientnet_pytorch impor

Luke Melas-Kyriazi 7.2k Jan 6, 2023
PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

README TabNet : Attentive Interpretable Tabular Learning This is a pyTorch implementation of Tabnet (Arik, S. O., & Pfister, T. (2019). TabNet: Attent

DreamQuark 2k Dec 27, 2022
An implementation of Performer, a linear attention-based transformer, in Pytorch

Performer - Pytorch An implementation of Performer, a linear attention-based transformer variant with a Fast Attention Via positive Orthogonal Random

Phil Wang 900 Dec 22, 2022
PyTorch implementation of Glow, Generative Flow with Invertible 1x1 Convolutions

glow-pytorch PyTorch implementation of Glow, Generative Flow with Invertible 1x1 Convolutions

Kim Seonghyeon 433 Dec 27, 2022
This is an differentiable pytorch implementation of SIFT patch descriptor.

This is an differentiable pytorch implementation of SIFT patch descriptor. It is very slow for describing one patch, but quite fast for batch. It can

Dmytro Mishkin 150 Dec 24, 2022
GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks This repository implements a capsule model Inten

Joel Huang 15 Dec 24, 2022
Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

NVIDIA Corporation 4.1k Jan 3, 2023
A Pytorch Implementation for Compact Bilinear Pooling.

CompactBilinearPooling-Pytorch A Pytorch Implementation for Compact Bilinear Pooling. Adapted from tensorflow_compact_bilinear_pooling Prerequisites I

null 169 Dec 23, 2022
A pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch.

Compact Bilinear Pooling for PyTorch. This repository has a pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch. This

Grégoire Payen de La Garanderie 234 Dec 7, 2022
Pytorch implementation of Distributed Proximal Policy Optimization

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 164 Jan 5, 2023
A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Intro PyTorch implementation of Learning to learn by gradient descent by gradient descent. Run python main.py TODO Initial implementation Toy data LST

Ilya Kostrikov 300 Dec 11, 2022
PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Jacob Gildenblat 836 Dec 26, 2022
Pretrained ConvNets for pytorch: NASNet, ResNeXt, ResNet, InceptionV4, InceptionResnetV2, Xception, DPN, etc.

Pretrained models for Pytorch (Work in progress) The goal of this repo is: to help to reproduce research papers results (transfer learning setups for

Remi 8.7k Dec 31, 2022
Model summary in PyTorch similar to `model.summary()` in Keras

Keras style model.summary() in PyTorch Keras has a neat API to view the visualization of the model which is very helpful while debugging your network.

Shubham Chandel 3.7k Dec 29, 2022