higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.

Overview

higher logo

higher is a library providing support for higher-order optimization, e.g. through unrolled first-order optimization loops, of "meta" aspects of these loops. It provides tools for turning existing torch.nn.Module instances "stateless", meaning that changes to the parameters thereof can be tracked, and gradient with regard to intermediate parameters can be taken. It also provides a suite of differentiable optimizers, to facilitate the implementation of various meta-learning approaches.

Full documentation is available at https://higher.readthedocs.io/en/latest/.

Requirements and Installation

  • Python version >= 3.5
  • PyTorch version >= 1.3

To install higher from PyPi:

pip install higher

To install higher from source:

git clone [email protected]:facebookresearch/higher.git
cd higher
pip install .

Alternatively python setup.py install will do the same thing.

Citation

If you use higher in your research and found it helpful, please consider citing the following paper:

@article{grefenstette2019generalized,
  title={Generalized Inner Loop Meta-Learning},
  author={Grefenstette, Edward and Amos, Brandon and Yarats, Denis and Htut, Phu Mon and Molchanov, Artem and Meier, Franziska and Kiela, Douwe and Cho, Kyunghyun and Chintala, Soumith},
  journal={arXiv preprint arXiv:1910.01727},
  year={2019}
}

Use case

Your needs

You have a model with parameters P, where P[t] denotes the parameters at update timestep t. You want to update the model through k steps of optimization, and compute gradients through the optimization process, i.e. compute torch.autograd.grad(P[k], P[0]) or obtain gradients that depend on this gradient pathway existing.

Your obstacles

You are using some existing code for your model, so the parameters are stateful, preventing you from forming a graph with P[t] as nodes. Even if you roll your own solution, you want to use optimization techniques beyond normal SGD, and torch.optim optimizers don't let you optimize "through" them.

Your solution

Good news: higher has got you covered! Using our growing set of tools and utility functions, you can backpropagate through an unbounded number of model update steps for all your meta-learning needs. This library includes:

  • Helper functions for monkey-patching torch.nn modules to make them functional (non-stateful), i.e. feed their parameters as an extra argument during the forward pass.
  • Classes implementing differentiable versions of torch.optim.Adam (and SGD), designed to track or branch out from the state of a "normal" Adam instance.

Example Usage

Say your training code looks like this:

model = MyModel()
opt = torch.optim.Adam(model.parameters())

for xs, ys in data:
    opt.zero_grad()
    logits = model(xs)
    loss = loss_function(logits, ys)
    loss.backward()
    opt.step()

To turn this into a differentiable version, the following changes should be introduced:

model = MyModel()
opt = torch.optim.Adam(model.parameters())

# When you want to branch from the current state of your model and unroll
# optimization, follow this example. This context manager gets a snapshot of the
# current version of the model and optimizer at the point where you want to
# start unrolling and create a functional version `fmodel` which executes the
# forward pass of `model` with implicit fast weights which can be read by doing
# `fmodel.parameters()`, and a differentiable optimizer `diffopt` which ensures
# that at each step, gradient of `fmodel.parameters()` with regard to initial
# fast weights `fmodel.parameters(time=0)` (or any other part of the unrolled
# model history) is defined.

with higher.innerloop_ctx(model, opt) as (fmodel, diffopt):
    for xs, ys in data:
        logits = fmodel(xs)  # modified `params` can also be passed as a kwarg
        loss = loss_function(logits, ys)  # no need to call loss.backwards()
        diffopt.step(loss)  # note that `step` must take `loss` as an argument!
        # The line above gets P[t+1] from P[t] and loss[t]. `step` also returns
        # these new parameters, as an alternative to getting them from
        # `fmodel.fast_params` or `fmodel.parameters()` after calling
        # `diffopt.step`.

        # At this point, or at any point in the iteration, you can take the
        # gradient of `fmodel.parameters()` (or equivalently
        # `fmodel.fast_params`) w.r.t. `fmodel.parameters(time=0)` (equivalently
        # `fmodel.init_fast_params`). i.e. `fast_params` will always have
        # `grad_fn` as an attribute, and be part of the gradient tape.

    # At the end of your inner loop you can obtain these e.g. ...
    grad_of_grads = torch.autograd.grad(
        meta_loss_fn(fmodel.parameters()), fmodel.parameters(time=0))

Beware that when unrolling your optimisation like this for k, all gradients and all activations of your model at each step is kept in memory, meaning the memory footprint of your model is k times greater.

Adding your own optimizers

It is possible to use optimizers other that those found in torch.optim. A differentiable version must be implemented first. This can be done by subclassing higher.optim.DifferentiableOptimizer and overriding the _update method, following the arguments of the original. Assuming the logic of the optimizer being added follows the logic of those found in torch.optim, the steps to follow are more or less:

  1. Remove the following code (no support for closures).
    loss = None
    if closure is not None:
        loss = closure()
    
  2. Replace
    for group in self.param_groups:
        for p in group['params']:
            if p.grad is None:
                continue
            grad = p.grad.data
    
    with
    zipped = zip(self.param_groups, grouped_grads)
    for group_idx, (group, grads) in enumerate(zipped):
        for p_idx, (p, g) in enumerate(zip(group['params'], grads)):
          if g is None:
              continue
    
  3. Replace state = self.state[p] with state = self.state[group_idx][p_idx].
  4. Replace any in-place op with a non in-place op, e.g. t.add_(a, x).mul_(y) should become t = t.add(a, x).mul(y) (note the assignment). Be careful to also track where dictionaries are being implicitly updated by such ops, e.g. if there is code of the form:
    p = state['k']
    ...
    p.add_(a, x)
    
    in the original optimizer, this code should be converted to
    p = state['k']
    ...
    state['k'] = p = p.add(a, x)
    
    to ensure the corresponding dictionary is.
  5. Except where used for shape inference, replace instances of t.data with t for all t.
  6. Be sure to update group['params'][p_idx] for each p_idx in need of update (those ignored will yield the original parameters in the fast weight collection). The latest fast weights will be returned by the inherited step function.
  7. Importantly, you need to register your new differentiable optimizer with higher using higher.register_optim to ensure that it is recognized as an option by the library's methods. You can do this at any point after the definition of an optimizer, and before any higher code involving that optimizer is called. For example, if you have implemented MyDiffOpt as a differentiable version of some optimizer MyOpt, register it by adding the line higher.register_optim(MyOpt, MyDiffOpt) after the classes are defined.

You can find examples of how to test for gradient correctness using finite difference methods in tests/test_optim.py. Please note that some stability tricks may be needed to avoid nans in the gradients. See the higher.optim.DifferentiableAdam implementation for examples of mitigation strategies, e.g. identify operations that yield exploding gradients, e.g. typically those taking the square roots of moving averages (which are intially zero), and register a backward hook using x.register_hook on the inputs x to those functions, using the helper function _get_mask_closure from higher.optim.

Related Projects

The following papers and codebases reference or directly use higher:

Is yours missing? Raise an issue or add it via a pull request!

Release Notes

See the changelog for release notes.

Known/Possible Issues

  • See the issues tracker for an up-to-date list.
  • No support (or planned support) for torch.nn.DataParallel at this time. This would require a rewrite of DataParallel. Please raise an issue on the pytorch issue tracker if this matters to you.
  • Some of the adaptative gradient-style differentiable optimizers may be unstable and yield NaNs when taking higher order gradients. Some tricks have been used to mitigate this risk. Please raise an issue if these are not sufficient in practice.
  • Second-order gradients may not work with some CUDNN modules (mostly RNNs). From PyTorch v1.3 onwards, wrapping the code where models are used with higher using the following context manager should solve the issue:
with torch.backends.cudnn.flags(enabled=False):
    # Your meta-learning code here...

License

higher is released under Apache License Version 2.0.

Thanks

Thanks to Adam Paszke whose gist was the source of inspiration (and starting point) for our method for monkey patching arbitrary torch.nn modules.

Thanks for the many interns, researchers, and engineers who helped road-test early versions of this library.

Comments
  • example of trainable optimizer?

    example of trainable optimizer?

    I suggest a full that implements the optimizer but a trainable step size could be a good example too...

    https://discuss.pytorch.org/t/implement-a-meta-trainable-step-size/70396

    help wanted good first issue 
    opened by renesax14 43
  • Higher leak memory with track_higher_grad = off.

    Higher leak memory with track_higher_grad = off.

    If i get it correctly, with track_higher_grad = off using it to train should be exactly the same as not using it (automatic differentiation-wise), only that imperative update on tensor is replaced with creating new tensor, and moving the reference to the new pointer. However, I found out that it still use up more and more memory, especially on diffopt.step(loss). I think the problem is that pytorch cant distinguish between normal tensor operation and weight update, as they are all functional operation on tensor. Hence, when you do functional update additional graph is still created, forcing old weight to stay in memory. More specifically, maybe you should call detach() before https://github.com/facebookresearch/higher/blob/master/higher/optim.py#L249 ?

    Below is a gist that reproduce the memory leak. https://gist.github.com/MarisaKirisame/e4a48617dbe25eee94f08ab9f7c49d99

    opened by MarisaKirisame 21
  • Question about second-order gradients for GRU

    Question about second-order gradients for GRU

    Hi, I was trying to use the package for obtaining second-order gradients through the optimization process of a model with GRU units each followed by a linear layer. However, when I check torch.autograd.grad(loss, learner_fmodel.parameters(time=0)) I get the error RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.. With allow_unused=True, I see that the gradients with respect to the GRU parameters are None whereas the gradient with respect to the linear layer has values. I was wondering if this is indeed supported for GRU?

    opened by Nithin-Holla 14
  • Improving documentation for copy_initial_weights

    Improving documentation for copy_initial_weights

    I suggest to improve the English used for the documentation in the following:

    copy_initial_weights – if true, the weights of the patched module are copied to form the initial weights of the patched module, and thus are not part of the gradient tape when unrolling the patched module. If this is set to False, the actual module weights will be the initial weights of the patched module. This is useful when doing MAML, for example.

    For example, "the weights of the patched module are copied to form the initial weights of the patched module" doesn't make sense to me because when the context manager is initiated a patched module does not exist yet. So it is unclear what we are copying from and to where (and why copying is something we want to do).

    Also, "unrolling the patched module" does not make sense to me. We usually unroll a computaiton graph caused by a for loop. A patched module is just a neural net that has been modified by this library. Unrolling is ambiguous.

    Also, there isn't a technical definition for "gradient tape".

    Also, when describing what false is, saying that it's useful for MAML isn't actually useful because it doesn't even hint why it's useful for MAML.

    Overall, it's impossible to use the context manager because it's unclear what that flag is suppose to be doing (and seems critical, which is weird it that is has default values).


    Related:

    • gitissue: https://github.com/facebookresearch/higher/issues/30
    • SO: https://stackoverflow.com/questions/60311183/what-does-the-copy-initial-weights-documentation-mean-in-the-higher-library-for
    • What does copy_initial_weights do in the higher library? https://discuss.pytorch.org/t/what-does-copy-initial-weights-do-in-the-higher-library/70384
    • Why does MAML need copy_initial_weights=False? https://discuss.pytorch.org/t/why-does-maml-need-copy-initial-weights-false/70387
    question 
    opened by renesax14 13
  • Memory leak when performing inner loop on a copy

    Memory leak when performing inner loop on a copy

    When running the following script, memory usage seems to continually increase (not all iterations, but many of them in the beginning). Explicitly decrementing the reference counts for the functional model/differentiable optimizer seems to help, but not completely fix the issue. Script for reproducing attached. Monitoring GPU memory usage with watch -n 0.1 nvidia-smi. Running PyTorch version 1.3.0, torchvision 0.4.1, higher version 0.1.3@59537fa. Let me know if any other information would be helpful.

    Code to reproduce the issue:

    import torch, torchvision
    import higher
    import time
    from copy import deepcopy
    
    
    print(torch.__version__, torchvision.__version__)
    
    inner_loop_copy = True # Setting this to False gives no memory usage increase
    
    device = torch.device('cuda:0')
    model = torchvision.models.resnet18().to(device)
    opt = torch.optim.SGD(model.parameters(), lr=1e-5)
    
    for idx in range(100):
        print(idx)
        if inner_loop_copy:
            model_ = deepcopy(model)
            opt_ = torch.optim.SGD(model_.parameters(), lr=1e-5)
        else:
            model_ = model
            opt_ = opt
    
        with higher.innerloop_ctx(model_, opt_) as (fm, do):
            pass
    
        # del fm, do # Uncommenting this helps, but doesn't completely eliminate the memory usage increase
    
    bug help wanted 
    opened by eric-mitchell 9
  • Computational graph not retained for BERT

    Computational graph not retained for BERT

    I'm trying to implement first-order version of ProtoMAML (https://arxiv.org/pdf/1903.03096.pdf) for a sequence labelling task. If I use BERT as encoder, I run into this error at the line diffopt.step: RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.. Instead, if I use an LSTM as an encoder, then it runs successfully. Perhaps the graph is purged somehow since BERT is a large model?

    Here is a self-contained code to replicate the issue. The issue occurs on both CPU and GPU. On line 117, you can specify the encoder as bert or lstm. It requires the transformers library from HuggingFace to run.

    import higher
    import torch
    
    from torch import nn, optim
    from transformers import BertModel, BertTokenizer
    from torch.nn import functional as F
    
    
    class BaseModel(nn.Module):
    
        def __init__(self, encoder, max_length, device):
            super(BaseModel, self).__init__()
            self.max_length = max_length
            self.device = device
            if encoder == 'bert':
                self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
                self.encoder = BertModel.from_pretrained('bert-base-uncased')
                self.encoder.pooler.dense.weight.requires_grad = False
                self.encoder.pooler.dense.bias.requires_grad = False
            elif encoder == 'lstm':
                self.encoder = nn.LSTM(batch_first=True, input_size=32, hidden_size=768)
            self.linear = nn.Linear(768, 192)
            self.to(self.device)
    
        def encode_text(self, text):
            if isinstance(self.encoder, BertModel):
                encode_result = self.tokenizer.batch_encode_plus(text, return_token_type_ids=False, max_length=self.max_length,
                                                                 pad_to_max_length=True, return_tensors='pt')
                for key in encode_result:
                    encode_result[key] = encode_result[key].to(self.device)
                return encode_result
            elif isinstance(self.encoder, nn.LSTM):
                return torch.randn((len(text), 32, 32), device=self.device)
    
        def forward(self, inputs):
            if isinstance(self.encoder, BertModel):
                out, _ = self.encoder(inputs['input_ids'], attention_mask=inputs['attention_mask'])
            elif isinstance(self.encoder, nn.LSTM):
                out, _ = self.encoder(inputs)
            out = out[:, 1:-1, :]
            out = self.linear(out)
            return out
    
    
    class ProtoMAML:
    
        def __init__(self, device, encoder):
            self.output_layer_weight = None
            self.output_layer_bias = None
            self.learner = BaseModel(encoder=encoder, max_length=32, device=device)
            self.inner_optimizer = optim.SGD([p for p in self.learner.parameters() if p.requires_grad], lr=0.001)
            self.loss_fn = nn.CrossEntropyLoss()
            self.output_lr = 0.001
            self.device = device
            self.updates = 5
    
        def output_layer(self, input, weight, bias):
            return F.linear(input, self.output_layer_weight + weight, self.output_layer_bias + bias)
    
        def initialize_with_proto_weights(self, support_repr, support_label, n_classes):
            prototypes = self.build_prototypes(support_repr, support_label, n_classes)
            weight = 2 * prototypes
            bias = -torch.norm(prototypes, dim=1) ** 2
            self.output_layer_weight = torch.zeros_like(weight, requires_grad=True)
            self.output_layer_bias = torch.zeros_like(bias, requires_grad=True)
            return weight, bias
    
        def build_prototypes(self, data_repr, data_label, num_outputs):
            n_dim = data_repr.shape[2]
            data_repr = data_repr.view(-1, n_dim)
            data_label = data_label.view(-1)
    
            prototypes = torch.zeros((num_outputs, n_dim), device=self.device)
    
            for c in range(num_outputs):
                idx = torch.nonzero(data_label == c).view(-1)
                if idx.nelement() != 0:
                    prototypes[c] = torch.mean(data_repr[idx], dim=0)
    
            return prototypes
    
        def initialize_output_layer(self, n_classes):
            self.output_layer_weight = torch.randn((n_classes, 768), requires_grad=True)
            self.output_layer_bias = torch.randn(n_classes, requires_grad=True)
    
        def train(self, support_text, labels, n_classes, n_iter):
    
            for itr in range(n_iter):
                print('Iteration ', itr)
    
                self.learner.zero_grad()
    
                self.initialize_output_layer(n_classes)
                x = self.learner.encode_text(support_text)
                y = labels.to(device)
                output_repr = self.learner(x)
                init_weights, init_bias = self.initialize_with_proto_weights(output_repr, y, n_classes)
    
                with higher.innerloop_ctx(self.learner, self.inner_optimizer,
                                          copy_initial_weights=False,
                                          track_higher_grads=False) as (flearner, diffopt):
    
                    for i in range(self.updates):
                        output = flearner(x)
                        output = self.output_layer(output, init_weights, init_bias)
                        output = output.view(output.size()[0] * output.size()[1], -1)
                        loss = self.loss_fn(output, y)
                        output_weight_grad, output_bias_grad = torch.autograd.grad(loss, [self.output_layer_weight, self.output_layer_bias],
                                                                                   retain_graph=True)
                        self.output_layer_weight = self.output_layer_weight - self.output_lr * output_weight_grad
                        self.output_layer_bias = self.output_layer_bias - self.output_lr * output_bias_grad
                        diffopt.step(loss)
    
    
    if __name__ == '__main__':
        
        encoder = 'bert'  # or 'lstm'
    
        support_text = [['This is a support text']] * 64
        labels = torch.randint(0, 10, (64 * 30, ))
        n_classes = 10
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        model = ProtoMAML(device=device, encoder=encoder)
        model.train(support_text, labels, n_classes, n_iter=10)
    
    opened by Nithin-Holla 8
  • First Order MAML?

    First Order MAML?

    Hi, Thanks for this very useful piece of software!

    I was wondering if there's an easy way to implement first-order MAML with higher. I currently have an implementation of (full) MAML with higher, but wanted to compare it with just first order MAML.

    EDIT: It appears that torch.autograd.grad(query_loss, fmodel.parameters()) would give the gradients corresponding to FOMAML and torch.autograd.grad(query_loss, fmodel.parameters(time=0)) can be used to get MAML gradient?

    opened by MurtyShikhar 8
  • Why not accumulate loss and then take derivative in MAML?

    Why not accumulate loss and then take derivative in MAML?

    Why do you not do this:

    def inner_loop2():
        n_inner_iter = 5
        inner_opt = torch.optim.SGD(net.parameters(), lr=1e-1)
    
        qry_losses = []
        qry_accs = []
        meta_opt.zero_grad()
        meta_loss = 0
        for i in range(task_num):
            with higher.innerloop_ctx(
                net, inner_opt, copy_initial_weights=False
            ) as (fnet, diffopt):
                # Optimize the likelihood of the support set by taking
                # gradient steps w.r.t. the model's parameters.
                # This adapts the model's meta-parameters to the task.
                # higher is able to automatically keep copies of
                # your network's parameters as they are being updated.
                for _ in range(n_inner_iter):
                    spt_logits = fnet(x_spt[i])
                    spt_loss = F.cross_entropy(spt_logits, y_spt[i])
                    diffopt.step(spt_loss)
    
                # The final set of adapted parameters will induce some
                # final loss and accuracy on the query dataset.
                # These will be used to update the model's meta-parameters.
                qry_logits = fnet(x_qry[i])
                qry_loss = F.cross_entropy(qry_logits, y_qry[i])
                qry_losses.append(qry_loss.detach())
                qry_acc = (qry_logits.argmax(
                    dim=1) == y_qry[i]).sum().item() / querysz
                qry_accs.append(qry_acc)
    
                # Update the model's meta-parameters to optimize the query
                # losses across all of the tasks sampled in this batch.
                # This unrolls through the gradient steps.
                #qry_loss.backward()
                meta_loss += qry_loss
    
        qry_losses = sum(qry_losses) / task_num
        qry_losses.backward()
        meta_opt.step()
        qry_accs = 100. * sum(qry_accs) / task_num
        i = epoch + float(batch_idx) / n_train_iter
        iter_time = time.time() - start_time
    

    instead of what you have:

    def inner_loop1():
        n_inner_iter = 5
        inner_opt = torch.optim.SGD(net.parameters(), lr=1e-1)
    
        qry_losses = []
        qry_accs = []
        meta_opt.zero_grad()
        for i in range(task_num):
            with higher.innerloop_ctx(
                net, inner_opt, copy_initial_weights=False
            ) as (fnet, diffopt):
                # Optimize the likelihood of the support set by taking
                # gradient steps w.r.t. the model's parameters.
                # This adapts the model's meta-parameters to the task.
                # higher is able to automatically keep copies of
                # your network's parameters as they are being updated.
                for _ in range(n_inner_iter):
                    spt_logits = fnet(x_spt[i])
                    spt_loss = F.cross_entropy(spt_logits, y_spt[i])
                    diffopt.step(spt_loss)
    
                # The final set of adapted parameters will induce some
                # final loss and accuracy on the query dataset.
                # These will be used to update the model's meta-parameters.
                qry_logits = fnet(x_qry[i])
                qry_loss = F.cross_entropy(qry_logits, y_qry[i])
                qry_losses.append(qry_loss.detach())
                qry_acc = (qry_logits.argmax(
                    dim=1) == y_qry[i]).sum().item() / querysz
                qry_accs.append(qry_acc)
    
                # Update the model's meta-parameters to optimize the query
                # losses across all of the tasks sampled in this batch.
                # This unrolls through the gradient steps.
                qry_loss.backward()
    
        meta_opt.step()
        qry_losses = sum(qry_losses) / task_num
        qry_accs = 100. * sum(qry_accs) / task_num
        i = epoch + float(batch_idx) / n_train_iter
        iter_time = time.time() - start_time
    

    https://github.com/facebookresearch/higher/blob/e45c1a059e39a16fa016d37bc15397824c65547c/examples/maml-omniglot.py#L130


    https://stackoverflow.com/questions/62394411/why-not-accumulate-query-loss-and-then-take-derivative-in-maml-with-pytorch-and

    question 
    opened by renesax14 8
  • _cudnn_rnn_backward is not implemented

    _cudnn_rnn_backward is not implemented

    Hi, I have a following error in my code, however I am using torch 1.3 . RuntimeError: derivative for _cudnn_rnn_backward is not implemented I know that it is pytorch related error. I am wondering in which version of pytorch it has been resolved!

    Solved by using following code : with torch.backends.cudnn.flags(enabled=False):

    opened by nooralahzadeh 8
  • Question about a module that does not required grad

    Question about a module that does not required grad

    Thanks for the wonderful library!

    I have one question about how to use higher.

    I want to train the network with second order derivative,

    but some parts of network (such as nn.Embedding ) are frozen.

    something like this

    net= Net()
    param = filter(lambda x: x.requires_grad, model.parameters())
    inner_opt = torch.optim.SGD(param, lr=1e-1)
    
    with higher.innerloop_ctx(net, inner_opt, copy_initial_weights=False) as (fnet, diffopt):
      logits = net(batch_x)
      loss = loss_fnt(logits, batch_y)
      diffopt.step(loss)
                  
    

    but i got the error message "RuntimeError: One of the differentiated Tensors does not require grad"

    bug 
    opened by seanie12 7
  • Relationship between the weights of a model and the weights of its functional version

    Relationship between the weights of a model and the weights of its functional version

    I am trying to implement the Meta-Pseudo Labels paper. (https://arxiv.org/pdf/2003.10580v2.pdf) Which in summary is trying to train a student model (modelB) with a teacher model (modelA). Model A in turn tracks progress for the student by evaluating it on a validation set and then updates itself through a second order gradient update all the way back. So something like this using higher:

    modelA_optim = optim.Adam(modelA.parameters(), lr=0.01)
    modelB_optim = optim.Adam(modelB.parameters(), lr=0.01)
    
    for n in range(n_iter):
        with higher.innerloop_ctx(modelB, modelB_optim, copy_initial_weights=False) as (fmodel, diffopt):
            pseudo_labels = modelA(X_train)
    
            # Train model B on pseudo labels
            logits = fmodel(X_train)
            loss1 = CrossEntropy(logits, pseudo_labels)
            diffopt.step(loss1)
    
            # Update model A with model B's loss on a validation set
            logits_b = fmodel(X_val)
            loss2 = CrossEntropy(logits_b, y_val)
    
            # Also update model A with real labels
            logits_a = modelA(X_val)
            loss3 = CrossEntropy(logits_a, y_val)
            lossA = loss2+ loss3
            lossA.backward()
    
       modelA_optim.step()
       new_params = dict(fmodel.named_parameters())
       for name, params in modelB.named_parameters():
                params.data.copy_(new_params[name])
    

    The last three lines of the code is where I have a problem. Because I want the fmodel's final values on say iteration 0, to be the starting point of the new fmodel at iteration 1, I explicitly copy the param values from fmodel to modelB at the end of each iteration. Although this runs without errors, when evaluated on a test set, the performance of this modelB is no better than random.

    Does anyone know what I am doing wrong in here?

    question 
    opened by pratikgujjar 6
  • grad clip correctness

    grad clip correctness

    Hello, I'm using higher for bilevel optimization and need to do gradient clipping in the lower level problem. I know that this can be done with a callback on the step() method, but when I take the gradient with respect to the lower level optimization, the computed gradient doesn't match with the finite difference results. I double checked that the higher version of optimizer gives the same optimization trajectory as their vanilla versions, so the callback gradient clip should be correct. In addition, the implemented finite difference check passes when gradient clipping is disabled.

    Here is a fairly minimal example: bilevel_example_checkgrad.zip, which is adapted from https://github.com/vis-opt-group/IAPTT-GM/blob/main/experiment/Numerical.py. Can someone see if this is correct? Many thanks!

    opened by Whalefishin 0
  • Using higher for hyperparameter optimization

    Using higher for hyperparameter optimization

    Hi,

    I believe that higher could be used for hyperparameter optimization using bilevel programming. I have attempted to adapt the given meta-learning example for bilevel programming. However, I am somewhat unsure as to whether I have done it correctly. Here is a general structure of what I have done:

    # Get optimizers
    inner_optim = torch.optim.Adam(params=model.parameters(), lr=args.learning_rate)
    outer_optim = torch.optim.Adam(params=hp.parameters(), lr=args.learning_rate)
    
    # Training loop
    num_inner_iter = args.inner_loop
    for epoch in range(args.epochs):
        outer_optim.zero_grad()
        with higher.innerloop_ctx(
            model=model,
            opt=inner_optim,
            copy_initial_weights=False,
            track_higher_grads=False,
        ) as (fmodel, diffopt):
            for _ in range(num_inner_iter):
                # Forward pass
                train_out = fmodel(transformed_features, hp)
                train_loss = custom_loss(predicted=train_out, actual=train_labels)
                diffopt.step(train_loss)
    
            val_out = fmodel(transformed_features_val, hp)
            val_loss = custom_loss(predicted=val_out, actual=val_labels)
            val_loss.backward()
        outer_optim.step()
    

    Does the above look correct? Or am I misunderstanding something?

    opened by aruniyer 1
  • Non scalar loss

    Non scalar loss

    Hi! I'm training a network with two separate head (something like Hydranet). How should I deal with non-scalar losses? With standard pytorch backward process I'm just feeding backward() with

    The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.

    loss_seq = [loss_head_1, loss_head_2]
    grad_seq = [torch.tensor(1.0).cuda(device) for _ in range(len(loss_seq))]
    torch.autograd.backward(loss_seq, grad_seq)
    

    Is it possible to handle this scenario with higher? What should I pass to the diffopt.step()? Is it enough to invoke diffopt.step(loss_seq)?

    Thanks for help in advance!

    opened by janglinko-dac 1
  • Intialize Differentiable Optimizer with non-leaf tensort

    Intialize Differentiable Optimizer with non-leaf tensort

    Hello, I used higher a while ago and if I remember correctly you could create a differentiable optimizer starting from a normal one. Now I need to optimize non-leaf tensors (my model g weights are generated from another model f). The problem with that is that, apparently, I cannot optimize them because they are not leaf tensors.

    Technically I could generate new leaf tensors starting from them but then I wouldn't be able to backpropagate back to the model f that generated them.

    Does anyone have a solution? Thanks

    opened by andrearosasco 1
  • higher for dpt architectures?

    higher for dpt architectures?

    Hi,

    Thanks for the great library! Does higher support a dpt-based module passed to the higher.innerloop_ctx? I'm getting the following error:

      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/contextlib.py", line 113, in __enter__
        return next(self.gen)
      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/site-packages/higher/__init__.py", line 85, in innerloop_ctx
        fmodel = monkeypatch(
      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/site-packages/higher/patch.py", line 542, in monkeypatch
        fmodule = make_functional(module, encapsulator=encapsulator)
      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/site-packages/higher/patch.py", line 435, in make_functional
        _, fmodule, MonkeyPatched = _make_functional(module, params_box, 0)
      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/site-packages/higher/patch.py", line 348, in _make_functional
        child_params_offset, fchild, _ = _make_functional(
      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/site-packages/higher/patch.py", line 218, in _make_functional
        class MonkeyPatched(_ModuleType, _MonkeyPatchBase):  # type: ignore
      File "/home/rbachman/miniconda/envs/py38/lib/python3.8/abc.py", line 85, in __new__
        cls = super().__new__(mcls, name, bases, namespace, **kwargs)
    TypeError: Cannot create a consistent method resolution
    order (MRO) for bases Module, _MonkeyPatchBase
    

    Thank you for your response!

    opened by Ainaz99 2
  • Support learning rate optimization in the outer loop

    Support learning rate optimization in the outer loop

    This pull request allows optimizing the learning rate used in the inner loop with higher. To do this we avoid coping all the elements of param_groups of the optimizer instance passed to DifferentiableOptimizer. See also an example here.

    CLA Signed 
    opened by MichaelKonobeev 3
Owner
Facebook Research
Facebook Research
Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

Tez: a simple pytorch trainer NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something does

abhishek thakur 1.1k Jan 4, 2023
Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

NVIDIA Corporation 4.1k Jan 3, 2023
High-level batteries-included neural network training library for Pytorch

Pywick High-Level Training framework for Pytorch Pywick is a high-level Pytorch training framework that aims to get you up and running quickly with st

null 382 Dec 6, 2022
null 270 Dec 24, 2022
Unofficial PyTorch implementation of DeepMind's Perceiver IO with PyTorch Lightning scripts for distributed training

Unofficial PyTorch implementation of DeepMind's Perceiver IO with PyTorch Lightning scripts for distributed training

Martin Krasser 251 Dec 25, 2022
PyTorch Extension Library of Optimized Scatter Operations

PyTorch Scatter Documentation This package consists of a small extension library of highly optimized sparse update (scatter and segment) operations fo

Matthias Fey 1.2k Jan 7, 2023
PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

PyTorch Sparse This package consists of a small extension library of optimized sparse matrix operations with autograd support. This package currently

Matthias Fey 757 Jan 4, 2023
The goal of this library is to generate more helpful exception messages for numpy/pytorch matrix algebra expressions.

Tensor Sensor See article Clarifying exceptions and visualizing tensor operations in deep learning code. One of the biggest challenges when writing co

Terence Parr 704 Dec 14, 2022
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

micrograd A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural

Andrej 3.5k Jan 8, 2023
ocaml-torch provides some ocaml bindings for the PyTorch tensor library.

ocaml-torch provides some ocaml bindings for the PyTorch tensor library. This brings to OCaml NumPy-like tensor computations with GPU acceleration and tape-based automatic differentiation.

Laurent Mazare 369 Jan 3, 2023
PyGCL: Graph Contrastive Learning Library for PyTorch

PyGCL is an open-source library for graph contrastive learning (GCL), which features modularized GCL components from published papers, standardized evaluation, and experiment management.

GCL: Graph Contrastive Learning Library for PyTorch 592 Jan 7, 2023
PyNIF3D is an open-source PyTorch-based library for research on neural implicit functions (NIF)-based 3D geometry representation.

PyNIF3D is an open-source PyTorch-based library for research on neural implicit functions (NIF)-based 3D geometry representation. It aims to accelerate research by providing a modular design that allows for easy extension and combination of NIF-related components, as well as readily available paper implementations and dataset loaders.

Preferred Networks, Inc. 96 Nov 28, 2022
S3-plugin is a high performance PyTorch dataset library to efficiently access datasets stored in S3 buckets.

S3-plugin is a high performance PyTorch dataset library to efficiently access datasets stored in S3 buckets.

Amazon Web Services 138 Jan 3, 2023
A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

Fidelity Investments 56 Sep 13, 2022
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

null 878 Dec 30, 2022
PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

Cong Cai 12 Dec 19, 2021
Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

ASAPP Research 2.1k Jan 1, 2023
Pretrained ConvNets for pytorch: NASNet, ResNeXt, ResNet, InceptionV4, InceptionResnetV2, Xception, DPN, etc.

Pretrained models for Pytorch (Work in progress) The goal of this repo is: to help to reproduce research papers results (transfer learning setups for

Remi 8.7k Dec 31, 2022
Model summary in PyTorch similar to `model.summary()` in Keras

Keras style model.summary() in PyTorch Keras has a neat API to view the visualization of the model which is very helpful while debugging your network.

Shubham Chandel 3.7k Dec 29, 2022