[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Gsunshine

Last update: Dec 29, 2022

Related tags

Overview

Enjoy-Hamburger 🍔

Official implementation of Hamburger, Is Attention Better Than Matrix Decomposition? (ICLR 2021)

Under construction.

Introduction

This repo provides the official implementation of Hamburger for further research. We sincerely hope that this paper can bring you inspiration about the Attention Mechanism, especially how the low-rankness and the optimization-driven method can help model the so-called Global Information in deep learning.

We model the global context issue as a low-rank completion problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs.

We are working on some exciting topics. Please wait for our new papers!

Enjoy Hamburger, please!

Organization

This section introduces the organization of this repo.

We strongly recommend the readers to read the blog (incoming soon) as a supplement to the paper!

blog.
- Some random thoughts about Hamburger and beyond.
- Possible directions based on Hamburger.
- FAQ.
seg.
- We provide the PyTorch implementation of Hamburger (V1) in the paper and an enhanced version (V2) flavored with Cheese. Some experimental features are included in V2+.
- We release the codebase for systematical research on the PASCAL VOC dataset, including the two-stage training on the trainaug and trainval datasets and the MSFlip test.
- We offer three checkpoints of HamNet, in which one is 85.90+ with the test server link, while the other two are 85.80+ with the test server link 1 and link 2. You can reproduce the test results using the checkpoints combined with the MSFlip test code.
- Statistics about HamNet that might ease further research.
gan.
- Official implementation of Hamburger in TensorFlow.
- Data preprocessing code for using ImageNet in tensorflow-datasets. (Possibly useful if you hope to run the JAX code of BYOL or other ImageNet training code with the Cloud TPUs.)
- Training and evaluation protocol of HamGAN on the ImageNet.
- Checkpoints of HamGAN-strong and HamGAN-baby.

TODO:

README doc for HamGAN.
PyTorch Hamburger with less encapsulation.
Suggestions for using and further developing Hamburger.
Blog in both English and Chinese.
~~We also consider adding a collection of popular context modules to this repo.~~ It depends on the time. No Guarantee. Perhaps GuGu 🕊️ (which means standing someone up).

Citation

If you find our work interesting or helpful to your research, please consider citing Hamburger. :)

@inproceedings{
    ham,
    title={Is Attention Better Than Matrix Decomposition?},
    author={Zhengyang Geng and Meng-Hao Guo and Hongxu Chen and Xia Li and Ke Wei and Zhouchen Lin},
    booktitle={International Conference on Learning Representations},
    year={2021},
}

Contact

Feel free to contact me if you have additional questions or have interests in collaboration. Please drop me an email at [email protected]. Find me at Twitter. Thank you!

Response to recent emails may be slightly delayed to March 26th due to the deadlines of ICLR. I feel sorry, but people are always deadline-driven. QAQ

Acknowledgments

Our research is supported with Cloud TPUs from Google's Tensorflow Research Cloud (TFRC). Nice and joyful experience with the TFRC program. Thank you!

We would like to sincerely thank EMANet, PyTorch-Encoding, YLG, and TF-GAN for their awesome released code.

Comments

Difference between the code and Eq(13) in the paper about the gradient calculation

https://github.com/Gsunshine/Enjoy-Hamburger/blob/d9b51f6f197486df68c6e059e396520680157c08/seg_mm/mmseg/models/decode_heads/ham_head.py#L45

According to the paper, isn't the gradient of the MDs should be one-step gradient? However, the code of NMF dose not apply torch.with_on_grad() on the local_inference of NMF and _MatrixDecomposition2DBase. Could you please provide some explanation on this difference?

opened by Magic-Ha 5
Default process group has not been initialized

请问一下，我用train.sh能够将模型跑通，因为是单卡，GPUS=1，我想深入了解下模型的参数，想尝试debug，但是已经设置了--gpus 1，norm_cfg = dict(type='BN', requires_grad=True)，仍然报错Default process group has not been initialized, please make sure to call init_process_group.请问有什么办法解决吗？

opened by m828 4
KeyError: 'van_tiny is not in the models registry'

Hi there, I set up environment on docker with torch=1.11.0, cuda=11.3, mmcv-full=1.5.0. When I ran the code, I got this trace back info shown in picture the "hamenet_light_van_tiny_512x1024_160k_cityscapes.py" is the config file modified by myself based on ade20k one. The error msg seems to be unrelated with data, so it shouldn't be a problem. Could you please give me any suggestion? Thanks!

opened by Shawn207 4
Applying Hamburger to other models makes the training collapse soon

I've been working on applying Hamburger to other detection models (to be specific, Mask2Former & SparseInst), mainly by inserting Hamburger after the neck to align the multi-scale features, but the training process always collapses after only several iterations because of the nan output. Given that the training recipe is rather general, and further reducing the lr does no help, I guess this indicates the gradient propagation is unstable? (p.s. applying @torch.no_grad() to local_inference() is also unhelpful) Thus I'm wondering what's the intrinsic cause for this? have you ever met similar cases? or any suggestions for a fix?

Any idea would be appreciated.

opened by npurson 2
BatchNormalization

Hello,

I have a question regarding the batch normalization. Is there a reason why you choose such a small momentum 3e-4 for the batch normalization?

Thank you in advance.

opened by SnowdenLee 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Bump tensorflow from 2.2.0 to 2.4.0 in /gan
Bumps tensorflow from 2.2.0 to 2.4.0.

Release notes

Sourced from tensorflow's releases.

TensorFlow 2.4.0

Release 2.4.0

Major Features and Improvements

tf.distribute introduces experimental support for asynchronous training of models via the tf.distribute.experimental.ParameterServerStrategy API. Please see the tutorial to learn more.

MultiWorkerMirroredStrategy is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on Multi-worker training with Keras.

Introduces experimental support for a new module named tf.experimental.numpy which is a NumPy-compatible API for writing TF programs. See the detailed guide to learn more. Additional details below.

Adds Support for TensorFloat-32 on Ampere based GPUs. TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs and is enabled by default.

A major refactoring of the internals of the Keras Functional API has been completed, that should improve the reliability, stability, and performance of constructing Functional models.

Keras mixed precision API tf.keras.mixed_precision is no longer experimental and allows the use of 16-bit floating point formats during training, improving performance by up to 3x on GPUs and 60% on TPUs. Please see below for additional details.

TensorFlow Profiler now supports profiling MultiWorkerMirroredStrategy and tracing multiple workers using the sampling mode API.

TFLite Profiler for Android is available. See the detailed guide to learn more.

TensorFlow pip packages are now built with CUDA11 and cuDNN 8.0.2.

Breaking Changes

TF Core:

Certain float32 ops run in lower precsion on Ampere based GPUs, including matmuls and convolutions, due to the use of TensorFloat-32. Specifically, inputs to such ops are rounded from 23 bits of precision to 10 bits of precision. This is unlikely to cause issues in practice for deep learning models. In some cases, TensorFloat-32 is also used for complex64 ops. TensorFloat-32 can be disabled by running tf.config.experimental.enable_tensor_float_32_execution(False).

The byte layout for string tensors across the C-API has been updated to match TF Core/C++; i.e., a contiguous array of tensorflow::tstring/TF_TStrings.

C-API functions TF_StringDecode, TF_StringEncode, and TF_StringEncodedSize are no longer relevant and have been removed; see core/platform/ctstring.h for string access/modification in C.

tensorflow.python, tensorflow.core and tensorflow.compiler modules are now hidden. These modules are not part of TensorFlow public API.

tf.raw_ops.Max and tf.raw_ops.Min no longer accept inputs of type tf.complex64 or tf.complex128, because the behavior of these ops is not well defined for complex types.

XLA:CPU and XLA:GPU devices are no longer registered by default. Use TF_XLA_FLAGS=--tf_xla_enable_xla_devices if you really need them, but this flag will eventually be removed in subsequent releases.

tf.keras:

The steps_per_execution argument in model.compile() is no longer experimental; if you were passing experimental_steps_per_execution, rename it to steps_per_execution in your code. This argument controls the number of batches to run during each tf.function call when calling model.fit(). Running multiple batches inside a single tf.function call can greatly improve performance on TPUs or small models with a large Python overhead.

A major refactoring of the internals of the Keras Functional API may affect code that is relying on certain internal details:

Code that uses isinstance(x, tf.Tensor) instead of tf.is_tensor when checking Keras symbolic inputs/outputs should switch to using tf.is_tensor.

Code that is overly dependent on the exact names attached to symbolic tensors (e.g. assumes there will be ":0" at the end of the inputs, treats names as unique identifiers instead of using tensor.ref(), etc.) may break.

Code that uses full path for get_concrete_function to trace Keras symbolic inputs directly should switch to building matching tf.TensorSpecs directly and tracing the TensorSpec objects.

Code that relies on the exact number and names of the op layers that TensorFlow operations were converted into may have changed.

Code that uses tf.map_fn/tf.cond/tf.while_loop/control flow as op layers and happens to work before TF 2.4. These will explicitly be unsupported now. Converting these ops to Functional API op layers was unreliable before TF 2.4, and prone to erroring incomprehensibly or being silently buggy.

Code that directly asserts on a Keras symbolic value in cases where ops like tf.rank used to return a static or symbolic value depending on if the input had a fully static shape or not. Now these ops always return symbolic values.

Code already susceptible to leaking tensors outside of graphs becomes slightly more likely to do so now.

Code that tries directly getting gradients with respect to symbolic Keras inputs/outputs. Use GradientTape on the actual Tensors passed to the already-constructed model instead.

Code that requires very tricky shape manipulation via converted op layers in order to work, where the Keras symbolic shape inference proves insufficient.

Code that tries manually walking a tf.keras.Model layer by layer and assumes layers only ever have one positional argument. This assumption doesn't hold true before TF 2.4 either, but is more likely to cause issues now.

... (truncated)

Changelog

Sourced from tensorflow's changelog.

Release 2.4.0

Major Features and Improvements

tf.distribute introduces experimental support for asynchronous training of models via the [tf.distribute.experimental.ParameterServerStrategy] (https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/ParameterServerStrategy) API. Please see the tutorial to learn more.

MultiWorkerMirroredStrategy is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on [Multi-worker training with Keras] (https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).

Introduces experimental support for a new module named [tf.experimental.numpy] (https://www.tensorflow.org/api_docs/python/tf/experimental/numpy) which is a NumPy-compatible API for writing TF programs. See the [detailed guide] (https://www.tensorflow.org/guide/tf_numpy) to learn more. Additional details below.

Adds Support for TensorFloat-32 on Ampere based GPUs. TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs and is enabled by default.

A major refactoring of the internals of the Keras Functional API has been completed, that should improve the reliability, stability, and performance of constructing Functional models.

Keras mixed precision API [tf.keras.mixed_precision] (https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision?version=nightly) is no longer experimental and allows the use of 16-bit floating point formats during training, improving performance by up to 3x on GPUs and 60% on TPUs. Please see below for additional details.

TensorFlow Profiler now supports profiling MultiWorkerMirroredStrategy and tracing multiple workers using the [sampling mode API] (https://www.tensorflow.org/guide/profiler#profiling_apis).

TFLite Profiler for Android is available. See the detailed [guide] (https://www.tensorflow.org/lite/performance/measurement#trace_tensorflow_lite_internals_in_android) to learn more.

TensorFlow pip packages are now built with CUDA11 and cuDNN 8.0.2.

Breaking Changes

TF Core:

Certain float32 ops run in lower precision on Ampere based GPUs, including

... (truncated)

Commits

582c8d2 Merge pull request #44220 from tensorflow-jenkins/relnotes-2.4.0rc0-18048

c16387f Update RELEASE.md

4cf406c Update RELEASE.md

3f35ef2 Update RELEASE.md

3647e8e Update RELEASE.md

281c7d5 Update RELEASE.md

91ec75f Update RELEASE.md

ed5ad82 Update RELEASE.md

1267bba Update RELEASE.md

13a4067 Update RELEASE.md

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Owner

Gsunshine

GitHub

[NeurIPS 2021] Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training

Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training Code for NeurIPS 2021 paper "Better Safe Than Sorry: Preventing Delu

29 Sep 20, 2022

Much faster than SORT(Simple Online and Realtime Tracking), a little worse than SORT

QSORT QSORT(Quick + Simple Online and Realtime Tracking) is a simple online and realtime tracking algorithm for 2D multiple object tracking in video s

8 Jul 27, 2022

Distributed Asynchronous Hyperparameter Optimization better than HyperOpt.

UltraOpt : Distributed Asynchronous Hyperparameter Optimization better than HyperOpt. UltraOpt is a simple and efficient library to minimize expensive

98 Aug 16, 2022

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Introduction Pytorch implementation of Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Expert. | paper Song Park1

97 Dec 23, 2022

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

1.3k Jan 4, 2023

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

T-Few This repository contains the official code for the paper: "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learni

220 Dec 31, 2022

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

28 Dec 2, 2022

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

336 Dec 25, 2022

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

127 Dec 28, 2022

An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

Neural Attention Distillation This is an implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep

84 Jan 4, 2023

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

130 Dec 2, 2022

[CVPRW 2022] Attentions Help CNNs See Better: Attention-based Hybrid Image Quality Assessment Network

Attention Helps CNN See Better: Hybrid Image Quality Assessment Network [CVPRW 2022] Code for Hybrid Image Quality Assessment Network [paper] [code] T

49 Dec 11, 2022

Improving Convolutional Networks via Attention Transfer (ICLR 2017)

Attention Transfer PyTorch code for "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Tran

1.4k Dec 23, 2022

Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

tf-fsvd TensorFlow Implementation of Functional Singular Value Decomposition for paper Fast Graph Learning with Unique Optimal Solutions Cite If you f

14 Nov 25, 2021

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

114 Dec 12, 2022

[ICLR 2021] Is Attention Better Than Matrix Decomposition?

Related tags

Overview

Enjoy-Hamburger 🍔

Introduction

Organization

Citation

Contact

Acknowledgments

Comments

Difference between the code and Eq(13) in the paper about the gradient calculation

Default process group has not been initialized

KeyError: 'van_tiny is not in the models registry'

Applying Hamburger to other models makes the training collapse soon

BatchNormalization

CVE-2007-4559 Patch

Patching CVE-2007-4559

Bump tensorflow from 2.2.0 to 2.4.0 in /gan

TensorFlow 2.4.0

Release 2.4.0

Major Features and Improvements

Breaking Changes

Release 2.4.0

Major Features and Improvements

Breaking Changes

Owner

Gsunshine

[NeurIPS 2021] Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training

Much faster than SORT(Simple Online and Realtime Tracking), a little worse than SORT

Distributed Asynchronous Hyperparameter Optimization better than HyperOpt.

Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

[CVPRW 2022] Attentions Help CNNs See Better: Attention-based Hybrid Image Quality Assessment Network

Improving Convolutional Networks via Attention Transfer (ICLR 2017)

Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

official code for dynamic convolution decomposition

Code for "Unsupervised Layered Image Decomposition into Object Prototypes" paper

Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

NeRD: Neural Reflectance Decomposition from Image Collections