Efficient Training of Visual Transformers with Small Datasets

Yahui Liu

Last update: Dec 25, 2022

Related tags

Deep Learning VTs-Drloc

Overview

Efficient Training of Visual Transformers with Small Datasets

To appear in NerIPS 2021.

[arXiv][code]
Yahui Liu^1,3, Enver Sangineto¹, Wei Bi², Nicu Sebe¹, Bruno Lepri³, Marco De Nadai³
¹University of Trento, Italy, ²Tencent AI Lab, China, ³Bruno Kessler Foundation, Italy.

Coming soon ...

Comments

Strange reproduced results of Swin transformer

Hi authors, I have reproduced all results based on your codes. Most of them are consistent with the reported results, except the swin transformer. Below are some results (with reported results followed in brackets): Trained with 8 gpus (a100): Cifar10: 75.00 (59.47), CIFAR100: 52.26 (53.28), SVHN: 38.10 (71.60) Trained with 4 gpus: CIFAR10: 81.91 (59.47), CIFAR100: 62.30 (53.28), SVHN: 91.29 (71.60) It seems that the batch size affect swin a lot from results above. All reproduced results are comparable with vit. (e.g. ViT on CIFAR10 with 8 gpus: 77.00 (71.70)). Do you have any idea on the reason?

opened by xiangyu8 6
compare to CvT

Hi

Thanks for sharing this good work. I'm curious about why the proposed loss function can outperform CvT, which contains a depthwise convolution that is capable to learn local features.

opened by liyunsheng13 3
Imagenet-100 split

Thanks for your amazing work! I also want to train with imagenet-100 using the subset in the file /scripts/imagenet-100.lst. But I didn't find its train/val split. May I know your splits or split reference?

opened by xiangyu8 2
Augmentation settings on CIFAR10/100

Hi, thank you so much for sharing this excellent work.

I have some confusion about the experimental setup of CIFAR10/100. Commonly used augmentation settings are random cropping and padding=4, and the input image resolution is 32x32. But this setting does not seem to be able to get the output resolution of 7x7 as described in the paper when using SwinT. So could you please tell me the detailed augmentation settings you used on CIFAR10/100, and whether there are any changes to the network structure of original VTs.

Thanks again.

opened by lkhl 2
[ Pretrained Models ]

Hi,

Thanks for the wonderful work. Could you please, share links to the default models used for finetunning experiments ?

Specifically, the pretrained models for finetunning experiments are they trained from scratch on ImageNet1K ? - Because the official ones published for ViTs models are trained on ImageNet21K and finetunned on ImageNet1K ?

Thanks,

opened by IemProg 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0

Efficient Training of Visual Transformers with Small Datasets

Related tags

Overview

Efficient Training of Visual Transformers with Small Datasets

Comments

Strange reproduced results of Swin transformer

compare to CvT

Imagenet-100 split

Augmentation settings on CIFAR10/100

[ Pretrained Models ]

CVE-2007-4559 Patch

Patching CVE-2007-4559

Owner

Yahui Liu

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Efficient Training of Audio Transformers with Patchout

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Implementation of "Debiasing Item-to-Item Recommendations With Small Annotated Datasets" (RecSys '20)

Minimal But Practical Image Classifier Pipline Using Pytorch, Finetune on ResNet18, Got 99% Accuracy on Own Small Datasets.

Re-implementation of 'Grokking: Generalization beyond overfitting on small algorithmic datasets'

PyTorch implementation of Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Bottleneck Transformers for Visual Recognition

Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

This is an official implementation for "ResT: An Efficient Transformer for Visual Recognition".