This is an official implementation for "Video Swin Transformers".

Swin Transformer

Last update: Jan 3, 2023

Related tags

Overview

Video Swin Transformer

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

Results and Models

Kinetics 400

Backbone	Pretrain	Lr Schd	spatial crop	acc@1	acc@5	#params	FLOPs	config	model
Swin-T	ImageNet-1K	30ep	224	78.8	93.6	28M	87.9G	config	github/baidu
Swin-S	ImageNet-1K	30ep	224	80.6	94.5	50M	165.9G	config	github/baidu
Swin-B	ImageNet-1K	30ep	224	80.6	94.6	88M	281.6G	config	github/baidu
Swin-B	ImageNet-22K	30ep	224	82.7	95.5	88M	281.6G	config	github/baidu

Kinetics 600

Backbone	Pretrain	Lr Schd	spatial crop	acc@1	acc@5	#params	FLOPs	config	model
Swin-B	ImageNet-22K	30ep	224	84.0	96.5	88M	281.6G	config	github/baidu

Something-Something V2

Backbone	Pretrain	Lr Schd	spatial crop	acc@1	acc@5	#params	FLOPs	config	model
Swin-B	Kinetics 400	60ep	224	69.6	92.7	89M	320.6G	config	github/baidu

Notes:

Pre-trained image models can be downloaded from Swin Transformer for ImageNet Classification.
The pre-trained model of SSv2 could be downloaded at github/baidu.
Access code for baidu is swin.

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL>

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Other Links

Image Classification: See Swin Transformer for Image Classification.

Object Detection: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Self-Supervised Learning: See MoBY with Swin Transformer.

Comments

KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

when i use : python tools/train.py configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py

an error occurred:

Traceback (most recent call last): File "tools/train.py", line 199, in main() File "tools/train.py", line 154, in main model = build_model( File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 70, in build_model return build_localizer(cfg) File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 62, in build_localizer return LOCALIZERS.build(cfg) File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 210, in build return self.build_func(*args, **kwargs, registry=self) File "/home/pytorch/lib/python3/site-packages/mmcv/cnn/builder.py", line 26, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 54, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

How to solve it?

opened by Note-Liu 9
How to train/test on custom dataset like HMDB51
I was trying to train swin transformer on HMDB51 dataset.

While I referred to "https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/tsn/tsn_r50_1x1x8_50e_hmdb51_imagenet_rgb.py" configuration file and "README" files, what I got was an error message.

Here is my configuration file. `base = [ '../../base/models/swin/swin_base.py', '../../base/default_runtime.py' ] model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.3), test_cfg=dict(max_testing_views=4), cls_head=dict(num_classes=174))

dataset settings

dataset_type = 'RawframeDataset' data_root = 'data/hmdb51/rawframes' data_root_val = 'data/hmdb51/rawframes' ann_file_train = 'data/hmdb51/hmdb51_train_split_1_rawframes.txt' ann_file_val = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' ann_file_test = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False) train_pipeline = [ dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=2, workers_per_gpu=1, val_dataloader=dict( videos_per_gpu=1, workers_per_gpu=1 ), test_dataloader=dict( videos_per_gpu=1, workers_per_gpu=1 ), train=dict( type=dataset_type, ann_file=ann_file_train, data_prefix=data_root, pipeline=train_pipeline), val=dict( type=dataset_type, ann_file=ann_file_val, data_prefix=data_root_val, pipeline=val_pipeline), test=dict( type=dataset_type, ann_file=ann_file_test, data_prefix=data_root_val, pipeline=test_pipeline)) evaluation = dict( interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'])

optimizer

optimizer = dict(type='AdamW', lr=1e-3, betas=(0.9, 0.999), weight_decay=0.05, paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.), 'relative_position_bias_table': dict(decay_mult=0.), 'norm': dict(decay_mult=0.), 'backbone': dict(lr_mult=0.1)}))

learning policy

lr_config = dict( policy='CosineAnnealing', min_lr=0, warmup='linear', warmup_by_epoch=True, warmup_iters=2.5 ) total_epochs = 30

runtime settings

checkpoint_config = dict(interval=1) work_dir = './work_dirs/hmdb51_swin_base_patch244_window877.py' find_unused_parameters = False

do not use mmdet version fp16

fp16 = None

optimizer_config = dict(

type="DistOptimizerHook",

update_interval=8,

grad_clip=None,

coalesce=True,

bucket_size_mb=-1,

use_fp16=True,

)

`

I tried to execute code by "python tools/train.py configs/recognition/swin/swin_base_patch244_window877_hmdb51.py" failed.

And following error log is given.

root@c265ec69239e:/home/Video-Swin-Transformer# python tools/train.py configs/recognition/swin/swin_base_patch244_window877_hmdb51.py 2022-07-07 05:46:41,569 - mmaction - INFO - Environment info:

sys.platform: linux Python: 3.7.7 (default, Mar 26 2020, 15:48:22) [GCC 7.3.0] CUDA available: True GPU 0: Tesla V100-PCIE-32GB CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.0_bu.TC445_37.28845127_0 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.7.1 PyTorch compiling details: PyTorch built with:

GCC 7.3

C++ Version: 201402

Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications

Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)

OpenMP 201511 (a.k.a. OpenMP 4.5)

NNPACK is enabled

CPU capability usage: AVX2

CUDA Runtime 11.0

NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37

CuDNN 8.0.5

Magma 2.5.2

Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2 OpenCV: 3.4.2 MMCV: 1.3.3 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: not available MMAction2: 0.15.0+db018fb

2022-07-07 05:46:41,570 - mmaction - INFO - Distributed training: False 2022-07-07 05:46:42,085 - mmaction - INFO - Config: model = dict( type='Recognizer3D', backbone=dict( type='SwinTransformer3D', patch_size=(2, 4, 4), embed_dim=128, depths=[2, 2, 18, 2], num_heads=[4, 8, 16, 32], window_size=(8, 7, 7), mlp_ratio=4.0, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.3, patch_norm=True), cls_head=dict( type='I3DHead', in_channels=1024, num_classes=174, spatial_type='avg', dropout_ratio=0.5), test_cfg=dict(average_clips='prob', max_testing_views=4)) checkpoint_config = dict(interval=1) log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] dataset_type = 'RawframeDataset' data_root = 'data/hmdb51/rawframes' data_root_val = 'data/hmdb51/rawframes' ann_file_train = 'data/hmdb51/hmdb51_train_split_1_rawframes.txt' ann_file_val = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' ann_file_test = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False) train_pipeline = [ dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=2, workers_per_gpu=1, val_dataloader=dict(videos_per_gpu=1, workers_per_gpu=1), test_dataloader=dict(videos_per_gpu=1, workers_per_gpu=1), train=dict( type='RawframeDataset', ann_file='data/hmdb51/hmdb51_train_split_1_rawframes.txt', data_prefix='data/hmdb51/rawframes', pipeline=[ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ]), val=dict( type='RawframeDataset', ann_file='data/hmdb51/hmdb51_val_split_1_rawframes.txt', data_prefix='data/hmdb51/rawframes', pipeline=[ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ]), test=dict( type='RawframeDataset', ann_file='data/hmdb51/hmdb51_val_split_1_rawframes.txt', data_prefix='data/hmdb51/rawframes', pipeline=[ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ])) evaluation = dict( interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy']) optimizer = dict( type='AdamW', lr=0.001, betas=(0.9, 0.999), weight_decay=0.05, paramwise_cfg=dict( custom_keys=dict( absolute_pos_embed=dict(decay_mult=0.0), relative_position_bias_table=dict(decay_mult=0.0), norm=dict(decay_mult=0.0), backbone=dict(lr_mult=0.1)))) lr_config = dict( policy='CosineAnnealing', min_lr=0, warmup='linear', warmup_by_epoch=True, warmup_iters=2.5) total_epochs = 30 work_dir = './work_dirs/hmdb51_swin_base_patch244_window877.py' find_unused_parameters = False gpu_ids = range(0, 1) omnisource = False module_hooks = []

2022-07-07 05:46:49,300 - mmaction - INFO - Start running, host: root@c265ec69239e, work_dir: /home/Video-Swin-Transformer/work_dirs/hmdb51_swin_base_patch244_window877.py 2022-07-07 05:46:49,300 - mmaction - INFO - workflow: [('train', 1)], max: 30 epochs Traceback (most recent call last): File "tools/train.py", line 200, in main() File "tools/train.py", line 196, in main meta=meta) File "/home/Video-Swin-Transformer/mmaction/apis/train.py", line 195, in train_model runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step return self.module.train_step(*inputs[0], **kwargs[0]) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 294, in train_step losses = self(imgs, label, return_loss=True, **aux_info) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 256, in forward return self.forward_train(imgs, label, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/recognizer3d.py", line 19, in forward_train x = self.extract_feat(imgs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 95, in new_func return old_func(*args, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 157, in extract_feat x = self.backbone(imgs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 652, in forward x = self.patch_embed(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 441, in forward _, _, D, H, W = x.size() ValueError: not enough values to unpack (expected 5, got 4)

PLEASE SOMEBODY HELP ME!!!
opened by Lee-daeho 6
Where can I find the ?

Hi, thanks for this fascinating work! I want to follow the instructions bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments] to run the program, but I don't know where I can find the pretrain model. So, I need some help, thanks all of you!

opened by wsh-nie 3
About the input image shape

I get the image shape from dataloader, the image shape is torch.Size([1, 12, 3, 32, 224, 224]). I know it's [batch_size, ?, RGB, frames, H, W], but I don't know why it has "12" in a video data. Does anyone know that dimension is?

In ./tools/test.py: data_loader = build_dataloader(dataset, **dataloader_setting) for i in data_loader: print("img shape: ", i['imgs'].size())

Thank you.

opened by Chen-Bo-Yang 2
About learning rate and batch size

For the original configuration file, is it all 8 GPU configurations?

Specifically, is the base_22k model on 8 GPUs with a batch size of 8 and a learning rate of 3e-4? For others, is the model on 8 GPUs with a batch size of 8 and a learning rate of 1e-3?

opened by geek12138 2
the training iteration is abnormally large

I used 4 gpus(2080ti) to train swin_small with config swin_small_patch244_window877_kinetics400_1k.py. The dataset I used is hacs(50w videos). The following is some of my training log:

I find that the train iteration is abnormally large based on my config(dataset size(50w) and batch_size(8)), which leads to the long training time. Is that normal?

opened by bolin-chen 2
Inaccessible Download Links

The download links for Kinetics 400 pretrained models are on pan.baid.com. Many people are not able to download these at all because you need to create an account (with a phone number) to download files from that site. If you are in germany or the UK, like me, it is not possible to create an account to download these. Please host them somewhere else to make them available to the general public.

opened by RaivoKoot 2
About the 3D relative position bias

In the subsection 3D relative position bias of your paper, a bias is added in the self-attention computaion. I don't fully understand it.

According to your description, Q,K,V are all matrices with P*M^2 rows and d columns, so QK^T will be a square matrix with P*M^2 rows and P*M^2 columns. To make the summation valid, the 3D relative position bias B should also be a square matrix wtih P*M^2 rows and P*M^2 columns. So how are the values in B are set? Specifically, how the member B(i,j) of B is set ? I can't get any link between B and

opened by TangMinLeo 2
AttributeError: Module tools/data/kinetics/label_map_k400.txt not found

I got to the end of the installation and was running the script to do the check to see if installation was done correctly.

In the open_mmlab environment, the very last line of the script inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt') received an error AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

How to proceed?

Thank You Tom

opened by minertom 1
The model's behavior is different from the picture in the paper.

Hello. Thank you for providing a good paper with a good code.

I had a question while experimenting with video swin transformer.

Input sizes (1, 3, 8, 384, 384) swintransformer3D( patch_size=(2,4,4), all of rest is default settings)

The output size per layer was measured in the forward section. The result is : after 1 layer output shape : torch.Size([1, 192, 4, 48, 48]) after 2 layer output shape : torch.Size([1, 384, 4, 24, 24]) after 3 layer output shape : torch.Size([1, 768, 4, 12, 12]) after 4 layer output shape : torch.Size([1, 768, 4, 12, 12])

As the paper illustrates, after 1 layer output shape : torch.Size([1, 96, 4, 96, 96]) after 2 layer output shape : torch.Size([1, 192, 4, 48, 48]) after 3 layer output shape : torch.Size([1, 384, 4, 24, 24]) after 4 layer output shape : torch.Size([1, 768, 4, 12, 12]) I think this is right.

I know it's hard work, but can I ask you to check it out?

opened by junsang7777 1
How long does it take to train an epoch with SWIN-B?

I used swin-B to train on the epic-kitchens dataset, but it takes me almost 27 hours for one epoch training (mixed precision was already applied). I used 4 V100 GPUs, batch_size=8. Is this the normal time for training?

opened by Christinepan881 1

Can't export ONNX transformer

This command python3 tools/deployment/pytorch2onnx.py configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py swin_tiny_patch244_window877_kinetics400_1k.pth outputs this error :

/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2966.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Use load_from_local loader
Traceback (most recent call last):
  File "tools/deployment/pytorch2onnx.py", line 163, in <module>
    pytorch2onnx(
  File "tools/deployment/pytorch2onnx.py", line 67, in pytorch2onnx
    torch.onnx.export(
  File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 479, in export
    _export(
  File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 1411, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 1050, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 925, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 833, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 1175, in _get_trace_graph
    outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 127, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 118, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1174, in _slow_forward
    result = self.forward(*input, **kwargs)
TypeError: forward_dummy() got multiple values for argument 'softmax'

opened by hyperfraise 0

Swin-L pretrain

Dear researchers,

Thank you for this very nice piece of of work.

Can you also provide the weight of the Swin-L as described in the paper ?

Best regards,

opened by wanghao15536870732 0
Error: av_read_frame failed with 1094995529

你好,我在使用的时候遇到一个问题, 参考 infer 的代码构造 data_loader ,加载视频文件,但是在处理到某个视频的时候报错了 decord._ffi.base.DECORDError: [16:51:53] /io/decord/src/video/video_reader.cc:432: Error: av_read_frame failed with 1094995529

定位到的错误应该是这个 mmaction/datasets/pipelines/loading.py", line 966, in __call__ container = decord.VideoReader(file_obj, num_threads=self.num_threads) 我对提示错误的视频进行测试,可以正常播放以及单独使用 decord.VideoReader 加载都是没有问题的, 现在就觉得很奇怪,不知道是否有空可以解答一下, 谢谢啦

opened by DWCTOD 0
Embeddings

Hello, I want to know how I can call the output tensor of an intermediate layer like the last FC. Through the OutputHook function, how should the name of the layer be passed?

inference_recognizer(model, video,labels,outputs ='fc1')?

opened by spunknic 0
Is there any plan to release the video swin transformer code and pre-trained models of swin transformer V2?

Hello, I have noticed that the swin transformer V2 paper has been published. There are experiments on video action classification in the paper, and the results are better than those in V1. Is there any plan to release video swin transformer code and pre-trained models based on V2?

Your work is very valuable and helpful to me. I look forward to your reply. Thank you very much!

opened by githubcvcv 0

Owner

Swin Transformer

This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer

GitHub

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

101 Nov 25, 2022

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

49 Nov 23, 2022

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

76 Nov 23, 2022

StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes.

3.2k Dec 30, 2022

Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

272 Dec 28, 2022

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

443 Dec 6, 2022

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow

201 Dec 21, 2022

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

364 Dec 14, 2022

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

111 Dec 31, 2022

Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

28 Nov 16, 2022

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

35 Dec 6, 2022

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

127 Dec 28, 2022

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

87 Dec 30, 2022

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

59 Dec 17, 2022

Official code implementation for "Personalized Federated Learning using Hypernetworks"

Personalized Federated Learning using Hypernetworks This is an official implementation of Personalized Federated Learning using Hypernetworks paper. [

121 Dec 25, 2022

StyleGAN2 - Official TensorFlow Implementation

10.1k Dec 28, 2022

Old Photo Restoration (Official PyTorch Implementation)

Bringing Old Photo Back to Life (CVPR 2020 oral)

11.3k Dec 30, 2022

Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

46 Nov 9, 2022

Official PyTorch implementation of Spatial Dependency Networks.

Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling Đorđe Miladinović Aleksandar Stanić Stefan Bauer Jürgen Schmid

34 Jan 19, 2022

This is an official implementation for "Video Swin Transformers".

Related tags

Overview

Video Swin Transformer

Updates

Introduction

Results and Models

Kinetics 400

Kinetics 600

Something-Something V2

Usage

Installation

Data Preparation

Inference

Training

Apex (optional):

Citation

Other Links

Comments

dataset settings

optimizer

learning policy

runtime settings

do not use mmdet version fp16

fp16 = None

optimizer_config = dict(

type="DistOptimizerHook",

update_interval=8,

grad_clip=None,

coalesce=True,

bucket_size_mb=-1,

use_fp16=True,

)

TorchVision: 0.8.2 OpenCV: 3.4.2 MMCV: 1.3.3 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: not available MMAction2: 0.15.0+db018fb

Owner

Swin Transformer

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

StyleGAN2-ADA - Official PyTorch implementation

Official implementation of the ICLR 2021 paper

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

Official code implementation for "Personalized Federated Learning using Hypernetworks"

StyleGAN2 - Official TensorFlow Implementation

Old Photo Restoration (Official PyTorch Implementation)

Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

Official PyTorch implementation of Spatial Dependency Networks.