This is an official implementation for "Video Swin Transformers".

Overview

Video Swin Transformer

PWC PWC PWC

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

teaser

Results and Models

Kinetics 400

Backbone Pretrain Lr Schd spatial crop acc@1 acc@5 #params FLOPs config model
Swin-T ImageNet-1K 30ep 224 78.8 93.6 28M 87.9G config github/baidu
Swin-S ImageNet-1K 30ep 224 80.6 94.5 50M 165.9G config github/baidu
Swin-B ImageNet-1K 30ep 224 80.6 94.6 88M 281.6G config github/baidu
Swin-B ImageNet-22K 30ep 224 82.7 95.5 88M 281.6G config github/baidu

Kinetics 600

Backbone Pretrain Lr Schd spatial crop acc@1 acc@5 #params FLOPs config model
Swin-B ImageNet-22K 30ep 224 84.0 96.5 88M 281.6G config github/baidu

Something-Something V2

Backbone Pretrain Lr Schd spatial crop acc@1 acc@5 #params FLOPs config model
Swin-B Kinetics 400 60ep 224 69.6 92.7 89M 320.6G config github/baidu

Notes:

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> 

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Other Links

Image Classification: See Swin Transformer for Image Classification.

Object Detection: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Self-Supervised Learning: See MoBY with Swin Transformer.

Comments
  • KeyError:

    KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

    when i use : python tools/train.py configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py

    an error occurred:

    Traceback (most recent call last): File "tools/train.py", line 199, in main() File "tools/train.py", line 154, in main model = build_model( File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 70, in build_model return build_localizer(cfg) File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 62, in build_localizer return LOCALIZERS.build(cfg) File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 210, in build return self.build_func(*args, **kwargs, registry=self) File "/home/pytorch/lib/python3/site-packages/mmcv/cnn/builder.py", line 26, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 54, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

    How to solve it?

    opened by Note-Liu 9
  • How to train/test on custom dataset like HMDB51

    How to train/test on custom dataset like HMDB51

    I was trying to train swin transformer on HMDB51 dataset.

    While I referred to "https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/tsn/tsn_r50_1x1x8_50e_hmdb51_imagenet_rgb.py" configuration file and "README" files, what I got was an error message.

    Here is my configuration file. `base = [ '../../base/models/swin/swin_base.py', '../../base/default_runtime.py' ] model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.3), test_cfg=dict(max_testing_views=4), cls_head=dict(num_classes=174))

    dataset settings

    dataset_type = 'RawframeDataset' data_root = 'data/hmdb51/rawframes' data_root_val = 'data/hmdb51/rawframes' ann_file_train = 'data/hmdb51/hmdb51_train_split_1_rawframes.txt' ann_file_val = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' ann_file_test = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False) train_pipeline = [ dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict(type='Normalize', **img_norm_cfg), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=2, workers_per_gpu=1, val_dataloader=dict( videos_per_gpu=1, workers_per_gpu=1 ), test_dataloader=dict( videos_per_gpu=1, workers_per_gpu=1 ), train=dict( type=dataset_type, ann_file=ann_file_train, data_prefix=data_root, pipeline=train_pipeline), val=dict( type=dataset_type, ann_file=ann_file_val, data_prefix=data_root_val, pipeline=val_pipeline), test=dict( type=dataset_type, ann_file=ann_file_test, data_prefix=data_root_val, pipeline=test_pipeline)) evaluation = dict( interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'])

    optimizer

    optimizer = dict(type='AdamW', lr=1e-3, betas=(0.9, 0.999), weight_decay=0.05, paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.), 'relative_position_bias_table': dict(decay_mult=0.), 'norm': dict(decay_mult=0.), 'backbone': dict(lr_mult=0.1)}))

    learning policy

    lr_config = dict( policy='CosineAnnealing', min_lr=0, warmup='linear', warmup_by_epoch=True, warmup_iters=2.5 ) total_epochs = 30

    runtime settings

    checkpoint_config = dict(interval=1) work_dir = './work_dirs/hmdb51_swin_base_patch244_window877.py' find_unused_parameters = False

    do not use mmdet version fp16

    fp16 = None

    optimizer_config = dict(

    type="DistOptimizerHook",

    update_interval=8,

    grad_clip=None,

    coalesce=True,

    bucket_size_mb=-1,

    use_fp16=True,

    )

    `

    I tried to execute code by "python tools/train.py configs/recognition/swin/swin_base_patch244_window877_hmdb51.py" failed.

    And following error log is given.

    • root@c265ec69239e:/home/Video-Swin-Transformer# python tools/train.py configs/recognition/swin/swin_base_patch244_window877_hmdb51.py 2022-07-07 05:46:41,569 - mmaction - INFO - Environment info:

    sys.platform: linux Python: 3.7.7 (default, Mar 26 2020, 15:48:22) [GCC 7.3.0] CUDA available: True GPU 0: Tesla V100-PCIE-32GB CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.0_bu.TC445_37.28845127_0 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.7.1 PyTorch compiling details: PyTorch built with:

    • GCC 7.3
    • C++ Version: 201402
    • Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications
    • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
    • OpenMP 201511 (a.k.a. OpenMP 4.5)
    • NNPACK is enabled
    • CPU capability usage: AVX2
    • CUDA Runtime 11.0
    • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
    • CuDNN 8.0.5
    • Magma 2.5.2
    • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

    TorchVision: 0.8.2 OpenCV: 3.4.2 MMCV: 1.3.3 MMCV Compiler: GCC 5.4 MMCV CUDA Compiler: not available MMAction2: 0.15.0+db018fb

    2022-07-07 05:46:41,570 - mmaction - INFO - Distributed training: False 2022-07-07 05:46:42,085 - mmaction - INFO - Config: model = dict( type='Recognizer3D', backbone=dict( type='SwinTransformer3D', patch_size=(2, 4, 4), embed_dim=128, depths=[2, 2, 18, 2], num_heads=[4, 8, 16, 32], window_size=(8, 7, 7), mlp_ratio=4.0, qkv_bias=True, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.3, patch_norm=True), cls_head=dict( type='I3DHead', in_channels=1024, num_classes=174, spatial_type='avg', dropout_ratio=0.5), test_cfg=dict(average_clips='prob', max_testing_views=4)) checkpoint_config = dict(interval=1) log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] dataset_type = 'RawframeDataset' data_root = 'data/hmdb51/rawframes' data_root_val = 'data/hmdb51/rawframes' ann_file_train = 'data/hmdb51/hmdb51_train_split_1_rawframes.txt' ann_file_val = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' ann_file_test = 'data/hmdb51/hmdb51_val_split_1_rawframes.txt' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False) train_pipeline = [ dict(type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=2, workers_per_gpu=1, val_dataloader=dict(videos_per_gpu=1, workers_per_gpu=1), test_dataloader=dict(videos_per_gpu=1, workers_per_gpu=1), train=dict( type='RawframeDataset', ann_file='data/hmdb51/hmdb51_train_split_1_rawframes.txt', data_prefix='data/hmdb51/rawframes', pipeline=[ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ]), val=dict( type='RawframeDataset', ann_file='data/hmdb51/hmdb51_val_split_1_rawframes.txt', data_prefix='data/hmdb51/rawframes', pipeline=[ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ]), test=dict( type='RawframeDataset', ann_file='data/hmdb51/hmdb51_val_split_1_rawframes.txt', data_prefix='data/hmdb51/rawframes', pipeline=[ dict( type='SampleFrames', clip_len=1, frame_interval=1, num_clips=8, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=256), dict(type='Flip', flip_ratio=0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ])) evaluation = dict( interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy']) optimizer = dict( type='AdamW', lr=0.001, betas=(0.9, 0.999), weight_decay=0.05, paramwise_cfg=dict( custom_keys=dict( absolute_pos_embed=dict(decay_mult=0.0), relative_position_bias_table=dict(decay_mult=0.0), norm=dict(decay_mult=0.0), backbone=dict(lr_mult=0.1)))) lr_config = dict( policy='CosineAnnealing', min_lr=0, warmup='linear', warmup_by_epoch=True, warmup_iters=2.5) total_epochs = 30 work_dir = './work_dirs/hmdb51_swin_base_patch244_window877.py' find_unused_parameters = False gpu_ids = range(0, 1) omnisource = False module_hooks = []

    2022-07-07 05:46:49,300 - mmaction - INFO - Start running, host: root@c265ec69239e, work_dir: /home/Video-Swin-Transformer/work_dirs/hmdb51_swin_base_patch244_window877.py 2022-07-07 05:46:49,300 - mmaction - INFO - workflow: [('train', 1)], max: 30 epochs Traceback (most recent call last): File "tools/train.py", line 200, in main() File "tools/train.py", line 196, in main meta=meta) File "/home/Video-Swin-Transformer/mmaction/apis/train.py", line 195, in train_model runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step return self.module.train_step(*inputs[0], **kwargs[0]) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 294, in train_step losses = self(imgs, label, return_loss=True, **aux_info) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 256, in forward return self.forward_train(imgs, label, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/recognizer3d.py", line 19, in forward_train x = self.extract_feat(imgs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 95, in new_func return old_func(*args, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 157, in extract_feat x = self.backbone(imgs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 652, in forward x = self.patch_embed(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 441, in forward _, _, D, H, W = x.size() ValueError: not enough values to unpack (expected 5, got 4)

    PLEASE SOMEBODY HELP ME!!!

    opened by Lee-daeho 6
  • Where can I find the <PRETRAIN_MODEL>?

    Where can I find the ?

    Hi, thanks for this fascinating work! I want to follow the instructions bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments] to run the program, but I don't know where I can find the pretrain model. So, I need some help, thanks all of you!

    opened by wsh-nie 3
  • About the input image shape

    About the input image shape

    I get the image shape from dataloader, the image shape is torch.Size([1, 12, 3, 32, 224, 224]). I know it's [batch_size, ?, RGB, frames, H, W], but I don't know why it has "12" in a video data. Does anyone know that dimension is?

    In ./tools/test.py: data_loader = build_dataloader(dataset, **dataloader_setting) for i in data_loader: print("img shape: ", i['imgs'].size())

    Thank you.

    opened by Chen-Bo-Yang 2
  • About learning rate and batch size

    About learning rate and batch size

    For the original configuration file, is it all 8 GPU configurations?

    Specifically, is the base_22k model on 8 GPUs with a batch size of 8 and a learning rate of 3e-4? For others, is the model on 8 GPUs with a batch size of 8 and a learning rate of 1e-3?

    opened by geek12138 2
  • the training iteration is abnormally large

    the training iteration is abnormally large

    I used 4 gpus(2080ti) to train swin_small with config swin_small_patch244_window877_kinetics400_1k.py. The dataset I used is hacs(50w videos). The following is some of my training log:

    image

    I find that the train iteration is abnormally large based on my config(dataset size(50w) and batch_size(8)), which leads to the long training time. Is that normal?

    opened by bolin-chen 2
  • Inaccessible Download Links

    Inaccessible Download Links

    The download links for Kinetics 400 pretrained models are on pan.baid.com. Many people are not able to download these at all because you need to create an account (with a phone number) to download files from that site. If you are in germany or the UK, like me, it is not possible to create an account to download these. Please host them somewhere else to make them available to the general public.

    opened by RaivoKoot 2
  • About the 3D relative position bias

    About the 3D relative position bias

    In the subsection 3D relative position bias of your paper, a bias is added in the self-attention computaion. I don't fully understand it. Image_20210826162825

    According to your description, Q,K,V are all matrices with P*M^2 rows and d columns, so QK^T will be a square matrix with P*M^2 rows and P*M^2 columns. To make the summation valid, the 3D relative position bias B should also be a square matrix wtih P*M^2 rows and P*M^2 columns. So how are the values in B are set? Specifically, how the member B(i,j) of B is set ? I can't get any link between B and Image_20210826175012

    opened by TangMinLeo 2
  • AttributeError: Module tools/data/kinetics/label_map_k400.txt not found

    AttributeError: Module tools/data/kinetics/label_map_k400.txt not found

    I got to the end of the installation and was running the script to do the check to see if installation was done correctly.

    In the open_mmlab environment, the very last line of the script inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt') received an error AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

    How to proceed?

    Thank You Tom

    opened by minertom 1
  • The model's behavior is different from the picture in the paper.

    The model's behavior is different from the picture in the paper.

    Hello. Thank you for providing a good paper with a good code.

    I had a question while experimenting with video swin transformer.

    Input sizes (1, 3, 8, 384, 384) swintransformer3D( patch_size=(2,4,4), all of rest is default settings)

    The output size per layer was measured in the forward section. The result is : after 1 layer output shape : torch.Size([1, 192, 4, 48, 48]) after 2 layer output shape : torch.Size([1, 384, 4, 24, 24]) after 3 layer output shape : torch.Size([1, 768, 4, 12, 12]) after 4 layer output shape : torch.Size([1, 768, 4, 12, 12])

    As the paper illustrates, after 1 layer output shape : torch.Size([1, 96, 4, 96, 96]) after 2 layer output shape : torch.Size([1, 192, 4, 48, 48]) after 3 layer output shape : torch.Size([1, 384, 4, 24, 24]) after 4 layer output shape : torch.Size([1, 768, 4, 12, 12]) I think this is right.

    I know it's hard work, but can I ask you to check it out?

    opened by junsang7777 1
  • How long does it take to train an epoch with SWIN-B?

    How long does it take to train an epoch with SWIN-B?

    I used swin-B to train on the epic-kitchens dataset, but it takes me almost 27 hours for one epoch training (mixed precision was already applied). I used 4 V100 GPUs, batch_size=8. Is this the normal time for training?

    opened by Christinepan881 1
  • Can't export ONNX transformer

    Can't export ONNX transformer

    This command python3 tools/deployment/pytorch2onnx.py configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py swin_tiny_patch244_window877_kinetics400_1k.pth outputs this error :

    /opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2966.)
      return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
    Use load_from_local loader
    Traceback (most recent call last):
      File "tools/deployment/pytorch2onnx.py", line 163, in <module>
        pytorch2onnx(
      File "tools/deployment/pytorch2onnx.py", line 67, in pytorch2onnx
        torch.onnx.export(
      File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 479, in export
        _export(
      File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 1411, in _export
        graph, params_dict, torch_out = _model_to_graph(
      File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 1050, in _model_to_graph
        graph, params, torch_out, module = _create_jit_graph(model, args)
      File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 925, in _create_jit_graph
        graph, torch_out = _trace_and_get_graph_from_model(model, args)
      File "/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py", line 833, in _trace_and_get_graph_from_model
        trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
      File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 1175, in _get_trace_graph
        outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 127, in forward
        graph, out = torch._C._create_graph_by_tracing(
      File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 118, in wrapper
        outs.append(self.inner(*trace_inputs))
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1174, in _slow_forward
        result = self.forward(*input, **kwargs)
    TypeError: forward_dummy() got multiple values for argument 'softmax'
    
    opened by hyperfraise 0
  • Swin-L pretrain

    Swin-L pretrain

    Dear researchers,

    Thank you for this very nice piece of of work.

    Can you also provide the weight of the Swin-L as described in the paper ?

    Best regards,

    opened by wanghao15536870732 0
  • Error: av_read_frame failed with 1094995529

    Error: av_read_frame failed with 1094995529

    你好,我在使用的时候遇到一个问题, 参考 infer 的代码构造 data_loader ,加载视频文件,但是在处理到某个视频的时候报错了 decord._ffi.base.DECORDError: [16:51:53] /io/decord/src/video/video_reader.cc:432: Error: av_read_frame failed with 1094995529

    定位到的错误应该是这个 mmaction/datasets/pipelines/loading.py", line 966, in __call__ container = decord.VideoReader(file_obj, num_threads=self.num_threads) 我对提示错误的视频进行测试,可以正常播放以及单独使用 decord.VideoReader 加载都是没有问题的, 现在就觉得很奇怪,不知道是否有空可以解答一下, 谢谢啦

    opened by DWCTOD 0
  • Embeddings

    Embeddings

    Hello, I want to know how I can call the output tensor of an intermediate layer like the last FC. Through the OutputHook function, how should the name of the layer be passed?

    inference_recognizer(model, video,labels,outputs ='fc1')?

    opened by spunknic 0
  • Is there any plan to release the video swin transformer code and pre-trained models of swin transformer V2?

    Is there any plan to release the video swin transformer code and pre-trained models of swin transformer V2?

    Hello, I have noticed that the swin transformer V2 paper has been published. There are experiments on video action classification in the paper, and the results are better than those in V1. Is there any plan to release video swin transformer code and pre-trained models based on V2?

    Your work is very valuable and helpful to me. I look forward to your reply. Thank you very much!

    opened by githubcvcv 0
Owner
Swin Transformer
This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer
Swin Transformer
Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

null 101 Nov 25, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

Angtian Wang 76 Nov 23, 2022
StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes.

NVIDIA Research Projects 3.2k Dec 30, 2022
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

Richard Wang 443 Dec 6, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

null 364 Dec 14, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

Dongkwan Kim 127 Dec 28, 2022
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

CV Lab @ Yonsei University 87 Dec 30, 2022
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022
Official code implementation for "Personalized Federated Learning using Hypernetworks"

Personalized Federated Learning using Hypernetworks This is an official implementation of Personalized Federated Learning using Hypernetworks paper. [

Aviv Shamsian 121 Dec 25, 2022
StyleGAN2 - Official TensorFlow Implementation

StyleGAN2 - Official TensorFlow Implementation

NVIDIA Research Projects 10.1k Dec 28, 2022
Old Photo Restoration (Official PyTorch Implementation)

Bringing Old Photo Back to Life (CVPR 2020 oral)

Microsoft 11.3k Dec 30, 2022
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

null 46 Nov 9, 2022
Official PyTorch implementation of Spatial Dependency Networks.

Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling Đorđe Miladinović   Aleksandar Stanić   Stefan Bauer   Jürgen Schmid

Djordje Miladinovic 34 Jan 19, 2022