[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Overview

MixFormer

The official implementation of the CVPR 2022 paper MixFormer: End-to-End Tracking with Iterative Mixed Attention

PWC

PWC

[Models and Raw results] (Google Driver) [Models and Raw results] (Baidu Driver: hmuv)

MixFormer_Framework

News

[Mar 21, 2022]

  • MixFormer is accepted to CVPR2022.
  • We release Code, models and raw results.

[Mar 29, 2022]

  • Our paper is selected for an oral presentation.

Highlights

New transformer tracking framework

MixFormer is composed of a target-search mixed attention (MAM) based backbone and a simple corner head, yielding a compact tracking pipeline without an explicit integration module.

End-to-end, Positional-embedding-free, multi-feature-aggregation-free

Mixformer is an end-to-end tracking framework without post-processing. Compared with other transformer trackers, MixFormer doesn's use positional embedding, attentional mask and multi-layer feature aggregation strategy.

Strong performance

Tracker VOT2020 (EAO) LaSOT (NP) GOT-10K (AO) TrackingNet (NP)
MixFormer 0.555 79.9 70.7 88.9
ToMP101* (CVPR2022) - 79.2 - 86.4
SBT-large* (CVPR2022) 0.529 - 70.4 -
SwinTrack* (Arxiv2021) - 78.6 69.4 88.2
Sim-L/14* (Arxiv2022) - 79.7 69.8 87.4
STARK (ICCV2021) 0.505 77.0 68.8 86.9
KeepTrack (ICCV2021) - 77.2 - -
TransT (CVPR2021) 0.495 73.8 67.1 86.7
TrDiMP (CVPR2021) - - 67.1 83.3
Siam R-CNN (CVPR2020) - 72.2 64.9 85.4
TREG (Arxiv2021) - 74.1 66.8 83.8

Install the environment

Use the Anaconda

conda create -n mixformer python=3.6
conda activate mixformer
bash install_pytorch17.sh

Data Preparation

Put the tracking datasets in ./data. It should look like:

${MixFormer_ROOT}
 -- data
     -- lasot
         |-- airplane
         |-- basketball
         |-- bear
         ...
     -- got10k
         |-- test
         |-- train
         |-- val
     -- coco
         |-- annotations
         |-- train2017
     -- trackingnet
         |-- TRAIN_0
         |-- TRAIN_1
         ...
         |-- TRAIN_11
         |-- TEST

Set project paths

Run the following command to set paths for this project

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir .

After running this command, you can also modify paths by editing these two files

lib/train/admin/local.py  # paths about training
lib/test/evaluation/local.py  # paths about testing

Train MixFormer

Training with multiple GPUs using DDP. More details of other training settings can be found at tracking/train_mixformer.sh

# MixFormer
bash tracking/train_mixformer.sh

Test and evaluate MixFormer on benchmarks

  • LaSOT/GOT10k-test/TrackingNet/OTB100/UAV123. More details of test settings can be found at tracking/test_mixformer.sh
bash tracking/test_mixformer.sh
  • VOT2020
    Before evaluating "MixFormer+AR" on VOT2020, please install some extra packages following external/AR/README.md. Also, the VOT toolkit is required to evaluate our tracker. To download and instal VOT toolkit, you can follow this tutorial. For convenience, you can use our example workspaces of VOT toolkit under external/vot20/ by setting trackers.ini.
cd external/vot20/<workspace_dir>
vot evaluate --workspace . MixFormerPython
# generating analysis results
vot analysis --workspace . --nocache

Run MixFormer on your own video

bash tracking/run_video_demo.sh

Compute FLOPs/Params and test speed

bash tracking/profile_mixformer.sh

Visualize attention maps

bash tracking/vis_mixformer_attn.sh

vis_attn

Model Zoo and raw results

The trained models and the raw tracking results are provided in the [Models and Raw results] (Google Driver) or [Models and Raw results] (Baidu Driver: hmuv).

Contact

Yutao Cui: [email protected]

Cheng Jiang: [email protected]

Acknowledgments

  • Thanks for PyTracking Library and STARK Library, which helps us to quickly implement our ideas.
  • We use the implementation of the CvT from the official repo CvT.
Comments
  • multi-layer feature aggregation strategy and long-term tracking

    multi-layer feature aggregation strategy and long-term tracking

    Thanks for sharing the excellent work! I have two small questions. First, as mentioned in your paper, the multi-layer feature aggregation strategy is commonly used in other trackers (e.g., SiamRPN++, STARK). The one in SiamRPN++ is understandable, but the one in STARK is confusing. STARK seems to only use the last stride=16 features for prediction. I would like to know what is the main difference between MixFormer and STARK in this regard? Second, have you tested the MixFormer on the VOT long-term dataset? STARK performs well on long-term tracking, and it feels like MixFormer could work better.

    opened by zzzmm1 8
  •  test my pretrained model

    test my pretrained model

    使用以下语句运行来测试自己训练的模型 :python tracking/test.py mixformer_online baseline --dataset lasot --threads 32 --num_gpus 2 --params__model MixFormer_ep0180.pth.tar --params__search_area_scale 4.55。 报错: 'Then try to run again.'.format(env_file)) RuntimeError: YOU HAVE NOT SETUP YOUR local.py!!! 在其他issues 中看到 image 运行后不在报 local.py 的错误 但是出现以下错误 image

    想知道如何才能正确运行test.py来测试自己训练的模型呢

    opened by JAYCHOU2020 6
  • Is this a typo?

    Is this a typo?

    In line751, it should be named online_template not template, or just am I misunderstanding? https://github.com/MCG-NJU/MixFormer/blob/0c2663d3afbce0da138d5b42bc7f28667d077ba3/lib/models/mixformer/mixformer.py#L746-L756

    opened by laisimiao 6
  • online阶段训练结果异常

    online阶段训练结果异常

    作者,你好,

    我在复现got-10k的数据集效果时, 第1阶段推理绘制后看上去是正常的, 第2阶段训练后的推理比较奇怪, 如下,

    GOT-10K-TEST-00132为例, 如下, 其它也都一样, 都是输出 10*10的框.

    1068 476 117 178 1108 548 10 10 1112 552 10 10 1116 556 10 10 1120 560 10 10 1123 564 10 10 1127 567 10 10 1131 571 10 10 1134 575 10 10 1138 579 10 10 1142 583 10 10 1146 587 10 10 1149 590 10 10 1153 594 10 10 1156 598 10 10 1160 602 10 10 1164 606 10 10 1168 610 10 10 1171 614 10 10 1175 617 10 10 1179 621 10 10 1182 625 10 10 1186 629 10 10 1190 633 10 10 1194 637 10 10 1197 640 10 10 1201 644 10 10 1205 648 10 10 1209 652 10 10 1212 656 10 10 1216 660 10 10 1220 663 10 10 1224 667 10 10 1228 671 10 10 1231 675 10 10 1235 679 10 10 1239 682 10 10 1243 686 10 10 1247 690 10 10 1250 694 10 10 1254 698 10 10 1258 702 10 10 1261 705 10 10 1265 709 10 10 1269 713 10 10 1273 717 10 10 1277 721 10 10 1280 725 10 10 1284 728 10 10 1288 732 10 10 1292 736 10 10 1295 740 10 10 1299 744 10 10 1303 748 10 10 1307 751 10 10 1310 755 10 10 1314 759 10 10 1318 763 10 10 1322 767 10 10 1325 770 10 10 1329 774 10 10 1333 778 10 10 1337 782 10 10 1341 786 10 10 1344 789 10 10 1348 793 10 10 1352 797 10 10 1355 801 10 10 1359 805 10 10 1363 809 10 10 1367 812 10 10 1370 816 10 10 1374 820 10 10 1378 824 10 10 1382 828 10 10 1385 832 10 10 1389 835 10 10 1393 839 10 10 1396 843 10 10 1400 847 10 10 1404 850 10 10

    opened by congjianting 4
  • repeat tracker initialize?

    repeat tracker initialize?

    First of all, I thanks for your clean and high-quality codes. But in https://github.com/MCG-NJU/MixFormer/blob/219bd14704ec217919c3b1eb310940769546c2d6/external/AR/pytracking/VOT2020_super_only_mask_384_HP/mixformer_alpha_seg_class.py#L32-L43 I find two times of tracker.initialize. I think initialize is just a setting step (not online update step), why do we need two times of that?

    opened by laisimiao 4
  • An error was encountered while testing

    An error was encountered while testing

    Thank you for your outstanding work. I reproduce your code, there is an error:

    {'model': 'mixformer_online_22k.pth.tar', 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 1} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 Evaluating 1 trackers on 1 sequences Tracker: mixformer_online baseline None , Sequence: Basketball Warning: Pretrained CVT weights are not loaded head channel: 384 Online size is: 3 Update interval is: 6 max score decay = 1.0 Error while processing rearrange-reduction pattern "b (h w) c -> b c h w". Input tensor shape: torch.Size([1, 1, 2048, 64]). Additional info: {'h': 32, 'w': 32}. Expected 3 dimensions, got 4 Done

    How to solve this problem?

    opened by DLRook1e 4
  • Con not compile Precise RoI Pooling library

    Con not compile Precise RoI Pooling library

    {'model': 'mixformer_online_22k.pth.tar', 'update_interval': 25, 'online_sizes': 3, 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 0} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 Warning: Pretrained CVT weights are not loaded head channel: 384 Online size is: 3 Update interval is: 25 max score decay = 1.0 Using C:\Users\210\AppData\Local\torch_extensions\torch_extensions\Cache as PyTorch extensions root... C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py:274: UserWarning: Error checking compiler version for cl: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte warnings.warn('Error checking compiler version for {}: {}'.format(compiler, error)) Detected CUDA files, patching ldflags Emitting ninja build file C:\Users\210\AppData\Local\torch_extensions\torch_extensions\Cache_prroi_pooling\build.ninja... Building extension module _prroi_pooling... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 1.10.2 Loading extension module _prroi_pooling... Traceback (most recent call last): File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 33, in _import_prroi_pooling verbose=True File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 980, in load keep_intermediates=keep_intermediates) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 1196, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 1543, in _import_module_from_library file, path, description = imp.find_module(module_name, [path]) File "C:\Users\210\anaconda3\envs\mixformer1\lib\imp.py", line 297, in find_module raise ImportError(_ERR_MSG.format(name), name=name) ImportError: No module named '_prroi_pooling'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "tracking/video_demo.py", line 53, in main() File "tracking/video_demo.py", line 49, in main args.save_results, tracker_params=tracker_params) File "tracking/video_demo.py", line 21, in run_video tracker.run_video(videofilepath=videofile, optional_box=optional_box, debug=debug, save_results=save_results) File "tracking..\lib\test\evaluation\tracker.py", line 228, in run_video out = tracker.track(frame) File "tracking..\lib\test\tracker\mixformer_online.py", line 135, in track out_dict, _ = self.network.forward_test(search, run_score_head=True) File "tracking..\lib\models\mixformer\mixformer_online.py", line 850, in forward_test out, outputs_coord_new = self.forward_head(search, template, run_score_head, gt_bboxes) File "tracking..\lib\models\mixformer\mixformer_online.py", line 875, in forward_head out_dict.update({'pred_scores': self.score_branch(search, template, gt_bboxes).view(-1)}) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "tracking..\lib\models\mixformer\mixformer_online.py", line 798, in forward search_box_feat = rearrange(self.search_prroipool(search_feat, target_roi), 'b c h w -> b (h w) c') File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\prroi_pool.py", line 28, in forward return prroi_pool2d(features, rois, self.pooled_height, self.pooled_width, self.spatial_scale) File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 44, in forward _prroi_pooling = _import_prroi_pooling() File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 36, in _import_prroi_pooling raise ImportError('Can not compile Precise RoI Pooling library.') ImportError: Can not compile Precise RoI Pooling library.

    Please help me! Thanks very much

    opened by wyanb 4
  • About Score Prediction Module (SPM) and MixFormer-1k

    About Score Prediction Module (SPM) and MixFormer-1k

    Hi, Thanks for your work. I have some questions about your paper:

    1. Have you ever tried to use the score prediction module of STARK (MLP) instead of the SPM proposed in your paper? I am curious about the performance difference between SPM and using MLP directly.
    2. The MixFormer-1k model seems to be trained with all dataset, not just the GOT10k, which is different from your paper (it is unreasonable that MixFormer-1k performs better than MixFormer-GOT if MixFormer-1k is also trained with GOT10k only). Is it fair to use it for comparison on GOT10k test?
    opened by hhhAlan 4
  • How to train SPM in stage2?

    How to train SPM in stage2?

    Thank you for your excellent work. I have some questions about the training process of SPM.

    I encounter a problem when I use the script in train_mixformer.sh to train SPM module. python tracking/train.py --script mixformer_online --config baseline --save_dir /mysavepath --mode multiple --nproc_per_node 1 --stage1_model /mylatest checkpoint trained in the first stage

    But the logs show that it seems that the program has loaded wrong checkpoint because there are so many missing keys

    missing keys: ['score_branch.score_token', 'score_branch.score_head.layers.0.weight', 'score_branch.score_head.layers.0.bias', 'score_branch.score_head.layers.1.weight', 'score_branch.score_head.layers.1.bias', 'score_branch.score_head.layers.2.weight', 'score_branch.score_head.layers.2.bias', 'score_branch.proj_q.0.weight', 'score_branch.proj_q.0.bias', 'score_branch.proj_q.1.weight', 'score_branch.proj_q.1.bias', 'score_branch.proj_k.0.weight', 'score_branch.proj_k.0.bias', 'score_branch.proj_k.1.weight', 'score_branch.proj_k.1.bias', 'score_branch.proj_v.0.weight', 'score_branch.proj_v.0.bias', 'score_branch.proj_v.1.weight', 'score_branch.proj_v.1.bias', 'score_branch.proj.0.weight', 'score_branch.proj.0.bias', 'score_branch.proj.1.weight', 'score_branch.proj.1.bias', 'score_branch.norm1.weight', 'score_branch.norm1.bias', 'score_branch.norm2.0.weight', 'score_branch.norm2.0.bias', 'score_branch.norm2.1.weight', 'score_branch.norm2.1.bias'] unexpected keys: [] Loading pretrained mixformer weights done.

    I am really confused about how to train the SPM module correctly.

    I am appreciate if you can give me some advices.

    The whole log shows below:

    error logs.txt

    opened by Lich-King000 3
  • About update MixedAttention operation

    About update MixedAttention operation

    Thank you for open source such an excellent work!

    The original version before the update was to separate Q, K, and V into templates, online templates, and search ranges, the templates and online templates were separately performed Attention, as shown in the following:

     # template attention
        k1 = torch.cat([k_t, k_ot], dim=2)
        v1 = torch.cat([v_t, v_ot], dim=2)
        attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_t, k1]) * self.scale
        attn = F.softmax(attn_score, dim=-1)
        attn = self.attn_drop(attn)
        x_t = torch.einsum('bhlt,bhtv->bhlv', [attn, v1])
        x_t = rearrange(x_t, 'b h t d -> b t (h d)')
    
      # online template attention
        k2 = torch.cat([k_t, k_ot], dim=2)
        v2 = torch.cat([v_t, v_ot], dim=2)
        attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_ot, k2]) * self.scale
        attn = F.softmax(attn_score, dim=-1)
        attn = self.attn_drop(attn)
        x_ot = torch.einsum('bhlt,bhtv->bhlv', [attn, v2])
        x_ot = rearrange(x_ot, 'b h t d -> b t (h d)')
    

    Especially in the calculation of attn_score, this part is calculated by q_t and k1, q_ot and k2 (both k1 and k2 are templates concatenated with online templates):

    attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_t, k1]) * self.scale

    attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_ot, k2]) * self.scale

    And now the updated version is that the template and the online template are merged together to execute Attention at the beginning, as shown below:

    attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_mt, k_mt]) * self.scale

    I would like to ask whether the two ways of calculating templates and online templates Attention before and after the update are equivalent?

    opened by s9021025292140 3
  • 关于如何验证模块的可信性?

    关于如何验证模块的可信性?

    你好,我是一个小白。我想问一下,您在做实验时,是每加上一个模块,就在got-10k上面训练一遍就测试验证模块的可行性,还是在全量数据集(TrackingNet、COCO、GOT10K、LASOT)一起训练并测试验证该模块的可行性呢?

    最近发现我的模型在仅GOT-10训练的情况下,表现很好(AO超过0.74),但是在全量数据集上表现不尽人意(LASOT只有68- 69),十分困惑。

    opened by RelayZ 2
  • How to run MixFormer on my own video?

    How to run MixFormer on my own video?

    MixFormer$ bash tracking/run_video_demo.sh {'model': '../MixFormer/models/mixformer_online_22k.pth.tar', 'update_interval': 25, 'online_sizes': 3, 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 0} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '/home/roxign-usr/Badminton/MixFormer/CvT_pretrain_weights/CvT-21-384x384-IN-22k.pth', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 200, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 missing keys: [] unexpected keys: ['stage2.cls_token'] Loading pretrained CVT done. head channel: 384 Traceback (most recent call last): File "tracking/video_demo.py", line 53, in main() File "tracking/video_demo.py", line 49, in main args.save_results, tracker_params=tracker_params) File "tracking/video_demo.py", line 21, in run_video tracker.run_video(videofilepath=videofile, optional_box=optional_box, debug=debug, save_results=save_results) File "tracking/../lib/test/evaluation/tracker.py", line 175, in run_video tracker = self.create_tracker(params) File "tracking/../lib/test/evaluation/tracker.py", line 65, in create_tracker tracker = self.tracker_class(params, self.dataset_name) File "tracking/../lib/test/tracker/mixformer_online.py", line 17, in init network = build_mixformer_online_score(params.cfg, train=False) File "tracking/../lib/models/mixformer/mixformer_online.py", line 907, in build_mixformer_online_score box_head = build_box_head(cfg) # a simple corner head File "tracking/../lib/models/mixformer/head.py", line 135, in build_box_head feat_sz=feat_sz, stride=stride, freeze_bn=freeze_bn) File "tracking/../lib/models/mixformer/head.py", line 50, in init .view((self.feat_sz * self.feat_sz,)).float().cuda() RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.


    I encountered some problems when i running this command.

    It show the RuntimeError.

    Do I need any other settings to run this?

    Thanks!!

    opened by HJ1722 0
  • 训练过程中的目标丢失数据的问题

    训练过程中的目标丢失数据的问题

    作者老师,你好,

    训练过程中, 如果一个训练数据文件夹中某几个图片出现目标丢失, 想问下这几张丢失目标的图片还参与训练吗?

    以got-10k为例, 我看dataset中存在如下处理,

    def get_sequence_info(self, seq_id):
        seq_path = self._get_sequence_path(seq_id)
        bbox = self._read_bb_anno(seq_path)
    
        valid = (bbox[:, 2] > 0) & (bbox[:, 3] > 0)
        visible, visible_ratio = self._read_target_visible(seq_path)
        visible = visible & valid.byte()
    
        return {'bbox': bbox, 'valid': valid, 'visible': visible, 'visible_ratio': visible_ratio}
    
    opened by congjianting 0
Owner
Multimedia Computing Group, Nanjing University
Multimedia Computing Group, Nanjing University
Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

Andrew Luo 41 Dec 9, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

Kakao Brain 114 Nov 28, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

LAV Learning from All Vehicles Dian Chen, Philipp Krähenbühl CVPR 2022 (also arXiV 2203.11934) This repo contains code for paper Learning from all veh

Dian Chen 300 Dec 15, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 87 Jan 8, 2023
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 327 Dec 27, 2022
Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

Perceiver - Pytorch Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch Install $ pip install perceiver-pytorch Usage

Phil Wang 876 Dec 29, 2022
Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Perceiver This Python package implements Perceiver: General Perception with Iterative Attention by Andrew Jaegle in TensorFlow. This model builds on t

Rishit Dagli 84 Oct 15, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Rethinking Minimal Sufficient Representation in Contrastive Learning PyTorch implementation of Rethinking Minimal Sufficient Representation in Contras

null 36 Nov 23, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

nvdiffrec Joint optimization of topology, materials and lighting from multi-view image observations as described in the paper Extracting Triangular 3D

NVIDIA Research Projects 1.4k Jan 1, 2023
[CVPR 2022 Oral] Crafting Better Contrastive Views for Siamese Representation Learning

Crafting Better Contrastive Views for Siamese Representation Learning (CVPR 2022 Oral) 2022-03-29: The paper was selected as a CVPR 2022 Oral paper! 2

null 249 Dec 28, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 365 Dec 30, 2022
(CVPR 2022 Oral) Official implementation for "Surface Representation for Point Clouds"

RepSurf - Surface Representation for Point Clouds [CVPR 2022 Oral] By Haoxi Ran* , Jun Liu, Chengjie Wang ( * : corresponding contact) The pytorch off

Haoxi Ran 264 Dec 23, 2022
The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

DeepBDC for few-shot learning        Introduction In this repo, we provide the implementation of the following paper: "Joint Distribution Matters: Dee

FeiLong 116 Dec 19, 2022