[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Multimedia Computing Group, Nanjing University

Last update: Jan 3, 2023

Related tags

Overview

MixFormer

The official implementation of the CVPR 2022 paper MixFormer: End-to-End Tracking with Iterative Mixed Attention

[Models and Raw results] (Google Driver) [Models and Raw results] (Baidu Driver: hmuv)

News

[Mar 21, 2022]

MixFormer is accepted to CVPR2022.
We release Code, models and raw results.

[Mar 29, 2022]

Our paper is selected for an oral presentation.

Highlights

✨ New transformer tracking framework

MixFormer is composed of a target-search mixed attention (MAM) based backbone and a simple corner head, yielding a compact tracking pipeline without an explicit integration module.

✨ End-to-end, Positional-embedding-free, multi-feature-aggregation-free

Mixformer is an end-to-end tracking framework without post-processing. Compared with other transformer trackers, MixFormer doesn's use positional embedding, attentional mask and multi-layer feature aggregation strategy.

✨ Strong performance

Tracker	VOT2020 (EAO)	LaSOT (NP)	GOT-10K (AO)	TrackingNet (NP)
MixFormer	0.555	79.9	70.7	88.9
ToMP101* (CVPR2022)	-	79.2	-	86.4
SBT-large* (CVPR2022)	0.529	-	70.4	-
SwinTrack* (Arxiv2021)	-	78.6	69.4	88.2
Sim-L/14* (Arxiv2022)	-	79.7	69.8	87.4
STARK (ICCV2021)	0.505	77.0	68.8	86.9
KeepTrack (ICCV2021)	-	77.2	-	-
TransT (CVPR2021)	0.495	73.8	67.1	86.7
TrDiMP (CVPR2021)	-	-	67.1	83.3
Siam R-CNN (CVPR2020)	-	72.2	64.9	85.4
TREG (Arxiv2021)	-	74.1	66.8	83.8

Install the environment

Use the Anaconda

conda create -n mixformer python=3.6
conda activate mixformer
bash install_pytorch17.sh

Data Preparation

Put the tracking datasets in ./data. It should look like:

${MixFormer_ROOT}
 -- data
     -- lasot
         |-- airplane
         |-- basketball
         |-- bear
         ...
     -- got10k
         |-- test
         |-- train
         |-- val
     -- coco
         |-- annotations
         |-- train2017
     -- trackingnet
         |-- TRAIN_0
         |-- TRAIN_1
         ...
         |-- TRAIN_11
         |-- TEST

Set project paths

Run the following command to set paths for this project

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir .

After running this command, you can also modify paths by editing these two files

lib/train/admin/local.py  # paths about training
lib/test/evaluation/local.py  # paths about testing

Train MixFormer

Training with multiple GPUs using DDP. More details of other training settings can be found at tracking/train_mixformer.sh

# MixFormer
bash tracking/train_mixformer.sh

Test and evaluate MixFormer on benchmarks

LaSOT/GOT10k-test/TrackingNet/OTB100/UAV123. More details of test settings can be found at tracking/test_mixformer.sh

bash tracking/test_mixformer.sh

VOT2020
Before evaluating "MixFormer+AR" on VOT2020, please install some extra packages following external/AR/README.md. Also, the VOT toolkit is required to evaluate our tracker. To download and instal VOT toolkit, you can follow this tutorial. For convenience, you can use our example workspaces of VOT toolkit under external/vot20/ by setting trackers.ini.

cd external/vot20/<workspace_dir>
vot evaluate --workspace . MixFormerPython
# generating analysis results
vot analysis --workspace . --nocache

Run MixFormer on your own video

bash tracking/run_video_demo.sh

Compute FLOPs/Params and test speed

bash tracking/profile_mixformer.sh

Visualize attention maps

bash tracking/vis_mixformer_attn.sh

Model Zoo and raw results

The trained models and the raw tracking results are provided in the [Models and Raw results] (Google Driver) or [Models and Raw results] (Baidu Driver: hmuv).

Contact

Yutao Cui: [email protected]

Cheng Jiang: [email protected]

Acknowledgments

Thanks for PyTracking Library and STARK Library, which helps us to quickly implement our ideas.
We use the implementation of the CvT from the official repo CvT.

Comments

multi-layer feature aggregation strategy and long-term tracking

Thanks for sharing the excellent work! I have two small questions. First, as mentioned in your paper, the multi-layer feature aggregation strategy is commonly used in other trackers (e.g., SiamRPN++, STARK). The one in SiamRPN++ is understandable, but the one in STARK is confusing. STARK seems to only use the last stride=16 features for prediction. I would like to know what is the main difference between MixFormer and STARK in this regard? Second, have you tested the MixFormer on the VOT long-term dataset? STARK performs well on long-term tracking, and it feels like MixFormer could work better.

opened by zzzmm1 8
test my pretrained model

使用以下语句运行来测试自己训练的模型：python tracking/test.py mixformer_online baseline --dataset lasot --threads 32 --num_gpus 2 --params__model MixFormer_ep0180.pth.tar --params__search_area_scale 4.55。报错： 'Then try to run again.'.format(env_file)) RuntimeError: YOU HAVE NOT SETUP YOUR local.py!!! 在其他issues 中看到运行后不在报 local.py 的错误但是出现以下错误

想知道如何才能正确运行test.py来测试自己训练的模型呢

opened by JAYCHOU2020 6
Is this a typo?

In line751, it should be named online_template not template, or just am I misunderstanding? https://github.com/MCG-NJU/MixFormer/blob/0c2663d3afbce0da138d5b42bc7f28667d077ba3/lib/models/mixformer/mixformer.py#L746-L756

opened by laisimiao 6
online阶段训练结果异常

作者,你好,

我在复现got-10k的数据集效果时, 第1阶段推理绘制后看上去是正常的, 第2阶段训练后的推理比较奇怪, 如下,

GOT-10K-TEST-00132为例, 如下, 其它也都一样, 都是输出 10*10的框.

1068 476 117 178 1108 548 10 10 1112 552 10 10 1116 556 10 10 1120 560 10 10 1123 564 10 10 1127 567 10 10 1131 571 10 10 1134 575 10 10 1138 579 10 10 1142 583 10 10 1146 587 10 10 1149 590 10 10 1153 594 10 10 1156 598 10 10 1160 602 10 10 1164 606 10 10 1168 610 10 10 1171 614 10 10 1175 617 10 10 1179 621 10 10 1182 625 10 10 1186 629 10 10 1190 633 10 10 1194 637 10 10 1197 640 10 10 1201 644 10 10 1205 648 10 10 1209 652 10 10 1212 656 10 10 1216 660 10 10 1220 663 10 10 1224 667 10 10 1228 671 10 10 1231 675 10 10 1235 679 10 10 1239 682 10 10 1243 686 10 10 1247 690 10 10 1250 694 10 10 1254 698 10 10 1258 702 10 10 1261 705 10 10 1265 709 10 10 1269 713 10 10 1273 717 10 10 1277 721 10 10 1280 725 10 10 1284 728 10 10 1288 732 10 10 1292 736 10 10 1295 740 10 10 1299 744 10 10 1303 748 10 10 1307 751 10 10 1310 755 10 10 1314 759 10 10 1318 763 10 10 1322 767 10 10 1325 770 10 10 1329 774 10 10 1333 778 10 10 1337 782 10 10 1341 786 10 10 1344 789 10 10 1348 793 10 10 1352 797 10 10 1355 801 10 10 1359 805 10 10 1363 809 10 10 1367 812 10 10 1370 816 10 10 1374 820 10 10 1378 824 10 10 1382 828 10 10 1385 832 10 10 1389 835 10 10 1393 839 10 10 1396 843 10 10 1400 847 10 10 1404 850 10 10

opened by congjianting 4
repeat tracker initialize?

First of all, I thanks for your clean and high-quality codes. But in https://github.com/MCG-NJU/MixFormer/blob/219bd14704ec217919c3b1eb310940769546c2d6/external/AR/pytracking/VOT2020_super_only_mask_384_HP/mixformer_alpha_seg_class.py#L32-L43 I find two times of tracker.initialize. I think initialize is just a setting step (not online update step), why do we need two times of that?

opened by laisimiao 4
An error was encountered while testing

Thank you for your outstanding work. I reproduce your code, there is an error:

{'model': 'mixformer_online_22k.pth.tar', 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 1} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 Evaluating 1 trackers on 1 sequences Tracker: mixformer_online baseline None , Sequence: Basketball Warning: Pretrained CVT weights are not loaded head channel: 384 Online size is: 3 Update interval is: 6 max score decay = 1.0 Error while processing rearrange-reduction pattern "b (h w) c -> b c h w". Input tensor shape: torch.Size([1, 1, 2048, 64]). Additional info: {'h': 32, 'w': 32}. Expected 3 dimensions, got 4 Done

How to solve this problem？

opened by DLRook1e 4
Con not compile Precise RoI Pooling library

{'model': 'mixformer_online_22k.pth.tar', 'update_interval': 25, 'online_sizes': 3, 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 0} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 Warning: Pretrained CVT weights are not loaded head channel: 384 Online size is: 3 Update interval is: 25 max score decay = 1.0 Using C:\Users\210\AppData\Local\torch_extensions\torch_extensions\Cache as PyTorch extensions root... C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py:274: UserWarning: Error checking compiler version for cl: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte warnings.warn('Error checking compiler version for {}: {}'.format(compiler, error)) Detected CUDA files, patching ldflags Emitting ninja build file C:\Users\210\AppData\Local\torch_extensions\torch_extensions\Cache_prroi_pooling\build.ninja... Building extension module _prroi_pooling... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 1.10.2 Loading extension module _prroi_pooling... Traceback (most recent call last): File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 33, in _import_prroi_pooling verbose=True File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 980, in load keep_intermediates=keep_intermediates) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 1196, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 1543, in _import_module_from_library file, path, description = imp.find_module(module_name, [path]) File "C:\Users\210\anaconda3\envs\mixformer1\lib\imp.py", line 297, in find_module raise ImportError(_ERR_MSG.format(name), name=name) ImportError: No module named '_prroi_pooling'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "tracking/video_demo.py", line 53, in main() File "tracking/video_demo.py", line 49, in main args.save_results, tracker_params=tracker_params) File "tracking/video_demo.py", line 21, in run_video tracker.run_video(videofilepath=videofile, optional_box=optional_box, debug=debug, save_results=save_results) File "tracking..\lib\test\evaluation\tracker.py", line 228, in run_video out = tracker.track(frame) File "tracking..\lib\test\tracker\mixformer_online.py", line 135, in track out_dict, _ = self.network.forward_test(search, run_score_head=True) File "tracking..\lib\models\mixformer\mixformer_online.py", line 850, in forward_test out, outputs_coord_new = self.forward_head(search, template, run_score_head, gt_bboxes) File "tracking..\lib\models\mixformer\mixformer_online.py", line 875, in forward_head out_dict.update({'pred_scores': self.score_branch(search, template, gt_bboxes).view(-1)}) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "tracking..\lib\models\mixformer\mixformer_online.py", line 798, in forward search_box_feat = rearrange(self.search_prroipool(search_feat, target_roi), 'b c h w -> b (h w) c') File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\prroi_pool.py", line 28, in forward return prroi_pool2d(features, rois, self.pooled_height, self.pooled_width, self.spatial_scale) File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 44, in forward _prroi_pooling = _import_prroi_pooling() File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 36, in _import_prroi_pooling raise ImportError('Can not compile Precise RoI Pooling library.') ImportError: Can not compile Precise RoI Pooling library.

Please help me! Thanks very much

opened by wyanb 4
About Score Prediction Module (SPM) and MixFormer-1k
Hi, Thanks for your work. I have some questions about your paper:

Have you ever tried to use the score prediction module of STARK (MLP) instead of the SPM proposed in your paper? I am curious about the performance difference between SPM and using MLP directly.

The MixFormer-1k model seems to be trained with all dataset, not just the GOT10k, which is different from your paper (it is unreasonable that MixFormer-1k performs better than MixFormer-GOT if MixFormer-1k is also trained with GOT10k only). Is it fair to use it for comparison on GOT10k test?
opened by hhhAlan 4
How to train SPM in stage2?

Thank you for your excellent work. I have some questions about the training process of SPM.

I encounter a problem when I use the script in train_mixformer.sh to train SPM module. python tracking/train.py --script mixformer_online --config baseline --save_dir /mysavepath --mode multiple --nproc_per_node 1 --stage1_model /mylatest checkpoint trained in the first stage

But the logs show that it seems that the program has loaded wrong checkpoint because there are so many missing keys

missing keys: ['score_branch.score_token', 'score_branch.score_head.layers.0.weight', 'score_branch.score_head.layers.0.bias', 'score_branch.score_head.layers.1.weight', 'score_branch.score_head.layers.1.bias', 'score_branch.score_head.layers.2.weight', 'score_branch.score_head.layers.2.bias', 'score_branch.proj_q.0.weight', 'score_branch.proj_q.0.bias', 'score_branch.proj_q.1.weight', 'score_branch.proj_q.1.bias', 'score_branch.proj_k.0.weight', 'score_branch.proj_k.0.bias', 'score_branch.proj_k.1.weight', 'score_branch.proj_k.1.bias', 'score_branch.proj_v.0.weight', 'score_branch.proj_v.0.bias', 'score_branch.proj_v.1.weight', 'score_branch.proj_v.1.bias', 'score_branch.proj.0.weight', 'score_branch.proj.0.bias', 'score_branch.proj.1.weight', 'score_branch.proj.1.bias', 'score_branch.norm1.weight', 'score_branch.norm1.bias', 'score_branch.norm2.0.weight', 'score_branch.norm2.0.bias', 'score_branch.norm2.1.weight', 'score_branch.norm2.1.bias'] unexpected keys: [] Loading pretrained mixformer weights done.

I am really confused about how to train the SPM module correctly.

I am appreciate if you can give me some advices.

The whole log shows below:

error logs.txt

opened by Lich-King000 3
About update MixedAttention operation
Thank you for open source such an excellent work!

The original version before the update was to separate Q, K, and V into templates, online templates, and search ranges, the templates and online templates were separately performed Attention, as shown in the following:

# template attention k1 = torch.cat([k_t, k_ot], dim=2) v1 = torch.cat([v_t, v_ot], dim=2) attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_t, k1]) * self.scale attn = F.softmax(attn_score, dim=-1) attn = self.attn_drop(attn) x_t = torch.einsum('bhlt,bhtv->bhlv', [attn, v1]) x_t = rearrange(x_t, 'b h t d -> b t (h d)') # online template attention k2 = torch.cat([k_t, k_ot], dim=2) v2 = torch.cat([v_t, v_ot], dim=2) attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_ot, k2]) * self.scale attn = F.softmax(attn_score, dim=-1) attn = self.attn_drop(attn) x_ot = torch.einsum('bhlt,bhtv->bhlv', [attn, v2]) x_ot = rearrange(x_ot, 'b h t d -> b t (h d)')

Especially in the calculation of attn_score, this part is calculated by q_t and k1, q_ot and k2 (both k1 and k2 are templates concatenated with online templates):

attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_t, k1]) * self.scale

attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_ot, k2]) * self.scale

And now the updated version is that the template and the online template are merged together to execute Attention at the beginning, as shown below:

attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_mt, k_mt]) * self.scale

I would like to ask whether the two ways of calculating templates and online templates Attention before and after the update are equivalent?
opened by s9021025292140 3
关于如何验证模块的可信性？

你好，我是一个小白。我想问一下，您在做实验时，是每加上一个模块，就在got-10k上面训练一遍就测试验证模块的可行性，还是在全量数据集（TrackingNet、COCO、GOT10K、LASOT）一起训练并测试验证该模块的可行性呢？

最近发现我的模型在仅GOT-10训练的情况下，表现很好(AO超过0.74)，但是在全量数据集上表现不尽人意（LASOT只有68- 69），十分困惑。

opened by RelayZ 2
How to run MixFormer on my own video?

MixFormer$ bash tracking/run_video_demo.sh {'model': '../MixFormer/models/mixformer_online_22k.pth.tar', 'update_interval': 25, 'online_sizes': 3, 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 0} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '/home/roxign-usr/Badminton/MixFormer/CvT_pretrain_weights/CvT-21-384x384-IN-22k.pth', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 200, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 missing keys: [] unexpected keys: ['stage2.cls_token'] Loading pretrained CVT done. head channel: 384 Traceback (most recent call last): File "tracking/video_demo.py", line 53, in main() File "tracking/video_demo.py", line 49, in main args.save_results, tracker_params=tracker_params) File "tracking/video_demo.py", line 21, in run_video tracker.run_video(videofilepath=videofile, optional_box=optional_box, debug=debug, save_results=save_results) File "tracking/../lib/test/evaluation/tracker.py", line 175, in run_video tracker = self.create_tracker(params) File "tracking/../lib/test/evaluation/tracker.py", line 65, in create_tracker tracker = self.tracker_class(params, self.dataset_name) File "tracking/../lib/test/tracker/mixformer_online.py", line 17, in init network = build_mixformer_online_score(params.cfg, train=False) File "tracking/../lib/models/mixformer/mixformer_online.py", line 907, in build_mixformer_online_score box_head = build_box_head(cfg) # a simple corner head File "tracking/../lib/models/mixformer/head.py", line 135, in build_box_head feat_sz=feat_sz, stride=stride, freeze_bn=freeze_bn) File "tracking/../lib/models/mixformer/head.py", line 50, in init .view((self.feat_sz * self.feat_sz,)).float().cuda() RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I encountered some problems when i running this command.

It show the RuntimeError.

Do I need any other settings to run this?

Thanks!!

opened by HJ1722 0

训练过程中的目标丢失数据的问题

作者老师,你好,

训练过程中, 如果一个训练数据文件夹中某几个图片出现目标丢失, 想问下这几张丢失目标的图片还参与训练吗?

以got-10k为例, 我看dataset中存在如下处理,

def get_sequence_info(self, seq_id):
    seq_path = self._get_sequence_path(seq_id)
    bbox = self._read_bb_anno(seq_path)

    valid = (bbox[:, 2] > 0) & (bbox[:, 3] > 0)
    visible, visible_ratio = self._read_target_visible(seq_path)
    visible = visible & valid.byte()

    return {'bbox': bbox, 'valid': valid, 'visible': visible, 'visible_ratio': visible_ratio}

opened by congjianting 0

Owner

Multimedia Computing Group, Nanjing University

GitHub https://arxiv.org/abs/2203.11082

MOT-Tracking-by-Detection-Pipeline - For Tracking-by-Detection format MOT (Multi Object Tracking), is it a framework that separates Detection and Tracking processes?

MOT-Tracking-by-Detection-Pipeline Tracking-by-Detection形式のMOT(Multi Object Trac

41 Nov 23, 2022

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

114 Nov 28, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

LAV Learning from All Vehicles Dian Chen, Philipp Krähenbühl CVPR 2022 (also arXiV 2203.11934) This repo contains code for paper Learning from all veh

300 Dec 15, 2022

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

DeepBDC for few-shot learning Introduction In this repo, we provide the implementation of the following paper: "Joint Distribution Matters: Dee

116 Dec 19, 2022

[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Related tags

Overview

MixFormer

News

Highlights

✨ New transformer tracking framework

✨ End-to-end, Positional-embedding-free, multi-feature-aggregation-free

✨ Strong performance

Install the environment

Data Preparation

Set project paths

Train MixFormer

Test and evaluate MixFormer on benchmarks

Run MixFormer on your own video

Compute FLOPs/Params and test speed

Visualize attention maps

Model Zoo and raw results

Contact

Acknowledgments

Comments

Owner

Multimedia Computing Group, Nanjing University

MOT-Tracking-by-Detection-Pipeline - For Tracking-by-Detection format MOT (Multi Object Tracking), is it a framework that separates Detection and Tracking processes?

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

[CVPR 2022 Oral] Crafting Better Contrastive Views for Siamese Representation Learning

Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

(CVPR 2022 Oral) Official implementation for "Surface Representation for Point Clouds"

The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).