[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Overview

MixFormer

The official implementation of the CVPR 2022 paper MixFormer: End-to-End Tracking with Iterative Mixed Attention

PWC

PWC

[Models and Raw results] (Google Driver) [Models and Raw results] (Baidu Driver: hmuv)

MixFormer_Framework

News

[Mar 21, 2022]

  • MixFormer is accepted to CVPR2022.
  • We release Code, models and raw results.

[Mar 29, 2022]

  • Our paper is selected for an oral presentation.

Highlights

New transformer tracking framework

MixFormer is composed of a target-search mixed attention (MAM) based backbone and a simple corner head, yielding a compact tracking pipeline without an explicit integration module.

End-to-end, Positional-embedding-free, multi-feature-aggregation-free

Mixformer is an end-to-end tracking framework without post-processing. Compared with other transformer trackers, MixFormer doesn's use positional embedding, attentional mask and multi-layer feature aggregation strategy.

Strong performance

Tracker VOT2020 (EAO) LaSOT (NP) GOT-10K (AO) TrackingNet (NP)
MixFormer 0.555 79.9 70.7 88.9
ToMP101* (CVPR2022) - 79.2 - 86.4
SBT-large* (CVPR2022) 0.529 - 70.4 -
SwinTrack* (Arxiv2021) - 78.6 69.4 88.2
Sim-L/14* (Arxiv2022) - 79.7 69.8 87.4
STARK (ICCV2021) 0.505 77.0 68.8 86.9
KeepTrack (ICCV2021) - 77.2 - -
TransT (CVPR2021) 0.495 73.8 67.1 86.7
TrDiMP (CVPR2021) - - 67.1 83.3
Siam R-CNN (CVPR2020) - 72.2 64.9 85.4
TREG (Arxiv2021) - 74.1 66.8 83.8

Install the environment

Use the Anaconda

conda create -n mixformer python=3.6
conda activate mixformer
bash install_pytorch17.sh

Data Preparation

Put the tracking datasets in ./data. It should look like:

${MixFormer_ROOT}
 -- data
     -- lasot
         |-- airplane
         |-- basketball
         |-- bear
         ...
     -- got10k
         |-- test
         |-- train
         |-- val
     -- coco
         |-- annotations
         |-- train2017
     -- trackingnet
         |-- TRAIN_0
         |-- TRAIN_1
         ...
         |-- TRAIN_11
         |-- TEST

Set project paths

Run the following command to set paths for this project

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir .

After running this command, you can also modify paths by editing these two files

lib/train/admin/local.py  # paths about training
lib/test/evaluation/local.py  # paths about testing

Train MixFormer

Training with multiple GPUs using DDP. More details of other training settings can be found at tracking/train_mixformer.sh

# MixFormer
bash tracking/train_mixformer.sh

Test and evaluate MixFormer on benchmarks

  • LaSOT/GOT10k-test/TrackingNet/OTB100/UAV123. More details of test settings can be found at tracking/test_mixformer.sh
bash tracking/test_mixformer.sh
  • VOT2020
    Before evaluating "MixFormer+AR" on VOT2020, please install some extra packages following external/AR/README.md. Also, the VOT toolkit is required to evaluate our tracker. To download and instal VOT toolkit, you can follow this tutorial. For convenience, you can use our example workspaces of VOT toolkit under external/vot20/ by setting trackers.ini.
cd external/vot20/<workspace_dir>
vot evaluate --workspace . MixFormerPython
# generating analysis results
vot analysis --workspace . --nocache

Run MixFormer on your own video

bash tracking/run_video_demo.sh

Compute FLOPs/Params and test speed

bash tracking/profile_mixformer.sh

Visualize attention maps

bash tracking/vis_mixformer_attn.sh

vis_attn

Model Zoo and raw results

The trained models and the raw tracking results are provided in the [Models and Raw results] (Google Driver) or [Models and Raw results] (Baidu Driver: hmuv).

Contact

Yutao Cui: [email protected]

Cheng Jiang: [email protected]

Acknowledgments

  • Thanks for PyTracking Library and STARK Library, which helps us to quickly implement our ideas.
  • We use the implementation of the CvT from the official repo CvT.
Issues
  • Is this a typo?

    Is this a typo?

    In line751, it should be named online_template not template, or just am I misunderstanding? https://github.com/MCG-NJU/MixFormer/blob/0c2663d3afbce0da138d5b42bc7f28667d077ba3/lib/models/mixformer/mixformer.py#L746-L756

    opened by laisimiao 6
  • repeat tracker initialize?

    repeat tracker initialize?

    First of all, I thanks for your clean and high-quality codes. But in https://github.com/MCG-NJU/MixFormer/blob/219bd14704ec217919c3b1eb310940769546c2d6/external/AR/pytracking/VOT2020_super_only_mask_384_HP/mixformer_alpha_seg_class.py#L32-L43 I find two times of tracker.initialize. I think initialize is just a setting step (not online update step), why do we need two times of that?

    opened by laisimiao 4
  • An error was encountered while testing

    An error was encountered while testing

    Thank you for your outstanding work. I reproduce your code, there is an error:

    {'model': 'mixformer_online_22k.pth.tar', 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 1} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 Evaluating 1 trackers on 1 sequences Tracker: mixformer_online baseline None , Sequence: Basketball Warning: Pretrained CVT weights are not loaded head channel: 384 Online size is: 3 Update interval is: 6 max score decay = 1.0 Error while processing rearrange-reduction pattern "b (h w) c -> b c h w". Input tensor shape: torch.Size([1, 1, 2048, 64]). Additional info: {'h': 32, 'w': 32}. Expected 3 dimensions, got 4 Done

    How to solve this problem?

    opened by DLRook1e 4
  • Con not compile Precise RoI Pooling library

    Con not compile Precise RoI Pooling library

    {'model': 'mixformer_online_22k.pth.tar', 'update_interval': 25, 'online_sizes': 3, 'search_area_scale': 4.5, 'max_score_decay': 1.0, 'vis_attn': 0} test config: {'MODEL': {'HEAD_TYPE': 'CORNER', 'HIDDEN_DIM': 384, 'NUM_OBJECT_QUERIES': 1, 'POSITION_EMBEDDING': 'sine', 'PREDICT_MASK': False, 'BACKBONE': {'PRETRAINED': True, 'PRETRAINED_PATH': '', 'INIT': 'trunc_norm', 'NUM_STAGES': 3, 'PATCH_SIZE': [7, 3, 3], 'PATCH_STRIDE': [4, 2, 2], 'PATCH_PADDING': [2, 1, 1], 'DIM_EMBED': [64, 192, 384], 'NUM_HEADS': [1, 3, 6], 'DEPTH': [1, 4, 16], 'MLP_RATIO': [4.0, 4.0, 4.0], 'ATTN_DROP_RATE': [0.0, 0.0, 0.0], 'DROP_RATE': [0.0, 0.0, 0.0], 'DROP_PATH_RATE': [0.0, 0.0, 0.1], 'QKV_BIAS': [True, True, True], 'CLS_TOKEN': [False, False, False], 'POS_EMBED': [False, False, False], 'QKV_PROJ_METHOD': ['dw_bn', 'dw_bn', 'dw_bn'], 'KERNEL_QKV': [3, 3, 3], 'PADDING_KV': [1, 1, 1], 'STRIDE_KV': [2, 2, 2], 'PADDING_Q': [1, 1, 1], 'STRIDE_Q': [1, 1, 1], 'FREEZE_BN': True}, 'PRETRAINED_STAGE1': True, 'NLAYER_HEAD': 3, 'HEAD_FREEZE_BN': True}, 'TRAIN': {'TRAIN_SCORE': True, 'SCORE_WEIGHT': 1.0, 'LR': 0.0001, 'WEIGHT_DECAY': 0.0001, 'EPOCH': 30, 'LR_DROP_EPOCH': 20, 'BATCH_SIZE': 32, 'NUM_WORKER': 8, 'OPTIMIZER': 'ADAMW', 'BACKBONE_MULTIPLIER': 0.1, 'GIOU_WEIGHT': 2.0, 'L1_WEIGHT': 5.0, 'DEEP_SUPERVISION': False, 'FREEZE_STAGE0': False, 'PRINT_INTERVAL': 50, 'VAL_EPOCH_INTERVAL': 5, 'GRAD_CLIP_NORM': 0.1, 'SCHEDULER': {'TYPE': 'step', 'DECAY_RATE': 0.1}}, 'DATA': {'SAMPLER_MODE': 'trident_pro', 'MEAN': [0.485, 0.456, 0.406], 'STD': [0.229, 0.224, 0.225], 'MAX_SAMPLE_INTERVAL': [200], 'TRAIN': {'DATASETS_NAME': ['GOT10K_vottrain', 'LASOT', 'COCO17', 'TRACKINGNET'], 'DATASETS_RATIO': [1, 1, 1, 1], 'SAMPLE_PER_EPOCH': 60000}, 'VAL': {'DATASETS_NAME': ['GOT10K_votval'], 'DATASETS_RATIO': [1], 'SAMPLE_PER_EPOCH': 10000}, 'SEARCH': {'SIZE': 320, 'FACTOR': 5.0, 'CENTER_JITTER': 4.5, 'SCALE_JITTER': 0.5}, 'TEMPLATE': {'SIZE': 128, 'FACTOR': 2.0, 'NUMBER': 2, 'CENTER_JITTER': 0, 'SCALE_JITTER': 0}}, 'TEST': {'TEMPLATE_FACTOR': 2.0, 'TEMPLATE_SIZE': 128, 'SEARCH_FACTOR': 5.0, 'SEARCH_SIZE': 320, 'EPOCH': 40, 'UPDATE_INTERVALS': {'LASOT': [200], 'GOT10K_TEST': [10], 'TRACKINGNET': [25], 'VOT20': [10], 'VOT20LT': [200], 'OTB': [6], 'UAV': [200]}, 'ONLINE_SIZES': {'LASOT': [2], 'GOT10K_TEST': [2], 'TRACKINGNET': [1], 'VOT20': [5], 'VOT20LT': [3], 'OTB': [3], 'UAV': [1]}}} search_area_scale: 4.5 Warning: Pretrained CVT weights are not loaded head channel: 384 Online size is: 3 Update interval is: 25 max score decay = 1.0 Using C:\Users\210\AppData\Local\torch_extensions\torch_extensions\Cache as PyTorch extensions root... C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py:274: UserWarning: Error checking compiler version for cl: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte warnings.warn('Error checking compiler version for {}: {}'.format(compiler, error)) Detected CUDA files, patching ldflags Emitting ninja build file C:\Users\210\AppData\Local\torch_extensions\torch_extensions\Cache_prroi_pooling\build.ninja... Building extension module _prroi_pooling... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 1.10.2 Loading extension module _prroi_pooling... Traceback (most recent call last): File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 33, in _import_prroi_pooling verbose=True File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 980, in load keep_intermediates=keep_intermediates) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 1196, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\utils\cpp_extension.py", line 1543, in _import_module_from_library file, path, description = imp.find_module(module_name, [path]) File "C:\Users\210\anaconda3\envs\mixformer1\lib\imp.py", line 297, in find_module raise ImportError(_ERR_MSG.format(name), name=name) ImportError: No module named '_prroi_pooling'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "tracking/video_demo.py", line 53, in main() File "tracking/video_demo.py", line 49, in main args.save_results, tracker_params=tracker_params) File "tracking/video_demo.py", line 21, in run_video tracker.run_video(videofilepath=videofile, optional_box=optional_box, debug=debug, save_results=save_results) File "tracking..\lib\test\evaluation\tracker.py", line 228, in run_video out = tracker.track(frame) File "tracking..\lib\test\tracker\mixformer_online.py", line 135, in track out_dict, _ = self.network.forward_test(search, run_score_head=True) File "tracking..\lib\models\mixformer\mixformer_online.py", line 850, in forward_test out, outputs_coord_new = self.forward_head(search, template, run_score_head, gt_bboxes) File "tracking..\lib\models\mixformer\mixformer_online.py", line 875, in forward_head out_dict.update({'pred_scores': self.score_branch(search, template, gt_bboxes).view(-1)}) File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "tracking..\lib\models\mixformer\mixformer_online.py", line 798, in forward search_box_feat = rearrange(self.search_prroipool(search_feat, target_roi), 'b c h w -> b (h w) c') File "C:\Users\210\anaconda3\envs\mixformer1\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\prroi_pool.py", line 28, in forward return prroi_pool2d(features, rois, self.pooled_height, self.pooled_width, self.spatial_scale) File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 44, in forward _prroi_pooling = _import_prroi_pooling() File "tracking..\external\PreciseRoIPooling\pytorch\prroi_pool\functional.py", line 36, in _import_prroi_pooling raise ImportError('Can not compile Precise RoI Pooling library.') ImportError: Can not compile Precise RoI Pooling library.

    Please help me! Thanks very much

    opened by wyanb 4
  • About Score Prediction Module (SPM) and MixFormer-1k

    About Score Prediction Module (SPM) and MixFormer-1k

    Hi, Thanks for your work. I have some questions about your paper:

    1. Have you ever tried to use the score prediction module of STARK (MLP) instead of the SPM proposed in your paper? I am curious about the performance difference between SPM and using MLP directly.
    2. The MixFormer-1k model seems to be trained with all dataset, not just the GOT10k, which is different from your paper (it is unreasonable that MixFormer-1k performs better than MixFormer-GOT if MixFormer-1k is also trained with GOT10k only). Is it fair to use it for comparison on GOT10k test?
    opened by hhhAlan 4
  • How to train SPM in stage2?

    How to train SPM in stage2?

    Thank you for your excellent work. I have some questions about the training process of SPM.

    I encounter a problem when I use the script in train_mixformer.sh to train SPM module. python tracking/train.py --script mixformer_online --config baseline --save_dir /mysavepath --mode multiple --nproc_per_node 1 --stage1_model /mylatest checkpoint trained in the first stage

    But the logs show that it seems that the program has loaded wrong checkpoint because there are so many missing keys

    missing keys: ['score_branch.score_token', 'score_branch.score_head.layers.0.weight', 'score_branch.score_head.layers.0.bias', 'score_branch.score_head.layers.1.weight', 'score_branch.score_head.layers.1.bias', 'score_branch.score_head.layers.2.weight', 'score_branch.score_head.layers.2.bias', 'score_branch.proj_q.0.weight', 'score_branch.proj_q.0.bias', 'score_branch.proj_q.1.weight', 'score_branch.proj_q.1.bias', 'score_branch.proj_k.0.weight', 'score_branch.proj_k.0.bias', 'score_branch.proj_k.1.weight', 'score_branch.proj_k.1.bias', 'score_branch.proj_v.0.weight', 'score_branch.proj_v.0.bias', 'score_branch.proj_v.1.weight', 'score_branch.proj_v.1.bias', 'score_branch.proj.0.weight', 'score_branch.proj.0.bias', 'score_branch.proj.1.weight', 'score_branch.proj.1.bias', 'score_branch.norm1.weight', 'score_branch.norm1.bias', 'score_branch.norm2.0.weight', 'score_branch.norm2.0.bias', 'score_branch.norm2.1.weight', 'score_branch.norm2.1.bias'] unexpected keys: [] Loading pretrained mixformer weights done.

    I am really confused about how to train the SPM module correctly.

    I am appreciate if you can give me some advices.

    The whole log shows below:

    error logs.txt

    opened by Lich-King000 3
  • About update MixedAttention operation

    About update MixedAttention operation

    Thank you for open source such an excellent work!

    The original version before the update was to separate Q, K, and V into templates, online templates, and search ranges, the templates and online templates were separately performed Attention, as shown in the following:

     # template attention
        k1 = torch.cat([k_t, k_ot], dim=2)
        v1 = torch.cat([v_t, v_ot], dim=2)
        attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_t, k1]) * self.scale
        attn = F.softmax(attn_score, dim=-1)
        attn = self.attn_drop(attn)
        x_t = torch.einsum('bhlt,bhtv->bhlv', [attn, v1])
        x_t = rearrange(x_t, 'b h t d -> b t (h d)')
    
      # online template attention
        k2 = torch.cat([k_t, k_ot], dim=2)
        v2 = torch.cat([v_t, v_ot], dim=2)
        attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_ot, k2]) * self.scale
        attn = F.softmax(attn_score, dim=-1)
        attn = self.attn_drop(attn)
        x_ot = torch.einsum('bhlt,bhtv->bhlv', [attn, v2])
        x_ot = rearrange(x_ot, 'b h t d -> b t (h d)')
    

    Especially in the calculation of attn_score, this part is calculated by q_t and k1, q_ot and k2 (both k1 and k2 are templates concatenated with online templates):

    attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_t, k1]) * self.scale

    attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_ot, k2]) * self.scale

    And now the updated version is that the template and the online template are merged together to execute Attention at the beginning, as shown below:

    attn_score = torch.einsum('bhlk,bhtk->bhlt', [q_mt, k_mt]) * self.scale

    I would like to ask whether the two ways of calculating templates and online templates Attention before and after the update are equivalent?

    opened by s9021025292140 3
  • Do you get stuck on the first dataset when you run it? Evaluating 1 trackers on 1 sequences Tracker: mixformer_online baseline None , Sequence: Basketball Warning: Pretrained CVT weights are not loaded head channel: 384 It's just like this, it's been running all night

    Do you get stuck on the first dataset when you run it? Evaluating 1 trackers on 1 sequences Tracker: mixformer_online baseline None , Sequence: Basketball Warning: Pretrained CVT weights are not loaded head channel: 384 It's just like this, it's been running all night

    When I tested it, it was stuck below, and it was still the same after running for a night. How can I solve it? Evaluating 1 trackers on 1 sequences Tracker: mixformer_online baseline None , Sequence: Basketball Warning: Pretrained CVT weights are not loaded head channel: 384

    opened by 123wfl 2
  • Where do you do the bbox coors normalization?

    Where do you do the bbox coors normalization?

    Because I find you use RandomHorizontalFlip_Norm like here https://github.com/MCG-NJU/MixFormer/blob/0c2663d3afbce0da138d5b42bc7f28667d077ba3/lib/train/base_functions.py#L83-L85

    But it seems I can't see the bbox coors normalization after loading from raw dataset.

    opened by laisimiao 1
  • about mutil-template test

    about mutil-template test

    Hello! I found that in the multi-templates test, the test strategy is not the same as the test strategy of the two-templates. The template and the search region are separately calculated for attention, which is different from the training strategy, the k and v values in the MAM module are operated by concat. will this make any difference?

    opened by davidyang180 1
  • Data sampling manner

    Data sampling manner

    Hello! As for the data sampling manner, I found the code uses the causal sampling manner instead of the trident sampling manner, like stark. Is there any difference in the results?

    opened by davidyang180 1
  • MixFormer that is trained on GOT10K without pretrained weights  seems collapse?

    MixFormer that is trained on GOT10K without pretrained weights seems collapse?

    Hi, we ran the miformer experiments without pretrained cvt weights on got10k, using the default configuration. The results show that after 200 epochs, the AO on the got10k test set is only 0.096. It is not clear where the problem occurred. Do you have valuable advice?

    opened by zorrocai 0
  • "trident_pro" sample mode

    Hi, why the template_frame_ids_extra could be invisible (line 316) when the sample mode is set to "trident_pro"?

    https://github.com/MCG-NJU/MixFormer/blob/90a6a9c9a9c874f56904796bab1ddf158948d4e3/lib/train/data/sampler.py#L300-L325

    opened by kongbia 1
  • Can I get guideline path?

    Can I get guideline path?

    I want to test model, but i got this error

    RuntimeError: YOU HAVE NOT SETUP YOUR local.py!!!

    Only i want to test pretrained model, this local.py path need?

    opened by jjuun0 1
  • The code of the Mixed Attention Module

    The code of the Mixed Attention Module

    I'm a little confused about the details of the MAM implementation.

    In def forward_test() of the class Attention() of lib/models/mixformer/mixformer_online.py, it seems that there is only one calculation process of attention, where q, k, and v are obtained as q = rearrange(search, 'b c h w -> b (h w) c').contiguous(), k = torch.cat([self.t_k, self.ot_k, k], dim=1), and v = torch.cat([self.t_v, self.ot_v, v], dim=1). However, Figure 2 of this paper contains 1 multi-head attention function and 2 attention operations, which cannot directly correspond to the code.

    I'm guessing you took a more convenient code in your implementation. Sorry for my poor understanding of code, please explain the details so that I can understand better.

    opened by EavanLi 1
  • multi-layer feature aggregation strategy and long-term tracking

    multi-layer feature aggregation strategy and long-term tracking

    Thanks for sharing the excellent work! I have two small questions. First, as mentioned in your paper, the multi-layer feature aggregation strategy is commonly used in other trackers (e.g., SiamRPN++, STARK). The one in SiamRPN++ is understandable, but the one in STARK is confusing. STARK seems to only use the last stride=16 features for prediction. I would like to know what is the main difference between MixFormer and STARK in this regard? Second, have you tested the MixFormer on the VOT long-term dataset? STARK performs well on long-term tracking, and it feels like MixFormer could work better.

    opened by zzzmm1 8
Owner
Multimedia Computing Group, Nanjing University
Multimedia Computing Group, Nanjing University
Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

Andrew Luo 38 May 31, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 183 Jun 12, 2022
Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

Kakao Brain 100 Jun 27, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 42 Jun 24, 2022
(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

LAV Learning from All Vehicles Dian Chen, Philipp Krähenbühl CVPR 2022 (also arXiV 2203.11934) This repo contains code for paper Learning from all veh

Dian Chen 183 Jun 19, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 114 Jun 23, 2022
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 290 Jun 30, 2022
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 45 Jun 9, 2022
Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

Perceiver - Pytorch Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch Install $ pip install perceiver-pytorch Usage

Phil Wang 800 Jun 29, 2022
Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Perceiver This Python package implements Perceiver: General Perception with Iterative Attention by Andrew Jaegle in TensorFlow. This model builds on t

Rishit Dagli 83 Jun 1, 2022
[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Rethinking Minimal Sufficient Representation in Contrastive Learning PyTorch implementation of Rethinking Minimal Sufficient Representation in Contras

null 24 May 6, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 46 Jun 17, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 61 Jun 25, 2022
Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

nvdiffrec Joint optimization of topology, materials and lighting from multi-view image observations as described in the paper Extracting Triangular 3D

NVIDIA Research Projects 960 Jun 26, 2022
[CVPR 2022 Oral] Crafting Better Contrastive Views for Siamese Representation Learning

Crafting Better Contrastive Views for Siamese Representation Learning (CVPR 2022 Oral) 2022-03-29: The paper was selected as a CVPR 2022 Oral paper! 2

null 199 Jun 24, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 280 Jun 27, 2022
(CVPR 2022 Oral) Official implementation for "Surface Representation for Point Clouds"

RepSurf - Surface Representation for Point Clouds [CVPR 2022 Oral] By Haoxi Ran* , Jun Liu, Chengjie Wang ( * : corresponding contact) The pytorch off

Haoxi Ran 114 Jun 26, 2022
The Pytorch code of "Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification", CVPR 2022 (Oral).

DeepBDC for few-shot learning        Introduction In this repo, we provide the implementation of the following paper: "Joint Distribution Matters: Dee

FeiLong 71 Jun 23, 2022