AOT (Associating Objects with Transformers) in PyTorch

Overview

AOT (Associating Objects with Transformers) in PyTorch

A modular reference PyTorch implementation of Associating Objects with Transformers for Video Object Segmentation (NIPS 2021). [paper]

alt text

alt text

Highlights

  • High performance: up to 85.5% (R50-AOTL) on YouTube-VOS 2018 and 82.1% (SwinB-AOTL) on DAVIS-2017 Test-dev under standard settings.
  • High efficiency: up to 51fps (AOTT) on DAVIS-2017 (480p) even with 10 objects and 41fps on YouTube-VOS (1.3x480p). AOT can process multiple objects (less than a pre-defined number, 10 in default) as efficiently as processing a single object. This project also supports inferring any number of objects together within a video by automatic separation and aggregation.
  • Multi-GPU training and inference
  • Mixed precision training and inference
  • Test-time augmentation: multi-scale and flipping augmentations are supported.

TODO

  • Code documentation
  • Demo tool
  • Adding your own dataset

Requirements

  • Python3
  • pytorch >= 1.7.0 and torchvision
  • opencv-python
  • Pillow

Optional (for better efficiency):

  • Pytorch Correlation (recommend to install from source instead of using pip)

Demo

Coming

Model Zoo and Results

Pre-trained models and corresponding results reproduced by this project can be found in MODEL_ZOO.md.

Getting Started

  1. Prepare datasets:

    Please follow the below instruction to prepare datasets in each correspondding folder.

    • Static

      datasets/Static: pre-training dataset with static images. A guidance can be found in AFB-URR.

    • YouTube-VOS

      A commonly-used large-scale VOS dataset.

      datasets/YTB/2019: version 2019, download link. train is required for training. valid (6fps) and valid_all_frames (30fps, optional) are used for evaluation.

      datasets/YTB/2018: version 2018, download link. Only valid (6fps) and valid_all_frames (30fps, optional) are required for this project and used for evaluation.

    • DAVIS

      A commonly-used small-scale VOS dataset.

      datasets/DAVIS: TrainVal (480p) contains both the training and validation split. Test-Dev (480p) contains the Test-dev split. The full-resolution version is also supported for training and evluation but not required.

  2. Prepare ImageNet pre-trained encoders

    Select and download below checkpoints into pretrain_models:

    The current default training configs are not optimized for encoders larger than ResNet-50. If you want to use larger encoders, we recommond to early stop the main-training stage at 80,000 iteration (100,000 in default) to avoid over-fitting on the seen classes of YouTube-VOS.

  3. Training and Evaluation

    The example script will train AOTT with 2 stages using 4 GPUs and auto-mixed precision (--amp). The first stage is a pre-training stage using Static dataset, and the second stage is main-training stage, which uses both YouTube-VOS 2019 train and DAVIS-2017 train for training, resulting in a model can generalize to different domains (YouTube-VOS and DAVIS) and different frame rates (6fps, 24fps, and 30fps).

    Notably, you can use only the YouTube-VOS 2019 train split in the second stage by changing pre_ytb_dav to pre_ytb, which leads to better YouTube-VOS performance on unseen classes. Besides, if you don't want to do the first stage, you can start the training from stage ytb, but the performance will drop about 1~2% absolutely.

    After the training is finished, the example script will evaluate the model on YouTube-VOS and DAVIS, and the results will be packed into Zip files. For calculating scores, please use offical YouTube-VOS servers (2018 server and 2019 server) and offical DAVIS toolkit.

Adding your own dataset

Coming

Troubleshooting

Waiting

Citations

Please consider citing the related paper(s) in your publications if it helps your research.

@inproceedings{yang2021aot,
  title={Associating Objects with Transformers for Video Object Segmentation},
  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

License

This project is released under the BSD-3-Clause license. See LICENSE for additional details.

Comments
  • Are you training with COCO?

    Are you training with COCO?

    Hello! When I use COCO as one of the Static datasets to train the pre stage, I find that it reduces the accuracy of the pre-training model tested on DAVIS17. The pre-train model trained 20000 iters by using MSRA10K、PASCAL-S、PASCAL-VOC、ECSSD achieve nearly 59% iou when test on DAVIS17-val, but after adding COCO, even train 100000 iters, its iou is only 48%. What do you think might have caused this? Are the COCO's annotations themself not particularly accurate? And are you training with COCO? Thank you very much!

    opened by MUVGuan 9
  • About the implementation of multi-head attention in DeAOT

    About the implementation of multi-head attention in DeAOT

    Hello, I have a question after reading your great work DeAOT. When you conduct the ablation study about head number, you compare the multi-head and single-head in DeAOT. As we all know, the common implementation of multi-head is to reshape Query (its shape is HW×batch_size×C, just take Query as an example), and its channel dimension C is divided into C/num_head, then Query is reshaped to HW×batch_size×num_head×(C/num_head). This kind of implementation can keep the computation complexity as single-head has. But the ablation study about head number shows that multi-head significantly reduces speed. So I want to know that what kind of implementation of multi-head attention in DeAOT? Is it what I show above?

    opened by MUVGuan 8
  • questions about inference fps

    questions about inference fps

    Thanks for making code available! I met some questions while testing the pretrained model! It can only get a speed of near 29FPS when testing the PRE_YTB_DAV pretrained model of DAVIS2017, AOTS which should be 40FPS according to paper result. But the test J & F-mean is the same as the results posted in model_zoo which is 0.820575.

    I did not modify the default test config of aots.py exclude dir such as dataset. Did I need to modify something in train_eval.sh?

    My device: 2 x Tesla V100 SXM2 32GB Driver Version: 450.51.06 CUDA Version: 11.0 pytorch==1.7.0 torchvision==0.8.1 spatial-correlation-sampler == 0.3.0

    Exp alldataset_AOTS:
    {
        "DATASETS": [
            "youtubevos",
            "davis2017"
        ],
        "DATA_DAVIS_REPEAT": 5,
        "DATA_DYNAMIC_MERGE_PROB": 0.3,
        "DATA_MAX_CROP_STEPS": 10,
        "DATA_MAX_SCALE_FACTOR": 1.3,
        "DATA_MIN_SCALE_FACTOR": 0.7,
        "DATA_RANDOMCROP": [
            465,
            465
        ],
        "DATA_RANDOMFLIP": 0.5,
        "DATA_RANDOM_GAP_DAVIS": 12,
        "DATA_RANDOM_GAP_YTB": 3,
        "DATA_RANDOM_REVERSE_SEQ": true,
        "DATA_SEQ_LEN": 5,
        "DATA_SHORT_EDGE_LEN": 480,
        "DATA_WORKERS": 8,
        "DIR_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ckpt",
        "DIR_DATA": "./datasets",
        "DIR_DAVIS": "/workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval",
        "DIR_EMA_CKPT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/ema_ckpt",
        "DIR_EVALUATION": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/eval",
        "DIR_IMG_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/img",
        "DIR_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log",
        "DIR_RESULT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV",
        "DIR_ROOT": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022",
        "DIR_STATIC": "/yexin/vos_related_source/vos_exper_dataset/unify_pretrain_dataset",
        "DIR_TB_LOG": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE_YTB_DAV/log/tensorboard",
        "DIR_YTB": "/yexin/vos_related_source/vos_exper_dataset/dataset/Youtube",
        "DIST_BACKEND": "nccl",
        "DIST_ENABLE": true,
        "DIST_START_GPU": 0,
        "DIST_URL": "tcp://127.0.0.1:13241",
        "EXP_NAME": "alldataset_AOTS",
        "MODEL_ALIGN_CORNERS": true,
        "MODEL_ATT_HEADS": 8,
        "MODEL_DECODER_INTERMEDIATE_LSTT": true,
        "MODEL_ENCODER": "mobilenetv2",
        "MODEL_ENCODER_DIM": [
            24,
            32,
            96,
            1280
        ],
        "MODEL_ENCODER_EMBEDDING_DIM": 256,
        "MODEL_ENCODER_PRETRAIN": "./pretrain_models/mobilenet_v2-b0353104.pth",
        "MODEL_ENGINE": "aotengine",
        "MODEL_EPSILON": 1e-05,
        "MODEL_FREEZE_BACKBONE": false,
        "MODEL_FREEZE_BN": true,
        "MODEL_LSTT_NUM": 2,
        "MODEL_MAX_OBJ_NUM": 10,
        "MODEL_NAME": "AOTS",
        "MODEL_SELF_HEADS": 8,
        "MODEL_USE_PREV_PROB": false,
        "MODEL_VOS": "aot",
        "PRETRAIN": true,
        "PRETRAIN_FULL": true,
        "PRETRAIN_MODEL": "/yexin/vos_related_source/experiments2022/10-17AM-on-January-27-2022/alldataset_AOTS/PRE/ema_ckpt/save_step_100000.pth",
        "STAGE_NAME": "PRE_YTB_DAV",
        "TEST_CKPT_PATH": "./AOTS_PRE_YTB_DAV.pth",
        "TEST_CKPT_STEP": null,
        "TEST_DATASET": "davis2017",
        "TEST_DATASET_FULL_RESOLUTION": false,
        "TEST_DATASET_SPLIT": "val",
        "TEST_EMA": true,
        "TEST_FLIP": false,
        "TEST_FRAME_LOG": false,
        "TEST_GPU_ID": 0,
        "TEST_GPU_NUM": 2,
        "TEST_LONG_TERM_MEM_GAP": 9999,
        "TEST_MAX_SIZE": 1040.0,
        "TEST_MIN_SIZE": null,
        "TEST_MULTISCALE": [
            1.0
        ],
        "TEST_WORKERS": 4,
        "TRAIN_AUTO_RESUME": true,
        "TRAIN_AUX_LOSS_RATIO": 1.0,
        "TRAIN_AUX_LOSS_WEIGHT": 1.0,
        "TRAIN_BATCH_SIZE": 16,
        "TRAIN_CLIP_GRAD_NORM": 5.0,
        "TRAIN_DATASET_FULL_RESOLUTION": false,
        "TRAIN_EMA_RATIO": 0.1,
        "TRAIN_ENABLE_PREV_FRAME": false,
        "TRAIN_ENCODER_FREEZE_AT": 2,
        "TRAIN_GPUS": 4,
        "TRAIN_HARD_MINING_RATIO": 0.5,
        "TRAIN_IMG_LOG": true,
        "TRAIN_LOG_STEP": 50,
        "TRAIN_LONG_TERM_MEM_GAP": 9999,
        "TRAIN_LR": 0.0002,
        "TRAIN_LR_COSINE_DECAY": false,
        "TRAIN_LR_ENCODER_RATIO": 0.1,
        "TRAIN_LR_MIN": 2e-05,
        "TRAIN_LR_POWER": 0.9,
        "TRAIN_LR_RESTART": 1,
        "TRAIN_LR_UPDATE_STEP": 1,
        "TRAIN_LR_WARM_UP_RATIO": 0.05,
        "TRAIN_LSTT_DROPPATH": 0.1,
        "TRAIN_LSTT_DROPPATH_LST": false,
        "TRAIN_LSTT_DROPPATH_SCALING": false,
        "TRAIN_LSTT_EMB_DROPOUT": 0.0,
        "TRAIN_LSTT_ID_DROPOUT": 0.0,
        "TRAIN_LSTT_LT_DROPOUT": 0.0,
        "TRAIN_LSTT_ST_DROPOUT": 0.0,
        "TRAIN_MAX_KEEP_CKPT": 8,
        "TRAIN_OPT": "adamw",
        "TRAIN_RESUME": false,
        "TRAIN_RESUME_CKPT": null,
        "TRAIN_RESUME_STEP": 0,
        "TRAIN_SAVE_STEP": 1000,
        "TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [
            "patch_wise_id_bank"
        ],
        "TRAIN_SEQ_TRAINING_START_RATIO": 0.5,
        "TRAIN_SGD_MOMENTUM": 0.9,
        "TRAIN_START_STEP": 0,
        "TRAIN_TBLOG": true,
        "TRAIN_TBLOG_STEP": 50,
        "TRAIN_TOP_K_PERCENT_PIXELS": 0.15,
        "TRAIN_TOTAL_STEPS": 100000,
        "TRAIN_WEIGHT_DECAY": 0.07,
        "TRAIN_WEIGHT_DECAY_EXCLUSIVE": {},
        "TRAIN_WEIGHT_DECAY_EXEMPTION": [
            "absolute_pos_embed",
            "relative_position_bias_table",
            "relative_emb_v",
            "conv_out"
        ]
    }
    Use GPU 0 for evaluating.
    Use GPU 1 for evaluating.
    Build VOS model.
    Load checkpoint from ./AOTS_PRE_YTB_DAV.pth
    Process dataset...
    /workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
    Eval alldataset_AOTS on davis2017 val:
    Done!
    /workdir/xiecunhuang/VOS/dataset/DAVIS/2017/trainval/JPEGImages/480p
    GPU 0 - Processing Seq bike-packing [1/30]:
    GPU 1 - Processing Seq blackswan [2/30]:
    GPU 1 - Seq blackswan - FPS: 29.29. All-Frame FPS: 29.29, All-Seq FPS: 29.29, Max Mem: 0.53G
    GPU 1 - Processing Seq breakdance [4/30]:
    GPU 0 - Seq bike-packing - FPS: 28.99. All-Frame FPS: 28.99, All-Seq FPS: 28.99, Max Mem: 0.58G
    GPU 0 - Processing Seq bmx-trees [3/30]:
    GPU 1 - Seq breakdance - FPS: 29.57. All-Frame FPS: 29.46, All-Seq FPS: 29.43, Max Mem: 0.53G
    GPU 1 - Processing Seq camel [5/30]:
    GPU 0 - Seq bmx-trees - FPS: 29.16. All-Frame FPS: 29.08, All-Seq FPS: 29.08, Max Mem: 0.58G
    GPU 0 - Processing Seq car-roundabout [6/30]:
    GPU 1 - Seq camel - FPS: 30.71. All-Frame FPS: 29.95, All-Seq FPS: 29.84, Max Mem: 0.53G
    GPU 1 - Processing Seq car-shadow [7/30]:
    GPU 0 - Seq car-roundabout - FPS: 29.29. All-Frame FPS: 29.15, All-Seq FPS: 29.15, Max Mem: 0.58G
    GPU 0 - Processing Seq cows [8/30]:
    GPU 1 - Seq car-shadow - FPS: 30.62. All-Frame FPS: 30.05, All-Seq FPS: 30.03, Max Mem: 0.53G
    GPU 1 - Processing Seq dance-twirl [9/30]:
    GPU 0 - Seq cows - FPS: 27.67. All-Frame FPS: 28.67, All-Seq FPS: 28.76, Max Mem: 0.58G
    GPU 0 - Processing Seq dog [10/30]:
    GPU 1 - Seq dance-twirl - FPS: 25.66. All-Frame FPS: 28.80, All-Seq FPS: 29.04, Max Mem: 0.53G
    GPU 1 - Processing Seq dogs-jump [11/30]:
    GPU 0 - Seq dog - FPS: 28.28. All-Frame FPS: 28.60, All-Seq FPS: 28.67, Max Mem: 0.58G
    GPU 0 - Processing Seq drift-chicane [12/30]:
    GPU 1 - Seq dogs-jump - FPS: 27.15. All-Frame FPS: 28.52, All-Seq FPS: 28.71, Max Mem: 0.53G
    GPU 1 - Processing Seq drift-straight [13/30]:
    GPU 0 - Seq drift-chicane - FPS: 28.43. All-Frame FPS: 28.58, All-Seq FPS: 28.63, Max Mem: 0.58G
    GPU 0 - Processing Seq goat [14/30]:
    GPU 1 - Seq drift-straight - FPS: 29.75. All-Frame FPS: 28.65, All-Seq FPS: 28.85, Max Mem: 0.53G
    GPU 1 - Processing Seq gold-fish [15/30]:
    GPU 0 - Seq goat - FPS: 27.72. All-Frame FPS: 28.43, All-Seq FPS: 28.49, Max Mem: 0.58G
    GPU 0 - Processing Seq horsejump-high [16/30]:
    GPU 1 - Seq gold-fish - FPS: 28.61. All-Frame FPS: 28.64, All-Seq FPS: 28.82, Max Mem: 0.53G
    GPU 1 - Processing Seq india [17/30]:
    GPU 0 - Seq horsejump-high - FPS: 28.93. All-Frame FPS: 28.47, All-Seq FPS: 28.55, Max Mem: 0.58G
    GPU 0 - Processing Seq judo [18/30]:
    GPU 0 - Seq judo - FPS: 31.24. All-Frame FPS: 28.61, All-Seq FPS: 28.82, Max Mem: 0.58G
    GPU 0 - Processing Seq lab-coat [20/30]:
    GPU 1 - Seq india - FPS: 28.42. All-Frame FPS: 28.61, All-Seq FPS: 28.78, Max Mem: 0.53G
    GPU 1 - Processing Seq kite-surf [19/30]:
    GPU 0 - Seq lab-coat - FPS: 29.81. All-Frame FPS: 28.69, All-Seq FPS: 28.92, Max Mem: 0.58G
    GPU 0 - Processing Seq libby [21/30]:
    GPU 1 - Seq kite-surf - FPS: 30.69. All-Frame FPS: 28.76, All-Seq FPS: 28.96, Max Mem: 0.53G
    GPU 1 - Processing Seq loading [22/30]:
    GPU 0 - Seq libby - FPS: 31.08. All-Frame FPS: 28.85, All-Seq FPS: 29.10, Max Mem: 0.58G
    GPU 0 - Processing Seq mbike-trick [23/30]:
    GPU 1 - Seq loading - FPS: 31.06. All-Frame FPS: 28.90, All-Seq FPS: 29.14, Max Mem: 0.53G
    GPU 1 - Processing Seq motocross-jump [24/30]:
    GPU 1 - Seq motocross-jump - FPS: 31.09. All-Frame FPS: 29.01, All-Seq FPS: 29.29, Max Mem: 0.53G
    GPU 1 - Processing Seq parkour [26/30]:
    GPU 0 - Seq mbike-trick - FPS: 27.85. All-Frame FPS: 28.74, All-Seq FPS: 28.99, Max Mem: 0.58G
    GPU 0 - Processing Seq paragliding-launch [25/30]:
    GPU 1 - Seq parkour - FPS: 28.09. All-Frame FPS: 28.90, All-Seq FPS: 29.20, Max Mem: 0.53G
    GPU 1 - Processing Seq pigs [27/30]:
    GPU 0 - Seq paragliding-launch - FPS: 29.78. All-Frame FPS: 28.84, All-Seq FPS: 29.05, Max Mem: 0.58G
    GPU 0 - Processing Seq scooter-black [28/30]:
    GPU 0 - Seq scooter-black - FPS: 30.13. All-Frame FPS: 28.89, All-Seq FPS: 29.13, Max Mem: 0.58G
    GPU 0 - Processing Seq soapbox [30/30]:
    GPU 1 - Seq pigs - FPS: 30.28. All-Frame FPS: 29.01, All-Seq FPS: 29.27, Max Mem: 0.53G
    GPU 1 - Processing Seq shooting [29/30]:
    GPU 1 - Seq shooting - FPS: 28.25. All-Frame FPS: 28.98, All-Seq FPS: 29.20, Max Mem: 0.65G
    Finished the evaluation on GPU 1.
    GPU 0 - Seq soapbox - FPS: 29.63. All-Frame FPS: 28.96, All-Seq FPS: 29.16, Max Mem: 0.58G
    Finished the evaluation on GPU 0.
    GPU [0, 1] - All-Frame FPS: 28.97, All-Seq FPS: 29.18, Max Mem: 0.65G
    
    opened by xinyeCH 7
  • About the LSTT module

    About the LSTT module

    When I read the source code of networks/engines/aot_engine.py, I found that only the value was updated when the memory was updated in the update_short_term_memory method. Is there any consideration for not updating the key here?

    opened by lsy-dot 4
  • What's the meaning of

    What's the meaning of "squeeze_idx" in eval_datasets.py?

    Hello! When I read the code in eval_datasets.py, I have a question about the function "read_label" of class VOSTest. Could you please tell me the meaning of "squeeze_idx"? When I debug the code, I find that in "getitem", it have used "obj_idx" as "squeeze_idx" in "read_label", but it seems no difference between the input label and "squeezed_label", so what's the effect of "squeeze_idx"? And under what circumstances will it be used? Thank you!

    opened by MUVGuan 4
  • About the training time.

    About the training time.

    Thanks for your great job!
    As said in the main page, the training process of the two stages cost about 0.6 days each. However, I train the AOTT model on DAVIS only with the default config, and it almost cost 1.5days for DAVIS. I wanna if I make some mistake in the training process. My environment: pytorch: 1.7.1 cuda: 10.2 GPU: 4 Tesla V100

    The config: import os import importlib

    class DefaultEngineConfig(): def init(self, exp_name='default', model='AOTT'): model_cfg = importlib.import_module('configs.models.' + model).ModelConfig() self.dict.update(model_cfg.dict) # add model config

        self.EXP_NAME = exp_name + '_' + self.MODEL_NAME
    
        self.STAGE_NAME = 'default'
    
        self.DATASETS = ['youtubevos']
        self.DATA_WORKERS = 12
        self.DATA_RANDOMCROP = (465,
                                465) if self.MODEL_ALIGN_CORNERS else (464,
                                                                       464)
        self.DATA_RANDOMFLIP = 0.5
        self.DATA_MAX_CROP_STEPS = 10
        self.DATA_SHORT_EDGE_LEN = 480
        self.DATA_MIN_SCALE_FACTOR = 0.7
        self.DATA_MAX_SCALE_FACTOR = 1.3
        self.DATA_RANDOM_REVERSE_SEQ = True
        self.DATA_SEQ_LEN = 5
        self.DATA_DAVIS_REPEAT = 5
        self.DATA_RANDOM_GAP_DAVIS = 12  # max frame interval between two sampled frames for DAVIS (24fps)
        self.DATA_RANDOM_GAP_YTB = 3  # max frame interval between two sampled frames for YouTube-VOS (6fps)
        self.DATA_DYNAMIC_MERGE_PROB = 0.3
    
        self.PRETRAIN = True
        self.PRETRAIN_FULL = False  # if False, load encoder only
        self.PRETRAIN_MODEL = ''
    
        self.TRAIN_TOTAL_STEPS = 100000
        self.TRAIN_START_STEP = 0
        self.TRAIN_WEIGHT_DECAY = 0.07
        self.TRAIN_WEIGHT_DECAY_EXCLUSIVE = {
            # 'encoder.': 0.01
        }
        self.TRAIN_WEIGHT_DECAY_EXEMPTION = [
            'absolute_pos_embed', 'relative_position_bias_table',
            'relative_emb_v', 'conv_out'
        ]
        self.TRAIN_LR = 2e-4
        self.TRAIN_LR_MIN = 2e-5 if 'mobilenetv2' in self.MODEL_ENCODER else 1e-5
        self.TRAIN_LR_POWER = 0.9
        self.TRAIN_LR_ENCODER_RATIO = 0.1
        self.TRAIN_LR_WARM_UP_RATIO = 0.05
        self.TRAIN_LR_COSINE_DECAY = False
        self.TRAIN_LR_RESTART = 1
        self.TRAIN_LR_UPDATE_STEP = 1
        self.TRAIN_AUX_LOSS_WEIGHT = 1.0
        self.TRAIN_AUX_LOSS_RATIO = 1.0
        self.TRAIN_OPT = 'adamw'
        self.TRAIN_SGD_MOMENTUM = 0.9
        self.TRAIN_GPUS = 4
        self.TRAIN_BATCH_SIZE = 16
        self.TRAIN_TBLOG = False
        self.TRAIN_TBLOG_STEP = 50
        self.TRAIN_LOG_STEP = 20
        self.TRAIN_IMG_LOG = True
        self.TRAIN_TOP_K_PERCENT_PIXELS = 0.15
        self.TRAIN_SEQ_TRAINING_FREEZE_PARAMS = ['patch_wise_id_bank']
        self.TRAIN_SEQ_TRAINING_START_RATIO = 0.5
        self.TRAIN_HARD_MINING_RATIO = 0.5
        self.TRAIN_EMA_RATIO = 0.1
        self.TRAIN_CLIP_GRAD_NORM = 5.
        self.TRAIN_SAVE_STEP = 1000
        self.TRAIN_MAX_KEEP_CKPT = 8
        self.TRAIN_RESUME = False
        self.TRAIN_RESUME_CKPT = None
        self.TRAIN_RESUME_STEP = 0
        self.TRAIN_AUTO_RESUME = True
        self.TRAIN_DATASET_FULL_RESOLUTION = False
        self.TRAIN_ENABLE_PREV_FRAME = False
        self.TRAIN_ENCODER_FREEZE_AT = 2
        self.TRAIN_LSTT_EMB_DROPOUT = 0.
        self.TRAIN_LSTT_ID_DROPOUT = 0.
        self.TRAIN_LSTT_DROPPATH = 0.1
        self.TRAIN_LSTT_DROPPATH_SCALING = False
        self.TRAIN_LSTT_DROPPATH_LST = False
        self.TRAIN_LSTT_LT_DROPOUT = 0.
        self.TRAIN_LSTT_ST_DROPOUT = 0.
    
        self.TEST_GPU_ID = 0
        self.TEST_GPU_NUM = 1
        self.TEST_FRAME_LOG = False
        self.TEST_DATASET = 'youtubevos'
        self.TEST_DATASET_FULL_RESOLUTION = False
        self.TEST_DATASET_SPLIT = 'val'
        self.TEST_CKPT_PATH = None
        # if "None", evaluate the latest checkpoint.
        self.TEST_CKPT_STEP = None
        self.TEST_FLIP = False
        self.TEST_MULTISCALE = [1]
        self.TEST_MIN_SIZE = None
        self.TEST_MAX_SIZE = 800 * 1.3
        self.TEST_WORKERS = 4
    
        # GPU distribution
        self.DIST_ENABLE = True
        self.DIST_BACKEND = "nccl"  # "gloo"
        self.DIST_URL = "tcp://127.0.0.1:13241"
        self.DIST_START_GPU = 0
    
    def init_dir(self):
        self.DIR_DATA = './datasets'
        self.DIR_DAVIS = os.path.join(self.DIR_DATA, 'DAVIS')
        self.DIR_YTB = os.path.join(self.DIR_DATA, 'YTB')
        self.DIR_STATIC = os.path.join(self.DIR_DATA, 'Static')
    
        self.DIR_ROOT = './results'
    
        self.DIR_RESULT = os.path.join(self.DIR_ROOT, 'result', self.EXP_NAME,
                                       self.STAGE_NAME)
        self.DIR_CKPT = os.path.join(self.DIR_RESULT, 'ckpt')
        self.DIR_EMA_CKPT = os.path.join(self.DIR_RESULT, 'ema_ckpt')
        self.DIR_LOG = os.path.join(self.DIR_RESULT, 'log')
        self.DIR_TB_LOG = os.path.join(self.DIR_RESULT, 'log', 'tensorboard')
        self.DIR_IMG_LOG = os.path.join(self.DIR_RESULT, 'log', 'img')
        self.DIR_EVALUATION = os.path.join(self.DIR_RESULT, 'eval')
    
        for path in [
                self.DIR_RESULT, self.DIR_CKPT, self.DIR_EMA_CKPT,
                self.DIR_LOG, self.DIR_EVALUATION, self.DIR_IMG_LOG,
                self.DIR_TB_LOG
        ]:
            if not os.path.isdir(path):
                try:
                    os.makedirs(path)
                except Exception as inst:
                    print(inst)
                    print('Failed to make dir: {}.'.format(path))
    
    opened by king-zark 4
  • What's the difference between the result of evaluation on Davis17 in the paper and ModelZoo?

    What's the difference between the result of evaluation on Davis17 in the paper and ModelZoo?

    Hello! I find the J&F mean of AOTT(Y) on Davis17 validation set in Table1(b) in the paper is 78.2, while the J&F mean of AOTT in ModelZoo is 79.2, could you please tell me the difference?

    opened by MUVGuan 3
  • Problems about main-train ytb

    Problems about main-train ytb

    Pytorch 1.8 torchversion 0.9.0 CUDA 10.1

    when train for ytb it report this error

    here is the full track

    (torch18) cwc@imc-Z9PE-D8-WS:~/aot-benchmark-main/tools$ python train.py Exp _AOTT: { "DATASETS": [ "youtubevos" ], "DATA_DAVIS_REPEAT": 5, "DATA_DYNAMIC_MERGE_PROB": 0.3, "DATA_MAX_CROP_STEPS": 10, "DATA_MAX_SCALE_FACTOR": 1.3, "DATA_MIN_SCALE_FACTOR": 0.7, "DATA_RANDOMCROP": [ 465, 465 ], "DATA_RANDOMFLIP": 0.5, "DATA_RANDOM_GAP_DAVIS": 12, "DATA_RANDOM_GAP_YTB": 3, "DATA_RANDOM_REVERSE_SEQ": true, "DATA_SEQ_LEN": 5, "DATA_SHORT_EDGE_LEN": 480, "DATA_WORKERS": 8, "DIR_CKPT": "./results/result/_AOTT/YTB/ckpt", "DIR_DAVIS": "/DATACENTER/1/ysl/Datasets/DAVIS/2017", "DIR_EMA_CKPT": "./results/result/_AOTT/YTB/ema_ckpt", "DIR_EVALUATION": "./results/result/_AOTT/YTB/eval", "DIR_IMG_LOG": "./results/result/_AOTT/YTB/log/img", "DIR_LOG": "./results/result/_AOTT/YTB/log", "DIR_RESULT": "./results/result/_AOTT/YTB", "DIR_ROOT": "./results", "DIR_STATIC": "/DATACENTER/1/Datasets/static", "DIR_TB_LOG": "./results/result/_AOTT/YTB/log/tensorboard", "DIR_YTB": "/DATACENTER/1/ysl/Datasets/YoutubeVOS", "DIST_BACKEND": "nccl", "DIST_ENABLE": true, "DIST_START_GPU": 1, "DIST_URL": "tcp://127.0.0.1:12311", "EXP_NAME": "_AOTT", "MODEL_ALIGN_CORNERS": true, "MODEL_ATT_HEADS": 8, "MODEL_DECODER_INTERMEDIATE_LSTT": true, "MODEL_ENCODER": "mobilenetv2", "MODEL_ENCODER_DIM": [ 24, 32, 96, 1280 ], "MODEL_ENCODER_EMBEDDING_DIM": 256, "MODEL_ENCODER_PRETRAIN": "/home/cwc/aot-benchmark-main/pretrain_models/mobilenet_v2-b0353104.pth", "MODEL_ENGINE": "aotengine", "MODEL_EPSILON": 1e-05, "MODEL_FREEZE_BACKBONE": false, "MODEL_FREEZE_BN": true, "MODEL_LSTT_NUM": 1, "MODEL_MAX_OBJ_NUM": 10, "MODEL_NAME": "AOTT", "MODEL_SELF_HEADS": 8, "MODEL_USE_PREV_PROB": false, "MODEL_VOS": "aot", "PRETRAIN": true, "PRETRAIN_FULL": false, "PRETRAIN_MODEL": "", "STAGE_NAME": "YTB", "TEST_CKPT_PATH": null, "TEST_CKPT_STEP": null, "TEST_DATASET": "youtubevos", "TEST_DATASET_FULL_RESOLUTION": false, "TEST_DATASET_SPLIT": "val", "TEST_FLIP": false, "TEST_FRAME_LOG": false, "TEST_GPU_ID": 1, "TEST_GPU_NUM": 1, "TEST_LONG_TERM_MEM_GAP": 9999, "TEST_MAX_SIZE": 1040.0, "TEST_MIN_SIZE": null, "TEST_MULTISCALE": [ 1 ], "TEST_WORKERS": 4, "TRAIN_AUTO_RESUME": true, "TRAIN_AUX_LOSS_RATIO": 1.0, "TRAIN_AUX_LOSS_WEIGHT": 1.0, "TRAIN_BATCH_SIZE": 4, "TRAIN_CLIP_GRAD_NORM": 5.0, "TRAIN_DATASET_FULL_RESOLUTION": false, "TRAIN_EMA_RATIO": 0.1, "TRAIN_ENABLE_PREV_FRAME": false, "TRAIN_ENCODER_FREEZE_AT": 2, "TRAIN_GPUS": 2, "TRAIN_HARD_MINING_RATIO": 0.5, "TRAIN_IMG_LOG": true, "TRAIN_LOG_STEP": 20, "TRAIN_LONG_TERM_MEM_GAP": 9999, "TRAIN_LR": 0.0002, "TRAIN_LR_COSINE_DECAY": false, "TRAIN_LR_ENCODER_RATIO": 0.1, "TRAIN_LR_MIN": 2e-05, "TRAIN_LR_POWER": 0.9, "TRAIN_LR_RESTART": 1, "TRAIN_LR_UPDATE_STEP": 1, "TRAIN_LR_WARM_UP_RATIO": 0.05, "TRAIN_LSTT_DROPPATH": 0.1, "TRAIN_LSTT_DROPPATH_LST": false, "TRAIN_LSTT_DROPPATH_SCALING": false, "TRAIN_LSTT_EMB_DROPOUT": 0.0, "TRAIN_LSTT_ID_DROPOUT": 0.0, "TRAIN_LSTT_LT_DROPOUT": 0.0, "TRAIN_LSTT_ST_DROPOUT": 0.0, "TRAIN_MAX_KEEP_CKPT": 8, "TRAIN_OPT": "adamw", "TRAIN_RESUME": false, "TRAIN_RESUME_CKPT": null, "TRAIN_RESUME_STEP": 0, "TRAIN_SAVE_STEP": 1000, "TRAIN_SEQ_TRAINING_FREEZE_PARAMS": [ "patch_wise_id_bank" ], "TRAIN_SEQ_TRAINING_START_RATIO": 0.5, "TRAIN_SGD_MOMENTUM": 0.9, "TRAIN_START_STEP": 0, "TRAIN_TBLOG": false, "TRAIN_TBLOG_STEP": 50, "TRAIN_TOP_K_PERCENT_PIXELS": 0.15, "TRAIN_TOTAL_STEPS": 100000, "TRAIN_WEIGHT_DECAY": 0.07, "TRAIN_WEIGHT_DECAY_EXCLUSIVE": {}, "TRAIN_WEIGHT_DECAY_EXEMPTION": [ "absolute_pos_embed", "relative_position_bias_table", "relative_emb_v", "conv_out" ] } Use GPU 1 for training VOS. Build VOS model. Use GPU 2 for training VOS. Use Frozen BN in Encoder! Total Param: 5.73M Build optimizer. Total Param: 5.73M Process dataset... Short object: 721bb6f2cb-3 Short object: 721bb6f2cb-3 Short object: d177e9878a-2 Short object: d177e9878a-3 Short object: d177e9878a-2 Short object: d177e9878a-3 Short object: f36483c824-2 Short object: f9bd1fabf5-4 Short object: f36483c824-2 Video Num: 3471 X 1 Done! Short object: f9bd1fabf5-4 Video Num: 3471 X 1 Remove ['features.0.1.num_batches_tracked', 'features.1.conv.0.1.num_batches_tracked', 'features.1.conv.2.num_batches_tracked', 'features.2.conv.0.1.num_batches_tracked', 'features.2.conv.1.1.num_batches_tracked', 'features.2.conv.3.num_batches_tracked', 'features.3.conv.0.1.num_batches_tracked', 'features.3.conv.1.1.num_batches_tracked', 'features.3.conv.3.num_batches_tracked', 'features.4.conv.0.1.num_batches_tracked', 'features.4.conv.1.1.num_batches_tracked', 'features.4.conv.3.num_batches_tracked', 'features.5.conv.0.1.num_batches_tracked', 'features.5.conv.1.1.num_batches_tracked', 'features.5.conv.3.num_batches_tracked', 'features.6.conv.0.1.num_batches_tracked', 'features.6.conv.1.1.num_batches_tracked', 'features.6.conv.3.num_batches_tracked', 'features.7.conv.0.1.num_batches_tracked', 'features.7.conv.1.1.num_batches_tracked', 'features.7.conv.3.num_batches_tracked', 'features.8.conv.0.1.num_batches_tracked', 'features.8.conv.1.1.num_batches_tracked', 'features.8.conv.3.num_batches_tracked', 'features.9.conv.0.1.num_batches_tracked', 'features.9.conv.1.1.num_batches_tracked', 'features.9.conv.3.num_batches_tracked', 'features.10.conv.0.1.num_batches_tracked', 'features.10.conv.1.1.num_batches_tracked', 'features.10.conv.3.num_batches_tracked', 'features.11.conv.0.1.num_batches_tracked', 'features.11.conv.1.1.num_batches_tracked', 'features.11.conv.3.num_batches_tracked', 'features.12.conv.0.1.num_batches_tracked', 'features.12.conv.1.1.num_batches_tracked', 'features.12.conv.3.num_batches_tracked', 'features.13.conv.0.1.num_batches_tracked', 'features.13.conv.1.1.num_batches_tracked', 'features.13.conv.3.num_batches_tracked', 'features.14.conv.0.1.num_batches_tracked', 'features.14.conv.1.1.num_batches_tracked', 'features.14.conv.3.num_batches_tracked', 'features.15.conv.0.1.num_batches_tracked', 'features.15.conv.1.1.num_batches_tracked', 'features.15.conv.3.num_batches_tracked', 'features.16.conv.0.1.num_batches_tracked', 'features.16.conv.1.1.num_batches_tracked', 'features.16.conv.3.num_batches_tracked', 'features.17.conv.0.1.num_batches_tracked', 'features.17.conv.1.1.num_batches_tracked', 'features.17.conv.3.num_batches_tracked', 'features.18.1.num_batches_tracked', 'classifier.1.weight', 'classifier.1.bias'] from pretrained model. Load pretrained backbone model from . Start training: step------------------------------ 0 step------------------------------ 0 [W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator()) Traceback (most recent call last): File "train.py", line 80, in main() File "train.py", line 76, in main mp.spawn(main_worker, nprocs=cfg.TRAIN_GPUS, args=(cfg, args.amp)) File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/cwc/aot-benchmark-main/tools/train.py", line 18, in main_worker trainer.sequential_training() File "/home/cwc/aot-benchmark-main/tools/../networks/managers/trainer.py", line 456, in sequential_training loss.backward() File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [900, 2, 256]], which is output 0 of AddBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

    opened by 1359347500cwc 3
  • The size of the image input to Swin Transformer

    The size of the image input to Swin Transformer

    I notice that Swin Transformer uses pre-training parameters, and I would like to ask you whether the image size input into Swin Transformer is also 224*224?

    opened by sly1220 3
  • how real_label can convert rgb mask image to one-channel id mask?

    how real_label can convert rgb mask image to one-channel id mask?

    Hi, thanks for this wonderful work! I would like to ask that, in demo.py,

        def read_label(self, label_name, squeeze_idx=None):
            label_path = os.path.join(self.label_root, self.seq_name, label_name)
            label = Image.open(label_path)
            label = np.array(label, dtype=np.uint8)
            if self.single_obj:
                label = (label > 0).astype(np.uint8)
            elif squeeze_idx is not None:
                squeezed_label = label * 0
                for idx in range(len(squeeze_idx)):
                    obj_id = squeeze_idx[idx]
                    if obj_id == 0:
                        continue
                    mask = label == obj_id
                    squeezed_label += (mask * idx).astype(np.uint8)
                label = squeezed_label
            return label
    

    why the rgb mask goes through the codes below

    label = Image.open(label_path)
    label = np.array(label, dtype=np.uint8)
    

    and can be converted to a one-channel id mask? How should I enocode a mask image so that I could achieve this effect? Thanks!

    opened by LilyDaytoy 2
  • Does AOTL use multiple frames in long term memory during the training phase?

    Does AOTL use multiple frames in long term memory during the training phase?

    Hello! Does AOTL use multiple frames in long term memory during the training phase? Or AOTL just uses the first frame in long term memory to train and uses multiple frames for evaluation?

    opened by MUVGuan 2
Owner
CS graduate student, Zhejiang University.
null
AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

AOT-GAN for High-Resolution Image Inpainting Arxiv Paper | AOT-GAN: Aggregated Contextual Transformations for High-Resolution Image Inpainting Yanhong

Multimedia Research 214 Jan 3, 2023
ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプル

ByteTrack-ONNX-Sample ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプルです。 ONNXに変換したモデルも同梱しています。 変換自体を試したい方はByteT

KazuhitoTakahashi 16 Oct 26, 2022
AoT is a system for automatically generating off-target test harness by using build information.

AoT: Auto off-Target Automatically generating off-target test harness by using build information. Brought to you by the Mobile Security Team at Samsun

Samsung 10 Oct 19, 2022
SOTR: Segmenting Objects with Transformers [ICCV 2021]

SOTR: Segmenting Objects with Transformers [ICCV 2021] By Ruohao Guo, Dantong Niu, Liao Qu, Zhenbo Li Introduction This is the official implementation

null 186 Dec 20, 2022
Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

QData 440 Jan 2, 2023
Official PyTorch implementation of CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds

CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds Introduction This is the official PyTorch implementation of o

Yijia Weng 96 Dec 7, 2022
Pytorch Implementation of Interaction Networks for Learning about Objects, Relations and Physics

Interaction-Network-Pytorch Pytorch Implementraion of Interaction Networks for Learning about Objects, Relations and Physics. Interaction Network is a

null 117 Nov 5, 2022
[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

Van Nguyen Nguyen 92 Dec 28, 2022
ChatBot-Pytorch - A GPT-2 ChatBot implemented using Pytorch and Huggingface-transformers

ChatBot-Pytorch A GPT-2 ChatBot implemented using Pytorch and Huggingface-transf

ParZival 42 Dec 9, 2022
Where2Act: From Pixels to Actions for Articulated 3D Objects

Where2Act: From Pixels to Actions for Articulated 3D Objects The Proposed Where2Act Task. Given as input an articulated 3D object, we learn to propose

Kaichun Mo 69 Nov 28, 2022
Automatically erase objects in the video, such as logo, text, etc.

Video-Auto-Wipe Read English Introduction:Here   本人不定期的基于生成技术制作一些好玩有趣的算法模型,这次带来的作品是“视频擦除”方向的应用模型,它实现的功能是自动感知到视频中我们不想看见的部分(譬如广告、水印、字幕、图标等等)然后进行擦除。由于图标擦

seeprettyface.com 141 Dec 26, 2022
Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

Tencent YouTu Research 64 Nov 11, 2022
CenterNet:Objects as Points目标检测模型在Pytorch当中的实现

CenterNet:Objects as Points目标检测模型在Pytorch当中的实现

Bubbliiiing 267 Dec 29, 2022
Unadversarial Examples: Designing Objects for Robust Vision

Unadversarial Examples: Designing Objects for Robust Vision This repository contains the code necessary to replicate the major results of our paper: U

Microsoft 93 Nov 28, 2022
A vision library for performing sliced inference on large images/small objects

SAHI: Slicing Aided Hyper Inference A vision library for performing sliced inference on large images/small objects Overview Object detection and insta

Open Business Software Solutions 2.3k Jan 4, 2023
DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021)

Evaluation, Training, Demo, and Inference of DeFMO DeFMO: Deblurring and Shape Recovery of Fast Moving Objects (CVPR 2021) Denys Rozumnyi, Martin R. O

Denys Rozumnyi 139 Dec 26, 2022
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

TheSys Group @ CMU CS 78 Jan 7, 2023
ManipulaTHOR, a framework that facilitates visual manipulation of objects using a robotic arm

ManipulaTHOR: A Framework for Visual Object Manipulation Kiana Ehsani, Winson Han, Alvaro Herrasti, Eli VanderBilt, Luca Weihs, Eric Kolve, Aniruddha

AI2 65 Dec 30, 2022
A DeepStack custom model for detecting common objects in dark/night images and videos.

DeepStack_ExDark This repository provides a custom DeepStack model that has been trained and can be used for creating a new object detection API for d

MOSES OLAFENWA 98 Dec 24, 2022