A full-fledged version of Pix2Seq

Overview

Stable-Pix2Seq

A full-fledged version of Pix2Seq

What it is. This is a full-fledged version of Pix2Seq. Compared with unofficial-pix2seq, stable-pix2seq contain most of the tricks mentioned in Pix2Seq like Sequence Augmentation, Batch Repretation, Warmup, Linear decay leanring rate and beam search(to be add later).

Difference between Pix2Seq. In sequence augmentation, we only augment random bounding box while original paper will mix with virual box from ground truth plus noise. Pix2seq also use input sequence dropout to regularize the training process.

Usage - Object detection

There are no extra compiled components in Stable-Pix2Seq and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/gaopengcuhk/Stable-Pix2Seq.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

To train baseline Stable-Pix2Seq on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --master_port=3141 --nproc_per_node 8 --use_env main.py --coco_path ./coco/ --batch_size 4 --lr 0.0005

A single epoch takes 50 minutes on 8 V100, so 300 epoch training takes around 10 days on a single machine with 8 V100 cards.

Why slower than DETR and Unofficial-Pix2Seq?. Stable-Pix2Seq use batch repeat which double the training time. Besides, stable-pix2seq use 1333 image resolution will the time report in unofficial-pix2seq is trained on low resolution 512.

We train DETR with AdamW setting learning rate using a linear warmup and decay schedule. Due to batch repeat, the real barch size is 64. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Please use the learning rate 0.0005 with causion. It is tested on batch 198.

Evaluation

To evaluate Stable-Pix2Seq R50 on COCO val5k with multiple GPU run:

python -m torch.distributed.launch --master_port=3142 --nproc_per_node 8 --use_env main.py --coco_path ./coco/ --batch_size 4 --eval --resume checkpoint.pth

Acknowledgement

DETR

Comments
  • Why do we want to create two samples in `get_item`

    Why do we want to create two samples in `get_item`

    I'm trying to understand the following code:

    https://github.com/gaopengcuhk/Stable-Pix2Seq/blob/12587302a2b697e2be8c131452e466a3f45c8c3e/datasets/coco.py#L23-L31

    This part is also different from the code in DETR. I'm wondering what's the design principle of transforming two samples.

    As I can see that the collate function actually just concatenates them together

    https://github.com/gaopengcuhk/Stable-Pix2Seq/blob/12587302a2b697e2be8c131452e466a3f45c8c3e/util/misc.py#L268-L271

    opened by allanj 1
  • How to using for panoptic segmentation

    How to using for panoptic segmentation

    can you share the command line when training panoptic seg. I using the command: python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --coco_path ... --coco_panoptic_path ... --masks but, there are some error.

    opened by byrsongyuxin 1
  • Return values in the coco.py __get_item__ method

    Return values in the coco.py __get_item__ method

    Why the return values here contain two images and targets? Should not it be simply. if self._transforms is not None: img, target = self._transforms(img, target) return img, target

    opened by RishabhMaheshwary 0
  • Extract token embedding

    Extract token embedding

    I want to extract the token embedding as shown in figure 11 of the paper. image

    However, when looking at the code, I see that the tokens are predicted by feeding the output feature map to a mlp whose last layer's dimension is 2003 (maybe number of tokens). Hence, the model do not learn the token embedding actually and we can't get the learned token embedding.

    Am I missing something ?

    opened by elituan 0
  • CUDA Out-of-memory using V100

    CUDA Out-of-memory using V100

    I'm using V100 for experiments, but still out of memory in the middle of the training process. Not sure what would be the reason at this momnet

    
    Namespace(aux_loss=True, backbone='resnet50', batch_size=4, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='./coco2017/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0005, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='./output', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
    Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/tiger/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
    100%|██████████| 97.8M/97.8M [00:09<00:00, 10.3MB/s]
    number of params: 36104659
    loading annotations into memory...
    Done (t=13.57s)
    creating index...
    index created!
    loading annotations into memory...
    Done (t=0.44s)
    creating index...
    index created!
    Start training
    Epoch: [0]  [   0/3696]  eta: 2:32:25  lr: 0.000100  loss: 7.6000 (7.6000)  at: 7.6000 (7.6000)  at_unscaled: 7.6000 (7.6000)  time: 2.4743  data: 0.5030  max mem: 14737
    Epoch: [0]  [  10/3696]  eta: 0:59:14  lr: 0.000100  loss: 7.5261 (7.5307)  at: 7.5261 (7.5307)  at_unscaled: 7.5261 (7.5307)  time: 0.9643  data: 0.0806  max mem: 25656
    Epoch: [0]  [  20/3696]  eta: 0:56:49  lr: 0.000100  loss: 7.4746 (7.4774)  at: 7.4746 (7.4774)  at_unscaled: 7.4746 (7.4774)  time: 0.8501  data: 0.0390  max mem: 25656
    Epoch: [0]  [  30/3696]  eta: 0:54:22  lr: 0.000100  loss: 7.3449 (7.4215)  at: 7.3449 (7.4215)  at_unscaled: 7.3449 (7.4215)  time: 0.8489  data: 0.0374  max mem: 25656
    Epoch: [0]  [  40/3696]  eta: 0:54:59  lr: 0.000100  loss: 7.2054 (7.3429)  at: 7.2054 (7.3429)  at_unscaled: 7.2054 (7.3429)  time: 0.8761  data: 0.0356  max mem: 25656
    Epoch: [0]  [  50/3696]  eta: 0:53:30  lr: 0.000100  loss: 7.0288 (7.2657)  at: 7.0288 (7.2657)  at_unscaled: 7.0288 (7.2657)  time: 0.8662  data: 0.0362  max mem: 25656
    Epoch: [0]  [  60/3696]  eta: 0:53:44  lr: 0.000100  loss: 6.8423 (7.1774)  at: 6.8423 (7.1774)  at_unscaled: 6.8423 (7.1774)  time: 0.8553  data: 0.0368  max mem: 26623
    Epoch: [0]  [  70/3696]  eta: 0:53:36  lr: 0.000100  loss: 6.6867 (7.0967)  at: 6.6867 (7.0967)  at_unscaled: 6.6867 (7.0967)  time: 0.9036  data: 0.0359  max mem: 26623
    Epoch: [0]  [  80/3696]  eta: 0:52:42  lr: 0.000100  loss: 6.5043 (7.0184)  at: 6.5043 (7.0184)  at_unscaled: 6.5043 (7.0184)  time: 0.8368  data: 0.0351  max mem: 26623
    Epoch: [0]  [  90/3696]  eta: 0:52:17  lr: 0.000100  loss: 6.4531 (6.9577)  at: 6.4531 (6.9577)  at_unscaled: 6.4531 (6.9577)  time: 0.8094  data: 0.0362  max mem: 26623
    Epoch: [0]  [ 100/3696]  eta: 0:51:33  lr: 0.000100  loss: 6.4151 (6.8982)  at: 6.4151 (6.8982)  at_unscaled: 6.4151 (6.8982)  time: 0.8019  data: 0.0386  max mem: 26623
    Epoch: [0]  [ 110/3696]  eta: 0:51:10  lr: 0.000100  loss: 6.3319 (6.8437)  at: 6.3319 (6.8437)  at_unscaled: 6.3319 (6.8437)  time: 0.7937  data: 0.0392  max mem: 26623
    Epoch: [0]  [ 120/3696]  eta: 0:50:56  lr: 0.000100  loss: 6.2714 (6.7969)  at: 6.2714 (6.7969)  at_unscaled: 6.2714 (6.7969)  time: 0.8268  data: 0.0377  max mem: 26623
    Epoch: [0]  [ 130/3696]  eta: 0:50:36  lr: 0.000100  loss: 6.2584 (6.7519)  at: 6.2584 (6.7519)  at_unscaled: 6.2584 (6.7519)  time: 0.8254  data: 0.0372  max mem: 26623
    Epoch: [0]  [ 140/3696]  eta: 0:50:25  lr: 0.000100  loss: 6.2035 (6.7111)  at: 6.2035 (6.7111)  at_unscaled: 6.2035 (6.7111)  time: 0.8266  data: 0.0372  max mem: 29528
    Epoch: [0]  [ 150/3696]  eta: 0:49:55  lr: 0.000100  loss: 6.1476 (6.6716)  at: 6.1476 (6.6716)  at_unscaled: 6.1476 (6.6716)  time: 0.8011  data: 0.0375  max mem: 29528
    Epoch: [0]  [ 160/3696]  eta: 0:49:27  lr: 0.000100  loss: 6.0711 (6.6330)  at: 6.0711 (6.6330)  at_unscaled: 6.0711 (6.6330)  time: 0.7585  data: 0.0372  max mem: 29528
    Epoch: [0]  [ 170/3696]  eta: 0:49:10  lr: 0.000100  loss: 6.0247 (6.5969)  at: 6.0247 (6.5969)  at_unscaled: 6.0247 (6.5969)  time: 0.7769  data: 0.0358  max mem: 29528
    Epoch: [0]  [ 180/3696]  eta: 0:49:27  lr: 0.000100  loss: 5.9822 (6.5631)  at: 5.9822 (6.5631)  at_unscaled: 5.9822 (6.5631)  time: 0.8812  data: 0.0361  max mem: 29528
    Epoch: [0]  [ 190/3696]  eta: 0:49:06  lr: 0.000100  loss: 5.9351 (6.5278)  at: 5.9351 (6.5278)  at_unscaled: 5.9351 (6.5278)  time: 0.8712  data: 0.0371  max mem: 29528
    Epoch: [0]  [ 200/3696]  eta: 0:48:45  lr: 0.000100  loss: 5.8904 (6.4953)  at: 5.8904 (6.4953)  at_unscaled: 5.8904 (6.4953)  time: 0.7744  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 210/3696]  eta: 0:48:35  lr: 0.000100  loss: 5.8645 (6.4635)  at: 5.8645 (6.4635)  at_unscaled: 5.8645 (6.4635)  time: 0.7968  data: 0.0348  max mem: 29528
    Epoch: [0]  [ 220/3696]  eta: 0:48:17  lr: 0.000100  loss: 5.8032 (6.4343)  at: 5.8032 (6.4343)  at_unscaled: 5.8032 (6.4343)  time: 0.7998  data: 0.0354  max mem: 29528
    Epoch: [0]  [ 230/3696]  eta: 0:47:58  lr: 0.000100  loss: 5.7949 (6.4067)  at: 5.7949 (6.4067)  at_unscaled: 5.7949 (6.4067)  time: 0.7687  data: 0.0362  max mem: 29528
    Epoch: [0]  [ 240/3696]  eta: 0:47:45  lr: 0.000100  loss: 5.7568 (6.3776)  at: 5.7568 (6.3776)  at_unscaled: 5.7568 (6.3776)  time: 0.7808  data: 0.0371  max mem: 29528
    Epoch: [0]  [ 250/3696]  eta: 0:47:30  lr: 0.000100  loss: 5.7063 (6.3502)  at: 5.7063 (6.3502)  at_unscaled: 5.7063 (6.3502)  time: 0.7889  data: 0.0366  max mem: 29528
    Epoch: [0]  [ 260/3696]  eta: 0:47:11  lr: 0.000100  loss: 5.6821 (6.3225)  at: 5.6821 (6.3225)  at_unscaled: 5.6821 (6.3225)  time: 0.7617  data: 0.0362  max mem: 29528
    Epoch: [0]  [ 270/3696]  eta: 0:47:00  lr: 0.000100  loss: 5.6091 (6.2965)  at: 5.6091 (6.2965)  at_unscaled: 5.6091 (6.2965)  time: 0.7725  data: 0.0366  max mem: 29528
    Epoch: [0]  [ 280/3696]  eta: 0:46:48  lr: 0.000100  loss: 5.6024 (6.2713)  at: 5.6024 (6.2713)  at_unscaled: 5.6024 (6.2713)  time: 0.7982  data: 0.0366  max mem: 29528
    Epoch: [0]  [ 290/3696]  eta: 0:46:48  lr: 0.000100  loss: 5.5578 (6.2455)  at: 5.5578 (6.2455)  at_unscaled: 5.5578 (6.2455)  time: 0.8433  data: 0.0370  max mem: 29528
    Epoch: [0]  [ 300/3696]  eta: 0:46:36  lr: 0.000100  loss: 5.5396 (6.2221)  at: 5.5396 (6.2221)  at_unscaled: 5.5396 (6.2221)  time: 0.8398  data: 0.0373  max mem: 29528
    Epoch: [0]  [ 310/3696]  eta: 0:46:23  lr: 0.000100  loss: 5.5059 (6.1994)  at: 5.5059 (6.1994)  at_unscaled: 5.5059 (6.1994)  time: 0.7842  data: 0.0374  max mem: 29528
    Epoch: [0]  [ 320/3696]  eta: 0:46:12  lr: 0.000100  loss: 5.4888 (6.1767)  at: 5.4888 (6.1767)  at_unscaled: 5.4888 (6.1767)  time: 0.7882  data: 0.0370  max mem: 29528
    Epoch: [0]  [ 330/3696]  eta: 0:45:58  lr: 0.000100  loss: 5.4756 (6.1560)  at: 5.4756 (6.1560)  at_unscaled: 5.4756 (6.1560)  time: 0.7820  data: 0.0365  max mem: 29528
    Epoch: [0]  [ 340/3696]  eta: 0:45:49  lr: 0.000100  loss: 5.4458 (6.1354)  at: 5.4458 (6.1354)  at_unscaled: 5.4458 (6.1354)  time: 0.7886  data: 0.0363  max mem: 29528
    Epoch: [0]  [ 350/3696]  eta: 0:45:42  lr: 0.000100  loss: 5.4504 (6.1157)  at: 5.4504 (6.1157)  at_unscaled: 5.4504 (6.1157)  time: 0.8230  data: 0.0364  max mem: 29528
    Epoch: [0]  [ 360/3696]  eta: 0:45:34  lr: 0.000100  loss: 5.4683 (6.0973)  at: 5.4683 (6.0973)  at_unscaled: 5.4683 (6.0973)  time: 0.8292  data: 0.0370  max mem: 29528
    Epoch: [0]  [ 370/3696]  eta: 0:45:30  lr: 0.000100  loss: 5.4665 (6.0802)  at: 5.4665 (6.0802)  at_unscaled: 5.4665 (6.0802)  time: 0.8410  data: 0.0357  max mem: 29528
    Epoch: [0]  [ 380/3696]  eta: 0:45:22  lr: 0.000100  loss: 5.4943 (6.0647)  at: 5.4943 (6.0647)  at_unscaled: 5.4943 (6.0647)  time: 0.8443  data: 0.0360  max mem: 29528
    Epoch: [0]  [ 390/3696]  eta: 0:45:13  lr: 0.000100  loss: 5.4801 (6.0489)  at: 5.4801 (6.0489)  at_unscaled: 5.4801 (6.0489)  time: 0.8209  data: 0.0371  max mem: 29528
    Epoch: [0]  [ 400/3696]  eta: 0:45:14  lr: 0.000100  loss: 5.4442 (6.0338)  at: 5.4442 (6.0338)  at_unscaled: 5.4442 (6.0338)  time: 0.8706  data: 0.0372  max mem: 29528
    Epoch: [0]  [ 410/3696]  eta: 0:45:03  lr: 0.000100  loss: 5.4351 (6.0182)  at: 5.4351 (6.0182)  at_unscaled: 5.4351 (6.0182)  time: 0.8613  data: 0.0376  max mem: 29528
    Epoch: [0]  [ 420/3696]  eta: 0:44:50  lr: 0.000100  loss: 5.3845 (6.0028)  at: 5.3845 (6.0028)  at_unscaled: 5.3845 (6.0028)  time: 0.7759  data: 0.0373  max mem: 29528
    Epoch: [0]  [ 430/3696]  eta: 0:45:03  lr: 0.000100  loss: 5.3922 (5.9884)  at: 5.3922 (5.9884)  at_unscaled: 5.3922 (5.9884)  time: 0.9318  data: 0.0361  max mem: 29528
    Epoch: [0]  [ 440/3696]  eta: 0:44:50  lr: 0.000100  loss: 5.4115 (5.9759)  at: 5.4115 (5.9759)  at_unscaled: 5.4115 (5.9759)  time: 0.9331  data: 0.0361  max mem: 29528
    Epoch: [0]  [ 450/3696]  eta: 0:44:43  lr: 0.000100  loss: 5.4180 (5.9631)  at: 5.4180 (5.9631)  at_unscaled: 5.4180 (5.9631)  time: 0.8017  data: 0.0359  max mem: 29528
    Epoch: [0]  [ 460/3696]  eta: 0:44:29  lr: 0.000100  loss: 5.3881 (5.9501)  at: 5.3881 (5.9501)  at_unscaled: 5.3881 (5.9501)  time: 0.7948  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 470/3696]  eta: 0:44:18  lr: 0.000100  loss: 5.3906 (5.9391)  at: 5.3906 (5.9391)  at_unscaled: 5.3906 (5.9391)  time: 0.7668  data: 0.0371  max mem: 29528
    Epoch: [0]  [ 480/3696]  eta: 0:44:10  lr: 0.000100  loss: 5.3906 (5.9277)  at: 5.3906 (5.9277)  at_unscaled: 5.3906 (5.9277)  time: 0.8013  data: 0.0390  max mem: 29528
    Epoch: [0]  [ 490/3696]  eta: 0:44:03  lr: 0.000100  loss: 5.4143 (5.9179)  at: 5.4143 (5.9179)  at_unscaled: 5.4143 (5.9179)  time: 0.8300  data: 0.0391  max mem: 29528
    Epoch: [0]  [ 500/3696]  eta: 0:43:54  lr: 0.000100  loss: 5.4093 (5.9075)  at: 5.4093 (5.9075)  at_unscaled: 5.4093 (5.9075)  time: 0.8303  data: 0.0378  max mem: 29528
    Epoch: [0]  [ 510/3696]  eta: 0:43:43  lr: 0.000100  loss: 5.3890 (5.8972)  at: 5.3890 (5.8972)  at_unscaled: 5.3890 (5.8972)  time: 0.7958  data: 0.0367  max mem: 29528
    Epoch: [0]  [ 520/3696]  eta: 0:43:31  lr: 0.000100  loss: 5.3959 (5.8872)  at: 5.3959 (5.8872)  at_unscaled: 5.3959 (5.8872)  time: 0.7730  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 530/3696]  eta: 0:43:22  lr: 0.000100  loss: 5.3743 (5.8775)  at: 5.3743 (5.8775)  at_unscaled: 5.3743 (5.8775)  time: 0.7915  data: 0.0358  max mem: 29528
    Epoch: [0]  [ 540/3696]  eta: 0:43:12  lr: 0.000100  loss: 5.3725 (5.8675)  at: 5.3725 (5.8675)  at_unscaled: 5.3725 (5.8675)  time: 0.8013  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 550/3696]  eta: 0:43:02  lr: 0.000100  loss: 5.3403 (5.8580)  at: 5.3403 (5.8580)  at_unscaled: 5.3403 (5.8580)  time: 0.7922  data: 0.0349  max mem: 29528
    Epoch: [0]  [ 560/3696]  eta: 0:42:52  lr: 0.000100  loss: 5.3460 (5.8494)  at: 5.3460 (5.8494)  at_unscaled: 5.3460 (5.8494)  time: 0.7893  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 570/3696]  eta: 0:42:43  lr: 0.000100  loss: 5.3509 (5.8408)  at: 5.3509 (5.8408)  at_unscaled: 5.3509 (5.8408)  time: 0.7901  data: 0.0359  max mem: 29528
    Epoch: [0]  [ 580/3696]  eta: 0:42:31  lr: 0.000100  loss: 5.3509 (5.8328)  at: 5.3509 (5.8328)  at_unscaled: 5.3509 (5.8328)  time: 0.7762  data: 0.0358  max mem: 29528
    Epoch: [0]  [ 590/3696]  eta: 0:42:22  lr: 0.000100  loss: 5.3572 (5.8243)  at: 5.3572 (5.8243)  at_unscaled: 5.3572 (5.8243)  time: 0.7785  data: 0.0351  max mem: 29528
    Epoch: [0]  [ 600/3696]  eta: 0:42:11  lr: 0.000100  loss: 5.3541 (5.8163)  at: 5.3541 (5.8163)  at_unscaled: 5.3541 (5.8163)  time: 0.7857  data: 0.0343  max mem: 29528
    Epoch: [0]  [ 610/3696]  eta: 0:41:59  lr: 0.000100  loss: 5.3445 (5.8085)  at: 5.3445 (5.8085)  at_unscaled: 5.3445 (5.8085)  time: 0.7585  data: 0.0351  max mem: 29528
    Epoch: [0]  [ 620/3696]  eta: 0:41:54  lr: 0.000100  loss: 5.3499 (5.8015)  at: 5.3499 (5.8015)  at_unscaled: 5.3499 (5.8015)  time: 0.8055  data: 0.0354  max mem: 29528
    Epoch: [0]  [ 630/3696]  eta: 0:41:42  lr: 0.000100  loss: 5.3499 (5.7940)  at: 5.3499 (5.7940)  at_unscaled: 5.3499 (5.7940)  time: 0.8031  data: 0.0343  max mem: 29528
    Epoch: [0]  [ 640/3696]  eta: 0:41:31  lr: 0.000100  loss: 5.3273 (5.7865)  at: 5.3273 (5.7865)  at_unscaled: 5.3273 (5.7865)  time: 0.7553  data: 0.0356  max mem: 29528
    Epoch: [0]  [ 650/3696]  eta: 0:41:22  lr: 0.000100  loss: 5.3314 (5.7792)  at: 5.3314 (5.7792)  at_unscaled: 5.3314 (5.7792)  time: 0.7825  data: 0.0378  max mem: 29528
    Epoch: [0]  [ 660/3696]  eta: 0:41:16  lr: 0.000100  loss: 5.3259 (5.7719)  at: 5.3259 (5.7719)  at_unscaled: 5.3259 (5.7719)  time: 0.8199  data: 0.0371  max mem: 29528
    Epoch: [0]  [ 670/3696]  eta: 0:41:06  lr: 0.000100  loss: 5.2930 (5.7651)  at: 5.2930 (5.7651)  at_unscaled: 5.2930 (5.7651)  time: 0.8170  data: 0.0351  max mem: 29528
    Epoch: [0]  [ 680/3696]  eta: 0:40:57  lr: 0.000100  loss: 5.2930 (5.7582)  at: 5.2930 (5.7582)  at_unscaled: 5.2930 (5.7582)  time: 0.7851  data: 0.0354  max mem: 29528
    Epoch: [0]  [ 690/3696]  eta: 0:40:49  lr: 0.000100  loss: 5.2727 (5.7514)  at: 5.2727 (5.7514)  at_unscaled: 5.2727 (5.7514)  time: 0.8068  data: 0.0353  max mem: 29528
    Epoch: [0]  [ 700/3696]  eta: 0:40:41  lr: 0.000100  loss: 5.2917 (5.7451)  at: 5.2917 (5.7451)  at_unscaled: 5.2917 (5.7451)  time: 0.8184  data: 0.0348  max mem: 29528
    Epoch: [0]  [ 710/3696]  eta: 0:40:31  lr: 0.000100  loss: 5.2949 (5.7387)  at: 5.2949 (5.7387)  at_unscaled: 5.2949 (5.7387)  time: 0.7904  data: 0.0358  max mem: 29528
    Epoch: [0]  [ 720/3696]  eta: 0:40:21  lr: 0.000100  loss: 5.2874 (5.7325)  at: 5.2874 (5.7325)  at_unscaled: 5.2874 (5.7325)  time: 0.7719  data: 0.0376  max mem: 29528
    Epoch: [0]  [ 730/3696]  eta: 0:40:10  lr: 0.000100  loss: 5.2801 (5.7262)  at: 5.2801 (5.7262)  at_unscaled: 5.2801 (5.7262)  time: 0.7581  data: 0.0372  max mem: 29528
    Epoch: [0]  [ 740/3696]  eta: 0:40:02  lr: 0.000100  loss: 5.2634 (5.7196)  at: 5.2634 (5.7196)  at_unscaled: 5.2634 (5.7196)  time: 0.7769  data: 0.0357  max mem: 29528
    Epoch: [0]  [ 750/3696]  eta: 0:39:53  lr: 0.000100  loss: 5.2367 (5.7135)  at: 5.2367 (5.7135)  at_unscaled: 5.2367 (5.7135)  time: 0.8039  data: 0.0365  max mem: 29528
    Epoch: [0]  [ 760/3696]  eta: 0:39:43  lr: 0.000100  loss: 5.2874 (5.7082)  at: 5.2874 (5.7082)  at_unscaled: 5.2874 (5.7082)  time: 0.7800  data: 0.0367  max mem: 29528
    Epoch: [0]  [ 770/3696]  eta: 0:39:33  lr: 0.000100  loss: 5.2954 (5.7024)  at: 5.2954 (5.7024)  at_unscaled: 5.2954 (5.7024)  time: 0.7681  data: 0.0356  max mem: 29528
    Epoch: [0]  [ 780/3696]  eta: 0:39:23  lr: 0.000100  loss: 5.3127 (5.6975)  at: 5.3127 (5.6975)  at_unscaled: 5.3127 (5.6975)  time: 0.7632  data: 0.0361  max mem: 29528
    Epoch: [0]  [ 790/3696]  eta: 0:39:14  lr: 0.000100  loss: 5.3130 (5.6919)  at: 5.3130 (5.6919)  at_unscaled: 5.3130 (5.6919)  time: 0.7715  data: 0.0359  max mem: 29528
    Epoch: [0]  [ 800/3696]  eta: 0:39:06  lr: 0.000100  loss: 5.2498 (5.6860)  at: 5.2498 (5.6860)  at_unscaled: 5.2498 (5.6860)  time: 0.7954  data: 0.0369  max mem: 29528
    Epoch: [0]  [ 810/3696]  eta: 0:38:58  lr: 0.000100  loss: 5.2336 (5.6804)  at: 5.2336 (5.6804)  at_unscaled: 5.2336 (5.6804)  time: 0.8095  data: 0.0380  max mem: 29528
    Epoch: [0]  [ 820/3696]  eta: 0:38:50  lr: 0.000100  loss: 5.2354 (5.6755)  at: 5.2354 (5.6755)  at_unscaled: 5.2354 (5.6755)  time: 0.8130  data: 0.0356  max mem: 29528
    Epoch: [0]  [ 830/3696]  eta: 0:38:39  lr: 0.000100  loss: 5.2691 (5.6704)  at: 5.2691 (5.6704)  at_unscaled: 5.2691 (5.6704)  time: 0.7757  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 840/3696]  eta: 0:38:31  lr: 0.000100  loss: 5.2588 (5.6653)  at: 5.2588 (5.6653)  at_unscaled: 5.2588 (5.6653)  time: 0.7692  data: 0.0369  max mem: 29528
    Epoch: [0]  [ 850/3696]  eta: 0:38:23  lr: 0.000100  loss: 5.2564 (5.6606)  at: 5.2564 (5.6606)  at_unscaled: 5.2564 (5.6606)  time: 0.8133  data: 0.0363  max mem: 29528
    Epoch: [0]  [ 860/3696]  eta: 0:38:15  lr: 0.000100  loss: 5.2448 (5.6556)  at: 5.2448 (5.6556)  at_unscaled: 5.2448 (5.6556)  time: 0.8129  data: 0.0352  max mem: 29528
    Epoch: [0]  [ 870/3696]  eta: 0:38:05  lr: 0.000100  loss: 5.2326 (5.6506)  at: 5.2326 (5.6506)  at_unscaled: 5.2326 (5.6506)  time: 0.7795  data: 0.0351  max mem: 29528
    Epoch: [0]  [ 880/3696]  eta: 0:37:56  lr: 0.000100  loss: 5.2049 (5.6456)  at: 5.2049 (5.6456)  at_unscaled: 5.2049 (5.6456)  time: 0.7750  data: 0.0364  max mem: 29528
    Epoch: [0]  [ 890/3696]  eta: 0:37:47  lr: 0.000100  loss: 5.2049 (5.6407)  at: 5.2049 (5.6407)  at_unscaled: 5.2049 (5.6407)  time: 0.7812  data: 0.0367  max mem: 29528
    Epoch: [0]  [ 900/3696]  eta: 0:37:37  lr: 0.000100  loss: 5.1690 (5.6354)  at: 5.1690 (5.6354)  at_unscaled: 5.1690 (5.6354)  time: 0.7607  data: 0.0348  max mem: 29528
    Epoch: [0]  [ 910/3696]  eta: 0:37:31  lr: 0.000100  loss: 5.1836 (5.6309)  at: 5.1836 (5.6309)  at_unscaled: 5.1836 (5.6309)  time: 0.8035  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 920/3696]  eta: 0:37:22  lr: 0.000100  loss: 5.2129 (5.6261)  at: 5.2129 (5.6261)  at_unscaled: 5.2129 (5.6261)  time: 0.8221  data: 0.0381  max mem: 29528
    Epoch: [0]  [ 930/3696]  eta: 0:37:13  lr: 0.000100  loss: 5.1586 (5.6210)  at: 5.1586 (5.6210)  at_unscaled: 5.1586 (5.6210)  time: 0.7758  data: 0.0377  max mem: 29528
    Epoch: [0]  [ 940/3696]  eta: 0:37:05  lr: 0.000100  loss: 5.1586 (5.6162)  at: 5.1586 (5.6162)  at_unscaled: 5.1586 (5.6162)  time: 0.7975  data: 0.0355  max mem: 29528
    Epoch: [0]  [ 950/3696]  eta: 0:36:56  lr: 0.000100  loss: 5.1713 (5.6120)  at: 5.1713 (5.6120)  at_unscaled: 5.1713 (5.6120)  time: 0.7970  data: 0.0358  max mem: 29528
    Epoch: [0]  [ 960/3696]  eta: 0:36:47  lr: 0.000100  loss: 5.1839 (5.6077)  at: 5.1839 (5.6077)  at_unscaled: 5.1839 (5.6077)  time: 0.7714  data: 0.0367  max mem: 29528
    Epoch: [0]  [ 970/3696]  eta: 0:36:38  lr: 0.000100  loss: 5.1800 (5.6036)  at: 5.1800 (5.6036)  at_unscaled: 5.1800 (5.6036)  time: 0.7812  data: 0.0363  max mem: 29528
    Epoch: [0]  [ 980/3696]  eta: 0:36:30  lr: 0.000100  loss: 5.2028 (5.5995)  at: 5.2028 (5.5995)  at_unscaled: 5.2028 (5.5995)  time: 0.7996  data: 0.0349  max mem: 29528
    Epoch: [0]  [ 990/3696]  eta: 0:36:23  lr: 0.000100  loss: 5.2028 (5.5954)  at: 5.2028 (5.5954)  at_unscaled: 5.2028 (5.5954)  time: 0.8110  data: 0.0353  max mem: 29528
    Epoch: [0]  [1000/3696]  eta: 0:36:14  lr: 0.000100  loss: 5.1880 (5.5914)  at: 5.1880 (5.5914)  at_unscaled: 5.1880 (5.5914)  time: 0.7950  data: 0.0369  max mem: 29528
    Epoch: [0]  [1010/3696]  eta: 0:36:04  lr: 0.000100  loss: 5.1773 (5.5870)  at: 5.1773 (5.5870)  at_unscaled: 5.1773 (5.5870)  time: 0.7645  data: 0.0368  max mem: 29528
    Epoch: [0]  [1020/3696]  eta: 0:35:57  lr: 0.000100  loss: 5.2493 (5.5836)  at: 5.2493 (5.5836)  at_unscaled: 5.2493 (5.5836)  time: 0.7915  data: 0.0360  max mem: 29528
    Epoch: [0]  [1030/3696]  eta: 0:35:49  lr: 0.000100  loss: 5.1982 (5.5793)  at: 5.1982 (5.5793)  at_unscaled: 5.1982 (5.5793)  time: 0.8164  data: 0.0363  max mem: 29528
    Epoch: [0]  [1040/3696]  eta: 0:35:41  lr: 0.000100  loss: 5.1446 (5.5754)  at: 5.1446 (5.5754)  at_unscaled: 5.1446 (5.5754)  time: 0.8053  data: 0.0375  max mem: 29528
    Epoch: [0]  [1050/3696]  eta: 0:35:31  lr: 0.000100  loss: 5.1319 (5.5714)  at: 5.1319 (5.5714)  at_unscaled: 5.1319 (5.5714)  time: 0.7766  data: 0.0359  max mem: 29528
    Epoch: [0]  [1060/3696]  eta: 0:35:22  lr: 0.000100  loss: 5.2017 (5.5679)  at: 5.2017 (5.5679)  at_unscaled: 5.2017 (5.5679)  time: 0.7481  data: 0.0365  max mem: 29528
    Epoch: [0]  [1070/3696]  eta: 0:35:13  lr: 0.000100  loss: 5.2017 (5.5642)  at: 5.2017 (5.5642)  at_unscaled: 5.2017 (5.5642)  time: 0.7754  data: 0.0387  max mem: 29528
    Epoch: [0]  [1080/3696]  eta: 0:35:03  lr: 0.000100  loss: 5.1192 (5.5603)  at: 5.1192 (5.5603)  at_unscaled: 5.1192 (5.5603)  time: 0.7605  data: 0.0383  max mem: 29528
    Epoch: [0]  [1090/3696]  eta: 0:34:56  lr: 0.000100  loss: 5.1105 (5.5560)  at: 5.1105 (5.5560)  at_unscaled: 5.1105 (5.5560)  time: 0.7700  data: 0.0379  max mem: 29528
    Epoch: [0]  [1100/3696]  eta: 0:34:47  lr: 0.000100  loss: 5.1321 (5.5524)  at: 5.1321 (5.5524)  at_unscaled: 5.1321 (5.5524)  time: 0.8007  data: 0.0380  max mem: 29528
    Epoch: [0]  [1110/3696]  eta: 0:34:39  lr: 0.000100  loss: 5.1603 (5.5489)  at: 5.1603 (5.5489)  at_unscaled: 5.1603 (5.5489)  time: 0.7850  data: 0.0382  max mem: 29528
    Epoch: [0]  [1120/3696]  eta: 0:34:30  lr: 0.000100  loss: 5.1443 (5.5452)  at: 5.1443 (5.5452)  at_unscaled: 5.1443 (5.5452)  time: 0.7765  data: 0.0383  max mem: 29528
    Epoch: [0]  [1130/3696]  eta: 0:34:21  lr: 0.000100  loss: 5.1185 (5.5413)  at: 5.1185 (5.5413)  at_unscaled: 5.1185 (5.5413)  time: 0.7790  data: 0.0372  max mem: 29528
    Epoch: [0]  [1140/3696]  eta: 0:34:13  lr: 0.000100  loss: 5.0800 (5.5374)  at: 5.0800 (5.5374)  at_unscaled: 5.0800 (5.5374)  time: 0.7986  data: 0.0356  max mem: 29528
    Epoch: [0]  [1150/3696]  eta: 0:34:04  lr: 0.000100  loss: 5.1101 (5.5337)  at: 5.1101 (5.5337)  at_unscaled: 5.1101 (5.5337)  time: 0.7654  data: 0.0345  max mem: 29528
    Epoch: [0]  [1160/3696]  eta: 0:33:56  lr: 0.000100  loss: 5.1744 (5.5307)  at: 5.1744 (5.5307)  at_unscaled: 5.1744 (5.5307)  time: 0.7695  data: 0.0344  max mem: 29528
    Epoch: [0]  [1170/3696]  eta: 0:33:47  lr: 0.000100  loss: 5.1829 (5.5277)  at: 5.1829 (5.5277)  at_unscaled: 5.1829 (5.5277)  time: 0.7968  data: 0.0362  max mem: 29528
    Epoch: [0]  [1180/3696]  eta: 0:33:40  lr: 0.000100  loss: 5.1845 (5.5246)  at: 5.1845 (5.5246)  at_unscaled: 5.1845 (5.5246)  time: 0.8120  data: 0.0374  max mem: 29528
    Epoch: [0]  [1190/3696]  eta: 0:33:32  lr: 0.000100  loss: 5.1798 (5.5216)  at: 5.1798 (5.5216)  at_unscaled: 5.1798 (5.5216)  time: 0.8169  data: 0.0371  max mem: 29528
    Epoch: [0]  [1200/3696]  eta: 0:33:23  lr: 0.000100  loss: 5.1929 (5.5188)  at: 5.1929 (5.5188)  at_unscaled: 5.1929 (5.5188)  time: 0.7739  data: 0.0361  max mem: 29528
    Epoch: [0]  [1210/3696]  eta: 0:33:16  lr: 0.000100  loss: 5.1929 (5.5158)  at: 5.1929 (5.5158)  at_unscaled: 5.1929 (5.5158)  time: 0.7985  data: 0.0340  max mem: 29528
    Epoch: [0]  [1220/3696]  eta: 0:33:07  lr: 0.000100  loss: 5.1322 (5.5126)  at: 5.1322 (5.5126)  at_unscaled: 5.1322 (5.5126)  time: 0.8027  data: 0.0350  max mem: 29528
    Epoch: [0]  [1230/3696]  eta: 0:32:59  lr: 0.000100  loss: 5.1595 (5.5096)  at: 5.1595 (5.5096)  at_unscaled: 5.1595 (5.5096)  time: 0.7881  data: 0.0374  max mem: 29528
    Epoch: [0]  [1240/3696]  eta: 0:32:50  lr: 0.000100  loss: 5.1620 (5.5067)  at: 5.1620 (5.5067)  at_unscaled: 5.1620 (5.5067)  time: 0.7849  data: 0.0365  max mem: 29528
    Epoch: [0]  [1250/3696]  eta: 0:32:42  lr: 0.000100  loss: 5.1620 (5.5038)  at: 5.1620 (5.5038)  at_unscaled: 5.1620 (5.5038)  time: 0.7893  data: 0.0357  max mem: 29528
    Epoch: [0]  [1260/3696]  eta: 0:32:34  lr: 0.000100  loss: 5.1245 (5.5005)  at: 5.1245 (5.5005)  at_unscaled: 5.1245 (5.5005)  time: 0.8002  data: 0.0359  max mem: 29528
    Epoch: [0]  [1270/3696]  eta: 0:32:26  lr: 0.000100  loss: 5.1023 (5.4975)  at: 5.1023 (5.4975)  at_unscaled: 5.1023 (5.4975)  time: 0.8015  data: 0.0362  max mem: 29528
    Epoch: [0]  [1280/3696]  eta: 0:32:17  lr: 0.000100  loss: 5.1132 (5.4946)  at: 5.1132 (5.4946)  at_unscaled: 5.1132 (5.4946)  time: 0.7906  data: 0.0349  max mem: 29528
    Epoch: [0]  [1290/3696]  eta: 0:32:09  lr: 0.000100  loss: 5.1292 (5.4918)  at: 5.1292 (5.4918)  at_unscaled: 5.1292 (5.4918)  time: 0.7743  data: 0.0334  max mem: 29528
    Epoch: [0]  [1300/3696]  eta: 0:32:01  lr: 0.000100  loss: 5.1292 (5.4890)  at: 5.1292 (5.4890)  at_unscaled: 5.1292 (5.4890)  time: 0.7875  data: 0.0339  max mem: 29528
    Epoch: [0]  [1310/3696]  eta: 0:31:54  lr: 0.000100  loss: 5.1232 (5.4863)  at: 5.1232 (5.4863)  at_unscaled: 5.1232 (5.4863)  time: 0.8117  data: 0.0343  max mem: 29528
    Epoch: [0]  [1320/3696]  eta: 0:31:45  lr: 0.000100  loss: 5.1016 (5.4832)  at: 5.1016 (5.4832)  at_unscaled: 5.1016 (5.4832)  time: 0.8161  data: 0.0341  max mem: 29528
    Epoch: [0]  [1330/3696]  eta: 0:31:38  lr: 0.000100  loss: 5.0905 (5.4805)  at: 5.0905 (5.4805)  at_unscaled: 5.0905 (5.4805)  time: 0.8149  data: 0.0343  max mem: 29528
    Traceback (most recent call last):
      File "main.py", line 257, in <module>
        main(args)
      File "main.py", line 207, in main
        args.clip_max_norm, learning_rate_schedule)
      File "/opt/tiger/intro/Stable-Pix2Seq/engine.py", line 98, in train_one_epoch
        losses.backward()
      File "/home/tiger/.local/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/home/tiger/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
        allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 7; 31.75 GiB total capacity; 29.63 GiB already allocated; 213.75 MiB free; 29.95 GiB reserved in total by PyTorch)
    Traceback (most recent call last):
      File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
        main()
      File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
        raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
    subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main.py', '--coco_path', './coco2017/', '--batch_size', '4', '--lr', '0.0005', '--output_dir', './output']' returned non-zero exit status 1.
    Killing subprocess 5627
    Killing subprocess 5628
    Killing subprocess 5629
    Killing subprocess 5630
    Killing subprocess 5631
    Killing subprocess 5632
    Killing subprocess 5633
    
    opened by allanj 1
  • CUDA out of memory during training

    CUDA out of memory during training

    I was training 'Stable Pix2Seq', everything goes fine until the 3rd training epoch. I wonder if there's any accumulate operation or some tensors or variable should have been deleted.

    opened by yshMars 0
  • about settings of learning rate

    about settings of learning rate

    In the original paper, learning rate was set to 3e-3 and weight decay was set to 5e-2, why do u use the learning rate 1e-5 and weight decay 1e-4 in the code? BTW, can u give the NLL_Loss when the model convergences, just for reference. Thanks!

    opened by Yongxin-Zhu 0
Owner
peng gao
Young Scientist at Shanghai AI Lab
peng gao
Replication of Pix2Seq with Pretrained Model

Pretrained-Pix2Seq We provide the pre-trained model of Pix2Seq. This version contains new data augmentation. The model is trained for 300 epochs and c

peng gao 51 Nov 22, 2022
Implementation of Pix2Seq in PyTorch

pix2seq-pytorch Implementation of Pix2Seq paper Different from the paper image input size 1280 bin size 1280 LambdaLR scheduler used instead of Linear

Tony Shin 9 Dec 15, 2022
Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Realistic Full-Body Anonymization with Surface-Guided GANs This is the official

Håkon Hukkelås 30 Nov 18, 2022
The PyTorch improved version of TPAMI 2017 paper: Face Alignment in Full Pose Range: A 3D Total Solution.

Face Alignment in Full Pose Range: A 3D Total Solution By Jianzhu Guo. [Updates] 2020.8.30: The pre-trained model and code of ECCV-20 are made public

Jianzhu Guo 3.4k Jan 2, 2023
A PaddlePaddle version of Neural Renderer, refer to its PyTorch version

Neural 3D Mesh Renderer in PadddlePaddle A PaddlePaddle version of Neural Renderer, refer to its PyTorch version Install Run: pip install neural-rende

AgentMaker 13 Jul 12, 2022
Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search This is an implementation for our paper Contextual Non-Loca

Tencent YouTu Research 50 Dec 3, 2022
PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."

FullSubNet This Git repository for the official PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech E

郝翔 357 Jan 4, 2023
NeRViS: Neural Re-rendering for Full-frame Video Stabilization

Neural Re-rendering for Full-frame Video Stabilization

Yu-Lun Liu 9 Jun 17, 2022
Neural Re-rendering for Full-frame Video Stabilization

NeRViS: Neural Re-rendering for Full-frame Video Stabilization Project Page | Video | Paper | Google Colab Setup Setup environment for [Yu and Ramamoo

Yu-Lun Liu 9 Jun 17, 2022
Puzzle-CAM: Improved localization via matching partial and full features.

Puzzle-CAM The official implementation of "Puzzle-CAM: Improved localization via matching partial and full features".

Sanghyun Jo 150 Nov 14, 2022
[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

FFB6D This is the official source code for the CVPR2021 Oral work, FFB6D: A Full Flow Biderectional Fusion Network for 6D Pose Estimation. (Arxiv) Tab

Yisheng (Ethan) He 201 Dec 28, 2022
Hybrid Neural Fusion for Full-frame Video Stabilization

FuSta: Hybrid Neural Fusion for Full-frame Video Stabilization Project Page | Video | Paper | Google Colab Setup Setup environment for [Yu and Ramamoo

Yu-Lun Liu 430 Jan 4, 2023
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 2, 2023
Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

Chang-Shu Chung 1.3k Jan 7, 2023
Code for NAACL 2021 full paper "Efficient Attentions for Long Document Summarization"

LongDocSum Code for NAACL 2021 paper "Efficient Attentions for Long Document Summarization" This repository contains data and models needed to reprodu

null 56 Jan 2, 2023
CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation (CVPR 2021, oral presentation) CoCosNet v2: Full-Resolution Correspondence

Microsoft 308 Dec 7, 2022