A full-fledged version of Pix2Seq

What it is. This is a full-fledged version of Pix2Seq. Compared with unofficial-pix2seq, stable-pix2seq contain most of the tricks mentioned in Pix2Seq like Sequence Augmentation, Batch Repretation, Warmup, Linear decay leanring rate and beam search(to be add later).

Difference between Pix2Seq. In sequence augmentation, we only augment random bounding box while original paper will mix with virual box from ground truth plus noise. Pix2seq also use input sequence dropout to regularize the training process.

Usage - Object detection

There are no extra compiled components in Stable-Pix2Seq and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/gaopengcuhk/Stable-Pix2Seq.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images


To train baseline Stable-Pix2Seq on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --master_port=3141 --nproc_per_node 8 --use_env main.py --coco_path ./coco/ --batch_size 4 --lr 0.0005

A single epoch takes 50 minutes on 8 V100, so 300 epoch training takes around 10 days on a single machine with 8 V100 cards.

Why slower than DETR and Unofficial-Pix2Seq?. Stable-Pix2Seq use batch repeat which double the training time. Besides, stable-pix2seq use 1333 image resolution will the time report in unofficial-pix2seq is trained on low resolution 512.

We train DETR with AdamW setting learning rate using a linear warmup and decay schedule. Due to batch repeat, the real barch size is 64. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Please use the learning rate 0.0005 with causion. It is tested on batch 198.


To evaluate Stable-Pix2Seq R50 on COCO val5k with multiple GPU run:

python -m torch.distributed.launch --master_port=3142 --nproc_per_node 8 --use_env main.py --coco_path ./coco/ --batch_size 4 --eval --resume checkpoint.pth



  • Why do we want to create two samples in `get_item`

    Why do we want to create two samples in `get_item`

    I'm trying to understand the following code:


    This part is also different from the code in DETR. I'm wondering what's the design principle of transforming two samples.

    As I can see that the collate function actually just concatenates them together


    opened by allanj 1
  • How to using for panoptic segmentation

    How to using for panoptic segmentation

    can you share the command line when training panoptic seg. I using the command: python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --coco_path ... --coco_panoptic_path ... --masks but, there are some error.

    opened by byrsongyuxin 1
  • Return values in the coco.py __get_item__ method

    Return values in the coco.py __get_item__ method

    Why the return values here contain two images and targets? Should not it be simply. if self._transforms is not None: img, target = self._transforms(img, target) return img, target

    opened by RishabhMaheshwary 0
  • Extract token embedding

    Extract token embedding

    I want to extract the token embedding as shown in figure 11 of the paper. image

    However, when looking at the code, I see that the tokens are predicted by feeding the output feature map to a mlp whose last layer's dimension is 2003 (maybe number of tokens). Hence, the model do not learn the token embedding actually and we can't get the learned token embedding.

    Am I missing something ?

    opened by elituan 0
  • CUDA Out-of-memory using V100

    CUDA Out-of-memory using V100

    I'm using V100 for experiments, but still out of memory in the middle of the training process. Not sure what would be the reason at this momnet

    opened by allanj 1
  • CUDA out of memory during training

    CUDA out of memory during training

    I was training 'Stable Pix2Seq', everything goes fine until the 3rd training epoch. I wonder if there's any accumulate operation or some tensors or variable should have been deleted.

    opened by yshMars 0
  • about settings of learning rate

    about settings of learning rate

    In the original paper, learning rate was set to 3e-3 and weight decay was set to 5e-2, why do u use the learning rate 1e-5 and weight decay 1e-4 in the code? BTW, can u give the NLL_Loss when the model convergences, just for reference. Thanks!

    opened by Yongxin-Zhu 0
peng gao
Young Scientist at Shanghai AI Lab
peng gao
