Training PSPNet in Tensorflow. Reproduce the performance from the paper.

Overview

Training Reproduce of PSPNet.

(Updated 2021/04/09. Authors of PSPNet have provided a Pytorch implementation for PSPNet and their new work with supporting Sync Batch Norm, see https://github.com/hszhao/semseg.)

(Updated 2019/02/26. A major change of code structure. For the version before, checkout v0.9 https://github.com/holyseven/PSPNet-TF-Reproduce/tree/v0.9.)

This is an implementation of PSPNet (from training to test) in pure Tensorflow library (tested on TF1.12, Python 3).

  • Supported Backbones: ResNet-V1-50, ResNet-V1-101 and other ResNet-V1s can be easily added.
  • Supported Databases: ADE20K, SBD (Augmented Pascal VOC) and Cityscapes.
  • Supported Modes: training, validation and inference with multi-scale inputs.
  • More things: L2-SP regularization and sync batch normalization implementation.

L2-SP Regularization

L2-SP regularization is a variant of L2 regularization. Instead of the origin like L2 does, L2-SP sets the pre-trained model as reference, just like (w - w0)^2, where w0 is the pre-trained model. Simple but effective. More details about L2-SP can be found in the paper and the code.

If you find the L2-SP useful for your research (not limited in image segmentation), please consider citing our work:

@inproceedings{li2018explicit,
  author    = {Li, Xuhong and Grandvalet, Yves and Davoine, Franck},
  title     = {Explicit Inductive Bias for Transfer Learning with Convolutional Networks},
  booktitle={International Conference on Machine Learning (ICML)},
   pages     = {2830--2839},
  year      = {2018}
}

Sync Batch Norm

When concerning image segmentation, batch size is usually limited. Small batch size will make the gradients instable and harm the performance, especially for batch normalization layers. Multi-GPU settings by default does not help because the statistics in batch normalization layer are computed independently within each GPU. More discussion can be found here and here.

This repo resolves this problem in pure python and pure Tensorflow by simply using a list as input. The main idea is located in model/utils_mg.py

I do not know if this is the first implementation of sync batch norm in Tensorflow, but there is already an implementation in PyTorch and some applications.

Update: There is other implementation that uses NCCL to gather statistics across GPUs, see in tensorpack. However, TF1.1 does not support gradients passing by nccl_all_reduce. Plus, ppc64le with tf1.10, cuda9.0 and nccl1.3.5 was not able to run this code. No idea why, and do not want to spend a lot of time on this. Maybe nccl2 can solve this.

Results

Numerical Results

  • Random scaling for all
  • Random rotation for SBD
  • SS/MS on validation set
  • Welcome to correct and fill in the table
Backbones L2 L2-SP
Cityscapes (train set: 3K) ResNet-50 76.9/? 77.9/?
ResNet-101 77.9/? 78.6/?
Cityscapes (coarse + train set: 20K + 3K) ResNet-50
ResNet-101 80.0/80.9 80.1/81.2*
SBD ResNet-50 76.5/? 76.6/?
ResNet-101 77.5/79.2 78.5/79.9
ADE20K ResNet-50 41.92/43.09
ResNet-101 42.80/?

*This model gets 80.3 without post-processing methods on Cityscapes test set (1525).

Qualitative Results on Cityscapes

Devil Details

Training and Evaluation

Download the databases with the links: ADE20K, SBD (Augmented Pascal VOC) and Cityscapes.

Prepare the database for Cityscapes by generating *labelTrainIds.png images with createTrainIdLabelImgs, and then change the code in database/reader.py or move undersired images to other directory.

Download pretrained models.

cd z_pretrained_weights
sh download_resnet_v1_101.sh

A script of training resnet-50 on ADE20K, getting around 41.92 mIoU scores (with single-scale test):

python ./run.py --network 'resnet_v1_50' --visible_gpus '0,1' --reader_method 'queue' --lrn_rate 0.01 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.001 --database 'ADE' --subsets_for_training 'train' --batch_size 8 --train_image_size 480 --snapshot 30000 --train_max_iter 90000 --test_image_size 480 --random_rotate 0 --fine_tune_filename './z_pretrained_weights/resnet_v1_50.ckpt'

Test and Infer

Test with multi-scale (set batch_size as large as you can to speed up).

python predict.py --visible_gpus '0' --network 'resnet_v1_101' --database 'ADE' --weights_ckpt './log/ADE/PSP-resnet_v1_101-gpu_num2-batch_size8-lrn_rate0.01-random_scale1-random_rotate1-480-60000-train-1-0.0001-0.001-0-0-1-1/snapshot/model.ckpt-60000' --test_subset 'val' --test_image_size 480 --batch_size 8 --ms 1 --mirror 1

Infer one image (with multi-scale).

python demo_infer.py --database 'Cityscapes' --network 'resnet_v1_101' --weights_ckpt './log/Cityscapes/old/model.ckpt-50000' --test_image_size 864 --batch_size 4 --ms 1

Uncertainties for Training Details:

  1. (Cityscapes only) Whether finely labeled data in the first training stage should be involved?
  2. (Cityscapes only) Whether the (base) learning rate should be reduced in the second training stage?
  3. Whether logits should be resized to original size before computing the loss?
  4. Whether new layers should receive larger learning rate?
  5. About weired padding behavior of tf.image.resize_images(). Whether the align_corners=True should be set?
  6. What is optimal hyperparameter of decay for statistics of batch normalization layers? (0.9, 0.95, 0.9997)
  7. may be more but not sure how much these little changes can effect the results ...
  8. Welcome to discuss !

Change Log

26 Febuary, 2019

  • Code structure: on-the-fly evaluation during training.
  • Code structure: wrapping of the model.
  • Add tf.data support, but with queue-based reader is faster.
  • print results using python utils.py in experiment_manager dir.
  • The default environment is Python 3 and TF1.12. OpenCV is needed for predicting and demo_infer.
  • The previous version becomes a branch of this repo named as v0.9.

External links

Pyramid Scene Parsing Network paper and official github.

Comments
  • Does loss regularization always improve accuracy?

    Does loss regularization always improve accuracy?

    I trained resnet-50 on Cityscapes with one gpu for three times that only different on the strategy of weight decay. The other parameters are the same as in your example. When I apply L2-SP regularization, the precision on val_set is 69.93 mIoU. When I apply L2 regularization, the precision on val_set is 69.68 mIoU. When I apply No regularization, the precision on val_set is 72.15 mIoU. So, my question is does loss regularization really work? Why the highest accuracy occurs with no loss regularization? Or how can I change the hyper-parameters to improve the accuracy of L2-SP regularization?

    opened by zdluffy 13
  • About training on fine + coarse data set

    About training on fine + coarse data set

    Hi, thanks for your nice work! Recently, I am following this pspnet work. I am curious about how you trained on the fine + coarse data set. Here is my idea:

    1. split training into 2 steps: train on the coarse or fine data set first and on the another then.
    2. mix fine and coarse data set, and train once only.
    opened by xiongzhanblake 13
  • pre-training weight and 'auxiliary loss operations'

    pre-training weight and 'auxiliary loss operations'

    Recently,I'm learning about semantic segmentation, and I'm glad to learn your Repositories: pspnet-tf-fiction on github.This is a particularly good reproduction. And I have a few questions for you:

    1. How do you import the pre-training weight of resnet_v1_101 into the model? I think the code is initialized with 'he'.
    2. What is the role of 'auxiliary loss operations' in the model? When I use resnet_v1_50.ckpt, it will report an error. When I screen it out, the output loss will be larger or smaller. I have not learned enough about this part, and there are many things to learn from you. Your help is very important to me, and I am looking forward to your reply.
    opened by hydxqing 11
  • Model failing to load

    Model failing to load

    model not loading. It gives an error. Below is the error.

    Tensor name "resnet_v1_50/block1/unit_1/bottleneck_v1/conv1/BatchNorm/beta" not found in checkpoint files /PSPNET/PSPNet-TF-Reproduce/z_pretrained_weights/resnet_v1_101.ckpt [[node save/RestoreV2 (defined at inference.py:71) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

    What could be the problem? I downloaded the model from .sh file given in the code.

    opened by gulzainali98 9
  • loss or weight norm is nan. Training Stopped!

    loss or weight norm is nan. Training Stopped!

    < Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/gamma:0 > < Finetuning Process: not import resnet_v1_101/block4/unit_4/weights:0 > < Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/beta:0 > < Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/gamma:0 > < Finetuning Process: not import resnet_v1_101/logits/weights:0 > < Finetuning Process: not import resnet_v1_101/logits/biases:0 > < Succesfully loaded fine-tune model from /nfs/private/PSPNet-TF-Reproduce-master/z_pretrained_weights/resnet_v1_101.c kpt. >

    < training process begins >

    loss or weight norm is nan. Training Stopped!

    opened by horizonheart 9
  • Num of classes of ADE20k, 150 or 151?

    Num of classes of ADE20k, 150 or 151?

    Thank you for reproducing pspnet. I find you set num_classes = 150 instead of 151 while 0 represents 'other' class. I think this may cause a little difference during training and testing.

    opened by REFunction 8
  • Mermory Cost Increase

    Mermory Cost Increase

    I tried the sync bn proposed by your codes and find that the memory cost increase tremendously. My experimental environment includes 4 Titan X GPU that could fit 5 batches per gpu when using Deeplab3+ as the solution, while I could only fit 1 batch per gpu after adopting sync bn. Also the running time is increased also.

    Could this be caused by the absence of NVLink on my severs?

    opened by tabrisweapon 6
  • Memory cost too much

    Memory cost too much

    Hi, I try to learn to train pspnet on cityscapes(only train dataset), but after several iters , always ResourceExhaustedError, I try to reduce batch_size and train_size from 864 to 512.and I use 4*TITAN-XP(12G). Could you give to some advice? Thank you so much! python ./run.py --network 'resnet_v1_101' --visible_gpus '0,1,2,3' --reader_method 'queue' --batch_size 4 --poly_lr 1 --lrn_rate 0.01 --momentum 0.9 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.001 --database 'Cityscapes' --subsets_for_training 'train' --train_image_size 512 --snapshot 10000 --train_max_iter 50000 --test_image_size 512 --random_rotate 0 --fine_tune_filename './z_pretrained_weights/resnet_v1_101.ckpt'

    opened by zhiyuli3 5
  • Training customised data set. Getting loss or weight norm is nan. Training Stopped!

    Training customised data set. Getting loss or weight norm is nan. Training Stopped!

    I have converted my dataset as ADE format annotation images.

    I have only two classes. like wise i will have only two pixel values 1 and 2 in annotation image rest all pixels will have 0.

    I have used this command

    python ./run.py --network 'resnet_v1_50' --visible_gpus '0,1' --reader_method 'queue' --lrn_rate 0.0001 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.001 --database 'ADE' --subsets_for_training 'train' --batch_size 2 --train_image_size 480 --snapshot 30000 --train_max_iter 90000 --test_image_size 480 --random_rotate 0 --fine_tune_filename './z_pretrained_weights/resnet_v1_50.ckpt'

    After some iterations (650) , I am getting following error

    loss or weight norm is nan. Training Stopped!

    I have seen issue #15 , but not worked.

    I think data set representation would be wrong. like number of classes.

    is there any way to check my custom data set in correct ADE format or not.

    Please help me out, Thanks

    opened by lakshmankanakala 4
  • Loss or weight nan error on ADE dataset

    Loss or weight nan error on ADE dataset

    I tried to train on ADE dataset, but I still met the error proposed in #15 . There are two differences with the example script (3.b):

    1. I used --batch_size 2 --gpu_num 4 because of GPU memory limitation. But I decrease the --lrn_rate to 0.00001 as suggested in #15 .

    2. I used resnet_v1_101 network and resnet_v1_101.ckpt as the pretrained model.

    My Tensorflow is 1.8.0. Any idea about this error? Thanks!

    opened by lcybuzz 4
  • shouldn't we use moving_mean and moving_bar in training mode instead of batch mean and var?

    shouldn't we use moving_mean and moving_bar in training mode instead of batch mean and var?

                if 'train' in stats_mode:
                    xn = tf.nn.batch_normalization(
                        list_input[i], mean, var, beta, gamma, bn_epsilon)
                    if tf.get_variable_scope().reuse or 'gather' not in stats_mode:
                        list_output.append(xn)
                    else:
                        # gather stats and it is the main gpu device.
                        xn = update_bn_ema(xn, mean, var, moving_mean, moving_var, bn_ema)
                        list_output.append(xn)
                else:
                    xn = tf.nn.batch_normalization(
                        list_input[i], moving_mean, moving_var, beta, gamma, bn_epsilon)
                    list_output.append(xn)
    
    opened by jonhe88 4
  • Performance issue in /database (by P3)

    Performance issue in /database (by P3)

    Hello! I've found a performance issue in /reader.py: dataset.batch(batch_size)(here) should be called before .map(_training_data_preprocess, num_parallel_calls=batch_size)(here), which could make your program more efficient.

    Here is the tensorflow document to support it.

    Besides, you need to check the function _training_data_preprocess called in .map(_training_data_preprocess, num_parallel_calls=batch_size) whether to be affected or not to make the changed code work properly. For example, if _training_data_preprocess needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

    Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

    opened by DLPerf 1
  • Binary segmentation on a custom dataset

    Binary segmentation on a custom dataset

    Hi, I am new to segmentation and trying to use this repo to train the network from scratch on a custom dataset having only 2 classes (binary segmentation) and so if you could please guide me as to where do i need to make the changes in the code. Also is it possible to train if a dataset has images in 4 channels or 1 channel and not 3. A help would be highly appreciated. Thanks!!

    opened by cspearl 7
Owner
Li Xuhong
Researcher at Baidu Research, focus on interpretable deep learning and transfer learning.
Li Xuhong
PyTorch implementation of PSPNet segmentation network

pspnet-pytorch PyTorch implementation of PSPNet segmentation network Original paper Pyramid Scene Parsing Network Details This is a slightly different

Roman Trusov 532 Dec 29, 2022
PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

pytorch-fcn PyTorch implementation of Fully Convolutional Networks. Requirements pytorch >= 0.2.0 torchvision >= 0.1.8 fcn >= 6.1.5 Pillow scipy tqdm

Kentaro Wada 1.6k Jan 7, 2023
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

null 78 Dec 27, 2022
Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

CopeNLU 36 Dec 5, 2022
Code reproduce for paper "Vehicle Re-identification with Viewpoint-aware Metric Learning"

VANET Code reproduce for paper "Vehicle Re-identification with Viewpoint-aware Metric Learning" Introduction This is the implementation of article VAN

EMDATA-AILAB 23 Dec 26, 2022
Code to reproduce experiments in the paper "Explainability Requires Interactivity".

Explainability Requires Interactivity This repository contains the code to train all custom models used in the paper Explainability Requires Interacti

Digital Health & Machine Learning 5 Apr 7, 2022
Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Code To run: python runner.py new --save <SAVE_NAME> --data <PATH_TO_DATA_DIR> --dataset <DATASET> --model <model_name> [options] --n 1000 - train - t

Geoff Pleiss 5 Dec 12, 2022
Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Tensor Component Analysis for Interpreting the Latent Space of GANs [ paper | project page ] Code to reproduce the results in the paper "Tensor Compon

James Oldfield 4 Jun 17, 2022
The codes reproduce the figures and statistics in the paper, "Controlling for multiple covariates," by Mark Tygert.

The accompanying codes reproduce all figures and statistics presented in "Controlling for multiple covariates" by Mark Tygert. This repository also pr

Meta Research 1 Dec 2, 2021
Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 1, 2022
Applications using the GTN library and code to reproduce experiments in "Differentiable Weighted Finite-State Transducers"

gtn_applications An applications library using GTN. Current examples include: Offline handwriting recognition Automatic speech recognition Installing

Facebook Research 68 Dec 29, 2022
This repo will contain code to reproduce and build upon understanding transfer learning

What is being transferred in transfer learning? This repo contains the code for the following paper: Behnam Neyshabur*, Hanie Sedghi*, Chiyuan Zhang*.

null 4 Jun 16, 2021
Code to reproduce the results for Compositional Attention: Disentangling Search and Retrieval.

Compositional-Attention This repository contains the official implementation for the paper Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal 17 Oct 23, 2021
Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

null 48 Dec 20, 2022
Source code and notebooks to reproduce experiments and benchmarks on Bias Faces in the Wild (BFW).

Face Recognition: Too Bias, or Not Too Bias? Robinson, Joseph P., Gennady Livitz, Yann Henon, Can Qin, Yun Fu, and Samson Timoner. "Face recognition:

Joseph P. Robinson 41 Dec 12, 2022
Reproduce ResNet-v2(Identity Mappings in Deep Residual Networks) with MXNet

Reproduce ResNet-v2 using MXNet Requirements Install MXNet on a machine with CUDA GPU, and it's better also installed with cuDNN v5 Please fix the ran

Wei Wu 531 Dec 4, 2022
Reproduce partial features of DeePMD-kit using PyTorch.

DeePMD-kit on PyTorch For better understand DeePMD-kit, we implement its partial features using PyTorch and expose interface consuing descriptors. Tec

Shaochen Shi 8 Dec 17, 2022
Code to reproduce the results for Statistically Robust Neural Network Classification, published in UAI 2021

Code to reproduce the results for Statistically Robust Neural Network Classification, published in UAI 2021

null 1 Jun 2, 2022
In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021

In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021. Balestriero et

Sean M. Hendryx 1 Jan 27, 2022