Development kit for MIT Scene Parsing Benchmark

MIT CSAIL Computer Vision

Last update: Dec 1, 2022

Related tags

Deep Learning sceneparsing

Overview

Development Kit for MIT Scene Parsing Benchmark

[NEW!] Our PyTorch implementation is released in the following repository:

https://github.com/hangzhaomit/semantic-segmentation-pytorch

Introduction

Table of contents:

Overview of scene parsing benchmark
Benchmark details
1. Image list and annotations
2. Submission format
3. Evaluation routines
Pretrained models

Please open an issue for questions, comments, and bug reports.

Overview of Scene Parsing Benchmark

The goal of this benchmark is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. It is similar to semantic segmentation tasks in COCO and Pascal Dataset, but the data is more scene-centric and with a diverse range of object categories. The data for this benchmark comes from ADE20K Dataset (the full dataset will be released after the benchmark) which contains more than 20K scene-centric images exhaustively annotated with objects and object parts. Specifically, the benchmark data is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are in total 150 semantic categories included in the benchmark for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that non-uniform distribution of objects occurs in the images, mimicking a more natural object occurrence in daily scenes.

The webpage of the benchmark is at http://sceneparsing.csail.mit.edu. You could download the data at the webpage.

Benchmark details

Data

There are three types of data, the training, the validation and the testing. The training data contains 20210 images, the validation data contains 2000 images. The testing data contains 2000 images which will be released in middle August. Each image in the training data and validation data has an annotation mask, indicating the labels for each pixel in the image.

After untarring the data file (please download it from http://sceneparsing.csail.mit.edu), the directory structure should be similar to the following,

the training images:

images/training/ADE_train_00000001.jpg
images/training/ADE_train_00000002.jpg
    ...
images/training/ADE_train_00020210.jpg

the corresponding annotation masks for the training images:

annotations/training/ADE_train_00000001.png
annotations/training/ADE_train_00000002.png
    ...
annotations/training/ADE_train_00020210.png

the validation images:

images/validation/ADE_val_00000001.jpg
images/validation/ADE_val_00000002.jpg
    ...
images/validation/ADE_val_00002000.jpg

the corresponding annotation masks for the validation images:

annotations/validation/ADE_val_00000001.png
annotations/validation/ADE_val_00000002.png
    ...
annotations/validation/ADE_val_00002000.png

the testing images will be released in a separate file in the middle Auguest. The directory structure will be: images/testing/ADE_test_00000001.jpg ...

Note: annotations masks contain labels ranging from 0 to 150, where 0 refers to "other objects". We do not consider those pixels in our evaluation.

objectInfo150.txt contains the information about the labels of the 150 semantic categories, including indices, pixel ratios and names.

Submission format to the evaluation server

To evaluate the algorithm on the test set of the benchmark (link: http://sceneparsing.csail.mit.edu/eval/), participants are required to upload a zip file which contains the predicted annotation mask for the given testing images to the evaluation server. The naming of the predicted annotation mask should be the same as the name of the testing images, while the filename extension should be png instead of jpg. For example, the predicted annotation mask for file ADE_test_00000001.jpg should be ADE_test_00000001.png.

Participants should check the zip file to make sure it could be decompressed correctly.

Interclass similarity

Some of the semantic classes in this dataset show some level of visual and semantic similarities across them. In order to quantify such similarities we include a matrix in human_semantic_similarity.mat, which includes human-perceived similarities between the 150 categories and can be used to train the segmentation models. In demoSimilarity.m, we show how to use that file.

Evaluation routines

The performance of the segmentation algorithms will be evaluated by the mean of (1) pixel-wise accuracy over all the labeled pixels, and (2) IoU (intersection over union) avereaged over all the 150 semantic categories.

Intersection over Union = (true positives) / (true positives + false positives + false negatives)
Pixel-wise Accuracy = correctly classifield pixels / labeled pixels
Final score = (Pixel-wise Accuracy + mean(Intersection over Union)) / 2

Demo code

In demoEvaluation.m, we have included our implementation of the standard evaluation metrics (pixel-wise accuracy and IoU) for the benchmark. As mentioned before, we ignore pixels labeled with 0's.

Please change the paths at the begining of the code accordingly to evalutate your own results. While running it correctly, you are expected to see output similar to:

Mean IoU over 150 classes: 0.1000
Pixel-wise Accuracy: 100.00%

In this case, we will take (0.1+1.0)/2=0.55 as your final score.

We have also provided demoVisualization.m, which helps you to visualize individual image results.

Training code

We provide the training code for three popular frameworks, Caffe, Torch7 and PyTorch (https://github.com/CSAILVision/sceneparsing/tree/master/trainingCode). You might need to modify the paths, and the data loader code accordingly to have all the things running on your own computer.

Pre-trained models

We release the pre-trained models for scene parsing at (http://sceneparsing.csail.mit.edu/model/). The demo code along with the model download links is at (https://github.com/CSAILVision/sceneparsing/blob/master/demoSegmentation.m). The models can be used for research only. The detail of how the models are trained is in the reference below. The performance of the models on the validation set of MIT SceneParse150 is as follows,

The qualitative results of the models are below:

Reference

If you find this scene parse benchmark or the data or the pre-trained models useful, please cite the following paper:

Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. (http://people.csail.mit.edu/bzhou/publication/scene-parse-camera-ready.pdf)

@inproceedings{zhou2017scene,
    title={Scene Parsing through ADE20K Dataset},
    author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    year={2017}
}

Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442. (https://arxiv.org/pdf/1608.05442.pdf)

@article{zhou2016semantic,
  title={Semantic understanding of scenes through the ade20k dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={arXiv preprint arXiv:1608.05442},
  year={2016}
}

Comments

Can not reproduce DilatedNet result using provided solver, data_layer and training network
I tried to reproduce the result with provided data layer, training network and solver parameters but fail to produce even close result as provided DilatedNet model [http://sceneparsing.csail.mit.edu/model/DilatedNet_iter_120000.caffemodel].

A test run on validation images with model of 120000 iteration gives me following stats:

iteration 120000 overall accuracy 0.71458910259

iteration 120000 mean accuracy 0.321233999994

iteration 120000 mean IU 0.243954227299

iteration 120000 fwavacc 0.567641521075

However, the reported baseline performance is (73.6, 44.6, 32.3, 60.1)

I wonder what is going wrong and what I should do to have a matching result? [The training images are resized to 384x384 and mirrored to match the setting of author's]
opened by hexiang-hu 8
Model does not learn on training(very high CE)

Hi, I trained the model using the given code twice(once with re-scaled images of size 384 by 384(bicubic for images and nearest for annotations)) and once without scaling. I trained for around 150,000 iterations. But in both cases with the trained snapshot weights, when I do inference on the validation images, I get blank images. Also during the training the cross entropy is very high all the time(~600000) and doesn't seem to come down at all. So did you use the same settings given in the solver_FCN ? specifically the base_lr: 1e-10? Are there any other tricks that must be used to train the model because now the predictions are completely blank and with this high CE it's obvious that the model is not learning anything.

Note- With the pre-trained weights that you have provided I can reproduce your results and I get 71.95% pixel accuracy using the FCN model. Just the training part does not seem to work. Also I tried initialising all the layers before fc6 with the pretrained VGG-16 weights. Any pointers highly appreciated. Thanks!

opened by himsR 4
Reference for Cascade-SegNet?

As we all know, SegNet is available in public. However, Cascade-Segnet and Cascade-DilatedNet are both reported as state-of-the-art. Can some please explain what is the difference between SegNet and Cascade-SegNet?

opened by pengpaiSH 4
How to create more training date with some extra classes

Hi I have been trying so create more training data with the object which are not accurately detected to improve the segmentation, but i can't figure out which color encoding to use which will give me single channel masking like yours. I have tried to pass same colored annotations with the encoding given in color150.mat file but the color encoding used in the original ADE20K dataset annotation are different. In the below below images the floor color encoding are different. The green floor one will give the single channel input as required but the brown floor one will give only black mask. So can anyone tell me how to get the correct color encoding to pass it from https://github.com/CSAILVision/sceneparsing/blob/master/convertFromADE/convertFromADE.m to get annotations

opened by sauravbandral 3
Mirrored Data File Download?

Hi,

This isn't exactly related to the code, but I cannot download the train/val data from the website. I'm getting interruptions and corrupted zip files. Is there a mirror for the data?

Thanks

opened by sdeck51 3
What's your methods to get Image and AnnoImage all be 384*384 ?

Use the pre-trained model, I can't figure out the pixel-wise accuracy posted here. I wonder whether the resize on annotation image may heart the result

opened by AIML 3
How to calculate the probability of each pixel belongs to every class?

After I ran the code, I output the array "imPred" after line 66 in "demoSegmentation.m", % imPred = net.forward({im_inp}); the imPred{1} array is 384 * 384 * 151, so I expect to get the probability of each pixel belongs to each class, for instance, 0.8, 0.53, 0.01...etc, which are between 0~1.

However the numbers I got from imPred{1} were like -1.2331, 3.0104, -0.7758, 10.1961...etc, so I was wondering if these numbers can convert to probabilities, and how to convert them?

Thank you.

opened by rita5022 3
Can't reproduce test results using pre-trained Dilated model

Using the offered DilatedNet_iter_120000.caffemodel model and demoSegmentation.m script, I am unable to reproduce the qualitative test results posted at the bottom of README file (see here https://github.com/CSAILVision/sceneparsing#pre-trained-models-on-going). Here are two examples I got using DilatedNet caffe model.

Are the released models exactly the same with the ones you are using?

opened by ieted 3
Which method does the online demo use? FCN, SegNet, Dialated ... ?

I have tried several images with the online segmentation demo and surprisingly it works quite well! I would like to ask which method does this use, FCN, SegNet, DialtedNet, or ensembles with XXX ?

opened by pengpaiSH 3
ADE20k classes
So I have a question about the ADE20k itself.

I read all the seg-masks from the training set (~15k files) and counted the number of unique class values. I got 2231 unique values, where the highest value is 3144. This makes no sense as the number of classes is supposed to be 150.

I'm using this code to load the *_seg.png files in Python (adapted from the Matlab code on the dataset site):

mask = np.array(Image.open(mask_path), dtype=np.uint16) R,G,B = mask[:,:,0], mask[:,:,1], mask[:,:,2] class_mask = R // 10 * 256 + G
opened by AAnoosheh 1

Failing to train models

I've been having a bit more trouble than I thought I had bargained for with these models that are intended to work out of the box, specifically with the AdeSegDataLayer. I think I almost have it, but I am getting the following error:

I0623 00:19:42.193922 27604 layer_factory.hpp:77] Creating layer data
I0623 00:19:42.639711 27604 net.cpp:100] Creating Layer data
I0623 00:19:42.639730 27604 net.cpp:408] data -> data
I0623 00:19:42.639760 27604 net.cpp:408] data -> label
I0623 00:19:43.050173 27604 net.cpp:150] Setting up data
I0623 00:19:43.050217 27604 net.cpp:157] Top shape: 1 3 1944 2592 (15116544)
I0623 00:19:43.050227 27604 net.cpp:157] Top shape: 1 1 1944 2592 3 (15116544)
I0623 00:19:43.050235 27604 net.cpp:165] Memory required for data: 120932352
I0623 00:19:43.050258 27604 layer_factory.hpp:77] Creating layer data_data_0_split
I0623 00:19:43.050281 27604 net.cpp:100] Creating Layer data_data_0_split
I0623 00:19:43.050292 27604 net.cpp:434] data_data_0_split <- data
I0623 00:19:43.050312 27604 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0623 00:19:43.050330 27604 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0623 00:19:43.050793 27604 net.cpp:150] Setting up data_data_0_split
I0623 00:19:43.050809 27604 net.cpp:157] Top shape: 1 3 1944 2592 (15116544)
I0623 00:19:43.050817 27604 net.cpp:157] Top shape: 1 3 1944 2592 (15116544)
I0623 00:19:43.050822 27604 net.cpp:165] Memory required for data: 241864704
I0623 00:19:43.050829 27604 layer_factory.hpp:77] Creating layer conv1_1
I0623 00:19:43.050853 27604 net.cpp:100] Creating Layer conv1_1
I0623 00:19:43.050859 27604 net.cpp:434] conv1_1 <- data_data_0_split_0
I0623 00:19:43.050871 27604 net.cpp:408] conv1_1 -> conv1_1
I0623 00:19:43.700464 27604 net.cpp:150] Setting up conv1_1
I0623 00:19:43.700536 27604 net.cpp:157] Top shape: 1 64 2142 2790 (382475520)
I0623 00:19:43.700549 27604 net.cpp:165] Memory required for data: 1771766784
I0623 00:19:43.700593 27604 layer_factory.hpp:77] Creating layer relu1_1
I0623 00:19:43.700616 27604 net.cpp:100] Creating Layer relu1_1
I0623 00:19:43.700634 27604 net.cpp:434] relu1_1 <- conv1_1
I0623 00:19:43.700644 27604 net.cpp:395] relu1_1 -> conv1_1 (in-place)
I0623 00:19:43.701800 27604 net.cpp:150] Setting up relu1_1
I0623 00:19:43.701817 27604 net.cpp:157] Top shape: 1 64 2142 2790 (382475520)
I0623 00:19:43.701825 27604 net.cpp:165] Memory required for data: 3301668864
I0623 00:19:43.701958 27604 layer_factory.hpp:77] Creating layer conv1_2
I0623 00:19:43.701982 27604 net.cpp:100] Creating Layer conv1_2
I0623 00:19:43.701988 27604 net.cpp:434] conv1_2 <- conv1_1
I0623 00:19:43.702000 27604 net.cpp:408] conv1_2 -> conv1_2
F0623 00:19:43.704733 27604 blob.cpp:34] Check failed: shape[i] <= 2147483647 / count_ (2790 vs. 1740) blob size exceeds INT_MAX
*** Check failure stack trace: ***
    @     0x7f0da310bdaa  (unknown)
    @     0x7f0da310bce4  (unknown)
    @     0x7f0da310b6e6  (unknown)
    @     0x7f0da310e687  (unknown)
    @     0x7f0da3794b5e  caffe::Blob<>::Reshape()
    @     0x7f0da37e81d6  caffe::BaseConvolutionLayer<>::Reshape()
    @     0x7f0da37b618f  caffe::CuDNNConvolutionLayer<>::Reshape()
    @     0x7f0da375ec7c  caffe::Net<>::Init()
    @     0x7f0da375faf5  caffe::Net<>::Net()
    @     0x7f0da379bb9a  caffe::Solver<>::InitTrainNet()
    @     0x7f0da379cc9c  caffe::Solver<>::Init()
    @     0x7f0da379cfca  caffe::Solver<>::Solver()
    @     0x7f0da377d2b3  caffe::Creator_AdamSolver<>()
    @           0x40f4ae  caffe::SolverRegistry<>::CreateSolver()
    @           0x408504  train()
    @           0x405e6c  main
    @     0x7f0da1966f45  (unknown)
    @           0x406773  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

I noticed that in the AdeSegDataLayer there doesn't appear to be any place to resize the data, but everywhere on the project page and in the evaluation scripts it looks as though the data is supposed to be 384x384. could that be the cause? if so, why isn't that in the DataLayer, and more importantly, can you suggest a change to my datalayer [attached] to do that resize properly (I could just shrink the smaller height dimension to 384 then crop the width, or shrink the width to 384 and pad the height...which is what you all did?)

ade_layers.py.zip

opened by balloch 1

How were the numbers uner 'Ratio', 'Train', and 'Val' calculated in objectInfo150.txt?

I am trying to understand where 'Ratio', 'Train', and 'Val' come from in the objectInfo150.txt file.

Presumably the 'Ratio' is the pixel ratio of each category presented over all the images. I tried to reproduce the number for the 'wall' category by 1) counting the number of pixels labelled with '1' for each image, divided by the total number of pixels in the image, then averaged over total number of images under the training/validation set separately/altogether; 2) similar to 1) but averaged over the sum of total number of pixels in all images. Neither approach successfully reproduce the number(around 0.1 off 0.1576)

I guess the numbers under 'Train' and 'Val' are the instance counts for each category? For this I simply count if the category 'wall' is present in every image under the training and validation sets. Since 'wall' is a stuff I guess it is sufficient to just check existence. But the numbers also don't match (11588 vs. 11664, 1167 vs. 1172)

I want to ask where my understanding goes wrong? Thanks a lot!

opened by YellowPig-zp 0
Scene names as classification labels

Hi,

I'd like to know if you have the scene names for the test data and I can evaluate my trained model using it. As far as I know, the original dataset (released here) have scene names for each training/validation data. But, the test data which can be downloaded from the above site don't contain scene names.

I think the original dataset is novel because it has wide variety of segmentation labels and also each image belongs to scene. Therefore, I believe it would be great if you could provide us with the scene names as classification labels for each test data. Of course, if we can evaluate the performance of classification against test data, it would be enough. (It means you don't need to make the scene names of test data public.)

if you could kindly consider it, I would be grateful. Also, if this post is not appropriate here, please make it close. Thanks,

opened by kotetsu-n 0
Lost Model Files - Hosted Elsewhere?

The model files which were referenced by the project (previously hosted at http://sceneparsing.csail.mit.edu/ and at http://sceneparsing.csail.mit.edu/model/pytorch/) are gone. Is there another place to find these files?

opened by saifrahmed 0
Class to Color Correspondence

Hi I wish to relabel the indoor images of ADE20k Dataset of 150 class labels into smaller number of categories e.g. floor, wall, furniture, person, stairs etc. I am having a hard time finding a file that provides class to color correspondence. I would be very grateful if anyone can help me with that so I can proceed with relabelling. Thanks.

opened by umer-rasheed 2

Owner

MIT CSAIL Computer Vision

GitHub http://sceneparsing.csail.mit.edu

Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset

Semantic Segmentation on MIT ADE20K dataset in PyTorch This is a PyTorch implementation of semantic segmentation models on MIT ADE20K scene parsing da

4.5k Jan 8, 2023

Development Kit for the SoccerNet Challenge

SoccerNetv2-DevKit Welcome to the SoccerNet-V2 Development Kit for the SoccerNet Benchmark and Challenge. This kit is meant as a help to get started w

117 Dec 30, 2022

Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

11 Oct 14, 2022

A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild"

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video

45 Nov 29, 2022

Edge-aware Guidance Fusion Network for RGB-Thermal Scene Parsing

EGFNet Edge-aware Guidance Fusion Network for RGB-Thermal Scene Parsing Dataset and Results Test maps: 百度网盘提取码：zust Citation @ARTICLE{ author={Zhou,

10 Dec 8, 2022

SAS output to EXCEL converter for Cornell/MIT Language and acquisition lab

CORNELLSASLAB SAS output to EXCEL converter for Cornell/MIT Language and acquisition lab Instructions: This python code can be used to convert SAS out

2 Jan 26, 2022

Sematic-Segmantation - Semantic Segmentation on MIT ADE20K dataset in PyTorch

Semantic Segmentation on MIT ADE20K dataset in PyTorch This is a PyTorch impleme

4 Mar 22, 2022

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

182 Dec 30, 2022

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion Code for Multi-Temporal Scene Classification and Scene Ch

33 Dec 12, 2022

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object compositions and views.

151 Dec 26, 2022

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

35 Nov 20, 2022

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

66 Nov 16, 2022

Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

52 Dec 22, 2022

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

259 Dec 28, 2022

image scene graph generation benchmark

Scene Graph Benchmark in PyTorch 1.7 This project is based on maskrcnn-benchmark Highlights Upgrad to pytorch 1.7 Multi-GPU training and inference Bat

303 Dec 27, 2022

This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard evaluation metric to measure the accuracy and robustness of 3D face reconstruction methods from a single image under variations in viewing angle, lighting, and common occlusions.

NoW Evaluation This is the official repository for evaluation on the NoW Benchmark Dataset. The goal of the NoW benchmark is to introduce a standard e

71 Dec 30, 2022

Starter kit for getting started in the Music Demixing Challenge.

Music Demixing Challenge - Starter Kit ?? Challenge page This repository is the Music Demixing Challenge Submission template and Starter kit! Clone th

106 Dec 20, 2022

Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad to your characters in Modo.

Applicator Kit for Modo Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad with a TrueDepth camera to

3 Aug 24, 2021

End-to-end image segmentation kit based on PaddlePaddle.

English | 简体中文 PaddleSeg PaddleSeg has released the new version including the following features: Our team won the AutoNUE@CVPR 2021 challenge, where

6.2k Jan 2, 2023