Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Overview

Scene Graph Generation from Natural Language Supervision

This repository includes the Pytorch code for our paper "Learning to Generate Scene Graph from Natural Language Supervision" accepted in ICCV 2021.

overview figure

Top (our setting): Our goal is learning to generate localized scene graphs from image-text pairs. Once trained, our model takes an image and its detected objects as inputs and outputs the image scene graph. Bottom (our results): A comparison of results from our method and state-of-the-art (SOTA) with varying levels of supervision.

Contents

  1. Overview
  2. Qualitative Results
  3. Installation
  4. Data
  5. Metrics
  6. Pretrained Object Detector
  7. Pretrained Scene Graph Generation Models
  8. Model Training
  9. Model Evaluation
  10. Acknowledgement
  11. Reference

Overview

Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated unlocalized scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation.

Qualitative Results

Our generated scene graphs learned from image descriptions

overview figure

Partial visualization of Figure 3 in our paper: Our model trained by image-sentence pairs produces scene graphs with a high quality (e.g. "man-on-motorcycle" and "man-wearing-helmet" in first example). More comparison with other models trained by stronger supervision (e.g. unlocalized scene graph labels, localized scene graph labels) can be viewed in the Figure 3 of paper.

Our generated scene graphs in open-set and closed-set settings

overview figure

Figure 4 in our paper: We explored open-set setting where the categories of target concepts (objects and predicates) are unknown during training. Compared to our closed-set model, our open-set model detects more concepts outside the evaluation dataset, Visual Genome (e.g. "swinge", "mouse", "keyboard"). Our results suggest an exciting avenue of large-scale training of open-set scene graph generation using image captioning dataset such as Conceptual Caption.

Installation

Check INSTALL.md for installation instructions.

Data

Check DATASET.md for instructions of data downloading.

Metrics

Explanation of metrics in this toolkit are given in METRICS.md

Pretrained Object Detector

In this project, we primarily use the detector Faster RCNN pretrained on Open Images dataset. To use this repo, you don't need to run this detector. You can directly download the extracted detection features, as the instruction in DATASET.md. If you're interested in this detector, the pretrained model can be found in TensorFlow 1 Detection Model Zoo: faster_rcnn_inception_resnet_v2_atrous_oidv4.

Additionally, to compare with previous fully supervised models, we also use the detector pretrained by Scene-Graph-Benchmark. You can download this Faster R-CNN model and extract all the files to the directory checkpoints/pretrained_faster_rcnn.

Pretrained Scene Graph Generation Models

Our pretrained SGG models can be downloaded on Google Drive. The details of these models can be found in Model Training section below. After downloading, please put all the folders to the directory checkpoints/. More pretrained models will be released. Stay tuned!

Model Training

To train our scene graph generation models, run the script

bash train.sh MODEL_TYPE

where MODEL_TYPE specifies the training supervision, the training dataset and the scene graph generation model. See details below.

  1. Language supervised models: trained by image-text pairs

    • Language_CC-COCO_Uniter: train our Transformer-based model on Conceptual Caption (CC) and COCO Caption (COCO) datasets
    • Language_*_Uniter: train our Transformer-based model on single dataset. * represents the dataset name and can be CC, COCO, and VG
    • Language_OpensetCOCO_Uniter: train our Transformer-based model on COCO dataset in open-set setting
    • Language_CC-COCO_MotifNet: train Motif-Net model with language supervision from CC and COCO datasets
  2. Weakly supervised models: trained by unlocalized scene graph labels

    • Weakly_Uniter: train our Transformer-based model
  3. Fully supervised models: trained by localized scene graph labels

    • Sup_Uniter: train our Transformer-based model
    • Sup_OnlineDetector_Uniter: train our Transformer-based model by using the object detector from Scene-Graph-Benchmark.

You can set CUDA_VISIBLE_DEVICES in train.sh to specify which GPUs are used for model training (e.g., the default script uses 2 GPUs).

Model Evaluation

To evaluate the trained scene graph generation model, you can reuse the commands in train.sh by simply changing WSVL.SKIP_TRAIN to True and setting OUTPUT_DIR as the path to your trained model. One example can be found in test.sh and just run bash test.sh.

Acknowledgement

This repository was built based on Scene-Graph-Benchmark for scene graph generation and UNITER for image-text representation learning.

We specially would like to thank Pengchuan Zhang for providing the object detector pretrained on Objects365 dataset which was used in our ablation study.

Reference

If you are using our code, please consider citing our paper.

@inproceedings{zhong2021SGGfromNLS,
  title={Learning to Generate Scene Graph from Natural Language Supervision},
  author={Zhong, Yiwu and Shi, Jing and Yang, Jianwei and Xu, Chenliang and Li, Yin},
  booktitle={ICCV},
  year={2021}
}
Comments
  • The object features from the pretrained detector on OpenImages

    The object features from the pretrained detector on OpenImages

    Noticed that The pretrained detector on OpenImages you applied is based on Tensorflow. However, this repo is based on pytorch. Are the object features directly applicable?

    opened by Kenneth-Wong 5
  • Problems while loading checkpoints

    Problems while loading checkpoints

    After downloading Language_CC-COCO_MotifNet on Google Drive, I change the content of last_checkpoint to the path model_final.pth stored.

    Then, while running the test.sh, error occurs.

    RuntimeError: Error(s) in loading state_dict for GeneralizedRCNN: size mismatch for roi_heads.relation.predictor.out_oobj.2.weight: copying a param with shape torch.Size([4274, 768]) from checkpoint, the shape in current model is torch.Size([149, 768]). size mismatch for roi_heads.relation.predictor.out_oobj.2.bias: copying a param with shape torch.Size([4274]) from checkpoint, the shape in current model is torch.Size([149]). size mismatch for roi_heads.relation.predictor.out_rel.2.weight: copying a param with shape torch.Size([678, 768]) from checkpoint, the shape in current model is torch.Size([66, 768]). size mismatch for roi_heads.relation.predictor.out_rel.2.bias: copying a param with shape torch.Size([678]) from checkpoint, the shape in current model is torch.Size([66]).

    The config.yml in Language_CC-COCO_MotifNet does not match the checkpoint or something goes wrong?

    Thank you for your time!

    opened by guikunchen 4
  • Run the SGG model on a set of custom images for inference

    Run the SGG model on a set of custom images for inference

    Hi, I have been able to run the object detector - pretrained faster rcnn and obtain the object detections. However, I am not sure on how to get the scene graph predictions using any of the pretrained scene graph generation models - ex: Language_CC-COCO_Uniter models on the custom images. Please let me know the steps to do this

    According to my understanding of the paper, the proposed model does not need the image caption during inference. However, I would like to see the results by passing (image + caption) as input during inference. Please let me know how this can be done.

    Thanks in advance. Looking forward for your response.

    opened by prateksha 3
  • can you upload the code to get the obj_norn and predicate_norn from caption for vg/coco/cc dataset?

    can you upload the code to get the obj_norn and predicate_norn from caption for vg/coco/cc dataset?

    Dear sir, I have seen the detailed process, but I have not seen the code of getting obj_norn and predicate_norn from caption.Would you like to upload the code of this part.

    opened by ZHUXUHAN 2
  • AttributeError: module 'torchvision' has no attribute 'datasets'

    AttributeError: module 'torchvision' has no attribute 'datasets'

    File "/home/xxx/icme/sgg_from_nls/maskrcnn_benchmark/data/datasets/coco.py", line 39, in class COCODataset(torchvision.datasets.coco.CocoDetection): AttributeError: module 'torchvision' has no attribute 'datasets'

    I dont know why it cannot import cocodataset

    opened by PrimoWW 1
  • The value of  'loss_rel' is nan

    The value of 'loss_rel' is nan

    Hi, thanks for your released project. I have tried to run your code in the setting of 'Weakly_Uniter', but why does the value of 'loss_rel' turn to be 'nan' after 500 iters?

    opened by xcppy 1
  • Would you release the code for the pseudo label generation and assignment?

    Would you release the code for the pseudo label generation and assignment?

    Hi!

    Thanks for your interesting work and nice codebase on weakly supervised SGG. Your high-quality codebase and comments make me understand your proposed method concretely.

    The pseude label generation and assignment is a much different design from previous works of weakly SGG. The generated labels have been provided here. However, I'm wondering about the technical details of seudo label generation and assignment in your work. Do you have a plan to release the preprocess code for the pseudo label generation?

    I'm looking forward to your reply.

    opened by Scarecrow0 6
Owner
Yiwu Zhong
Ph.D. Student of Computer Science in University of Wisconsin-Madison
Yiwu Zhong
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 8, 2022
Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

GANInversion_with_ConsecutiveImgs Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images" https://a

QingyangXu 38 Dec 7, 2022
Official Repository for the ICCV 2021 paper "PixelSynth: Generating a 3D-Consistent Experience from a Single Image"

PixelSynth: Generating a 3D-Consistent Experience from a Single Image (ICCV 2021) Chris Rockwell, David F. Fouhey, and Justin Johnson [Project Website

Chris Rockwell 95 Nov 22, 2022
Repository of our paper 'Refer-it-in-RGBD' in CVPR 2021

Refer-it-in-RGBD This is the repository of our paper 'Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images' in CVPR 2021 Pape

Haolin Liu 34 Nov 7, 2022
PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

Shipeng Wang 34 Dec 21, 2022
Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

Wang Yucheng 30 Dec 18, 2022
code for ICCV 2021 paper 'Generalized Source-free Domain Adaptation'

G-SFDA Code (based on pytorch 1.3) for our ICCV 2021 paper 'Generalized Source-free Domain Adaptation'. [project] [paper]. Dataset preparing Download

Shiqi Yang 84 Dec 26, 2022
Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators..

ARAPReg Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators.. Installation The cod

Bo Sun 132 Nov 28, 2022
Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Pixel Difference Convolution This repository contains the PyTorch implementation for "Pixel Difference Networks for Efficient Edge Detection" by Zhuo

Alex 236 Dec 21, 2022
Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization 0. Environment Environment: python 3.6 and cuda 10

Haitao Yang 62 Dec 30, 2022
Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds This is the official code implementation for the paper "Spatio-temporal Se

Hesper 63 Jan 5, 2023
Code release for ICCV 2021 paper "Anticipative Video Transformer"

Anticipative Video Transformer Ranked first in the Action Anticipation task of the CVPR 2021 EPIC-Kitchens Challenge! (entry: AVT-FB-UT) [project page

Facebook Research 123 Dec 13, 2022
Official code release for ICCV 2021 paper SNARF: Differentiable Forward Skinning for Animating Non-rigid Neural Implicit Shapes.

Official code release for ICCV 2021 paper SNARF: Differentiable Forward Skinning for Animating Non-rigid Neural Implicit Shapes.

null 235 Dec 26, 2022
Demo code for ICCV 2021 paper "Sensor-Guided Optical Flow"

Sensor-Guided Optical Flow Demo code for "Sensor-Guided Optical Flow", ICCV 2021 This code is provided to replicate results with flow hints obtained f

null 10 Mar 16, 2022
Code for ICCV 2021 paper "HuMoR: 3D Human Motion Model for Robust Pose Estimation"

Code for ICCV 2021 paper "HuMoR: 3D Human Motion Model for Robust Pose Estimation"

Davis Rempe 367 Dec 24, 2022
Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA>=10.0,

null 29 Aug 23, 2022
Starter code for the ICCV 2021 paper, 'Detecting Invisible People'

Detecting Invisible People [ICCV 2021 Paper] [Website] Tarasha Khurana, Achal Dave, Deva Ramanan Introduction This repository contains code for Detect

Tarasha Khurana 28 Sep 16, 2022