Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

Overview

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

This repository contains PyTorch implementation of our paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (CVPR 2020).

Overview of ASG2Caption Model

Prerequisites

Python 3 and PyTorch 1.3.

# clone the repository
git clone https://github.com/cshizhe/asg2cap.git
cd asg2cap
# clone caption evaluation codes
git clone https://github.com/cshizhe/eval_cap.git
export PYTHONPATH=$(pwd):${PYTHONPATH}

Training & Inference

cd controlimcap/driver

# support caption models: [node, node.role, 
# rgcn, rgcn.flow, rgcn.memory, rgcn.flow.memory]
# see our paper for details
mtype=rgcn.flow.memory 

# setup config files
# you should modify data paths in configs/prepare_*_imgsg_config.py
python configs/prepare_coco_imgsg_config.py $mtype
resdir='' # copy the output string of the previous step

# training
python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_loss --is_train --num_workers 8

# inference
python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_set tst --num_workers 8

Datasets

Annotations

Annotations for MSCOCO and VisualGenome datasets can be download from GoogleDrive.

  • (Image, ASG, Caption) annotations: regionfiles/image_id.json
JSON Format:
{
	"region_id": {
		"objects":[
			{
	     		"object_id": int, 
	     		"name": str, 
	     		"attributes": [str],
				"x": int,
				"y": int, 
				"w": int, 
				"h": int
			}],
  	  "relationships": [
			{
				"relationship_id": int,
				"subject_id": int,
				"object_id": int,
				"name": str
			}],
  	  "phrase": str,
  }
}
  • vocabularies int2word.npy: [word] word2int.json: {word: int}

  • data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy

Features

Features for MSCOCO and VisualGenome datasets are available at BaiduNetdisk (code: 6q32).

We also provide pretrained models and codes to extract features for new images.

format: npy array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

format: hdf5 files, "image_id".jpg.hdf5

key: 'image_id'.jpg

attrs: {"image_w": int, "image_h": int, "boxes": 4d array (x1, y1, x2, y2)}

Result Visualization

Examples

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{chen2020say,
  title={Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs},
  author={Chen, Shizhe and Jin, Qin and Wang, Peng and Wu, Qi},
  journal={CVPR},
  year={2020}
}

License

MIT License

Comments
  • Hi, Where is the supplementary material?

    Hi, Where is the supplementary material?

    In this paper, "The details of automatic ASG generation are provided in the supplementary material.". But I cant find it. Please tell me where is it. THANKS!

    opened by jxylon 1
  • How to compute Div-n?

    How to compute Div-n?

    Hi, thanks for the awesome work! I'm trying to use your method as a comparison for my own work, and I am confused about the calculation of n-gram diversity (Div-n). It is defined in the paper as "the ratio of distinct n-grams to the total number of words in the best 5 sampled captions".

    My questions are:

    1. Does it mean that, for each image, you use 5 different ASGs to obtain 5 captions, calculate a Div-n score over these captions, and then average the Div-n scores overall images in the test set to get the final Div-n score?
    2. How to obtain the best captions?
    3. Which dataset split do you use for evaluating the n-gram diversity?
    4. Would you mind providing me your implementation of the Div-n score?

    I would be very happy if you can answer the above questions so that I can make a fair comparison to your work.

    Best regards

    opened by bearcatt 0
  • After typing the training sentence,an keyerror has been raised for no reason. Please help me with this kind of error

    After typing the training sentence,an keyerror has been raised for no reason. Please help me with this kind of error

    After i download the target data file,the target file has been split into 4 files: objrels1 to objrels4. The target file path is: anaconda3/envs/asg2cap/controlimcap/driver/configs/ControllableImageCaption/VisualGenome/ordered_feature/SA/X_101_32x8d/

    and the error is like: python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_loss --is_train --num_workers 8 2022-04-17 22:16:55,677 mp_encoder: ft_embed.weight, shape=torch.Size([512, 2560]), num:1310720 2022-04-17 22:16:55,678 mp_encoder: ft_embed.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,678 attn_encoder: attr_order_embeds, shape=torch.Size([20, 2048]), num:40960 2022-04-17 22:16:55,678 attn_encoder: layers.0.loop_weight, shape=torch.Size([2048, 512]), num:1048576 2022-04-17 22:16:55,678 attn_encoder: layers.0.weight, shape=torch.Size([6, 2048, 512]), num:6291456 2022-04-17 22:16:55,678 attn_encoder: layers.1.loop_weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,678 attn_encoder: layers.1.weight, shape=torch.Size([6, 512, 512]), num:1572864 2022-04-17 22:16:55,678 attn_encoder: node_embedding.weight, shape=torch.Size([3, 2048]), num:6144 2022-04-17 22:16:55,678 decoder: embedding.we.weight, shape=torch.Size([11123, 512]), num:5694976 2022-04-17 22:16:55,678 decoder: attn_lstm.weight_ih, shape=torch.Size([2048, 1536]), num:3145728 2022-04-17 22:16:55,678 decoder: attn_lstm.weight_hh, shape=torch.Size([2048, 512]), num:1048576 2022-04-17 22:16:55,679 decoder: attn_lstm.bias_ih, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: attn_lstm.bias_hh, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: lang_lstm.weight_ih, shape=torch.Size([2048, 1024]), num:2097152 2022-04-17 22:16:55,679 decoder: lang_lstm.weight_hh, shape=torch.Size([2048, 512]), num:1048576 2022-04-17 22:16:55,679 decoder: lang_lstm.bias_ih, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: lang_lstm.bias_hh, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: attn.linear_query.weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,679 decoder: attn.linear_query.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,679 decoder: attn.attn_w.weight, shape=torch.Size([1, 512]), num:512 2022-04-17 22:16:55,679 decoder: attn_linear_context.weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,679 decoder: address_layer.0.weight, shape=torch.Size([512, 1024]), num:524288 2022-04-17 22:16:55,679 decoder: address_layer.0.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,679 decoder: address_layer.2.weight, shape=torch.Size([4, 512]), num:2048 2022-04-17 22:16:55,679 decoder: address_layer.2.bias, shape=torch.Size([4]), num:4 2022-04-17 22:16:55,679 decoder: memory_update_layer.0.weight, shape=torch.Size([512, 1024]), num:524288 2022-04-17 22:16:55,680 decoder: memory_update_layer.0.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,680 decoder: memory_update_layer.2.weight, shape=torch.Size([1024, 512]), num:524288 2022-04-17 22:16:55,680 decoder: memory_update_layer.2.bias, shape=torch.Size([1024]), num:1024 2022-04-17 22:16:55,680 decoder: sentinal_layer.0.weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,680 decoder: sentinal_layer.0.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,680 decoder: sentinal_layer.2.weight, shape=torch.Size([1, 512]), num:512 2022-04-17 22:16:55,680 decoder: sentinal_layer.2.bias, shape=torch.Size([1]), num:1 2022-04-17 22:16:55,680 num params 33, num weights 25942021 2022-04-17 22:16:55,680 trainable: num params 32, num weights 25901061 2022-04-17 22:17:52,931 mp_fts (96738, 2048) 2022-04-17 22:17:53,020 num_data 3397459 /home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( 2022-04-17 22:17:55,916 mp_fts (4925, 2048) 2022-04-17 22:17:55,921 num_data 172290 Traceback (most recent call last): File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/asg2caption.py", line 146, in main() File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/asg2caption.py", line 94, in main _model.train(trn_reader, val_reader, path_cfg.model_dir, path_cfg.log_dir, File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/framework/modelbase.py", line 191, in train metrics = self.validate(val_reader) File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/caption/models/captionbase.py", line 66, in validate for batch_data in val_reader: File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/readers/imgsgreader.py", line 398, in getitem 'mp_fts': self.mp_fts[self.img_id_to_ftidx_name[image_id][0]], KeyError: '2334484'

    opened by MrLianSYSU 1
  • RuntimeError: CUDA error: device-side assert triggered

    RuntimeError: CUDA error: device-side assert triggered

    Could anyone tell me how to modify the code?

    The detailed error information is as follows:

    /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexTypedexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [106,0,0] Assertion indexValue >=xValue < src.sizes[dim] failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/generic/THCTensorScatterline=67 error=710 : device-side assert triggered

    opened by meiling-fdu 0
  • Could you please tell me the details about how to generate ASG ?

    Could you please tell me the details about how to generate ASG ?

    Hello, shiche . Thanks for your great work ! I am interested in your work ! I checked the supplementary materials related to ASG, and I still have some doubts about the implementation details. I don't know how to achieve it if I change the datasets .Could you share the relevant code of Automatic ASG Generation ? Thank you very much.

    opened by grape0803 1
Owner
Shizhe Chen
Shizhe Chen
Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

Facebook Research 182 Dec 30, 2022
Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification Code release for The Devil is in the Channels: Mutual-Channel

PRIS-CV: Computer Vision Group 230 Dec 31, 2022
Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

tyty 4 Aug 28, 2022
Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object compositions and views.

null 151 Dec 26, 2022
Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

ToxiChat Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Install depen

Ashutosh Baheti 11 Jan 1, 2023
Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

Andrew Luo 41 Dec 9, 2022
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Microsoft 61 Nov 14, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation

FIRA is a learning-based commit message generation approach, which first represents code changes via fine-grained graphs and then learns to generate commit messages automatically.

Van 21 Dec 30, 2022
[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

VITA 71 Dec 28, 2022
Code accompanying "Dynamic Neural Relational Inference" from CVPR 2020

Code accompanying "Dynamic Neural Relational Inference" This codebase accompanies the paper "Dynamic Neural Relational Inference" from CVPR 2020. This

Colin Graber 48 Dec 23, 2022
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Talk-to-Edit (ICCV2021) This repository contains the implementation of the following paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog Yumin

Yuming Jiang 221 Jan 7, 2023
Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

DeepPanoContext (DPC) [Project Page (with interactive results)][Paper] DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context G

Cheng Zhang 66 Nov 16, 2022
Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

Facebook Research 296 Dec 29, 2022
The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

Temporal Query Networks for Fine-grained Video Understanding ?? This repository contains the implementation of CVPR2021 paper Temporal_Query_Networks

null 55 Dec 21, 2022
official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu

null 77 Dec 27, 2022
Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

Casual GAN Papers 259 Dec 28, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 365 Dec 30, 2022