Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

Shizhe Chen

Last update: Dec 29, 2022

Related tags

Deep Learning asg2cap

Overview

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

This repository contains PyTorch implementation of our paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (CVPR 2020).

Prerequisites

Python 3 and PyTorch 1.3.

# clone the repository
git clone https://github.com/cshizhe/asg2cap.git
cd asg2cap
# clone caption evaluation codes
git clone https://github.com/cshizhe/eval_cap.git
export PYTHONPATH=$(pwd):${PYTHONPATH}

Training & Inference

cd controlimcap/driver

# support caption models: [node, node.role, 
# rgcn, rgcn.flow, rgcn.memory, rgcn.flow.memory]
# see our paper for details
mtype=rgcn.flow.memory 

# setup config files
# you should modify data paths in configs/prepare_*_imgsg_config.py
python configs/prepare_coco_imgsg_config.py $mtype
resdir='' # copy the output string of the previous step

# training
python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_loss --is_train --num_workers 8

# inference
python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_set tst --num_workers 8

Datasets

Annotations

Annotations for MSCOCO and VisualGenome datasets can be download from GoogleDrive.

(Image, ASG, Caption) annotations: regionfiles/image_id.json

JSON Format:
{
	"region_id": {
		"objects":[
			{
	     		"object_id": int, 
	     		"name": str, 
	     		"attributes": [str],
				"x": int,
				"y": int, 
				"w": int, 
				"h": int
			}],
  	  "relationships": [
			{
				"relationship_id": int,
				"subject_id": int,
				"object_id": int,
				"name": str
			}],
  	  "phrase": str,
  }
}

vocabularies int2word.npy: [word] word2int.json: {word: int}
data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy

Features

Features for MSCOCO and VisualGenome datasets are available at BaiduNetdisk (code: 6q32).

We also provide pretrained models and codes to extract features for new images.

Global Image Feature: the last mean pooling feature of ResNet101 pretrained on ImageNet

format: npy array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

Region Image Feature: fc7 layer of Faster-RCNN pretrained on VisualGenome

format: hdf5 files, "image_id".jpg.hdf5

key: 'image_id'.jpg

attrs: {"image_w": int, "image_h": int, "boxes": 4d array (x1, y1, x2, y2)}

Result Visualization

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{chen2020say,
  title={Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs},
  author={Chen, Shizhe and Jin, Qin and Wang, Peng and Wu, Qi},
  journal={CVPR},
  year={2020}
}

License

MIT License

Comments

Hi, Where is the supplementary material?

In this paper, "The details of automatic ASG generation are provided in the supplementary material.". But I cant find it. Please tell me where is it. THANKS!

opened by jxylon 1
How to compute Div-n?
Hi, thanks for the awesome work! I'm trying to use your method as a comparison for my own work, and I am confused about the calculation of n-gram diversity (Div-n). It is defined in the paper as "the ratio of distinct n-grams to the total number of words in the best 5 sampled captions".

My questions are:

Does it mean that, for each image, you use 5 different ASGs to obtain 5 captions, calculate a Div-n score over these captions, and then average the Div-n scores overall images in the test set to get the final Div-n score?

How to obtain the best captions?

Which dataset split do you use for evaluating the n-gram diversity?

Would you mind providing me your implementation of the Div-n score?

I would be very happy if you can answer the above questions so that I can make a fair comparison to your work.

Best regards
opened by bearcatt 0
After typing the training sentence,an keyerror has been raised for no reason. Please help me with this kind of error

After i download the target data file,the target file has been split into 4 files: objrels1 to objrels4. The target file path is: anaconda3/envs/asg2cap/controlimcap/driver/configs/ControllableImageCaption/VisualGenome/ordered_feature/SA/X_101_32x8d/

and the error is like: python asg2caption.py $resdir/model.json $resdir/path.json $mtype --eval_loss --is_train --num_workers 8 2022-04-17 22:16:55,677 mp_encoder: ft_embed.weight, shape=torch.Size([512, 2560]), num:1310720 2022-04-17 22:16:55,678 mp_encoder: ft_embed.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,678 attn_encoder: attr_order_embeds, shape=torch.Size([20, 2048]), num:40960 2022-04-17 22:16:55,678 attn_encoder: layers.0.loop_weight, shape=torch.Size([2048, 512]), num:1048576 2022-04-17 22:16:55,678 attn_encoder: layers.0.weight, shape=torch.Size([6, 2048, 512]), num:6291456 2022-04-17 22:16:55,678 attn_encoder: layers.1.loop_weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,678 attn_encoder: layers.1.weight, shape=torch.Size([6, 512, 512]), num:1572864 2022-04-17 22:16:55,678 attn_encoder: node_embedding.weight, shape=torch.Size([3, 2048]), num:6144 2022-04-17 22:16:55,678 decoder: embedding.we.weight, shape=torch.Size([11123, 512]), num:5694976 2022-04-17 22:16:55,678 decoder: attn_lstm.weight_ih, shape=torch.Size([2048, 1536]), num:3145728 2022-04-17 22:16:55,678 decoder: attn_lstm.weight_hh, shape=torch.Size([2048, 512]), num:1048576 2022-04-17 22:16:55,679 decoder: attn_lstm.bias_ih, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: attn_lstm.bias_hh, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: lang_lstm.weight_ih, shape=torch.Size([2048, 1024]), num:2097152 2022-04-17 22:16:55,679 decoder: lang_lstm.weight_hh, shape=torch.Size([2048, 512]), num:1048576 2022-04-17 22:16:55,679 decoder: lang_lstm.bias_ih, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: lang_lstm.bias_hh, shape=torch.Size([2048]), num:2048 2022-04-17 22:16:55,679 decoder: attn.linear_query.weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,679 decoder: attn.linear_query.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,679 decoder: attn.attn_w.weight, shape=torch.Size([1, 512]), num:512 2022-04-17 22:16:55,679 decoder: attn_linear_context.weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,679 decoder: address_layer.0.weight, shape=torch.Size([512, 1024]), num:524288 2022-04-17 22:16:55,679 decoder: address_layer.0.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,679 decoder: address_layer.2.weight, shape=torch.Size([4, 512]), num:2048 2022-04-17 22:16:55,679 decoder: address_layer.2.bias, shape=torch.Size([4]), num:4 2022-04-17 22:16:55,679 decoder: memory_update_layer.0.weight, shape=torch.Size([512, 1024]), num:524288 2022-04-17 22:16:55,680 decoder: memory_update_layer.0.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,680 decoder: memory_update_layer.2.weight, shape=torch.Size([1024, 512]), num:524288 2022-04-17 22:16:55,680 decoder: memory_update_layer.2.bias, shape=torch.Size([1024]), num:1024 2022-04-17 22:16:55,680 decoder: sentinal_layer.0.weight, shape=torch.Size([512, 512]), num:262144 2022-04-17 22:16:55,680 decoder: sentinal_layer.0.bias, shape=torch.Size([512]), num:512 2022-04-17 22:16:55,680 decoder: sentinal_layer.2.weight, shape=torch.Size([1, 512]), num:512 2022-04-17 22:16:55,680 decoder: sentinal_layer.2.bias, shape=torch.Size([1]), num:1 2022-04-17 22:16:55,680 num params 33, num weights 25942021 2022-04-17 22:16:55,680 trainable: num params 32, num weights 25901061 2022-04-17 22:17:52,931 mp_fts (96738, 2048) 2022-04-17 22:17:53,020 num_data 3397459 /home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:487: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( 2022-04-17 22:17:55,916 mp_fts (4925, 2048) 2022-04-17 22:17:55,921 num_data 172290 Traceback (most recent call last): File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/asg2caption.py", line 146, in main() File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/asg2caption.py", line 94, in main _model.train(trn_reader, val_reader, path_cfg.model_dir, path_cfg.log_dir, File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/framework/modelbase.py", line 191, in train metrics = self.validate(val_reader) File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/driver/caption/models/captionbase.py", line 66, in validate for batch_data in val_reader: File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lianjunliang/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lianjunliang/anaconda3/envs/asg2cap/controlimcap/readers/imgsgreader.py", line 398, in getitem 'mp_fts': self.mp_fts[self.img_id_to_ftidx_name[image_id][0]], KeyError: '2334484'

opened by MrLianSYSU 1
RuntimeError: CUDA error: device-side assert triggered

Could anyone tell me how to modify the code?

The detailed error information is as follows:

/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexTypedexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [106,0,0] Assertion indexValue >=xValue < src.sizes[dim] failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/generic/THCTensorScatterline=67 error=710 : device-side assert triggered

opened by meiling-fdu 0
Could you please tell me the details about how to generate ASG ?

Hello, shiche . Thanks for your great work ! I am interested in your work ! I checked the supplementary materials related to ASG, and I still have some doubts about the implementation details. I don't know how to achieve it if I change the datasets .Could you share the relevant code of Automatic ASG Generation ? Thank you very much.

opened by grape0803 1

Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

Related tags

Overview

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Prerequisites

Training & Inference

Datasets

Annotations

Features

Result Visualization

Citations

License

Comments

Hi, Where is the supplementary material?

How to compute Div-n?

After typing the training sentence,an keyerror has been raised for no reason. Please help me with this kind of error

RuntimeError: CUDA error: device-side assert triggered

Could you please tell me the details about how to generate ASG ?

Owner

Shizhe Chen

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Code accompanying "Dynamic Neural Relational Inference" from CVPR 2020

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral