WeakVRD-Captioning - Implementation of paper Improving Image Captioning with Better Use of Caption

Overview

Paper "Improving image captioning with better use of captions"

@inproceedings{shi2020improving,
  title={Improving Image Captioning with Better Use of Caption},
  author={Shi, Zhan and Zhou, Xu and Qiu, Xipeng and Zhu, Xiaodan},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  pages={7454--7464},
  year={2020}
}

Requirements

python 2.7.15

torch 1.0.1

Specific conda env is shown in ezs.yml

BTW, you need to download coco-captions and cider folder in this directory for evaluation.

Data Files and Models

Files: Add files in data directory in google drive or [baidu netdisk](链接:https://pan.baidu.com/s/1ddtfdlwD65cm4JmVu6GF3w 提取码:39pa) to data directory here. See data/README for more details.

Models: Add log directory in google drive or or [baidu netdisk](链接:https://pan.baidu.com/s/1ddtfdlwD65cm4JmVu6GF3w 提取码:39pa) here.

Scripts

MLE training:

python train.py --gpus 0 --id experiment-mle

RL training

python train.py --gpus 0 --id experiment-rl --learning_rate 2e-5 --resume_from experiment-mle --resume_from_best True --self_critical_after 0 --max_epochs 60 --learning_rate_decay_start -1 --scheduled_sampling_start -1 --reduce_on_plateau

Evaluate your own model or Load trained model:

python eval.py --gpus 0 --resume_from experiment-mle

and

python eval.py --gpus 0 --resume_from experiment-rl

Acknowledgement

This code is based on Ruotian Luo's brilliant image captioning repo ruotianluo/self-critical.pytorch. We use the detected bounding boxes/categories/features provided by Bottom-Up peteanderson80/bottom-up-attention, yangxuntu/SGAE. Many thanks for their work!

Comments
  • 关于CGVRG的生成

    关于CGVRG的生成

    作者您好,最近拜读了您的论文。论文中关于使用用coco中的描述句子和图片训练关系检测网络的想法非常的新颖,比其他论文中使用通用VRG检测器的方法精度提高很多。所以我想认真研究一下这部分的实现。但这个仓库中似乎之后后面解码器的部分代码,请问您能开源一下弱监督学习训练关系检测网络的这部分代码吗,万分感谢!

    opened by llylll 16
  • 您好,想请教下关于构建visual relation graph的问题

    您好,想请教下关于构建visual relation graph的问题

    如文章中所说,一张图片经过目标检测网络可以得到n个区域,这n个区域两两匹配有n(n-1)种可能,每对区域都会经过predicate classification。如果每一对区域都预测出一个predicate,那么得到的visual relation graph就会非常杂乱。请问是否是当预测的predicate的概率大于某个阈值的时候,才判定这对区域之间存在联系呢?

    opened by n9705 11
  • Code: Weak supervised multi-instance training

    Code: Weak supervised multi-instance training

    Can you please provide the code for weak supervised multi-instance training or if you have already uploaded, can you please provide the link to it?

    Thank you

    opened by Monikshah 2
  • Confirming Training Setting Information

    Confirming Training Setting Information

    Hi, Gitsamshi! Thanks for your work and the kind reply in previous issus. I just wanted to confirm some simple information about training setting, as I didn't see them in the paper/repo.

    1. I am running on a single 1080Ti and the batch size is set to 64, it consumes around 10GB of GPU memory and that each epoch takes ~1 hour in the XE step. Does this match your training time/memory usage?
    2. Besides, I didn't use the raw batch size (128) since would run out of the memory of GPU, would this change affect the performances a lot? Because I found that when I training a model under this setting, the performance reach a plateau (cider: 1.01) early (about 3~4 epochs). Does it means I did something wrong?
    3. By the way, would you be so generous to share some advice or tricks if I wish to reproduce the performances in the paper?
    opened by tjuwyh 2
  • Question about the Data used

    Question about the Data used

    Hi, thanks for your great work and I have a few questions.

    1. How did you get the object labels for detected regions from Bottom-Up model. Because it seems to be not included by the official repo.
    2. How did you implement the weakly supervised multi-instance learning described in paper? I didn't figure it out that where is the corresponding computing process in this code. By the way, I'm looking forward to the release of the data and I wish to follow this work after this. Thanks a lot!
    opened by tjuwyh 2
  • wrela

    wrela

    wrela consists of [subject object predicate_label]

    Can you please let me know which file does predicate_label refer to ? I don't find any file which has the labels to the predicate apart from coco_dict.json which consists of the index of predicates. I think index and labels are different.

    Thank you.

    opened by Monikshah 1
  • 关于论文相关问题

    关于论文相关问题

    您好,我最近拜读了您的论文《Improving Image Captioning with Better Use of Captions》,我非常喜欢您的工作,但对Qualitative Analysis部分的实验有一些疑惑。比如Figure7的图一,场景图中出现了street,但左边的图片目标检测的结果似乎没有street这个物体,请问这是怎么回事呢?

    opened by Linjz1 1
  • RL training ,keyerror

    RL training ,keyerror

    When I used RL training, there was a mistake. How can I modify it? Traceback (most recent call last): File "train.py", line 171, in train(opt) File "train.py", line 102, in train reward = get_self_critical_reward(decoder, core_args, vrg_data, fc_feats, att_feats, weak_relas, att_masks, data, gen_result, opt) File "/peng/pyx/WeakVRD-Captioning-master/misc/rewards_graph.py", line 60, in get_self_critical_reward , cider_scores = CiderD_scorer.compute_score(gts, res) File "cider/pyciderevalcap/ciderD/ciderD.py", line 48, in compute_score (score, scores) = cider_scorer.compute_score(self._df) File "cider/pyciderevalcap/ciderD/ciderD_scorer.py", line 199, in compute_score score = self.compute_cider(df_mode) File "cider/pyciderevalcap/ciderD/ciderD_scorer.py", line 173, in compute_cider vec, norm, length = counts2vec(test) File "cider/pyciderevalcap/ciderD/ciderD_scorer.py", line 122, in counts2vec df = np.log(max(1.0, self.document_frequency[ngram])) KeyError: ('5452', '1660')

    opened by userpei 1
  • Meeting unknown error when training with RL

    Meeting unknown error when training with RL

    Hi, Gitsamshi! I have meet this problem when I tried to train my model with RL and I cannot figure it out why it happens. image

    Since I actually totally follow your data setting and use the released data files, I don't think it's the issus about label categories (like some blogs said). Do you have any idea on why this problem occurs or how to fix it?

    opened by tjuwyh 1
  • 关于结果

    关于结果

    你好,我运行了代码,结果生成了一个eval_results的文件夹,其中有test_test.json文件。这个json文件里的caption是模型生成的吗?同一张图片id,test_test.json文件的caption和captions_val2014.json的caption一样哎。那怎么看模型生成的caption呢

    opened by lee-geng 0
  • cider evaluation

    cider evaluation

    Hi Shi,

    Thanks for your open-source codebase.It help me a lot.

    I want to ask you about why use ciderD in cider package rather than coco-caption to do cider evaluation ?

    opened by fortunechen 0
  • RL training error

    RL training error

    当我尝试CIDEr优化训练时,遇到报错:“ gen_result, sample_logprobs, core_args = decoder(vrg_data, fc_feats, att_feats, weak_relas, att_masks, opt={'sample_max': 0, 'return_core_args': True}, mode='sample') ValueError: need more than 2 values to unpack” decoder()只能返回gen_result和sample_logprobs,不能返回core_args.

    我想问一下,decoder()的返回值具体由那一步决定???

    opened by HN246 0
  • Positive Bag Negative Bag

    Positive Bag Negative Bag

    I am trying to reproduce your model for weak supervised multi instance learning, I am a bit confused about the formation of positive and negative bag. It says in the paper, for a predicate r associated with object region pair, the region pair will be labeled as positive bag if the predicate r is in the caption S. My question is the predicates are extracted from the triplets and the triplets are extracted from the caption so the predicate with always be present in the caption.

    How to label the positive and negative bag. Can you please help me understand this?

    Thank you very much.

    opened by Monikshah 24
Owner
all is classfication
null
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 172 Dec 22, 2022
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Microsoft 61 Nov 14, 2022
Yet another video caption

Yet another video caption

Fan Zhimin 5 May 26, 2022
Gif-caption - A straightforward GIF Captioner written in Python

Broksy's GIF Captioner Have you ever wanted to easily caption a GIF without havi

null 3 Apr 9, 2022
Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh

Arjun Majumdar 44 Dec 14, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 3, 2023
Implementation of Analyzing and Improving the Image Quality of StyleGAN (StyleGAN 2) in PyTorch

Implementation of Analyzing and Improving the Image Quality of StyleGAN (StyleGAN 2) in PyTorch

Kim Seonghyeon 2.2k Jan 1, 2023
Official PyTorch implementation of the paper: Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting.

Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting Official PyTorch implementation of the paper: Improving Graph Neural Net

Giorgos Bouritsas 58 Dec 31, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
This repository is an implementation of paper : Improving the Training of Graph Neural Networks with Consistency Regularization

CRGNN Paper : Improving the Training of Graph Neural Networks with Consistency Regularization Environments Implementing environment: GeForce RTX™ 3090

THUDM 1 Dec 9, 2021
[CVPRW 2022] Attentions Help CNNs See Better: Attention-based Hybrid Image Quality Assessment Network

Attention Helps CNN See Better: Hybrid Image Quality Assessment Network [CVPRW 2022] Code for Hybrid Image Quality Assessment Network [paper] [code] T

IIGROUP 49 Dec 11, 2022
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

hwanheelee 14 Nov 20, 2022
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

null 24 Dec 28, 2022