The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

CAiRE

Last update: Jan 7, 2023

Related tags

Deep Learning VG-GPLMs

Overview

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

[Paper] accepted at the EMNLP 2021:

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization, by Tiezheng Yu *, Wenliang Dai *, Zihan Liu, Pascale Fung.

Paper Abstract

Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs' powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our visual guidance method contributes 83.6% of the overall improvement. Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.

If you work is inspired by our paper or code, please cite it, thanks!

TODO

Evaluation

We release the generated summaries from different models in ./evaluation/results. All the evaluation metrics can be computed following ./evaluation/README.md.

Prepare dataset

You can go to How2 dataset Github to get the dataset. We recommend you to choose the (option 1): Download a pre-packaged version.

Run fine-tuning

make directory for saving lightning logs: mkdir lightning_logs
An example of running Bart text only model: ./scripts/Bart_text_only.sh
An example of running Bart multimodal model: ./scripts/Bart_multimodal.sh

Run inference

An example of running Bart multimodal model: ./scripts/test_Bart_multimodal.sh

Comments

Question about the data

Thank you guys for your amazing work. I am trying to reproduce your work, but when I try to train multi-modal model, it occured error. It seems that the error is about loading video extraction . I find that How2 provides several files for video feature, I wonder which file is used or did you preprocess these feature?

opened by Fr0zenCrane 1
Question about the code

Hi, thanks for your awesome work! I have a question about the code: https://github.com/HLTCHKUST/VG-GPLMs/blob/d27a485c3da4869a6a9f2ff902a9c64129b778b1/src/data_preprocess/data_builder.py#L30 Why did you ignore the first word in the src and tgt? Looking forward to your reply.

opened by haruhi-sudo 1
about version of transformers

Hello,

I tried to run your code and got an error saying that 'force_bos_token_to_be_generated' does not exist in BartConfig.

I think it is because of version of transformers i used.

I used transformers version==4.11.0 and it works well on text-only Bart. The error occurred when I run multimodal task.

Could you let me know the version of transformer you used in your work?

And I want to know the reason why you used './dataset/sum_devtest/tran.tok.txt ' instead of './dataset/sum_train/tran.tok.txt ' for your 'train_src_path'.

Thank you

opened by jyson24 1
Unsupported operation when running t5_multimodal model

I'm running t5-multimodal model on our computer. However, I received the following information after loading the data: Validation sanity check: 0it [00:00, ?it/s] Validation sanity check: 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last): File "/nas02/homes/yang22-1000062/VG-GPLMs/./src/run.py", line 155, in trainer.fit(model, train_dataloader=summary_data.train_loader, val_dataloaders=summary_data.val_loader) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit self.dispatch() File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in dispatch self.accelerator.start_training(self) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training self.training_type_plugin.start_training(trainer) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training self._results = trainer.run_train() File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 606, in run_train self.run_sanity_check(self.lightning_module) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 855, in run_sanity_check _, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 164, in evaluation_step output = self.trainer.accelerator.validation_step(args) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 177, in validation_step return self.training_type_plugin.validation_step(*args) File "/home/nas02home/conda/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 297, in validation_step return self.model(*args, **kwargs) File "/home/nas02home/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/nas02home/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1002, in forward self._sync_buffers() File "/home/nas02home/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1585, in _sync_buffers self._sync_module_buffers(authoritative_rank) File "/home/nas02home/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1589, in _sync_module_buffers self._default_broadcast_coalesced(authoritative_rank=authoritative_rank) File "/home/nas02home/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1610, in _default_broadcast_coalesced self._distributed_broadcast_coalesced( File "/home/nas02home/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1526, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.

For other models, the program can run smoothly. How can I solve it?

opened by chlorane 0
`image_len` is not uesd?

https://github.com/HLTCHKUST/VG-GPLMs/blob/ecd40e8d884123666a339ef9d2968b178610b898/src/models/modeling_bart.py#L749

image_len is not uesd in calculate attn?

opened by FutureWithoutEnding 5
how get the "noise image feature"?

In the code, noise image feature is load from file. how can i generate the noise image feature?

In the paper, said generate from a uniform distribution from 0 to 3, why 0 and 3, How get the two value? and Why use a uniform?

Can you show me some code about these?

opened by FutureWithoutEnding 2

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Related tags

Overview

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Paper Abstract

Evaluation

Prepare dataset

Run fine-tuning

Run inference

Comments

Question about the data

Question about the code

about version of transformers

Unsupported operation when running t5_multimodal model

`image_len` is not uesd?

how get the "noise image feature"?

Owner

CAiRE

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.

Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Code for our paper Aspect Sentiment Quad Prediction as Paraphrase Generation in EMNLP 2021.

Implementation for the EMNLP 2021 paper "Interactive Machine Comprehension with Dynamic Knowledge Graphs".

Related resources for our EMNLP 2021 paper

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

Code and data to accompany the camera-ready version of "Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation" in EMNLP 2021

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"