Language Models Can See: Plugging Visual Controls in Text Generation
Authors: Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier
This repository contains code, models, and other related resources of our paper [Language Models Can See: Plugging Visual Controls in Text Generation].
Catalogue:
- 1. Introduction
- 2. News
- 3. Citation
- 4. Environment Setup
- 5. Zero-Shot Image Captioning
- 6. Visually Grounded Story Generation
- 7. Contact
- 8. MAGIC Elsewhere
1. Introduction:
Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.
2. News:
- [2022/05/06] MAGIC is publicly released!
3. Citation:
If you find our paper and resources useful, please kindly leave a star and cite our papers. Thanks!
@article{DBLP:journals/corr/abs-2205-02655,
author = {Yixuan Su and
Tian Lan and
Yahui Liu and
Fangyu Liu and
Dani Yogatama and
Yan Wang and
Lingpeng Kong and
Nigel Collier},
title = {Language Models Can See: Plugging Visual Controls in Text Generation},
journal = {CoRR},
volume = {abs/2205.02655},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2205.02655},
doi = {10.48550/arXiv.2205.02655},
eprinttype = {arXiv},
eprint = {2205.02655},
timestamp = {Wed, 11 May 2022 17:29:40 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2205-02655.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/corr/abs-2202-06417,
author = {Yixuan Su and
Tian Lan and
Yan Wang and
Dani Yogatama and
Lingpeng Kong and
Nigel Collier},
title = {A Contrastive Framework for Neural Text Generation},
journal = {CoRR},
volume = {abs/2202.06417},
year = {2022},
url = {https://arxiv.org/abs/2202.06417},
eprinttype = {arXiv},
eprint = {2202.06417},
timestamp = {Fri, 18 Feb 2022 12:23:53 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2202-06417.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
4. Environment Setup:
python version: 3.8
pip3 install -r requirements.txt