Multilingual Image Captioning

Overview

Multilingual Image Captioning

Authors: Bhavitvya Malik, Gunjan Chhablani
Demo Link: https://huggingface.co/spaces/flax-community/multilingual-image-captioning

GitHub Repository for Multilingual Image Captioning task created during HuggingFace JAX/Flax community week. Multilingual Image Captioning addresses the challenge of caption generation for an image in a multilingual setting. Here, we fuse CLIP Vision transformer into mBART50 and perform training on translated version of Conceptual-12M dataset.

  • Our models are present in the models directory. We have combined CLIP Vision+mBART-50 in the model repository.
  • Our training scripts are:
    • run.sh for pre-training.

Usage:

  • Install the requirements.txt file in your environment.
  • To run a training script, just use your command line:
./run.sh
  • Inference (You will need to clone the model from here as well):
from torchvision.io import read_image
import numpy as  np
import os, wget
from transformers import CLIPProcessor, MBart50TokenizerFast
from model.flax_clip_vision_mbart.modeling_clip_vision_mbart import FlaxCLIPVisionMBartForConditionalGeneration
img = wget("http://images.cocodataset.org/val2017/000000397133.jpg")
img = read_image(img) # reading image
clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
clip_outputs = clip_processor(images=img)
clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
tokenizer = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50"')
model = FlaxCLIPVisionBertForMaskedLM.from_pretrained('flax-community/clip-vit-base-patch32_mbart-large-50')
output_ids = model.generate(batch["pixel_values"], forced_bos_token_id=tokenizer.lang_code_to_id["es_XX"], num_beams=4, max_length=64).sequences  # "es_XX is the language code in which you want the translation
# en_XX: English, fr_XX: French, es_XX: Spanish, de_DE: Deutsch
output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
output_string  # Un restaurante u otro lugar para comer en el Hotel

Table of Contents

Introduction and Motivation

This project is focused on Mutilingual Image Captioning, which has attracted an increasing amount of attention in the last decade due to its potential applications. Most of the existing datasets and models on this task work with English-only image-text pairs. It is a challenging task to generate captions with proper linguistics properties in different languages as it requires an advanced level of image understanding. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German, and Spanish.

Novel Contributions

Our novel contributions include:

Methodology

Pre-training

We follow an encoder-decoder approach for image captioning, where the image encoder is the CLIP Vision model (a ViT transformer). The pre-training task is image-to-text generation. We take the input tokens and shift them using an token towards right in order to create the inputs for our model, while the original input tokens become labels. The model is trained on the dataset. in an end-to-end fashion.

Dataset The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. To save time, we use 2.5M of these image-text pairs. Then we use the MarianMT Helsinki-NLP/opus-mt-{src}-{tgt} checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping approximately 2.5M examples of each language.

Model The model is shown in the image above. We create a custom model in Flax which integerates the CLIP Vision model as an encoder inside mBART model. We also use custom configs and modules in order to accomodate for these changes, and allow loading from mBART and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the shifted token ids are fed to the mBART decoder. We use the facebook/mbart-large-50 and openai/clip-vit-base-patch32 checkpoints for mBART and CLIP Vision models, respectively. All our code is available on GitHub.

Our model reached eval loss of ~2.6 around ~70K steps. Here are the BLEU scores (out of 1) for different languages:

Language BLEU-1 BLEU-2 BLEU-3 BLEU-4
English 0.13083 0.08887 0.06681 0.04899
Spanish 0.15981 0.09858 0.06918 0.04776
German 0.14234 0.09817 0.07405 0.0515
French 0.13021 0.08862 0.06598 0.04647

Challenges and Technical Difficulties

Training image captioning that too multilingual was a difficult task and we faced challenges at almost every point of this process.

  • Dataset: Our initial plan was to translate Conceptual 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. We translated the whole dataset using MBart50 for all languages which took around 3-4 days. Further on, we realised that mBART captions were not that good and model was not converging because of that which lead us to re-translate our captions with Marian

  • We prepared the model and config classes for our model from scratch, basing it on CLIP model based on ViT-B/32 Image Transformer and mBART50 implementations in FLAX. The CLIP embeddings were to be used inside the mBART50 embeddings class, which was the major challenge here.

  • RAM issues: Loading and training 10M image-caption dataset led to huge amount of RAM consumption on TPU (~200GB in the first few steps) because of which we had to optimize the script, use less data, and use less num_workers in order to avoid this issue. This also caused our training to slow down.

  • We were only able to get around 2-3 days of training time on TPUs due to aformentioned challenges. We were unable to perform hyperparameter tuning.

Limitations and Bias

Limitations

  • Our model has a major limitation in that the training data provided was limited to a sequence length of 64 tokens. Hence, it doesn not perform very well with longer sequence lengths. Sometimes, it yields up empty captions. We are working on it as of this writing by doubling the maximum sequence length of translation and training.
  • The dataset has all Person type named entites masked as . While that is good for biases as we explain below, the dataset contains too many tags and the model results in sometimes for Person-related images.
  • Our captions are sometimes generic. Stating what is present in the image instead of generation well-formed and convoluted captions. Despite the training, the BLEU scores we achieve are not very great, which could be a reason for this. With higher BLEU scores, we can expect less-generic models.
  • English captions are sometimes better than other languages. This can be due to the fact that we limit sequence length of other languages to 64 (and now 128) while English text works fine. This could also be due to poor-quality translations which we wish to address in our next attempt.

Biases

  • Due to the gender, racial, color and stereotypical biases in data, person identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias.

  • One of the reasons why we chose Conceptual 12M over COCO captioning dataset for training our Multi-lingual Image Captioning model was that in former all named entities of type Person were substituted by a special token . Because of this, the gendered terms in our captions became quite infrequent. We'll present a few captions from our model to analyse how our model performed on different images on which different pre-trained image captioning model usually gives gender prediction biases.

Conclusion, Future Work, and Social Impact

Conclusion

In this project, we presented Proof-of-Concept with our CLIP Vision + mBART-50 model baseline which leverages a multilingual checkpoint with pre-trained image encoders in four languages - English, French, German, and Spanish. Our models achieve a BLEU-1 score of around 0.14 which is decent considering the amount of training time we could get and how challenging multilingual training is.

Future Scope

We hope to improve this project in the future by using:

  • Superior translation model: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
  • Checking translation quality: Inspecting quality of translated data is as important as the translation model itself. For this we'll either require native speakers to manually inspect a sample of translated data or devise some unsupervised translation quality metrics for the same.
  • More data: Currently we are using only 2.5M images of Conceptual 12M for image captioning. We plan to include other datasets like Conceptual Captions 3M, subset of YFCC100M dataset etc.
  • Low resource languages: With better translation tools we also wish to train our model in low resource languages which would further democratize the image captioning solution and help people realise the potential of language systems.
  • Accessibility: Making the model deployable on hand-held devices to make it more accessible. Currently, our model is too large to fit on mobile/edge devices because of which not many will be able to access it. However, our final goal is ensure everyone can access it without any computation barriers. Hopefully we'll be able to support TFLite for our model as well in future.
  • More models: We can combine several decoders with the CLIP-Vision encoder to get multilingual mdoels. We also wish to work with Marian models for language-specific captioning models, especially for low-resource languages.
  • Better training: We wish to experiment more with hyperparameters, optimizers, and learning rate schedulers to make the training work better. Our validation curve, as of now, plateaus in a very few epochs and we wish to address this issue.

Social Impact

Our initial plan was to include 4 high-resource and 4 low-resource languages (Marathi, Bengali, Urdu, Telegu) in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time.

Being able to automatically describe the content of an image using properly formed sentences in any language is a challenging task, but it could have great impact by helping visually impaired people better understand their surroundings.

A slightly (not-so) long term use case would definitely be, explaining what happens in a video, frame by frame. One more recent use-case for the same can be generating surgical instructions. Since our model is multi-lingual which means the instructions will not be just limited to regions where English is spoken but those instructions can be perused in regions where Spanish, French and German are spoken as well. Further if we extend this project to low-resource languages then its impact can be manifold.

References

Papers

@inproceedings{NIPS2017_3f5ee243,
 author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, \L ukasz and Polosukhin, Illia},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett},
 pages = {},
 publisher = {Curran Associates, Inc.},
 title = {Attention is All you Need},
 url = {https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf},
 volume = {30},
 year = {2017}
}
@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}
@inproceedings{changpinyo2021cc12m,
  title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts},
  author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu},
  booktitle = {CVPR},
  year = {2021},
}
@InProceedings{mariannmt,
  title     = {Marian: Fast Neural Machine Translation in {C++}},
  author    = {Junczys-Dowmunt, Marcin and Grundkiewicz, Roman and
               Dwojak, Tomasz and Hoang, Hieu and Heafield, Kenneth and
               Neckermann, Tom and Seide, Frank and Germann, Ulrich and
               Fikri Aji, Alham and Bogoychev, Nikolay and
               Martins, Andr\'{e} F. T. and Birch, Alexandra},
  booktitle = {Proceedings of ACL 2018, System Demonstrations},
  pages     = {116--121},
  publisher = {Association for Computational Linguistics},
  year      = {2018},
  month     = {July},
  address   = {Melbourne, Australia},
  url       = {http://www.aclweb.org/anthology/P18-4020}
}
@article{liu2020multilingual,
    title={Multilingual Denoising Pre-training for Neural Machine Translation},
    author={Yinhan Liu and Jiatao Gu and Naman Goyal and Xian Li and Sergey Edunov and Marjan Ghazvininejad and Mike Lewis and Luke Zettlemoyer},
    year={2020},
    eprint={2001.08210},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision},
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Useful Links

Acknowledgements

We'd like to thank Abheesht Sharma for helping in the discussions in the initial phases. Luke Melas helped us get the cleaned CC-12M data on our TPU-VMs and we are very grateful to him.

This project would not be possible without the help of Patrick and Suraj who met with us and helped us review our approach and guided us throughout the project. We especially thank Patrick for going out of the way and allowing us extra TPU time so that we could work on this project.

Last but not the least, we thank the Google Team for helping answer our queries on the Slack channel, and for providing us TPU-VMs.

You might also like...
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

 Image Captioning on google cloud platform based on iot
Image Captioning on google cloud platform based on iot

Image-Captioning-on-google-cloud-platform-based-on-iot - Image Captioning on google cloud platform based on iot

Syntax-Aware Action Targeting for Video Captioning

Syntax-Aware Action Targeting for Video Captioning Code for SAAT from "Syntax-Aware Action Targeting for Video Captioning" (Accepted to CVPR 2020). Th

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.
Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

WSDEC This is the official repo for our NeurIPS paper Weakly Supervised Dense Event Captioning in Videos. Description Repo directories ./: global conf

Videocaptioning.pytorch - A simple implementation of video captioning

pytorch implementation of video captioning recommend installing pytorch and pyth

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Meta Language-Specific Layers in Multilingual Language Models

Meta Language-Specific Layers in Multilingual Language Models This repo contains the source codes for our paper On Negative Interference in Multilingu

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

This is the XLM-T repository, which includes data, code and pre-trained multilingual language models for Twitter. XLM-T - A Multilingual Language Mode

Comments
  • Tips about changing Decoder model

    Tips about changing Decoder model

    hi @gchhablani

    i'm doing similar jax project but with monolingual language instead of multilingual

    i'm planning to use gpt2 as decoder instead of bart

    can you give any tips where should i change it?

    i notice that FlaxBartDecoder return cross attention but Gpt2 only return hidden_states and attention. should that be concern?

    thank you before

    sorry if this no the right place to put question like this

    opened by acul3 4
  • Add translation script for Conceptual Captions dataset

    Add translation script for Conceptual Captions dataset

    @gchhablani, this is a rough draft of the translation script. Review it before you run it on the instance provided by HF.

    Install the mtranslate library: pip install mtranslate

    If you are in the folder containing the tsv file, run the following command:

    python translate_cc_captions.py --dataset_store_path ./ --no_of_train_samples 1000 --no_of_val_samples 100
    

    Check the lang_list argument if you want to translate it to other languages. The default is French ("fr").

    opened by abheesht17 0
  • inference error

    inference error

    Dear @gchhablani , I'm trying run Inference.py to get the caption of the image. However, I got this error: flax.errors.ScopeParamShapeError: Inconsistent shapes between value and initializer for parameter "kernel" in "/model/encoder/vision_model/embeddings/patch_embedding": (32, 32, 3, 768), (32, 32, 224, 768). (https://flax.readthedocs.io/en/latest/flax.errors.html#flax.errors.ScopeParamShapeError) could you please give me some clue about it?

    opened by wentaoxandry 0
Owner
Gunjan Chhablani
Computer Science Graduate from BITS Pilani, Goa || Deep Learning || NLP, CV, Biomedical Image Analysis, Multimodal Learning ||
Gunjan Chhablani
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023
An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

hwanheelee 14 Nov 20, 2022
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

null 24 Dec 28, 2022
Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

lyricpoem 16 Dec 16, 2022
An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 1.1k Oct 18, 2021
A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese This repo GitHub contains our solution for vieCap4H

Doanh B C 4 May 5, 2022