A collection of models for image - text generation in ACM MM 2021.

Multimedia Research

Last update: Oct 30, 2022

Related tags

Text Data & NLP generate-it

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

Unifying Multimodal Transformer for Bi-directional Image and Text Generation,
Yupan Huang, Bei Liu, Yutong Lu, in ACM MM 2021 (Industrial Track).

UMT-DBITG (diverse image & text generator)

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation,
Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu, in ACM MM 2021 (Video and Demo Track).

Poster or slides are available in the assets folder by visiting OneDrive.

Data & Pre-trained Models

Download preprocessed data and our pre-trained models by visiting OneDrive. We suggest following our data structures, which is consistent with the paths in config.py. You may need to modify the root_path in config.py. In addition, please following the instructions to prepare some other data:

Download grid features in path data/grid_features provided by X-LXMERT or follow feature extraction to extract these features.

wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_train_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_valid_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_test_grid8.h5 -P data/grid_features

For text-to-image evaluation on MSCOCO dataset, we need the real images to calculate the FID metric. For UMT-DBITG, we use MSCOCO karpathy split, which has been included in the OneDrive folder (images/imgs_karpathy). For UMT-BITG, please download MSCOCO validation set in path images/coco_val2014.

Citation

If you like our paper or code, please generously cite us:

@inproceedings{huang2021unifying,
  author    = {Yupan Huang and Bei Liu and Yutong Lu},
  title     = {Unifying Multimodal Transformer for Bi-directional Image and Text Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

@inproceedings{huang2021diverse,
  author    = {Yupan Huang and Bei Liu and Jianlong Fu and Yutong Lu},
  title     = {A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

Acknowledgement

Our code is based on LaBERT and X-LXMERT. Our evaluation code is from pytorch-fid and inception_score. We sincerely thank them for their contributions!

Feel free to open issues or email to me for help to use this code. Any feedback is welcome!

You might also like...

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

wav2vec-toolkit A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models This repository accompanies the

29 Oct 23, 2022

A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Preference-aware Fake News Detection (UPFD) framework. The fake news detection problem is instantiated as a graph classification task under the UPFD framework.

251 Jan 1, 2023

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

605 Jan 2, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

211 Dec 28, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.1k Feb 17, 2021

Comments

"it-generator" project requirements issues
@HYPJUDY I am unable to setup the requirements for the project using the "requirements.txt" on my Windows machine as it produces the following error:

I did find a way around this using PyTorch previous versions:

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

There is additionally an issue with conflicting tensorboard versions. I commented out the tensorboard in the "requirements.txt" for now:

Also I believe the yacs and apex libraries are required, as they are also not in the requirements and produce errors when running the sample_images.py file.
opened by GoelBiju 2
Inference using own datasets

Hello, I'm interested in your work!

I want to know what should i do if i want to infer the model by using my own datasets.

Should i build the file like dataset_coco.json, and re-extract cluster and grid features?

Thank you!

opened by victorup 1

a question about image mask

In train.py(103-110)

103: # for image
104: _visual_mask = torch.zeros((batch_size, visual_token_num), dtype=torch.float32, device=device)
105: # need to mask token content in selected_idx for prediction/generation
106: num_masks = random.randint(max(1, int(0.1 * visual_token_num)), visual_token_num)
107: selected_idx = random.sample(range(visual_token_num), num_masks)
108: _visual_mask[:, selected_idx] = 1
109: mask_position = (_visual_mask == 1).to(torch.long).view(-1)
110: mask_position = mask_position.nonzero().squeeze()

I think '_visual_mask = 1' means the model can see it, '_visual_mask = 0' is the opposite. The above codes randomly sample mask position, which selects which grid(8*8) the model can see(_visual_mask=1). The position that really needs to be masked is the position where the _visual_mask is equal to 0. So the code on line 109 should be changed to mask_position = (_visual_mask == 0).to(torch.long).view(-1) is this right?

opened by aaronma2020 0

Owner

Multimedia Research

Multimedia Research at Microsoft Research Asia

GitHub

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

66 Dec 26, 2022

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

2.3k Jan 1, 2023

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

20 Jul 11, 2022

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

612 Jan 4, 2023

A collection of models for image - text generation in ACM MM 2021.

Related tags

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

UMT-DBITG (diverse image & text generator)

Data & Pre-trained Models

Citation

Acknowledgement

You might also like...

Code for Text Prior Guided Scene Text Image Super-Resolution

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

A collection of GNN-based fake news detection models.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Comments

"it-generator" project requirements issues

Inference using own datasets

a question about image mask

Owner

Multimedia Research

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Unsupervised text tokenizer for Neural Network-based text generation.

Unsupervised text tokenizer for Neural Network-based text generation.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks