(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Last update: Dec 4, 2022

Related tags

Deep Learning Kaleido-BERT

Overview

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, Ling Shao.

[Paper][中文版][Video][Poster][MSRA_Slide][News1][New2][MSRA_Talking][机器之心_Talking]

Introduction

We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, \ie, rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains state-of-the-art results by large margins on four downstream tasks, including text retrieval (R@1: 4.03% absolute improvement), image retrieval (R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commercial websites, demonstrating its broader potential in real-world applications.

Noted

Code will be released in 2021/4/16.
This is the tensorflow implementation built on Alibaba/EasyTransfer. We will also release a Pytorch version built on Huggingface/Transformers in future.
If you feel hard to download these datasets, please modify /dataset/get_pretrain_data.sh, /dataset/get_finetune_data.sh, /dataset/get_retrieve_data.sh, and comment out some wget #file_links as you want. This will not inhibit following implementation.

Get started

Clone this code

git clone git@github.com:mczhuge/Kaleido-BERT.git
cd Kaleido-BERT

Enviroment setup (Details can be found on conda_env.info)

conda create  --name kaleidobert --file conda_env.info
conda activate kaleidobert
conda install tensorflow==1.15.0
pip install boto3 tqdm tensorflow_datasets --index-url=https://mirrors.aliyun.com/pypi/simple/
pip install sentencepiece==0.1.92 sklearn --index-url=https://mirrors.aliyun.com/pypi/simple/
pip install joblib==0.14.1
python setup.py develop

Download Pretrained Dependancy

cd Kaleido-BERT/scripts/checkpoint
sh get_checkpoint.sh

Finetune

#Download finetune datasets

cd Kaleido-BERT/scripts/dataset
sh get_finetune_dataset.sh
sh get_retrieve_dataset.sh

#Testing CAT/SUB

cd Kaleido-BERT/scripts
sh run_cat.sh
sh run_subcat.sh

#Testing TIR/ITR

cd Kaleido-BERT/scripts
sh run_i2t.sh
sh run_t2i.sh

Pre-training

#Download pre-training datasets

cd Kaleido-BERT/scripts/dataset
sh get_prtrain_dataset.sh

#Remove existed checkpoint
rm -rf Kaleido-BERT/checkpoint/pretrained

#Run pre-training
cd Kaleido-BERT/scripts/
sh run_pretrain.sh

Acknowlegement

Thanks Alibaba ICBU Search Team and Alibaba PAI Team for technical support.

Citing Kaleido-BERT

@inproceedings{Zhuge2021KaleidoBERT,
  title={Kaleido-BERT: Vision-Language Pre-training on Fashion Domain},
  author={Zhuge, Mingchen and Gao, Dehong and Fan, Deng-Ping and Jin, Linbo and Chen, Ben and Zhou, Haoming and Qiu, Minghui and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={},
  year={2021}
}

Contact

Mingchen Zhuge (email: mczhuge@cug.edu.cn | wechat: tjpxiaoming)
Deng-Ping Fan (email: denpfan@gmail.com)
Dehong Gao (email: dehong.gdh@alibaba-inc.com)

Feel free to contact us if you have additional questions.

Comments

Reg testing with other data

What's the best way to test with my custom image set ? the t2i and i2t data looks mostly text and vectors, any script to get my data into that format ?

opened by shaheenkdr 9
The problem about the third step:Download Dependancy

Thank you for sharing such great work.

When I run the sh get_checkpoint.sh, I get the mistake like below:

Resolving icbu-ensa-sc.oss-cn-zhangjiakou.aliyuncs.com (icbu-ensa-sc.oss-cn-zhangjiakou.aliyuncs.com)... 47.92.17.218 Connecting to icbu-ensa-sc.oss-cn-zhangjiakou.aliyuncs.com (icbu-ensa-sc.oss-cn-zhangjiakou.aliyuncs.com)|47.92.17.218|:80... connected. HTTP request sent, awaiting response... 403 Forbidden.

And when I click the link directly, I get the mistake like below:

This XML file does not appear to have any style information associated with it. The document tree is shown below. AccessDenied You have no right to access this object because of bucket acl. 607C319BB6DA383338EC6AFD icbu-ensa-sc.oss-cn-zhangjiakou.aliyuncs.com

May you provide the solution?

opened by tangyuhao2016 5
Finetuning of Kaleido-BERT for Fashion Captioning

Thanks for sharing this interesting work. Would you please share how "Kaleido-BERT" has been fine-tuned on captioning task? Have you used separate decoder for generation or "Kaleido-BERT" encoder only?

opened by gourango01 2
How to generate input_schema format data?

Hi, I find your work very interesting, and it is aligned with my project requirements. I want to fine tune it for custom dataset, where I have raw images and text with labels, the task is similar to "Category/SubCategory Recognition". How to get the data in input_schema format? Please share the code if you have any.

opened by Nidhi-kumari 1
Fashion Captioning using Kaleido-BERT and Fashion-BERT

Hi, I have gone through your code. Very interesting work. Can you please explain the input to calculate input MLM logits for caption generation? I have tried input in the formats: 1. image_feature,[SEP], [MASK],[PAD]...[PAD] 2. image_feature,[CLS], [MASK],[PAD]....[PAD] 3. [CLS], [MASK],[PAD]...[PAD],[SEP],image_feature; this will be in loop. Which one is the correct format? Thanks!

opened by Surabhi-Kumari 1
Finetuning of Kaleido-BERT for Fashion Captioning Update

#6 During the fine-tuning on the image captioning task, Did you use any pre-training task (for e.g., AKPM, TIM and AMLM) along with the fashion captioning task i.e., given an image ( i.e., sequence of image patches generated by "Kaleido Patch Generator") predict the corresponding caption?

opened by gourango01 1
The problem about the second step

thanks for sharing such great works when I run "pip install boto3 tqdm tensorflow_datasets --index-url=https://mirrors.aliyun.com/pypi/simple/" there is something wrong happened like below: Looking in indexes: https://mirrors.aliyun.com/pypi/simple/ Collecting boto3 Downloading https://mirrors.aliyun.com/pypi/packages/f1/99/43e5571005c792284276986eabd956699fac65d283df409b1482ca8722d8/boto3-1.17.67-py2.py3-none-any.whl (131kB) |████████████████████████████████| 133kB 5.5MB/s Collecting tqdm Downloading https://mirrors.aliyun.com/pypi/packages/72/8a/34efae5cf9924328a8f34eeb2fdaae14c011462d9f0e3fcded48e1266d1c/tqdm-4.60.0-py2.py3-none-any.whl (75kB) |████████████████████████████████| 81kB 14.7MB/s Collecting tensorflow_datasets Downloading https://mirrors.aliyun.com/pypi/packages/fe/52/9b9f6312cfa29c39445d22a3ba45f6279db1937de9df93c9fb65dcf0e42a/tensorflow-datasets-3.2.1.tar.gz (2.9MB) |████████████████████████████████| 2.9MB 30.2MB/s Collecting jmespath<1.0.0,>=0.7.1 Downloading https://mirrors.aliyun.com/pypi/packages/07/cb/5f001272b6faeb23c1c9e0acc04d48eaaf5c862c17709d20e3469c6e0139/jmespath-0.10.0-py2.py3-none-any.whl ERROR: Could not find a version that satisfies the requirement botocore<1.21.0,>=1.20.67 (from boto3) (from versions: 0.4.1, 0.4.2, 0....) ERROR: No matching distribution found for botocore<1.21.0,>=1.20.67 (from boto3)

I‘m confused about this. May you provide the solution? PS ： Here is my environment： OS ：Ubuntu 16.04.6 env：set up as you said

opened by zhangxj59 0
Some questions about the model proposed in the paper
1.In 3.3 Attention-based Alignment Generator, Generated Tokens --> the Attention Map. Is token means only noun or all the words just like prepositions and verbs, etc in the generation and raw text?

In 3.3 Attention-based Alignment Generator, the Attention Map --> Patch. How the attention map produced by the SAT model aligned with the Kaleido Patches? Calculate KL divergence between attention map and patches of some other calculation method?
opened by tangyuhao2016 0

Owner

Master Student of Computer Science, on Chinese University of Geoscience.

GitHub

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

84 Dec 26, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

430 Dec 23, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

X-VLM: Multi-Grained Vision Language Pre-Training

X-VLM: learning multi-grained vision language alignments Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xi

286 Dec 23, 2022

Code for pre-training CharacterBERT models (as well as BERT models).

Pre-training CharacterBERT (and BERT) This is a repository for pre-training BERT and CharacterBERT. DISCLAIMER: The code was largely adapted from an o

31 Dec 5, 2022

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

201 Nov 21, 2022

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Pre-trained checkpoint and bert config json file Location of checkpoint and bert config json file This MLCommons members Google Drive location contain

SAIT (Samsung Advanced Institute of Technology)

12 Apr 27, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Related tags

Overview

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Introduction

Noted

Get started

Acknowlegement

Citing Kaleido-BERT

Contact

Comments

Reg testing with other data

The problem about the third step:Download Dependancy

Finetuning of Kaleido-BERT for Fashion Captioning

How to generate input_schema format data?

Fashion Captioning using Kaleido-BERT and Fashion-BERT

Finetuning of Kaleido-BERT for Fashion Captioning Update

The problem about the second step

Some questions about the model proposed in the paper