Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Last update: Dec 16, 2022

Related tags

Deep Learning image-retrieval-baseline

Overview

MUGE Multimodal Retrieval Baseline

This repo is implemented based on the open_clip project, with modifications to adapt to the Chinese Multimodal Retrieval task

Requirements and Installation

This repo is successfully tested on the following environment:

python == 3.6.4
pytorch == 1.7.1
CUDA Version == 10.2

To install the requirements, run the following command:

pip install -r requirements.txt

For other CUDA versions (9.2, 10.1, 11.0), please refer to this guide on official Pytorch website and edit the requirements.txt to correctly install the compatible version of torch and torchvision.

Getting Started

Assume the downloaded dataset and downloaded pretrained weights are placed under this directory ${DATAPATH}. The following experiment is performed on a single server with 8 V100-16G GPUs.

Prepare CLIP and BERT Weights

In this repo, we build a CLIP model and employ pretrained Openai ViT-B-16 (download) and Chinese RoBERTa (ymcui's project, download) weights to initialize the image-side and text-side, respectively.

For ViT-B-16 weight, run the following command to transform the checkpoint format from a JIT-model to state_dict:

python src/preprocess/transform_openai_pretrain_weights.py \ 
    --raw-ckpt-path ${DATAPATH}/ViT-B-16.pt \
    --new-ckpt-path ${DATAPATH}/ViT-B-16.state_dict.pt

For RoBERTa weight, unzip the downloaded zipfile and place the pytorch_model.bin under the ${DATAPATH}.

Prepare the Transformed Images

The images need to be transformed to feed into the CLIP model. However, online transformation during training and inference is slow. Here we perform the image transformation before the experiment.

python src/preprocess/transform_images.py \ 
    --data_dir ${DATAPATH} \
    --image_resolution 224

The transformed image dataset costs around 100G disk space.

Training

export PYTHONPATH="$PYTHONPATH:$PWD/src"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -u src/training/main.py \
    --save-frequency 1 \
    --train-data="${DATAPATH}/train_queries.jsonl"  \
    --train-img="${DATAPATH}/train_imgs.224.npz"  \
    --val-data="${DATAPATH}/valid_queries.jsonl"  \
    --val-img="${DATAPATH}/valid_imgs.224.npz"  \
    --clip-weight-path="${DATAPATH}/ViT-B-16.state_dict.pt" \
    --bert-weight-path="${DATAPATH}/pytorch_model.bin" \
    --warmup 500 \
    --batch-size=32 \
    --lr=8e-5 \
    --wd=0.001 \
    --epochs=10 \
    --model ViT-B-16

The training will cost a few hours. The log and checkpoint files will be saved under the logs directory.

Inference and Evaluation

Run the following command to compute image and query features using the trained CLIP model:

# only supports single-GPU inference
export CUDA_VISIBLE_DEVICES=0

python -u src/eval/extract_features.py \
    --extract-image-feats \
    --extract-text-feats \
    --image-data="${DATAPATH}/test_imgs.224.npz" \
    --text-data="${DATAPATH}/test_queries.jsonl" \
    --img-batch-size=32 \
    --text-batch-size=32 \
    --resume="logs/${experiment_name}/checkpoints/epoch_5.pt" \
    --model ViT-B-16

After obtaining the testing features, run the following command to perform kNN search to generate top-10 prediction jsonl file:

python -u src/eval/make_topk_predictions.py \
    --image-feats="${DATAPATH}/test_imgs.224.img_feat.jsonl" \
    --text-feats="${DATAPATH}/test_queries.txt_feat.jsonl" \
    --top-k=10 \
    --eval-batch-size=32768 \
    --output="${DATAPATH}/test_predictions.jsonl"

The jsonl file can be submitted to MUGE challenge site. In expection, the evaluated model will get a mean-recall of around 50. We strongly believe the baseline can be easily tuned and improved to achieve much better points :)

We also provide the evaluation script to evaluate model's mean-recall on validation set. Run the following command:

python src/eval/evaluation.py valid_predictions.jsonl valid_queries.jsonl output.json

The score will be saved in output.json. The script is the same as the MUGE evaluation server.

Reference

@inproceedings{M6,
  author    = {Junyang Lin and
               Rui Men and
               An Yang and
               Chang Zhou and
               Ming Ding and
               Yichang Zhang and
               Peng Wang and
               Ang Wang and
               Le Jiang and
               Xianyan Jia and
               Jie Zhang and
               Jianwei Zhang and
               Xu Zou and
               Zhikang Li and
               Xiaodong Deng and
               Jie Liu and
               Jinbao Xue and
               Huiling Zhou and
               Jianxin Ma and
               Jin Yu and
               Yong Li and
               Wei Lin and
               Jingren Zhou and
               Jie Tang and
               Hongxia Yang},
  title     = {{M6:} {A} Chinese Multimodal Pretrainer},
  year      = {2021},
  booktitle = {Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining},
  pages     = {3251–3261},
  numpages  = {11},
  location  = {Virtual Event, Singapore},
}

@article{M6-T,
  author    = {An Yang and
               Junyang Lin and
               Rui Men and
               Chang Zhou and
               Le Jiang and
               Xianyan Jia and
               Ang Wang and
               Jie Zhang and
               Jiamang Wang and
               Yong Li and
               Di Zhang and
               Wei Lin and
               Lin Qu and
               Jingren Zhou and
               Hongxia Yang},
  title     = {{M6-T:} Exploring Sparse Expert Models and Beyond},
  journal   = {CoRR},
  volume    = {abs/2105.15082},
  year      = {2021}
}

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

You might also like...

Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

23 Dec 21, 2022

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

64 Dec 12, 2022

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

370 Dec 27, 2022

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

69 Dec 26, 2022

MERLOT: Multimodal Neural Script Knowledge Models

merlot MERLOT: Multimodal Neural Script Knowledge Models MERLOT is a model for learning what we are calling "neural script knowledge" -- representatio

190 Dec 22, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

Comments

trainloss 在第一个epoch下降后上升

你好，我按readme一样的配置（8卡，bs=32每卡）训练，epoch0后半段就出现train loss上升的现象，请问是什么问题呢

日志是这个： https://drive.google.com/file/d/1ozN-kRNY7Ujvwt5cH7nwMNnSpXqx1jlc/view?usp=sharing

opened by liangzimei 2
关于get_loss的疑问

作者您好，首先感谢您开源这么好的工作！有一个不成熟的疑惑期待您的答复：

在get_loss函数里，每个进程在算loss的时候是把自己的img-text特征和其他进程上的img-text特征齐聚一起，来计算loss。这将会使得每个进程独立计算的梯度完全一样，但我觉得这样好像没太大必要？(劳烦作者解惑) 我的想法是：为什么不直接让每个进程仅仅计算当前自己的img-features和text-features间的loss约束(即maxmize similarity)，因为DDP在loss.backward时不是会汇总所有进程上的梯度并平均吗，这已经能保证每个进程更新参数时的梯度是一样的。

opened by Alex20172023 0
get_loss函数中的问题

我知道下面代码块的作用是聚合不同gpu的特征，然后进行比对损失计算，但是为什么不直接使用gathered_image_features和gathered_text_features呢？ gathered_image_features = [ torch.zeros_like(image_features) for _ in range(world_size) ] gathered_text_features = [ torch.zeros_like(text_features) for _ in range(world_size) ] dist.all_gather(gathered_image_features, image_features) dist.all_gather(gathered_text_features, text_features) all_image_features = torch.cat( [image_features] + gathered_image_features[:rank] + gathered_image_features[rank + 1 :] ) all_text_features = torch.cat( [text_features] + gathered_text_features[:rank] + gathered_text_features[rank + 1 :] )

opened by huangyangke 0
有关chinese-clip微调模型valid不收敛的问题

您好，我看到与您这个代码构架类似的一个chinese-clip模型，我尝试使用chinese-clip训练muge模型，但是遇到train集loss正常下降，准确率提高，但是valid准确率始终不变的问题，我使用的batchsize是128，单卡，lr为5e-5，但无论我如何调整lr，结果都没有太大改变，请问您有遇到过相关的问题吗

opened by ronghonghui 1

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Related tags

Overview

MUGE Multimodal Retrieval Baseline

Requirements and Installation

Getting Started

Prepare CLIP and BERT Weights

Prepare the Transformed Images

Training

Inference and Evaluation

Reference

You might also like...

Deep Multimodal Neural Architecture Search

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

MERLOT: Multimodal Neural Script Knowledge Models

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Preprocessed Datasets for our Multimodal NER paper

Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

multimodal transformer

Comments

trainloss 在第一个epoch下降后上升

关于get_loss的疑问

get_loss函数中的问题

有关chinese-clip微调模型valid不收敛的问题

Owner

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Multimodal commodity image retrieval 多模态商品图像检索

Jingju baseline - A baseline model of our project of Beijing opera script generation

Rethinking the U-Net architecture for multimodal biomedical image segmentation

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

A Strong Baseline for Image Semantic Segmentation

A very simple baseline to estimate 2D & 3D SMPL-compatible keypoints from a single color image.

A Comparative Framework for Multimodal Recommender Systems

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).