Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

Overview

MUGE Multimodal Retrieval Baseline

This repo is implemented based on the open_clip project, with modifications to adapt to the Chinese Multimodal Retrieval task

Requirements and Installation

This repo is successfully tested on the following environment:

  • python == 3.6.4
  • pytorch == 1.7.1
  • CUDA Version == 10.2

To install the requirements, run the following command:

pip install -r requirements.txt

For other CUDA versions (9.2, 10.1, 11.0), please refer to this guide on official Pytorch website and edit the requirements.txt to correctly install the compatible version of torch and torchvision.

Getting Started

Assume the downloaded dataset and downloaded pretrained weights are placed under this directory ${DATAPATH}. The following experiment is performed on a single server with 8 V100-16G GPUs.

Prepare CLIP and BERT Weights

In this repo, we build a CLIP model and employ pretrained Openai ViT-B-16 (download) and Chinese RoBERTa (ymcui's project, download) weights to initialize the image-side and text-side, respectively.

For ViT-B-16 weight, run the following command to transform the checkpoint format from a JIT-model to state_dict:

python src/preprocess/transform_openai_pretrain_weights.py \ 
    --raw-ckpt-path ${DATAPATH}/ViT-B-16.pt \
    --new-ckpt-path ${DATAPATH}/ViT-B-16.state_dict.pt

For RoBERTa weight, unzip the downloaded zipfile and place the pytorch_model.bin under the ${DATAPATH}.

Prepare the Transformed Images

The images need to be transformed to feed into the CLIP model. However, online transformation during training and inference is slow. Here we perform the image transformation before the experiment.

python src/preprocess/transform_images.py \ 
    --data_dir ${DATAPATH} \
    --image_resolution 224

The transformed image dataset costs around 100G disk space.

Training

export PYTHONPATH="$PYTHONPATH:$PWD/src"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -u src/training/main.py \
    --save-frequency 1 \
    --train-data="${DATAPATH}/train_queries.jsonl"  \
    --train-img="${DATAPATH}/train_imgs.224.npz"  \
    --val-data="${DATAPATH}/valid_queries.jsonl"  \
    --val-img="${DATAPATH}/valid_imgs.224.npz"  \
    --clip-weight-path="${DATAPATH}/ViT-B-16.state_dict.pt" \
    --bert-weight-path="${DATAPATH}/pytorch_model.bin" \
    --warmup 500 \
    --batch-size=32 \
    --lr=8e-5 \
    --wd=0.001 \
    --epochs=10 \
    --model ViT-B-16

The training will cost a few hours. The log and checkpoint files will be saved under the logs directory.

Inference and Evaluation

Run the following command to compute image and query features using the trained CLIP model:

# only supports single-GPU inference
export CUDA_VISIBLE_DEVICES=0

python -u src/eval/extract_features.py \
    --extract-image-feats \
    --extract-text-feats \
    --image-data="${DATAPATH}/test_imgs.224.npz" \
    --text-data="${DATAPATH}/test_queries.jsonl" \
    --img-batch-size=32 \
    --text-batch-size=32 \
    --resume="logs/${experiment_name}/checkpoints/epoch_5.pt" \
    --model ViT-B-16

After obtaining the testing features, run the following command to perform kNN search to generate top-10 prediction jsonl file:

python -u src/eval/make_topk_predictions.py \
    --image-feats="${DATAPATH}/test_imgs.224.img_feat.jsonl" \
    --text-feats="${DATAPATH}/test_queries.txt_feat.jsonl" \
    --top-k=10 \
    --eval-batch-size=32768 \
    --output="${DATAPATH}/test_predictions.jsonl"

The jsonl file can be submitted to MUGE challenge site. In expection, the evaluated model will get a mean-recall of around 50. We strongly believe the baseline can be easily tuned and improved to achieve much better points :)

We also provide the evaluation script to evaluate model's mean-recall on validation set. Run the following command:

python src/eval/evaluation.py valid_predictions.jsonl valid_queries.jsonl output.json

The score will be saved in output.json. The script is the same as the MUGE evaluation server.

Reference

@inproceedings{M6,
  author    = {Junyang Lin and
               Rui Men and
               An Yang and
               Chang Zhou and
               Ming Ding and
               Yichang Zhang and
               Peng Wang and
               Ang Wang and
               Le Jiang and
               Xianyan Jia and
               Jie Zhang and
               Jianwei Zhang and
               Xu Zou and
               Zhikang Li and
               Xiaodong Deng and
               Jie Liu and
               Jinbao Xue and
               Huiling Zhou and
               Jianxin Ma and
               Jin Yu and
               Yong Li and
               Wei Lin and
               Jingren Zhou and
               Jie Tang and
               Hongxia Yang},
  title     = {{M6:} {A} Chinese Multimodal Pretrainer},
  year      = {2021},
  booktitle = {Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining},
  pages     = {3251–3261},
  numpages  = {11},
  location  = {Virtual Event, Singapore},
}

@article{M6-T,
  author    = {An Yang and
               Junyang Lin and
               Rui Men and
               Chang Zhou and
               Le Jiang and
               Xianyan Jia and
               Ang Wang and
               Jie Zhang and
               Jiamang Wang and
               Yong Li and
               Di Zhang and
               Wei Lin and
               Lin Qu and
               Jingren Zhou and
               Hongxia Yang},
  title     = {{M6-T:} Exploring Sparse Expert Models and Beyond},
  journal   = {CoRR},
  volume    = {abs/2105.15082},
  year      = {2021}
}

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}
You might also like...
Deep Multimodal Neural Architecture Search
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

Code and datasets for the paper
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

MERLOT: Multimodal Neural Script Knowledge Models
MERLOT: Multimodal Neural Script Knowledge Models

merlot MERLOT: Multimodal Neural Script Knowledge Models MERLOT is a model for learning what we are calling "neural script knowledge" -- representatio

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Preprocessed Datasets for our Multimodal NER paper

Unified Multimodal Transformer (UMT) for Multimodal Named Entity Recognition (MNER) Two MNER Datasets and Codes for our ACL'2020 paper: Improving Mult

Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations
Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

The Second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 Welcome to the Second Situated Interactive Multimodal Conversation

multimodal transformer
multimodal transformer

This repo holds the code to perform experiments with the multimodal autoregressive probabilistic model Transflower. Overview of the repo It is structu

Comments
  • trainloss 在第一个epoch下降后上升

    trainloss 在第一个epoch下降后上升

    你好,我按readme一样的配置(8卡,bs=32每卡)训练,epoch0后半段就出现train loss上升的现象,请问是什么问题呢

    日志是这个: https://drive.google.com/file/d/1ozN-kRNY7Ujvwt5cH7nwMNnSpXqx1jlc/view?usp=sharing

    opened by liangzimei 2
  • 关于get_loss的疑问

    关于get_loss的疑问

    作者您好,首先感谢您开源这么好的工作!有一个不成熟的疑惑期待您的答复:

    在get_loss函数里,每个进程在算loss的时候是把自己的img-text特征和其他进程上的img-text特征齐聚一起,来计算loss。这将会使得每个进程独立计算的梯度完全一样,但我觉得这样好像没太大必要?(劳烦作者解惑) 我的想法是:为什么不直接让每个进程仅仅计算当前自己的img-features和text-features间的loss约束(即maxmize similarity),因为DDP在loss.backward时不是会汇总所有进程上的梯度并平均吗,这已经能保证每个进程更新参数时的梯度是一样的。

    opened by Alex20172023 0
  • get_loss函数中的问题

    get_loss函数中的问题

    我知道下面代码块的作用是聚合不同gpu的特征,然后进行比对损失计算,但是为什么不直接使用gathered_image_features和gathered_text_features呢? gathered_image_features = [ torch.zeros_like(image_features) for _ in range(world_size) ] gathered_text_features = [ torch.zeros_like(text_features) for _ in range(world_size) ] dist.all_gather(gathered_image_features, image_features) dist.all_gather(gathered_text_features, text_features) all_image_features = torch.cat( [image_features] + gathered_image_features[:rank] + gathered_image_features[rank + 1 :] ) all_text_features = torch.cat( [text_features] + gathered_text_features[:rank] + gathered_text_features[rank + 1 :] )

    opened by huangyangke 0
  • 有关chinese-clip微调模型valid不收敛的问题

    有关chinese-clip微调模型valid不收敛的问题

    您好,我看到与您这个代码构架类似的一个chinese-clip模型,我尝试使用chinese-clip训练muge模型,但是遇到train集loss正常下降,准确率提高,但是valid准确率始终不变的问题,我使用的batchsize是128,单卡,lr为5e-5,但无论我如何调整lr,结果都没有太大改变,请问您有遇到过相关的问题吗

    opened by ronghonghui 1
Owner
null
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 398 Dec 30, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Multimodal commodity image retrieval 多模态商品图像检索

Multimodal commodity image retrieval 多模态商品图像检索 Not finished yet... introduce explain:The specific description of the project and the product image dat

hongjie 8 Nov 25, 2022
Jingju baseline - A baseline model of our project of Beijing opera script generation

Jingju Baseline It is a baseline of our project about Beijing opera script gener

midon 1 Jan 14, 2022
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Nabil Ibtehaz 308 Jan 5, 2023
Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoMIR: Contrastive Multimodal Image Representation for Registration Framework ?? Registration of images in different modalities with Deep Learning ??

Methods for Image Data Analysis - MIDA 55 Dec 9, 2022
A Strong Baseline for Image Semantic Segmentation

A Strong Baseline for Image Semantic Segmentation Introduction This project is an open source semantic segmentation toolbox based on PyTorch. It is ba

Clark He 49 Sep 20, 2022
A very simple baseline to estimate 2D & 3D SMPL-compatible keypoints from a single color image.

Minimal Body A very simple baseline to estimate 2D & 3D SMPL-compatible keypoints from a single color image. The model file is only 51.2 MB and runs a

Yuxiao Zhou 49 Dec 5, 2022
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 3, 2023
This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

Raymond 247 Dec 28, 2022