HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

[toc]

1. Introduction

This repository provides the code for our paper at TheWebConf 2022:

Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval. Jinpeng Wang, Bin Chen, Dongliang Liao, Ziyun Zeng, Gongfu Li, Shu-Tao Xia, Jin Xu. [arXiv].

Our proposed Hybrid Contrastive Quantization (HCQ) is the first quantization learning method for cross-view (e.g., text-to-video) retrieval, which learns both coarse-grained and fine-grained quantizations with transformers. Experiments on MSRVTT, LSMDC and ActivityNet Captions datasets demonstrate that it can achieve competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation.

In the following, we will guide you how to use this repository step by step. 🤗

2. Preparation

git clone https://github.com/gimpong/WWW22-HCQ.git

2.1 Requirements

python 3.7.4
gensim 4.1.2
h5py 3.6.0
numpy 1.17.3
pandas 1.2.3
pytorch-warmup 0.0.4
scikit-learn 0.23.0
scipy 1.6.1
tensorboardX 2.4.1
torch 1.6.0+cu101
transformers 3.1.0

cd WWW22-HCQ
# Install the requirements
pip install -r requirements.txt

We conduct each training experiment on a single NVIDIA® Tesla® V100 GPU (32 GB).

2.2 Download the features

Before running the code, we need to download the datasets and arrange them in the "data" directory properly. We use the video features provided by the authors of MMT. These features can be downloaded from this page by running the following commands:

# Create and move to WWW22-HCQ/data directory
cd data
# Download the video features
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/MSRVTT.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/activity-net.tar.gz
wget http://pascal.inrialpes.fr/data2/vgabeur/video-features/LSMDC.tar.gz
# Extract the video features
tar -xvf MSRVTT.tar.gz
tar -xvf activity-net.tar.gz
tar -xvf LSMDC.tar.gz

3. Training and Evaluation

3.1 Training from scratch

Let us take "training HCQ on MSRVTT dataset ('1k-A' split)" as an example:

# working directory: WWW22-HCQ/
python -m train --config configs/HCQ_MSRVTT_1kA.json

Expected results:

MSRVTT_jsfusion_test:
 t2v_metrics/R1/final_eval: 25.9
 t2v_metrics/R5/final_eval: 54.8
 t2v_metrics/R10/final_eval: 69.0
 t2v_metrics/R50/final_eval: 88.8
 t2v_metrics/MedR/final_eval: 5.0
 t2v_metrics/MeanR/final_eval: 28.062
 t2v_metrics/geometric_mean_R1-R5-R10/final_eval: 46.09386629981193
 v2t_metrics/R1/final_eval: 26.3
 v2t_metrics/R5/final_eval: 57.0
 v2t_metrics/R10/final_eval: 70.1
 v2t_metrics/R50/final_eval: 90.0
 v2t_metrics/MedR/final_eval: 4.0
 v2t_metrics/MeanR/final_eval: 25.1535
 v2t_metrics/geometric_mean_R1-R5-R10/final_eval: 47.18995255588879

After training, a folder with the same name as the configuration json file (e.g., "HCQ_MSRVTT_1kA") will be generated under WWW22-HCQ/exps/, which contains the model checkpoints, logs, tensorboard files, and so on.

For reproducing other experiments, please see the following tables. You can just replace the config json path with another in the training command.

3.1.1 Main results of HCQ (reported in Table 1-3 in our paper)

Model	Dataset (+split)	Config json	Log	Text-to-Video Retrieval							Video-to-Text Retrieval
Model	Dataset (+split)	Config json	Log	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}
HCQ	MSRVTT (1k-A)	HCQ_MSRVTT_1kA.json	HCQ_MSRVTT_1kA.txt	25.90	54.80	69.00	88.80	5	28.06	46.09	26.30	57.00	70.10	90.00	4	25.15	47.19
	MSRVTT (1k-B)	HCQ_MSRVTT_1kB.json	HCQ_MSRVTT_1kB.txt	22.50	51.50	65.90	86.10	5	33.65	42.43	23.70	52.20	66.90	88.10	5	29.30	43.58
	MSRVTT (Full)	HCQ_MSRVTT_full.json	HCQ_MSRVTT_full.txt	15.15	38.53	51.00	81.34	10	46.22	30.99	18.26	44.88	59.06	87.16	7	30.96	36.45
	LSMDC	HCQ_LSMDC.json	HCQ_LSMDC.txt	14.50	33.60	43.10	68.20	18.5	75.95	27.59	13.70	33.20	42.80	66.10	17	74.28	26.90
	ActivityNet Captions	HCQ_ActivityNet.json	HCQ_ActivityNet.txt	22.19	53.69	70.12	91.21	5	30.71	43.72	23.00	54.85	70.14	91.38	5	29.08	44.56

3.1.2 Result of Hybrid Contrastive Transformer (HCT), Dual Transformer (DT) + DCMH, and DT + JPQ (reported in Table 4 in our paper)

Model	Dataset (+split)	Config json	Log	Text-to-Video Retrieval							Video-to-Text Retrieval
Model	Dataset (+split)	Config json	Log	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}
HCT	MSRVTT (1k-A)	HCT_MSRVTT_1kA.json	HCT_MSRVTT_1kA.txt	27.80	58.00	70.00	89.50	4	26.79	48.33	27.30	57.80	72.10	90.60	4	24.38	48.46
	MSRVTT (1k-B)	HCT_MSRVTT_1kB.json	HCT_MSRVTT_1kB.txt	25.70	53.70	67.30	88.30	5	31.09	45.29	24.70	55.50	68.70	88.80	4	25.54	45.50
	MSRVTT (Full)	HCT_MSRVTT_full.json	HCT_MSRVTT_full.txt	16.76	41.87	55.79	82.44	8	44.33	33.95	21.64	50.57	63.88	87.66	5	29.56	41.19
	LSMDC	HCT_LSMDC.json	HCT_LSMDC.txt	16.40	34.10	43.10	69.10	17	72.39	28.89	14.10	33.70	41.40	67.40	18	73.54	26.99
	ActivityNet Captions	HCT_ActivityNet.json	HCT_ActivityNet.txt	23.12	54.95	71.14	92.64	5	24.82	44.88	22.94	55.81	70.84	92.29	4	25.35	44.93
DT+DCMH	MSRVTT (1k-A)	DCMH_MSRVTT_1kA.json	DCMH_MSRVTT_1kA.txt	19.00	48.40	62.20	85.30	6	32.40	38.53	20.00	50.20	63.30	84.90	5.5	31.69	39.91
	MSRVTT (1k-B)	DCMH_MSRVTT_1kB.json	DCMH_MSRVTT_1kB.txt	15.80	41.30	57.70	83.30	8	40.42	33.52	16.60	44.10	58.10	84.10	7	37.17	34.91
	MSRVTT (Full)	DCMH_MSRVTT_full.json	DCMH_MSRVTT_full.txt	8.46	28.16	41.51	73.48	15.75	67.90	21.46	9.57	31.30	46.62	78.13	12	55.30	24.08
	LSMDC	DCMH_LSMDC.json	DCMH_LSMDC.txt	10.00	25.80	36.00	66.30	22	75.84	21.02	9.60	25.80	36.40	65.40	22.75	78.37	20.81
	ActivityNet Captions	DCMH_ActivityNet.json	DCMH_ActivityNet.txt	12.34	38.40	55.62	84.62	8.5	63.41	29.76	12.45	39.19	55.52	84.58	8.5	65.43	30.03
DT+JPQ	MSRVTT (1k-A)	JPQ_MSRVTT_1kA.json	JPQ_MSRVTT_1kA.txt	18.90	46.80	60.80	87.90	6	29.12	37.75	18.20	47.40	63.20	87.80	6	26.63	37.92
	MSRVTT (1k-B)	JPQ_MSRVTT_1kB.json	JPQ_MSRVTT_1kB.txt	14.90	42.50	57.70	86.90	7	33.05	33.18	15.30	43.50	59.10	88.30	7	27.79	34.01
	MSRVTT (Full)	JPQ_MSRVTT_full.json	JPQ_MSRVTT_full.txt	9.30	30.00	43.44	77.49	14	50.00	22.97	11.44	36.29	51.30	82.84	10	37.00	27.72
	LSMDC	JPQ_LSMDC.json	JPQ_LSMDC.txt	9.50	23.40	34.30	63.10	25	80.27	19.68	7.80	22.80	32.80	62.50	27	79.98	18.00
	ActivityNet Captions	JPQ_ActivityNet.json	JPQ_ActivityNet.txt	17.10	46.43	62.38	90.05	6	28.09	36.73	17.67	46.88	62.94	90.14	6	28.21	37.36

3.1.3 Results of HCQ under different hyper-parameters (reported in Figure 6 in our paper)

Experimental subject	Dataset (+split)	Setting	Config json	Log	Text-to-Video Retrieval							Video-to-Text Retrieval
Experimental subject	Dataset (+split)	Setting	Config json	Log	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}
L: the number of active cluster(s) in GhostVLAD	MSRVTT (1k-A)	1	HCQ_MSRVTT_1kA_L1.json	HCQ_MSRVTT_1kA_L1.txt	25.10	54.10	67.30	89.10	5	28.21	45.04	22.70	55.10	67.90	89.90	4	25.35	43.96
		3	HCQ_MSRVTT_1kA_L3.json	HCQ_MSRVTT_1kA_L3.txt	25.70	52.90	66.90	89.30	5	28.39	44.97	26.70	55.00	68.50	90.50	4	24.20	46.51
		7 (default)	HCQ_MSRVTT_1kA.json	HCQ_MSRVTT_1kA.txt	25.90	54.80	69.00	88.80	5	28.06	46.09	26.30	57.00	70.10	90.00	4	25.15	47.19
		15	HCQ_MSRVTT_1kA_L15.json	HCQ_MSRVTT_1kA_L15.txt	24.20	54.40	68.10	88.70	5	27.15	44.76	23.60	55.00	69.40	90.60	4	22.79	44.83
		31	HCQ_MSRVTT_1kA_L31.json	HCQ_MSRVTT_1kA_L31.txt	26.20	54.50	67.90	88.00	5	27.57	45.94	25.00	55.60	69.10	90.00	4	24.38	45.80
	MSRVTT (1k-B)	1	HCQ_MSRVTT_1kB_L1.json	HCQ_MSRVTT_1kB_L1.txt	22.40	51.70	64.10	87.50	5	30.79	42.03	21.90	52.50	65.90	88.10	5	27.49	42.32
		3	HCQ_MSRVTT_1kB_L3.json	HCQ_MSRVTT_1kB_L3.txt	23.10	50.60	65.40	87.90	5	31.43	42.44	22.90	51.70	66.50	88.30	5	26.82	42.86
		7 (default)	HCQ_MSRVTT_1kB.json	HCQ_MSRVTT_1kB.txt	22.50	51.50	65.90	86.10	5	33.65	42.43	23.70	52.20	66.90	88.10	5	29.30	43.58
		15	HCQ_MSRVTT_1kB_L15.json	HCQ_MSRVTT_1kB_L15.txt	22.20	51.50	64.30	87.20	5	30.98	41.89	22.00	52.40	65.50	87.90	5	26.35	42.27
		31	HCQ_MSRVTT_1kB_L31.json	HCQ_MSRVTT_1kB_L31.txt	23.30	50.40	64.30	86.80	5	34.97	42.27	22.70	53.50	65.20	88.10	5	29.55	42.94
	MSRVTT (Full)	1	HCQ_MSRVTT_full_L1.json	HCQ_MSRVTT_full_L1.txt	14.31	38.63	52.24	80.94	10	44.35	30.68	17.32	44.98	59.60	86.89	7	31.44	35.95
		3	HCQ_MSRVTT_full_L3.json	HCQ_MSRVTT_full_L3.txt	14.45	39.16	51.84	80.80	10	45.37	30.84	17.56	46.19	60.37	86.82	6	31.24	36.58
		7 (default)	HCQ_MSRVTT_full.json	HCQ_MSRVTT_full.txt	15.15	38.53	51.00	81.34	10	46.22	30.99	18.26	44.88	59.06	87.16	7	30.96	36.45
		15	HCQ_MSRVTT_full_L15.json	HCQ_MSRVTT_full_L15.txt	14.01	37.53	51.47	81.74	10	41.04	30.02	16.19	44.08	59.80	86.99	7	29.87	34.94
		31	HCQ_MSRVTT_full_L31.json	HCQ_MSRVTT_full_L31.txt	14.48	38.56	52.64	81.61	9	43.41	30.86	18.09	45.99	59.67	87.22	7	30.54	36.75
	LSMDC	1	HCQ_LSMDC_L1.json	HCQ_LSMDC_L1.txt	14.40	31.50	42.50	68.50	17	73.09	26.81	13.00	30.60	40.50	68.10	19	71.16	25.26
		3	HCQ_LSMDC_L3.json	HCQ_LSMDC_L3.txt	14.00	33.80	44.10	68.30	17	73.91	27.53	12.90	32.80	42.80	68.50	17	71.74	26.26
		7 (default)	HCQ_LSMDC.json	HCQ_LSMDC.txt	14.50	33.60	43.10	68.20	18.5	75.95	27.59	13.70	33.20	42.80	66.10	17	74.28	26.90
		15	HCQ_LSMDC_L15.json	HCQ_LSMDC_L15.txt	14.10	32.60	41.90	69.80	17	71.28	26.81	13.10	31.40	40.70	68.30	18	71.21	25.58
		31	HCQ_LSMDC_L31.json	HCQ_LSMDC_L31.txt	12.80	31.90	41.90	68.30	17	72.03	25.77	12.50	32.20	42.00	67.20	17	72.26	25.66
	ActivityNet Captions	1	HCQ_ActivityNet_L1.json	HCQ_ActivityNet_L1.txt	19.77	50.54	65.77	89.06	5	33.26	40.35	20.03	51.33	66.36	89.40	5	32.14	40.86
		3	HCQ_ActivityNet_L3.json	HCQ_ActivityNet_L3.txt	20.95	52.21	68.35	90.54	5	30.22	42.13	20.72	53.10	68.70	90.50	5	29.18	42.28
		7 (default)	HCQ_ActivityNet.json	HCQ_ActivityNet.txt	22.19	53.69	70.12	91.21	5	30.71	43.72	23.00	54.85	70.14	91.38	5	29.08	44.56
		15	HCQ_ActivityNet_L15.json	HCQ_ActivityNet_L15.txt	21.33	52.15	68.07	90.16	5	30.00	42.31	22.07	52.92	68.31	90.46	5	29.26	43.05
		31	HCQ_ActivityNet_L31.json	HCQ_ActivityNet_L31.txt	20.56	52.45	69.07	89.91	5	31.39	42.07	21.66	52.96	68.60	90.81	5	29.67	42.85
M: the number of sub-codebooks in each quantization module	MSRVTT (1k-A)	8	HCQ_MSRVTT_1kA_M8.json	HCQ_MSRVTT_1kA_M8.txt	23.00	52.00	65.00	87.00	5	32.93	42.68	21.40	52.40	65.50	88.20	5	30.19	41.88
		16	HCQ_MSRVTT_1kA_M16.json	HCQ_MSRVTT_1kA_M16.txt	23.40	53.40	68.10	88.00	5	30.89	43.98	23.00	55.30	68.60	89.60	4	26.62	44.35
		32 (default)	HCQ_MSRVTT_1kA.json	HCQ_MSRVTT_1kA.txt	25.90	54.80	69.00	88.80	5	28.06	46.09	26.30	57.00	70.10	90.00	4	25.15	47.19
		64	HCQ_MSRVTT_1kA_M64.json	HCQ_MSRVTT_1kA_M64.txt	27.20	56.80	69.10	89.30	4	26.93	47.44	26.10	58.10	71.40	90.70	4	23.82	47.66
	MSRVTT (1k-B)	8	HCQ_MSRVTT_1kB_M8.json	HCQ_MSRVTT_1kB_M8.txt	20.10	47.00	60.60	84.10	6.75	37.97	38.54	18.90	47.90	63.10	86.40	6	36.00	38.51
		16	HCQ_MSRVTT_1kB_M16.json	HCQ_MSRVTT_1kB_M16.txt	22.50	49.50	62.70	85.90	6	33.82	41.18	21.10	52.10	65.60	87.10	5	32.43	41.62
		32 (default)	HCQ_MSRVTT_1kB.json	HCQ_MSRVTT_1kB.txt	22.50	51.50	65.90	86.10	5	33.65	42.43	23.70	52.20	66.90	88.10	5	29.30	43.58
		64	HCQ_MSRVTT_1kB_M64.json	HCQ_MSRVTT_1kB_M64.txt	24.50	51.60	66.20	87.70	5	31.31	43.74	23.60	54.30	67.40	88.80	4.75	27.56	44.20
	MSRVTT (Full)	8	HCQ_MSRVTT_full_M8.json	HCQ_MSRVTT_full_M8.txt	11.61	33.44	46.86	75.82	12	62.06	26.30	11.91	36.99	51.77	82.31	10	44.63	28.36
		16	HCQ_MSRVTT_full_M16.json	HCQ_MSRVTT_full_M16.txt	12.81	36.45	50.17	79.06	10	52.58	28.61	14.55	41.07	55.85	84.75	8	37.39	32.20
		32 (default)	HCQ_MSRVTT_full.json	HCQ_MSRVTT_full.txt	15.15	38.53	51.00	81.34	10	46.22	30.99	18.26	44.88	59.06	87.16	7	30.96	36.45
		64	HCQ_MSRVTT_full_M64.json	HCQ_MSRVTT_full_M64.txt	16.02	40.97	54.25	83.01	8	40.48	32.90	19.16	48.26	62.94	88.70	6	26.65	38.76
	LSMDC	8	HCQ_LSMDC_M8.json	HCQ_LSMDC_M8.txt	12.60	29.00	38.60	64.30	22	84.53	24.16	10.40	29.20	39.10	64.20	21	78.32	22.81
		16	HCQ_LSMDC_M16.json	HCQ_LSMDC_M16.txt	13.20	31.10	39.40	66.50	19	79.15	25.29	12.70	31.60	39.90	65.30	21	77.42	25.21
		32 (default)	HCQ_LSMDC.json	HCQ_LSMDC.txt	14.50	33.60	43.10	68.20	18.5	75.95	27.59	13.70	33.20	42.80	66.10	17	74.28	26.90
		64	HCQ_LSMDC_M64.json	HCQ_LSMDC_M64.txt	14.80	33.00	43.60	69.10	16	72.80	27.72	14.10	32.30	40.80	67.40	19	72.64	26.49
	ActivityNet Captions	8	HCQ_ActivityNet_M8.json	HCQ_ActivityNet_M8.txt	18.77	48.44	65.08	88.75	6	39.86	38.97	18.63	48.69	65.24	89.30	6	38.20	38.97
		16	HCQ_ActivityNet_M16.json	HCQ_ActivityNet_M16.txt	20.56	51.86	67.93	89.89	5	35.07	41.68	20.68	52.10	68.09	90.44	5	32.72	41.87
		32 (default)	HCQ_ActivityNet.json	HCQ_ActivityNet.txt	22.19	53.69	70.12	91.21	5	30.71	43.72	23.00	54.85	70.14	91.38	5	29.08	44.56
		64	HCQ_ActivityNet_M64.json	HCQ_ActivityNet_M64.txt	22.96	54.59	70.80	91.80	5	26.29	44.60	23.61	55.28	70.80	92.03	4	25.74	45.21
Batch size	MSRVTT (1k-A)	16	HCQ_MSRVTT_1kA_bs16.json	HCQ_MSRVTT_1kA_bs16.txt	24.20	53.40	67.40	89.90	5	25.86	44.33	23.60	54.10	67.60	89.60	4	22.96	44.19
		32	HCQ_MSRVTT_1kA_bs32.json	HCQ_MSRVTT_1kA_bs32.txt	24.20	54.00	67.20	89.90	5	27.50	44.45	24.00	54.30	66.90	90.10	4	25.09	44.34
		64	HCQ_MSRVTT_1kA_bs64.json	HCQ_MSRVTT_1kA_bs64.txt	26.20	55.90	67.90	88.70	4	26.67	46.33	25.50	55.80	69.00	89.90	4	23.37	46.13
		128 (default)	HCQ_MSRVTT_1kA.json	HCQ_MSRVTT_1kA.txt	25.90	54.80	69.00	88.80	5	28.06	46.09	26.30	57.00	70.10	90.00	4	25.15	47.19
		256	HCQ_MSRVTT_1kA_bs256.json	HCQ_MSRVTT_1kA_bs256.txt	25.50	55.30	67.50	89.20	4	26.80	45.66	26.00	55.80	68.70	90.50	4	23.47	46.36
	MSRVTT (1k-B)	16	HCQ_MSRVTT_1kB_bs16.json	HCQ_MSRVTT_1kB_bs16.txt	22.00	49.40	64.50	87.60	6	31.45	41.23	18.50	51.80	66.20	89.60	5	26.30	39.88
		32	HCQ_MSRVTT_1kB_bs32.json	HCQ_MSRVTT_1kB_bs32.txt	22.60	49.20	65.10	87.10	6	32.03	41.68	21.40	52.30	65.90	88.20	5	28.20	41.94
		64	HCQ_MSRVTT_1kB_bs64.json	HCQ_MSRVTT_1kB_bs64.txt	23.60	50.70	64.60	86.60	5	33.26	42.60	21.10	51.60	64.60	89.00	5	28.00	41.28
		128 (default)	HCQ_MSRVTT_1kB.json	HCQ_MSRVTT_1kB.txt	22.50	51.50	65.90	86.10	5	33.65	42.43	23.70	52.20	66.90	88.10	5	29.30	43.58
		256	HCQ_MSRVTT_1kB_bs256.json	HCQ_MSRVTT_1kB_bs256.txt	22.50	50.20	63.80	87.00	5	30.96	41.61	21.30	52.40	65.90	88.30	5	27.50	41.90
	MSRVTT (Full)	16	HCQ_MSRVTT_full_bs16.json	HCQ_MSRVTT_full_bs16.txt	13.08	37.96	52.91	82.04	9	41.76	29.72	15.95	42.44	57.59	86.09	8	31.76	33.91
		32	HCQ_MSRVTT_full_bs32.json	HCQ_MSRVTT_full_bs32.txt	13.75	38.39	52.37	80.80	10	45.51	30.24	16.39	44.58	58.86	86.29	7	32.54	35.04
		64	HCQ_MSRVTT_full_bs64.json	HCQ_MSRVTT_full_bs64.txt	14.65	39.20	52.98	82.27	9	44.13	31.22	17.69	46.59	61.10	87.83	6	31.56	36.93
		128 (default)	HCQ_MSRVTT_full.json	HCQ_MSRVTT_full.txt	15.15	38.53	51.00	81.34	10	46.22	30.99	18.26	44.88	59.06	87.16	7	30.96	36.45
		256	HCQ_MSRVTT_full_bs256.json	HCQ_MSRVTT_full_bs256.txt	14.21	39.06	52.47	82.81	9	40.74	30.77	16.92	46.15	59.70	87.63	7	28.24	35.99
	LSMDC	16	HCQ_LSMDC_bs16.json	HCQ_LSMDC_bs16.txt	12.30	29.70	39.40	65.30	21	82.64	24.32	10.70	28.30	38.90	65.60	23	80.80	22.75
		32	HCQ_LSMDC_bs32.json	HCQ_LSMDC_bs32.txt	12.30	30.00	38.70	66.30	20	79.95	24.26	12.10	28.70	39.10	63.50	23	80.79	23.86
		64	HCQ_LSMDC_bs64.json	HCQ_LSMDC_bs64.txt	13.40	31.90	41.00	66.20	17	75.98	25.98	13.40	31.50	40.00	66.20	20	73.14	25.65
		128 (default)	HCQ_LSMDC.json	HCQ_LSMDC.txt	14.50	33.60	43.10	68.20	18.5	75.95	27.59	13.70	33.20	42.80	66.10	17	74.28	26.90
		256	HCQ_LSMDC_bs256.json	HCQ_LSMDC_bs256.txt	14.30	34.80	43.60	69.30	16	74.04	27.89	14.30	33.50	42.50	67.70	16	71.84	27.31
	ActivityNet Captions	16	HCQ_ActivityNet_bs16.json	HCQ_ActivityNet_bs16.txt	21.31	52.55	70.59	92.19	5	27.31	42.92	22.25	53.18	70.41	92.33	5	26.57	43.68
		32 (default)	HCQ_ActivityNet.json	HCQ_ActivityNet.txt	22.19	53.69	70.12	91.21	5	30.71	43.72	23.00	54.85	70.14	91.38	5	29.08	44.56
		64	HCQ_ActivityNet_bs64.json	HCQ_ActivityNet_bs64.txt	20.62	51.60	66.91	88.94	5	33.61	41.45	20.58	51.64	67.76	89.40	5	31.52	41.61
		128	HCQ_ActivityNet_bs128.json	HCQ_ActivityNet_bs128.txt	19.36	48.61	64.86	88.41	6	35.38	39.37	19.22	49.68	66.04	89.12	6	33.15	39.80
τ: the temperature factor in contrastive learning loss (Eq.(13))	MSRVTT (1k-A)	0.03	HCQ_MSRVTT_1kA_t0.03.json	HCQ_MSRVTT_1kA_t0.03.txt	24.90	56.50	68.80	88.80	4	26.95	45.91	25.10	53.90	69.10	89.70	4	24.91	45.39
		0.05	HCQ_MSRVTT_1kA.json	HCQ_MSRVTT_1kA.txt	25.90	54.80	69.00	88.80	5	28.06	46.09	26.30	57.00	70.10	90.00	4	25.15	47.19
		0..07	HCQ_MSRVTT_1kA_t0.07.json	HCQ_MSRVTT_1kA_t0.07.txt	25.40	52.80	67.50	88.60	5	30.40	44.90	25.90	57.00	68.00	90.00	4	27.78	46.48
		0.1	HCQ_MSRVTT_1kA_t0.1.json	HCQ_MSRVTT_1kA_t0.1.txt	23.90	52.10	66.20	87.10	5	32.74	43.52	22.50	54.00	67.10	87.70	5	31.09	43.36
		0.12	HCQ_MSRVTT_1kA_t0.12.json	HCQ_MSRVTT_1kA_t0.12.txt	22.60	49.60	65.00	87.90	6	34.53	41.77	21.20	50.80	65.10	87.30	5	33.46	41.23
		0.15	HCQ_MSRVTT_1kA_t0.15.json	HCQ_MSRVTT_1kA_t0.15.txt	18.20	44.50	60.20	86.80	7	36.74	36.53	16.50	46.80	61.40	85.80	6	35.20	36.19
	MSRVTT (1k-B)	0.03	HCQ_MSRVTT_1kB_t0.03.json	HCQ_MSRVTT_1kB_t0.03.txt	23.10	51.90	63.40	88.20	5	30.89	42.36	22.90	51.70	65.60	88.10	5	25.72	42.67
		0.05	HCQ_MSRVTT_1kB.json	HCQ_MSRVTT_1kB.txt	22.50	51.50	65.90	86.10	5	33.65	42.43	23.70	52.20	66.90	88.10	5	29.30	43.58
		0..07	HCQ_MSRVTT_1kB_t0.07.json	HCQ_MSRVTT_1kB_t0.07.txt	23.90	49.90	63.50	86.70	6	34.78	42.31	22.70	52.10	65.30	87.40	5	32.91	42.59
		0.1	HCQ_MSRVTT_1kB_t0.1.json	HCQ_MSRVTT_1kB_t0.1.txt	19.90	50.70	63.80	86.80	5	35.51	40.08	19.90	50.70	65.00	87.20	5	34.81	40.33
		0.12	HCQ_MSRVTT_1kB_t0.12.json	HCQ_MSRVTT_1kB_t0.12.txt	19.00	46.30	61.00	86.40	7	35.89	37.72	18.30	48.20	61.30	86.60	6	35.56	37.81
		0.15	HCQ_MSRVTT_1kB_t0.15.json	HCQ_MSRVTT_1kB_t0.15.txt	15.60	43.20	56.70	84.50	8	40.02	33.68	14.70	44.20	57.90	85.80	7	39.38	33.51
	MSRVTT (Full)	0.03	HCQ_MSRVTT_full_t0.03.json	HCQ_MSRVTT_full_t0.03.txt	14.11	38.29	50.77	80.00	10	45.90	30.16	16.32	45.45	59.80	86.86	7	31.64	35.40
		0.05	HCQ_MSRVTT_full.json	HCQ_MSRVTT_full.txt	15.15	38.53	51.00	81.34	10	46.22	30.99	18.26	44.88	59.06	87.16	7	30.96	36.45
		0..07	HCQ_MSRVTT_full_t0.07.json	HCQ_MSRVTT_full_t0.07.txt	14.15	37.89	51.17	81.30	10	46.22	30.16	16.72	43.18	58.09	85.95	8	33.70	34.75
		0.1	HCQ_MSRVTT_full_t0.1.json	HCQ_MSRVTT_full_t0.1.txt	13.58	36.56	49.06	80.43	11	49.80	28.99	14.35	39.13	53.65	84.15	9	39.70	31.11
		0.12	HCQ_MSRVTT_full_t0.12.json	HCQ_MSRVTT_full_t0.12.txt	12.31	34.25	49.13	79.50	11	50.45	27.46	12.24	35.65	50.64	82.98	10	44.35	28.06
		0.15	HCQ_MSRVTT_full_t0.15.json	HCQ_MSRVTT_full_t0.15.txt	10.10	30.64	43.88	76.79	14	55.40	23.86	9.16	29.90	45.69	79.00	13	53.01	23.22
	LSMDC	0.03	HCQ_LSMDC_t0.03.json	HCQ_LSMDC_t0.03.txt	14.90	32.00	42.50	66.20	18	76.14	27.26	12.90	31.80	40.80	66.80	20	72.31	25.58
		0.05	HCQ_LSMDC.json	HCQ_LSMDC.txt	14.50	33.60	43.10	68.20	18.5	75.95	27.59	13.70	33.20	42.80	66.10	17	74.28	26.90
		0..07	HCQ_LSMDC_t0.07.json	HCQ_LSMDC_t0.07.txt	12.80	32.30	43.40	67.70	17	75.92	26.18	12.80	32.70	42.90	67.30	17	76.30	26.19
		0.1	HCQ_LSMDC_t0.1.json	HCQ_LSMDC_t0.1.txt	12.50	30.10	40.80	66.90	18	81.02	24.85	11.80	29.00	40.30	64.20	19	82.29	23.98
		0.12	HCQ_LSMDC_t0.12.json	HCQ_LSMDC_t0.12.txt	12.00	28.10	38.80	66.40	20	81.93	23.56	11.90	27.60	39.60	64.80	20	84.15	23.52
		0.15	HCQ_LSMDC_t0.15.json	HCQ_LSMDC_t0.15.txt	10.70	26.10	36.00	64.90	23	82.81	21.58	9.10	24.00	35.10	62.80	25	88.27	19.72
	ActivityNet Captions	0.03	HCQ_ActivityNet_t0.03.json	HCQ_ActivityNet_t0.03.txt	22.15	52.78	68.58	91.38	5	26.42	43.12	21.74	52.47	68.70	91.38	5	26.65	42.79
		0.05	HCQ_ActivityNet.json	HCQ_ActivityNet.txt	21.96	53.30	68.99	90.89	5	29.67	43.23	21.94	52.94	69.21	90.69	5	29.12	43.16
		0..07	HCQ_ActivityNet_t0.07.json	HCQ_ActivityNet_t0.07.txt	22.19	53.69	70.12	91.21	5	30.71	43.72	23.00	54.85	70.14	91.38	5	29.08	44.56
		0.1	HCQ_ActivityNet_t0.1.json	HCQ_ActivityNet_t0.1.txt	22.11	52.08	68.23	91.34	5	28.34	42.83	21.72	53.33	69.60	91.60	5	27.19	43.20
		0.12	HCQ_ActivityNet_t0.12.json	HCQ_ActivityNet_t0.12.txt	19.20	50.52	67.99	91.95	5	30.12	40.40	20.09	51.66	68.23	91.89	5	29.16	41.37
		0.15	HCQ_ActivityNet_t0.15.json	HCQ_ActivityNet_t0.15.txt	17.00	47.14	65.49	91.42	6	31.43	37.44	18.59	48.81	65.30	91.84	6	32.65	38.99

3.1.4 Results of HCQ with different kinds of text encoders ("1k-A" split) (reported in Table 5 in our paper)

Model	Text Encoder	Config json	Log	Text-to-Video Retrieval							Video-to-Text Retrieval
Model	Text Encoder	Config json	Log	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}	Recall@1	Recall@5	Recall@10	Recall@50	Median rank	Mean rank	Geometric mean of recall@{1,5,10}
HCQ	bert-base (default)	HCQ_MSRVTT_1kA.json	HCQ_MSRVTT_1kA.txt	25.90	54.80	69.00	88.80	5	28.06	46.09	26.30	57.00	70.10	90.00	4	25.15	47.19
	BERT-large	HCQ_MSRVTT_1kA_bert-large.json	HCQ_MSRVTT_1kA_bert-large.txt	27.40	57.70	70.70	89.60	4	27.09	48.17	26.20	59.00	71.80	89.50	4	25.47	48.06
	DistilBERT-base	HCQ_MSRVTT_1kA_distilbert-base.json	HCQ_MSRVTT_1kA_distilbert-base.txt	25.40	54.20	67.30	89.80	4	27.00	45.25	26.30	56.40	69.00	90.10	4	24.22	46.78
	RoBERTa-base	HCQ_MSRVTT_1kA_roberta-base.json	HCQ_MSRVTT_1kA_roberta-base.txt	25.50	54.70	67.80	89.20	5	27.04	45.56	24.50	55.00	69.00	90.20	4	23.80	45.30
	RoBERTa-large	HCQ_MSRVTT_1kA_roberta-large.json	HCQ_MSRVTT_1kA_roberta-large.txt	28.00	55.40	68.50	88.10	4	30.67	47.36	27.00	59.00	68.40	88.50	4	27.41	47.76
	XLNet-base	HCQ_MSRVTT_1kA_xlnet-base.json	HCQ_MSRVTT_1kA_xlnet-base.txt	25.80	56.20	68.70	87.50	5	28.35	46.36	24.60	55.50	69.00	88.40	4	25.59	45.50
	XLNet-large	HCQ_MSRVTT_1kA_xlnet-large.json	HCQ_MSRVTT_1kA_xlnet-large.txt	25.00	53.00	66.60	88.20	5	27.59	44.52	25.30	54.50	68.00	89.10	4	23.69	45.43

If you are doing experiments on a platform with enough RAM and want to accelerate the training, you can load the whole dataset in RAM by the following modification:

# WWW22-HCQ/base/base_dataset.py:L170
               load_in_ram=True, # change from 'False' to 'True'

3.2 Evaluation from checkpoint

We can evaluate the model from the checkpoint without re-training. The evaluation command:

python -m train --config configs/HCQ_MSRVTT_1kA.json --only_eval --load_checkpoint HCQ_MSRVTT_1kA.pth

We provide the checkpoint of HCQ_MSRVTT_1kA.json as an example, you can download this file (~1.6G) from the Google Drive and put it in the working directory (WWW22-HCQ/).

3.3 Evaluation for post-compression methods

Take the evaluation on MSRVTT dataset ("1k-A" split) as an example. First, we need to train an HCT.

# working directory: WWW22-HCQ/
python -m train --config configs/HCT_MSRVTT_1kA.json

Then, run the get_embed.py and pass the path of the HCT checkpoint to the script:

python -m get_embed configs/HCT_MSRVTT_1kA.json --only_eval --load_checkpoint HCT_MSRVTT_1kA/trained_model.pth

After that, we will get the embedding file embeddings.h5 under WWW22-HCQ/exps/HCT_MSRVTT_1kA/. Run the compress_embed.py and get the results:

# compress embeddings with LSH
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type LSH
# compress embeddings with PQ
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type PQ
# compress embeddings with OPQ
python -m compress_embed --path ./exps/HCT_MSRVTT_1kA/embeddings.h5 --type OPQ

3. References

If you find this code useful or use the toolkit in your work, please consider citing:

@inproceedings{wang22hcq,
  author={Wang, Jinpeng and Chen, Bin and Liao, Dongliang and Zeng, Ziyun and Li, Gongfu and Shu-Tao, Xia and Xu, Jin},
  title={Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval},
  booktitle={Proceedings of the Web Conference 2022},
  doi={10.1145/3485447.3512022}
}

4. Acknowledgements

Our code is based on the implementation of nanopq, Multi-Modal Transformer, Collaborative Experts, Transformers and Mixture of Embedding Experts.

5. Contact

If you have any question, you can raise an issue or email Jinpeng Wang ([email protected]). We will reply you soon.

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

114 Nov 27, 2022

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

2 Jan 29, 2022

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

MuVER This repo contains the code and pre-trained model for our EMNLP 2021 paper: MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity

24 May 30, 2022

Hybrid Neural Fusion for Full-frame Video Stabilization

FuSta: Hybrid Neural Fusion for Full-frame Video Stabilization Project Page | Video | Paper | Google Colab Setup Setup environment for [Yu and Ramamoo

430 Jan 4, 2023

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

8 Oct 7, 2022

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

59 Dec 28, 2022

For AILAB: Cross Lingual Retrieval on Yelp Search Engine

Cross-lingual Information Retrieval Model for Document Search Train Phase CUDA_VISIBLE_DEVICES="0,1,2,3" \ python -m torch.distributed.launch --nproc_

104 Nov 12, 2022

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

9 Jan 12, 2022

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

6 Dec 23, 2022

Wrong [66] reference

Hi, congratulations to your WWW22 paper!

This is a kind reminder of your [66] paper reference. I think the name of the cited paper is wrong. I assume you want to cite "Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval", but the current name is the same as [65].

Maybe consider an update of the arxiv paper?

opened by jingtaozhan 1

HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

Related tags

Overview

HCQ: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

1. Introduction

2. Preparation

2.1 Requirements

2.2 Download the features

3. Training and Evaluation

3.1 Training from scratch

3.1.1 Main results of HCQ (reported in Table 1-3 in our paper)

3.1.2 Result of Hybrid Contrastive Transformer (HCT), Dual Transformer (DT) + DCMH, and DT + JPQ (reported in Table 4 in our paper)

3.1.3 Results of HCQ under different hyper-parameters (reported in Figure 6 in our paper)

3.1.4 Results of HCQ with different kinds of text encoders ("1k-A" split) (reported in Table 5 in our paper)

3.2 Evaluation from checkpoint

3.3 Evaluation for post-compression methods

3. References

4. Acknowledgements

5. Contact

You might also like...

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

Hybrid Neural Fusion for Full-frame Video Stabilization

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

For AILAB: Cross Lingual Retrieval on Yelp Search Engine

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Comments

Wrong [66] reference

Owner

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Hybrid CenterNet - Hybrid-supervised object detection / Weakly semi-supervised object detection

Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

Blender add-on: Add to Cameras menu: View → Camera, View → Add Camera, Camera → View, Previous Camera, Next Camera

(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline