CLIP4CMR
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval
The original data and pre-calculated CLIP features are available at here. The train.pkl and test.pkl include image pixel features and text id features, and the clip_train.pkl and clip_test.pkl include 1024-dimensional image and text features.