PyTorch Code for the paper "VSE++: Improving Visual-Semantic Embeddings with Hard Negatives"

Overview

Improving Visual-Semantic Embeddings with Hard Negatives

Code for the image-caption retrieval methods from VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, Proceedings of the British Machine Vision Conference (BMVC), 2018. (BMVC Spotlight)

Dependencies

We recommended to use Anaconda for the following packages.

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The precomputed image features are from here and here. To use full image encoders, download the images from their original sources here, here and here.

wget http://www.cs.toronto.edu/~faghri/vsepp/vocab.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/data.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/runs.tar

We refer to the path of extracted files for data.tar as $DATA_PATH and files for models.tar as $RUN_PATH. Extract vocab.tar to ./vocab directory.

Update: The vocabulary was originally built using all sets (including test set captions). Please see issue #29 for details. Please consider not using test set captions if building up on this project.

Evaluate pre-trained models

python -c "\
from vocab import Vocabulary
import evaluation
evaluation.evalrank('$RUN_PATH/coco_vse++/model_best.pth.tar', data_path='$DATA_PATH', split='test')"

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco.

Training new models

Run train.py:

python train.py --data_path "$DATA_PATH" --data_name coco_precomp --logger_name 
runs/coco_vse++ --max_violation

Arguments used to train pre-trained models:

Method Arguments
VSE0 --no_imgnorm
VSE++ --max_violation
Order0 --measure order --use_abs --margin .05 --learning_rate .001
Order++ --measure order --max_violation

Reference

If you found this code useful, please cite the following paper:

@article{faghri2018vse++,
  title={VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
  author={Faghri, Fartash and Fleet, David J and Kiros, Jamie Ryan and Fidler, Sanja},
  booktitle = {Proceedings of the British Machine Vision Conference ({BMVC})},
  url = {https://github.com/fartashf/vsepp},
  year={2018}
}

License

Apache License 2.0

Comments
  • problem in finetune

    problem in finetune

    I use the original flickr30k dataset rather than precomputed features ,and use the pretrained model resnet50.I set finetune True. But, i get poor result of recall. R@1,R@5,R@10 is nearly zero.I can't find the reason about the bad result.How do you set the parameters when you use resnet with finetune? Thank you.

    opened by 136823xuewei 6
  • Reproducing results

    Reproducing results

    Hi,

    First of al, thanks for sharing this great work!

    I've difficulties reproducing the results from the paper as a baseline. I will talk about experiment #3.15 in this issue: VSE++ (ResNet), Flickr30k

    So what I get from the paper, the config is the following:

    • 30 epochs
    • Load images from disk, no precomputed features?
    • lower the lr after 15 epochs.
    • lr goes from 0.0002 -> 0.00002

    My question is: is the image-encoder here trained end-to-end or not. In other words, is ResNet152 only used as a fixed feature extractor, or is it optimized?

    According to your documentation, VSE++ (and therefore I assume 3.14) can be reproduced by - only - using the --max_violation flag, but I get (way) lower results, do I need the --finetune flag as well?

    Thanks, Maurits

    opened by MauritsBleeker 5
  • Finish validation without running model.train()

    Finish validation without running model.train()

    The code in 176, 177 line of train.py:

    if model.Eiters % opt.val_step == 0:  
        validate(opt, val_loader, model)
    

    will run the validation and call model.val_start() to stop the batch_norm and dropout layers of img_encoder and txt_encoder.

    However, it will not call model.train_start() until next epoch, that may cause the dropout and batch_norm layers will only activate in the first val_step steps for each epoch...

    Is it a bug need to be fixed?

    opened by Wangzhpp 5
  • pass enforce_sorted to False

    pass enforce_sorted to False

    This is a PR as much as it is a question, according to the documentation on pack_padded_sequence, by default it understands that their input comes sorted from longest sentence to shortest.

            enforce_sorted (bool, optional): if ``True``, the input is expected to
                contain sequences sorted by length in a decreasing order. If
                ``False``, the input will get sorted unconditionally. Default: ``True``.
    

    I cannot see where is this guaranteed at data loading time to ensure that every batch is properly processed, could this be an issue? I have reproduced your results, but I have seen differences when embedding sentences alone or in a batch, and what I have observed is that embeddings in a batch seemed to be mixed up.

    opened by JoanFM 4
  • How to get coco_train_ids.npy?

    How to get coco_train_ids.npy?

    I want to train "coco" instead of "coco_precomp", but the file indicated by "download.sh":

    http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip

    doesn't contain the coco_train/val/test_ids.npy files and I encountered No such file or directory: vsepp/data/coco/annotations/coco_train_ids.npy

    So where to get these files in order to train "coco"?

    Thanks in advance.

    opened by MaAo 4
  • single caption query

    single caption query

    This code works quite well. Thanks for sharing it. I'm wondering, do you have any code snippets to show how one might use a trained VSE++ model to create their own caption query from text (i.e. a string), submit it to the VSE++ model to get a single caption embedding and then search for matching images that have also mapped to the joint space using the same model? It's easy to do the comparison once a numpy array for the caption and image embeddings in joint space are created, but it's not clear how to use your model with a brand new caption query or simply a set of CNN image features that are not part of some complete COCO/FLICKR/etc train or test set with corresponding caption/image pairs. Thanks for any tips. I'd prefer not to rewrite everything if you already have some additional tools for this.

    opened by wingz1 4
  • Patch for Python3 compatibility

    Patch for Python3 compatibility

    diff --git a/data.py b/data.py
    index 913ea16..520c38d 100644
    --- a/data.py
    +++ b/data.py
    @@ -221,13 +221,12 @@ class PrecompDataset(data.Dataset):
         def __getitem__(self, index):
             # handle the image redundancy
             img_id = index/self.im_div
    -        image = torch.Tensor(self.images[img_id])
    +        image = torch.Tensor(self.images[int(img_id)])
             caption = self.captions[index]
             vocab = self.vocab
     
             # Convert caption (string) to word ids.
    -        tokens = nltk.tokenize.word_tokenize(
    -            str(caption).lower().decode('utf-8'))
    +        tokens = nltk.tokenize.word_tokenize(str(caption).lower())
             caption = []
             caption.append(vocab('<start>'))
             caption.extend([vocab(token) for token in tokens])
    diff --git a/evaluation.py b/evaluation.py
    index 7e5da4e..9171f85 100644
    --- a/evaluation.py
    +++ b/evaluation.py
    @@ -57,7 +57,7 @@ class LogCollector(object):
             """Concatenate the meters in one log line
             """
             s = ''
    -        for i, (k, v) in enumerate(self.meters.iteritems()):
    +        for i, (k, v) in enumerate(self.meters.items()):
                 if i > 0:
                     s += '  '
                 s += k + ' ' + str(v)
    @@ -66,7 +66,7 @@ class LogCollector(object):
         def tb_log(self, tb_logger, prefix='', step=None):
             """Log using tensorboard
             """
    -        for k, v in self.meters.iteritems():
    +        for k, v in self.meters.items():
                 tb_logger.log_value(prefix + k, v.val, step=step)
     
     
    
    help wanted 
    opened by cdluminate 4
  • evaluation problems

    evaluation problems

    def t2i(images, captions, npts=None, measure='cosine', return_ranks=False): """ Text->Images (Image Search) Images: (5N, K) matrix of images Captions: (5N, K) matrix of captions """ if npts is None: npts = int(images.shape[0] / 5) ims = numpy.array([images[i] for i in range(0, len(images), 5)])

    why divide 5?

    opened by chirstinaFan 4
  • questions on dataset construction

    questions on dataset construction

    Hi. Thanks for your code. 1- May I ask why are you including the start and end token when constructing the caption? Since you want to encode the caption only, there is no need for it. As far as I know, the start and end token are only needed when predicting text (such as image captioning, neural machine transaltion....etc). But for your case, you just want to encode. Or does it have to do with how the evaluation metric is calculated?

    2- I also have a question about the data loader. In this part:

            if self.images.shape[0] != self.length:
                self.im_div = 5
            else:
                self.im_div = 1
            # the development set for coco is large and so validation would be slow
            if data_split == 'dev':
                self.length = 5000
    

    I understand that for the training and test splits, you are replicating the image 5 times (the number of captions per image). However, for the 'dev' split (validation after training), you are specifying 5000 only. For Flickr30k, that would still be correct (since we have 1000 validation images * 5). But for COCO, the actual validation dataset with the replication is 25K. But you are loading only a portion of it. According to how the data loader works, it will generate indices according to the length of the dataset specified in __len__. Therefore, for COCO dev set, it will generate 5000 indices, and with images[i//5], this is will retrieve only 1000 original COCO validation images. So my question, is that right to be done? What if the other samples are better? This would lead to a low validation score while it should be high.

    opened by muaz1994 3
  • How to caculate the scores on MSCOCO 1k test images?

    How to caculate the scores on MSCOCO 1k test images?

    I have read some image-text matching papers but I still have no idea about how to select 1k images from the 5k test images, how can I get the ID list of the 1k test images?

    opened by jamiechoi1995 3
  • meanr and rsum seem inversely correlated

    meanr and rsum seem inversely correlated

    Hi @fartashf,

    Great code base, very easy to work with! I had two quick questions regarding evaluation metrics:

    • I noticed that meanr increases as rsum increases and was wondering if you had an explanation for this? (See plots below) I should mention that these results are with a few modifications to your code. Specifically, I used Glove embeddings where the embeddings are not trained / "backproped" into.

    • Also was wondering what was the reason for choosing rsum for model selection instead of meanr?

    screen shot 2019-01-17 at 12 48 21 pm screen shot 2019-01-17 at 12 48 43 pm

    Thanks!

    opened by BigRedT 3
  • Question about your model ?

    Question about your model ?

    I have a question about your model. I have a image with pill object and list of text is name of pill corresponding to image. I want to map pill object and name of it. Because number pill classes very huge so I know some prior infomation about classes . I have train faster RCNN to detect pill and now i must map it with name. Is your model is relavent to this problem. Thank you so much

    opened by ThorPham 0
Owner
Fartash Faghri
PhD student @ UofT. Research on optimization for deep learning and other topics.
Fartash Faghri
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 3, 2023
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 6, 2022
A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

Harry Yang 121 Dec 17, 2022
HashNeRF-pytorch - Pure PyTorch Implementation of NVIDIA paper on Instant Training of Neural Graphics primitives

HashNeRF-pytorch Instant-NGP recently introduced a Multi-resolution Hash Encodin

Yash Sanjay Bhalgat 616 Jan 6, 2023
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

null 43 Nov 19, 2022
PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

Facebook Research 366 Dec 28, 2022
PyTorch code for the paper "Curriculum Graph Co-Teaching for Multi-target Domain Adaptation" (CVPR2021)

PyTorch code for the paper "Curriculum Graph Co-Teaching for Multi-target Domain Adaptation" (CVPR2021) This repo presents PyTorch implementation of M

Evgeny 79 Dec 19, 2022
PyTorch code for the paper "FIERY: Future Instance Segmentation in Bird's-Eye view from Surround Monocular Cameras"

FIERY This is the PyTorch implementation for inference and training of the future prediction bird's-eye view network as described in: FIERY: Future In

Wayve 406 Dec 24, 2022
Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

longlongman 170 Dec 1, 2022
PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

Haoyu Chen 71 Dec 30, 2022
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

Rowel Atienza 198 Dec 27, 2022
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision https://arxiv.org/abs/2003.00393 Abstract Active learning (AL) aims to min

Denis 29 Nov 21, 2022
PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Representation

How to Reproduce our Results This repository contains PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Represen

opcrisis 46 Dec 15, 2022
This is the pytorch code for the paper Curious Representation Learning for Embodied Intelligence.

Curious Representation Learning for Embodied Intelligence This is the pytorch code for the paper Curious Representation Learning for Embodied Intellig

null 19 Oct 19, 2022
PyTorch code for our paper "Image Super-Resolution with Non-Local Sparse Attention" (CVPR2021).

Image Super-Resolution with Non-Local Sparse Attention This repository is for NLSN introduced in the following paper "Image Super-Resolution with Non-

null 143 Dec 28, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 43 Nov 21, 2022
PyTorch code for our ECCV 2020 paper "Single Image Super-Resolution via a Holistic Attention Network"

HAN PyTorch code for our ECCV 2020 paper "Single Image Super-Resolution via a Holistic Attention Network" This repository is for HAN introduced in the

五维空间 140 Nov 23, 2022
PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Yulun Zhang 1.2k Dec 26, 2022