Implementation of the Object Relation Transformer for Image Captioning


Object Relation Transformer

This is a PyTorch implementation of the Object Relation Transformer published in NeurIPS 2019. You can find the paper here. This repository is largely based on code from Ruotian Luo's Self-critical Sequence Training for Image Captioning GitHub repo, which can be found here.

The primary additions are as follows:

  • Relation transformer model
  • Script to create reports for runs on MSCOCO


  • Python 2.7 (because there is no coco-caption version for Python 3)
  • PyTorch 0.4+ (along with torchvision)
  • h5py
  • scikit-image
  • typing
  • pyemd
  • gensim
  • cider (already added as a submodule). See .gitmodules and clone the referenced repo into the object_relation_transformer folder.
  • The coco-caption library, which is used for generating different evaluation metrics. To set it up, clone the repo into the object_relation_transformer folder. Make sure to keep the cloned repo folder name as coco-caption and also to run the script from within that repo.

Data Preparation

Download ResNet101 weights for feature extraction

Download the file resnet101.pth from here. Copy the weights to a folder imagenet_weights within the data folder:

mkdir data/imagenet_weights
cp /path/to/downloaded/weights/resnet101.pth data/imagenet_weights

Download and preprocess the COCO captions

Download the preprocessed COCO captions from Karpathy's homepage. Extract dataset_coco.json from the zip file and copy it in to data/. This file provides preprocessed captions and also standard train-val-test splits.

Then run:

$ python scripts/ --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json and discretized caption data are dumped into data/cocotalk_label.h5.

Next run:

$ python scripts/ --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train

This will preprocess the dataset and get the cache for calculating cider score.

Download the COCO dataset and pre-extract the image features

Download the COCO images from the MSCOCO website. We need 2014 training images and 2014 validation images. You should put the train2014/ and val2014/ folders in the same directory, denoted as $IMAGE_ROOT:

mv 262993_z.jpg $IMAGE_ROOT/train2014/COCO_train2014_000000167126.jpg

The last two commands are needed to address an issue with a corrupted image in the MSCOCO dataset (see here). The prepro script will fail otherwise.

Then run:

$ python scripts/ --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT extracts the ResNet101 features (both fc feature and last conv feature) of each image. The features are saved in data/cocotalk_fc and data/cocotalk_att, and resulting files are about 200GB. Running this script may take a day or more, depending on hardware.

(Check the prepro scripts for more options, like other ResNet models or other attention sizes.)

Download the Bottom-up features

Download the pre-extracted features from here. For the paper, the adaptive features were used.

Do the following:

mkdir data/bu_data; cd data/bu_data

The .zip file is around 22 GB. Then return to the base directory and run:

python scripts/ --output_dir data/cocobu

This will create data/cocobu_fc, data/cocobu_att and data/cocobu_box.

Generate the relative bounding box coordinates for the Relation Transformer

Run the following:

python scripts/ --input_json data/dataset_coco.json --input_box_dir data/cocobu_box --output_dir data/cocobu_box_relative --image_root $IMAGE_ROOT

This should take a couple hours or so, depending on hardware.

Model Training and Evaluation

Standard cross-entropy loss training

python --id relation_transformer_bu --caption_model relation_transformer --input_json data/cocotalk.json --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --input_label_h5 data/cocotalk_label.h5 --checkpoint_path log_relation_transformer_bu --noamopt --noamopt_warmup 10000 --label_smoothing 0.0 --batch_size 15 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --max_epochs 30 --use_box 1

The train script will dump checkpoints into the folder specified by --checkpoint_path (default = save/). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space.

To resume training, you can specify --start_from option to be the path saving infos.pkl and model.pth (usually you could just set --start_from and --checkpoint_path to be the same).

If you have tensorflow, the loss histories are automatically dumped into --checkpoint_path, and can be visualized using tensorboard.

The current command uses scheduled sampling. You can also set scheduled_sampling_start to -1 to disable it.

If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1 option, but don't forget to download the coco-caption code into coco-caption directory.

For more options, see

The above training script should achieve a CIDEr-D score of about 115.

Self-critical RL training

After training using cross-entropy loss, additional self-critical training produces signficant gains in CIDEr-D score.

First, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)

$ bash scripts/ relation_transformer_bu relation_transformer_bu_rl


python --id relation_transformer_bu_rl --caption_model relation_transformer --input_json data/cocotalk.json --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_label_h5 data/cocotalk_label.h5  --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --input_label_h5 data/cocotalk_label.h5 --checkpoint_path log_relation_transformer_bu_rl --label_smoothing 0.0 --batch_size 10 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --start_from log_transformer_bu_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --max_epochs 60 --use_box 1

The above training script should achieve a CIDEr-D score of about 128.

Evaluate on Karpathy's test split

To evaluate the cross-entropy model, run:

python --dump_images 0 --num_images 5000 --model log_relation_transformer_bu/model.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --use_box 1 --language_eval 1

and for cross-entropy+RL run:

python --dump_images 0 --num_images 5000 --model log_relation_transformer_bu_rl/model.pth --infos_path log_relation_transformer_bu_rl/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --language_eval 1


Visualize caption predictions

Place all your images of interest into a folder, e.g. images, and run the eval script:

$ python --dump_images 1 --num_images 10 --model log_relation_transformer_bu/model.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root $IMAGE_ROOT --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5  --input_fc_dir data/cocobu_fc --input_att_dir data/cocobu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative

This tells the eval script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing batch_size. Use --num_images -1 to process all images. The eval script will create an vis.json file inside the vis folder, which can then be visualized with the provided HTML interface:

$ cd vis
$ python -m SimpleHTTPServer

Now visit localhost:8000 in your browser and you should see your predicted captions.

Generate reports from runs on MSCOCO

The script can be used in order to generate HTML reports containing results from different runs. Please see the script for specific usage examples.

The script takes as input one or more pickle files containing results from runs on the MSCOCO dataset. It reads in the pickle files and creates a set of HTML files with tables and graphs generated from the different captioning evaluation metrics, as well as the generated image captions and corresponding metrics for individual images.

If more than one pickle file with results is provided as input, the script will also generate a report containing a comparison between the metrics generated by each pair of methods.

Model Zoo and Results

The table below presents links to our pre-trained models, as well as results from our paper on the Karpathy test split. Similar results should be obtained by running the respective commands in As learning rate scheduling was not fully optimized, these values should only serve as a reference/expectation rather than what can be achieved with additional tuning.

The models are Copyright Verizon Media, licensed under the terms of the CC-BY-4.0 license. See associated license file.

Up-Down + LSTM * 106.6 19.9 75.6 32.9 26.5 55.4
Up-Down + Transformer 111.0 20.9 75.0 32.8 27.5 55.6
Up-Down + Object Relation Transformer 112.6 20.8 75.6 33.5 27.6 56.0
Up-Down + Object Relation Transformer + Beamsize 2 115.4 21.2 76.6 35.5 28.0 56.6
Up-Down + Object Relation Transformer + Self-Critical + Beamsize 5 128.3 22.6 80.5 38.6 28.7 58.4

* Note that the pre-trained Up-Down + LSTM model above produces slightly better results than reported, as it came from a different training run. We kept the older LSTM results in the table above for consistency with our paper.

Comparative Analysis

In addition, in the paper we also present a head-to-head comparison of the Object Relation Transformer against the "Up-Down + Transformer" model. (Results from the latter model are also included in the table above). In the paper, we refer to this latter model as "Baseline Transformer", as it does not make use of geometry in its attention definition. The idea of the head-to-head comparison is to better understand the improvement obtained by adding geometric attention to the Transformer, both quantitatively and qualitatively. The comparison consists of a set of evaluation metrics computed for each model on a per-image basis, as well as aggregated over all images. It includes the results of paired t-tests, which test for statistically significant differences between the evaluation metrics resulting from each of the models. This comparison can be generated by running the commands in The commands first run the two aforementioned models on the MSCOCO test set and then generate the corresponding report containing the complete comparative analysis.


If you find this repo useful, please consider citing (no obligation at all):

  title={Image Captioning: Transforming Objects into Words},
  author={Herdade, Simao and Kappeler, Armin and Boakye, Kofi and Soares, Joao},
  journal={arXiv preprint arXiv:1906.05963},

Of course, please cite the original paper of models you are using (you can find references in the model files).


Please refer to the file for information about how to get involved. We welcome issues, questions, and pull requests.

Please be aware that we (the maintainers) are currently busy with other projects, so it make take some days before we are able to get back to you. We do not foresee big changes to this repository going forward.


Kofi Boakye:

Simao Herdade:

Joao Soares:


This project is licensed under the terms of the MIT open source license. Please refer to LICENSE for the full terms.


Thanks to Ruotian Luo for the original code.

  • Dimension error for geometric and appearance features in Relation Encoding

    Dimension error for geometric and appearance features in Relation Encoding

    Thanks for sharing the codes. It's solid organized and compact programmed.

    I'd like to have two questions about the based on my running results.

    1. Code at llne 454, after this compare operation, we got a array with boolean indexes which can not be added in the following line, i changed it seq_mask = ( > 0) to seq_mask = ( > 0).type(torch.int8) It's quiet thereafter.

    2. In the function box_attention at line 236, # multiplying log of geometric weights by feature weights w_mn = torch.log(torch.clamp(w_g, min = 1e-6)) + w_a

    the dimensions of these two geometric and appearance features are not matched. Thus I got error as follows:

    RuntimeError: The size of tensor a (54) must match the size of tensor b (50) at non-singleton dimension 3

    I've tried to figure it out what's going on there quite for a while but got no idea as so far. I'm not sure whether it depends on my environments (I think not) or it's just a typo in coding.

    • torch 0.4.1
    • torchvision 0.2.1
    • 4 x Tesla V100-SXM2 Driver Version: 410.104 CUDA Version: 10.0

    Any input will be appreciated. Jian

    opened by Jian-Xi 1
  • Update


    Yahoo no longer issues CLAs. The OSPO is going through an exercise to remove references to

    I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

    opened by retlawrose 0
  • changes to run make bottom up features data with python 3

    changes to run make bottom up features data with python 3

    I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

    opened by gsrivas4 0
  • clipping boxes to support multi-gpu training

    clipping boxes to support multi-gpu training

    In order to have multi-gpu training with the model, we need to addclip_att(boxes, att_masks) to the features preprocessing. This is necessary since when a batch is split across multiple GPUS, the maximum number of boxes in each batch split will in general be different.

    fixes #8

    I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

    opened by simaoh 0
  • Add model zoo

    Add model zoo

    • Added links to models in the README and included CC-BY license
    • Fixed eval sample commands to use model-best.pth instead of model.pth
    • Added legacy_extra_skip parameter for compatibility with older RL model

    Closes #1

    I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

    opened by jvbsoares 0
  • Error when processing my image folder

    Error when processing my image folder

    Eval scripts provides evaluation on user patch of images: # For evaluation on a folder of images: parser.add_argument('--image_folder', type=str, default='', help='If this is nonempty then will predict on the images in this folder path') parser.add_argument('--image_root', type=str, default='', help='In case the image paths have to be preprended with a root path to an image folder')

    I put some images into images folder and run the csript:

    python3 --dump_images 1 --num_images 10 --model log_relation_transformer_bu_rl/model-best.pth --infos_path log_relation_transformer_bu_rl/infos_relation_transformer_bu-best.pkl --image_folder images --language_eval 0

    When doing so i get the error:

    Traceback (most recent call last): File "", line 175, in loss, split_predictions, lang_stats = eval_utils.eval_split(model, crit, loader, File "/home/docet/Projects/Pic2Text/object_relation_transformer-master/", line 134, in eval_split boxes_data= data['boxes'][np.arange(loader.batch_size) * loader.seq_per_img] KeyError: 'boxes'

    How can i solve it?

    opened by KyriaAnnwyn 2
  • Minor formating changes

    Minor formating changes

    I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

    opened by Arun-George-Zachariah 0
  • win10 can not run the project

    win10 can not run the project

    Python 2.7 (because there is no coco-caption version for Python 3) PyTorch 0.4+ (along with torchvision)

    Because there is no pytroch0.4+ according to python2.7.

    opened by jiajunhua 2
  • Evaluate on COCO test split

    Evaluate on COCO test split

    When I try to evaluate the model on coco test split for 6w images by the command of "python --dump_images 0 --num_images 5000 --model log_relation_transformer_bu/model-best.pth --infos_path log_relation_transformer_bu/infos_relation_transformer_bu-best.pkl --image_root ./data/coco2014/ --input_json data/cocotest.json --input_label_h5 data/cocotalk_label.h5 --input_fc_dir data/cocotest_bu_fc --input_att_dir data/cocotest_bu_att --input_box_dir data/cocobu_box --input_rel_box_dir data/cocobu_box_relative --use_box 1 --language_eval 1 '', I got the error of "Bad file descriptor". How to evaluate the model on coco test split?????

    opened by HN123-123 1
  • self-critical training [duration and memory occupation]

    self-critical training [duration and memory occupation]


    thank you a lot for your great work and some nice code!

    I have a question regarding the self-criticial extra training. I am not exactly sure if there is an issue with it, but could you please tell me, how much memory self-critical training should consume? I keep running into CUDA out of memory error with 3 GPUs, and I can see that self-critical training is really hungry for space...therefore, I wanted to hear from the authors of the paper how much space this extra training required in the original experiments? And was there any optimisation of the code to handle this issue?

    Best, Nikolai.

    opened by nilinykh 1
This organization is the home to many of the active open source projects published by engineers at Yahoo Inc.
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

null 2 Jan 22, 2022
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 2, 2023
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 363 Dec 14, 2022
Extreme Learning Machine implementation in Python

Python-ELM v0.3 ---> ARCHIVED March 2021 <--- This is an implementation of the Extreme Learning Machine [1][2] in Python, based on scikit-learn. From

David C. Lambert 511 Dec 20, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 7, 2022
TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (>=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Mikhail Trofimov 785 Dec 21, 2022
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
Implementation of different ML Algorithms from scratch, written in Python 3.x

Implementation of different ML Algorithms from scratch, written in Python 3.x

Gautam J 393 Nov 29, 2022
This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment to test the algorithm

Martin Huber 59 Dec 9, 2022
Machine learning algorithms implementation

Machine learning algorithms implementation This repository consisits of implementation of various machine learning algorithms. The algorithms implemen

Karun Dawadi 1 Jan 3, 2022
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

Fluid Simulation Usage Download this repo and store it in your computer. Open a terminal and go to the root directory of this folder. Make sure you ha

Mariana Ávalos Arce 5 Dec 2, 2022
A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

Peter Corke 1.2k Jan 7, 2023
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

null 3 Nov 24, 2021
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets ( and https:

Zachary Petroff 4 Dec 30, 2022
NumPy-based implementation of a multilayer perceptron (MLP)

My own NumPy-based implementation of a multilayer perceptron (MLP). Several of its components can be tuned and played with, such as layer depth and size, hidden and output layer activation functions, weight decay and dropout.

null 1 Feb 10, 2022
An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Background This repository contains an implementation of Relaxed Linear Adversarial Concept Erasure (RLACE). Given a dataset X of dense representation

Shauli Ravfogel 4 Apr 13, 2022
Implementation of linesearch Optimization Algorithms in Python

Nonlinear Optimization Algorithms During my time as Scientific Assistant at the Karlsruhe Institute of Technology (Germany) I implemented various Opti

Paul 3 Dec 6, 2022