A unified framework to jointly model images, text, and human attention traces.

Meta Research

Last update: Oct 24, 2022

Related tags

Deep Learning connect-caption-and-trace

Overview

connect-caption-and-trace

This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attention Traces (CVPR2021).

Requirements

Python 3
PyTorch 1.5+ (along with torchvision)
coco-caption (Remember to follow initialization steps in coco-caption/README.md)

Prepare data

Our experiments cover all four datasets included in Localized Narratives: COCO2017, Flickr30k, Open Images and ADE20k. For each dataset, we need four things: (1) json file containing image info and word tokens. (DATASET_LN.json) (2) h5 file containing caption labels (DATASET_LN_label.h5) (3) The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/) (4) json file for coco-caption evaluation (captions_DATASET_LN_test.json) (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome.

You can download (1--4) from here: (make a folder named data and put (1--3) in it, and put (4) under coco-caption/annotaions/)

To get (5), you can use Detectron2. First, install Detectron2, then follow Prepare COCO-style annotations for Visual Genome (We use the pre-trained Resnet101-C4 model provided there). After that you can utilize tools/extract_feats.py in Detectron2 to extract features. Finally, run scripts/prepare_feats_boxes_from_npz.py in this repo to prepare features and bounding boxes in seperate folders for training.

For COCO dataest you can also directly use the features provided by Peter Anderson here. The performance is almost the same (with around 0.2% difference.)

Training

The dataset can be chosen from the four datasets. The --task can be chosen from trace, caption, c_joint_t and pred_both. The --eval_task can be chosen from trace, caption, and pred_both.

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 0 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 0 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Flickr30k: training of controlled caption generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_flk30k  --caption_model transformer --input_json data/flk30k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/flk30k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/flk30k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task caption --eval_task caption --dataset_choice=flk30k

ADE20k: training of controlled trace generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_ade20k  --caption_model transformer --input_json data/ade20k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/ade20k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/ade20k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task trace --eval_task trace --dataset_choice=ade20k

Evaluating

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on trace generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task trace --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 1 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Acknowledgements

Some components of this repo were built from Ruotian Luo's ImageCaptioning.pytorch.

Comments

No module named 'captioning.data'

I couldn't find the custom dataloader in a source codes. The error occurred in the https://github.com/facebookresearch/connect-caption-and-trace/blob/d015988cdca81afb22508742107658b2574ebf09/tools/train.py#L20

opened by Napkin-DL 1
Adding Code of Conduct file

This is pull request was created automatically because we noticed your project was missing a Code of Conduct file.

Code of Conduct files facilitate respectful and constructive communities by establishing expected behaviors for project contributors.

This PR was crafted with love by Facebook's Open Source Team.
CLA Signed

opened by facebook-github-bot 0
Adding Contributing file

This is pull request was created automatically because we noticed your project was missing a Contributing file.

CONTRIBUTING files explain how a developer can contribute to the project - which you should actively encourage.

This PR was crafted with love by Facebook's Open Source Team.
CLA Signed

opened by facebook-github-bot 0
About the storage location of the results (like the generated caption and trace)

Hi, @zihangm,

Thanks for your excellent work! I have successfully run the code, but I can't find where those results are stored. After the training, just one new folder called "eval_result" appears, which only contains the image caption of the eval part.

I would appreciate it if you could share with me the right way to use this code.

Thanks, yuhu

opened by yuhufeng 2
Some questions about dataset
Hi, I'm interested in the data file ‘(2) h5 file containing caption labels (DATASET_LN_label.h5)’ & ‘The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/)’.

How are these data files generated？

What is fc_feat in the model's input?

which image features provided by Peter Anderson are suitable for this task?
opened by harukaza 0
box_feats, trace_feats dimension size 5

Hi,

I was attempting to reproduce the model and I had two questions. I saw that the box_feats (which corresponds to the bounding box of object proposals) and trace_feats (corresponding to bounding box of traces) has 5 dimensions.

Could you elaborate on what each dimension means? Specifically what is the 5th dimension? What does this value refer to?

Also, is the bounding box expressed in terms of width and height or secondary x,y coordinates, i.e: (x, y, w, h, ?) or (x1, y1, x2, y2, ?).

Thank you!

opened by dondongwon 3
More details for creating dense word-to-box alignment.

Hi, @zihangm,

Thanks for your nice work!

I am curious about more details for creating dense word-to-box alignment in Section3.1 in your paper. I have compared your released coco_LN_trace_box data with the original released LN dataset annotations, and found that the numbers of trace segments of one specific image are not the same. For example, considering the image(id: 322944) in coco_val split, the number of trace segments in your released data is 13 while in the original released data the number is 18. So I wonder whether you took some extra rules for filtering or merging the original trace segments for better alignment in your data preprocessing?

Since I can't find related preprocessing code in the repo, I will appreciate it if you can share some experience.

Thanks, Jianjie

opened by jianjieluo 2
Detectron2 Preprocessing

Hi, I'm having trouble following the steps for (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome. The step: Prepare COCO-style annotations for Visual Genome.

Could you elaborate on these steps? I think this is due to the depreciation of the repository that was linked, with the newer installation not allowing for the same steps.

I would really appreciate detailed preprocessing instructions, or if you could provide the preprocessed features directly, so it would be possible to recreate the results, that would be amazing as well.

Thank you for the amazing work!

opened by dondongwon 1

A unified framework to jointly model images, text, and human attention traces.

Related tags

Overview

connect-caption-and-trace

Requirements

Prepare data

Training

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

Flickr30k: training of controlled caption generation alone (N=1 layer)

ADE20k: training of controlled trace generation alone (N=1 layer)

Evaluating

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on trace generation)

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

Acknowledgements

Comments

Owner

Meta Research

3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Unified tracking framework with a single appearance model

This repository is the offical Pytorch implementation of ContextPose: Context Modeling in 3D Human Pose Estimation: A Unified Perspective (CVPR 2021).

Reimplementation of the paper `Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (ACL2020)`

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

A PyTorch Implementation of the Luna: Linear Unified Nested Attention

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors, CVPR 2021

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

[CVPR2021] UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

Human Action Controller - A human action controller running on different platforms.

StyleGAN-Human: A Data-Centric Odyssey of Human Generation