A task-agnostic vision-language architecture as a step towards General Purpose Vision

Related tags

Deep Learning gpv-1
Overview

Towards General Purpose Vision Systems

By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem

teaser

Overview

Welcome to the official code base for GPV-I - a general purpose vision-language architecture that can learn and perform any task that requires bounding boxes or text prediction. We demonstrate the effectiveness of GPV-I by jointly training it on VQA, Captioning, Localization, and Classification tasks and achieveing favorable performance in comparison to specialized single-task models.

Available on Arxiv: https://arxiv.org/abs/2104.00743

Project Page: https://prior.allenai.org/projects/gpv

Demo: https://vision-explorer.allenai.org/general_purpose_vision

BibTex:

@article{Gupta2021GPV,
  title={Towards General Purpose Vision Systems},
  author={Tanmay Gupta and A. Kamath and Aniruddha Kembhavi and Derek Hoiem},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00743}
}

Clone repository

git clone --recurse-submodules [email protected]:allenai/gpv-1.git

Install dependencies

Create conda environment

conda create -n gpv python=3.6 -y
conda activate gpv

Install libraries

bash setup_conda_env.sh

Paths

Decide the following paths:

  • <data_dir>: This is the directory where images and annotations will be saved
  • <output_dir>: This is where outputs of various experiments will be saved including model checkpoints, visualization, inference and evaluation results

<data_dir> and <output_dir> refer to these absolute paths in the instructions below.

Download data

To study generalization of concepts across skills, we created a new split of COCO annotations - COCO-SCE. To download the original and our new split, pretrained DETR checkpoints on both splits run the following:

bash setup_data.sh <data_dir>

Note - If you intend to run experiments only on COCO-SCE, you can skip downloading COCO test images and save time and disk space by setting download_coco_test_images=False in setup_data.sh

Download model

Model Split Download
GPV COCO Link
GPV COCO-SCE Link

To use any of these models, download them into <output_dir>/<exp_name>/ckpts directory as follows:

wget <link> -P <output_dir>/<exp_name>/ckpts/

<exp_name> could be any directory name of your choice such as gpv_coco or gpv_coco_sce.

Test the model interactively

We provide easy to use interactive IPython notebooks where you may provide an image and a natural language task description and visualize the models outputs, namely - bounding boxes for relevant image regions and text answer. Note that while some tasks might expect only one of the output modalities, the model always outputs both. For example, the model outputs relevant regions during captioning and text during localization. These auxiliary outputs may be unsolicited but often provide useful and diagnostic information.

We provide the following notebooks:

  • inference.ipynb: This demonstrates inference for GPV-1 using greedy inference for text decoding as used in all experiments in our paper.
  • inference_beam_search.ipynb: Post-submission, we implemented beam search! This also allows greedy inference by setting beam size to 1. This also allows sampling multiple high ranking text outputs which is especially useful for tasks with multiple plausible outputs such as captioning.

We also provide equivalent .py scripts to run inference on a single image and task description pair. To run these scripts update output_dir, ckpt, inputs.img, and inputs.query in configs/exp/gpv_inference_cmdline.yaml.

For inference with beam search run:

python -m inference_beam_search beam_size=5

For greedy decoding either set beam_size to 1 in the previous command or run the following:

python -m inference

Train model

We provide scripts for training GPV on one or more of the following tasks:

  • CocoClassification
  • CocoVqa
  • CocoDetection (refered to as the Localization task in the paper)
  • CocoCaptioning

Training GPV-1 involves 3 steps:

  • Step 1: Update the configs/exp/gpv.yaml file. Here are the key parameters to consider (the ones marked with a star will be set later in Step 3):

    • num_gpus_per_node (set to 4 if you have 24GB GPUs, 2 for 48GB, and 1 for 80GB)
    • dist_url
    • output_dir *
    • data_dir *
    • model.pretr_detr *
  • Step 2: Decide the dataset or combination of supported datasets to train the model. This is specified through one of the files in configs/learning_datasets. For instance, all.yaml trains on all 4 tasks, cap_vqa.yaml trains on CocoCaptioning & CocoVqa, and cap.yaml trains only on CocoCaptioning. If you don't see a dataset combination you may add one by modifying all.yaml. We refer to the name of the chosen yaml file without the extension by <learning_datasets>

  • Step 3: Launch training as follows:

    bash exp/gpv/scripts/train.sh <learning_datasets> <data_split> <exp_name> <output_dir> <data_dir>
    

    Note that training comprises of 2 sub-steps. First, the model is trained for training.frozen_epochs (in configs/exp/gpv.yaml) steps with DETR weights frozen. Then the model is finetuned end-to-end for a total of training.num_epochs epochs. train_gpv.sh executes both steps sequentially. model.pretr_detr is selected automatically in train.sh based on <data_split>.

  • Step 4: Visualize loss, metrics, and learning rate on tensorboard:

    tensorboard --logdir=<output_dir> --bind_all
    
  • Step 5: Predictions are visualized on a small set of train and validation set samples every few thousand iterations (training.vis_step). These are available at <output_dir>/<exp_name>/training_visualizations

Evaluation

We provide evaluation code for the following tasks:

  • CocoClassification
  • CocoVqa
  • CocoDetection (refered to as the Localization task in the paper)
  • CocoCaptioning
  • RefCocop

Run the following command to evaluate on one or a set of tasks

bash exp/gpv/scripts/eval.sh <exp_name> <task_name> <subset> <split> <output_dir> <data_dir>
  • <exp_name>: name of the experiment directory (<output_dir>/<exp_name>) where the model to be evaluated lives.
  • <task_name>: set to all to evaluate on all 5 tasks, all_but_refexp to evalute on all tasks excepts RefCocop, or the name of tasks to evaluate only on that task.
  • <subset>: set to train or val for COCO (no test since COCO test annotations are hidden) and train, val, or test for COCO-SCE.
  • <split>: set to original_split (COCO) or gpv_split (COCO-SCE). This flag is unused for RefCocop.

Predictions and metrics are saved at <output_dir>/<exp_name>/eval.

If you wish to evaluate captioning or vqa performnce on the COCO test images, we provide scripts to generate predictions in the format expected by their respective official evaluation servers (Captioning eval server, VQA eval server). You may run these as follows:

bash exp/gpv/scripts/eval_<cap/vqa>_test.sh <exp_name> <subset> <output_dir> <data_dir>

<subset> may be test or testdev for VQA and val or test for Captioning.

Finetune GPV-1

GPV-1 can be finetuned on your data. To evaluate GPV-1's learning efficiency and extent of catastrophic forgetting, we provide scripts to finetune GPV on RefCocop. These scripts may also be used as an example of finetuning GPV on your own data.

To finetune pretrained GPV-1 on RefCocop, run the following

bash exp/gpv/scripts/ft_gpv.sh <ckpt> <train_perc> <output_dir> <data_dir>
  • <ckpt>: absolute path of the GPV-1 checkpoint that you want to initialize the training with
  • <train_perc>: percentage of the full Refcocop training set to use for learning. Supported values include 1, 2, 5, 10, 25, 50, 75, 100. These subsampled subsets can be found in <data_dir>/learning_phase_data/refcocop/

The evaluation script described in the previous section works for Refcocop evaluation as well.

A note on GPU memory requirements

  • The current hyperparameters are chosen for training GPV-1 with a batch size of 120 samples. This leads to significant GPU memory requirements during training (e.g. 5-7 days of training on four 24GB GPUs).
  • While training is memory intensive, evaluation is easily run on a single GPU (you may further reduce batch size for evaluation using eval.batch_size flag in gpv.yaml file if working with low memory GPUs).
  • It may be possible to trade-off GPU memory with training time by reducing training batch size using the training.batch_size flag. However, this might require tuning the hyperparameters to achieve competitive performance.
  • Finally, if working with COCO-like data or when finetuning from a pretrained GPV-1 checkpoint, you might be able to get good performance with low GPU memory requirements by freezing the DETR backbone (training.freeze=True) and only training the remaining modules.
You might also like...
ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection
ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection This repository contains implementation of the

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.
ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representa

A general-purpose, flexible, and easy-to-use simulator alongside an OpenAI Gym trading environment for MetaTrader 5 trading platform (Approved by OpenAI Gym)
A general-purpose, flexible, and easy-to-use simulator alongside an OpenAI Gym trading environment for MetaTrader 5 trading platform (Approved by OpenAI Gym)

gym-mtsim: OpenAI Gym - MetaTrader 5 Simulator MtSim is a simulator for the MetaTrader 5 trading platform alongside an OpenAI Gym environment for rein

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)
General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

General purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.

Code for the paper Task Agnostic Morphology Evolution.

Task-Agnostic Morphology Optimization This repository contains code for the paper Task-Agnostic Morphology Evolution by Donald (Joey) Hejna, Pieter Ab

Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs

Perceiver IO Unofficial implementation of Perceiver IO: A General Architecture for Structured Inputs & Outputs Usage import torch from src.perceiver.

code for paper
code for paper "Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?"

Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? Code for paper: Does Unsupervised Architecture Representation

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come
An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come

IceVision is the first agnostic computer vision framework to offer a curated collection with hundreds of high-quality pre-trained models from torchvision, MMLabs, and soon Pytorch Image Models. It orchestrates the end-to-end deep learning workflow allowing to train networks with easy-to-use robust high-performance libraries such as Pytorch-Lightning and Fastai

Comments
  • File path error when evaluating on RefCOCOp

    File path error when evaluating on RefCOCOp

    I got and OSError: No such file or directory when evaluating on RefCOCOp, which seems to be caused by the inconsistency between line 37 and line164. I solved this problem by change line 37 and line 38 to the following code

        boxes_h5py = h5py.File(os.path.join(
            eval_dir,f'{cfg.eval.task}_{cfg.task_configs.data_split}_{cfg.eval.subset}_boxes.h5py'),'w')
    

    Command

    command to reproduce the error

    # Set up env and prepare data following guidance, here I set the `download_coco_test_images` to False
    
    # Evaluate
    bash exp/gpv/scripts/eval.sh gpv_coco_sce RefCocop val '' ${work_dir}/GPV/exp_output ${work_dir}/GPV
    
    
    opened by yiranyyu 0
Owner
AI2
AI2
Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"

This is the official repository of my book "Deep Learning with PyTorch Step-by-Step". Here you will find one Jupyter notebook for every chapter in the book.

Daniel Voigt Godoy 340 Jan 1, 2023
In this work, we will implement some basic but important algorithm of machine learning step by step.

WoRkS continued English 中文 Français Probability Density Estimation-Non-Parametric Methods(概率密度估计-非参数方法) 1. Kernel / k-Nearest Neighborhood Density Est

liziyu0104 1 Dec 30, 2021
Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Self-attention building blocks for computer vision applications in PyTorch Implementation of self attention mechanisms for computer vision in PyTorch

AI Summer 962 Dec 23, 2022
a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

Microsoft 9.9k Jan 8, 2023
Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

MobileViT RegNet Unofficial PyTorch implementation of MobileViT based on paper MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TR

Hong-Jia Chen 91 Dec 2, 2022
A general-purpose programming language, focused on simplicity, safety and stability.

The Rivet programming language A general-purpose programming language, focused on simplicity, safety and stability. Rivet's goal is to be a very power

The Rivet programming language 17 Dec 29, 2022
PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

PocketNet This is the official repository of the paper: PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and M

Fadi Boutros 40 Dec 22, 2022
Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment The official implementation of Arch-Net: Model Distillation for Architecture A

MEGVII Research 22 Jan 5, 2023
Alex Pashevich 62 Dec 24, 2022
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation This is a demo implementation of BYOL for Audio (BYOL-A), a self-sup

NTT Communication Science Laboratories 160 Jan 4, 2023