Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Overview

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Download links and PyTorch implementation of "Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision", ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Xiaoshi Wu, Hadar Averbuch-Elor, Jin Sun, Noah Snavely ICCV 2021

Project Page | Paper

drawing

The WikiScenes Dataset

  1. Image and Textual Descriptions: WikiScenes contains 63K images with captions of 99 cathedrals. We provide two versions for download:

    • Low-res version used in our experiments (maximum width set to 200[px], aspect ratio fixed): (1.9GB .zip file)
    • Higher-res version (maximum longer dimension set to 1200[px], aspect ratio fixed): (19.4GB .zip file)

    Licenses for the images are provided here: (LicenseInfo.json file)

    Data Structure

    WikiScenes is organized recursively, following the tree structure in Wikimedia. Each semantic category (e.g. cathedral) contains the following recursive structure:

    ----0 (e.g., "milano cathedral duomo milan milano italy italia")
    --------0 (e.g., "Exterior of the Duomo (Milan)")
    ----------------0 (e.g., "Duomo (Milan) in art - exterior")
    ----------------1
    ----------------...
    ----------------K0-0
    ----------------category.json
    ----------------pictures (contains all pictures in current hierarchy level)
    --------1
    --------...
    --------K0
    --------category.json
    --------pictures (contains all pictures in current hierarchy level)
    ----1
    ----2
    ----...
    ----N
    ----category.json
    

    category.json is a dictionary of the following format:

    {
        "max_index": SUB-DIR-NUMBER
        "pairs" :    {
                        CATEGORY-NAME: SUB-DIR-NAME
                    }
        "pictures" : {
                        PICTURE-NAME: {
                                            "caption": CAPTION-DATA,
                                            "url": URL-DATA,
                                            "properties": PROPERTIES
                                    }
                    }
    }
    

    where:

    1. SUB-DIR-NUMBER is the total number of subcategories
    2. CATEGORY-NAME is the name of the category (e.g., "milano cathedral duomo milan milano italy italia")
    3. SUB-DIR-NAME is the name of the sub-folder (e.g., "0")
    4. PICTURE-NAME is the name of the jpg file located within the pictures folder
    5. CAPTION-DATA contains the caption and URL contains the url from which the image was scraped.
    6. PROPERTIES is a list of properties pre-computed for the image-caption pair (e.g. estimated language of caption).
  2. Keypoint correspondences: We also provide keypoint correspondences between pixels of images from the same landmark: (982MB .zip file)

    Data Structure

     {
         "image_id" : {
                         "kp_id": (x, y),
                     }
     }
    

    where:

    1. image_id is the id of each image.
    2. kp_id is the id of keypoints, which is unique across the whole dataset.
    3. (x, y) the location of the keypoint in this image.
  3. COLMAP reconstructions: We provide the full 3D models used for computing keypoint correspondences: (1GB .zip file)

    To view these models, download and install COLMAP. The reconstructions are organized by landmarks. Each landmark folder contains all the reconstructions associated with that landmark. Each reconstruction contains 3 files:

    1. points3d.txt that contains one line of data for each 3D point associated with the reconstruction. The format for each point is: POINT3D_ID, X, Y, Z, R, G, B, ERROR, TRACK[] as (IMAGE_ID, POINT2D_IDX).
    2. images.txt that contains two lines of data for each image associated with the reconstruction. The format of the first line is: IMAGE_ID, QW, QX, QY, QZ, TX, TY, TZ, CAMERA_ID, NAME. The format of the second line is: POINTS2D[] as (X, Y, POINT3D_ID)
    3. cameras.txt that contains one line of data for each camera associated with the reconstruction according to the following format: CAMERA_ID, MODEL, WIDTH, HEIGHT, PARAMS[]

    Please refer to COLMAP's tutorial for further instructions on how to view these reconstructions.

  4. Companion datasets for additional landmark categories: We provide download links for additional category types:

    Synagogues

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

    Mosques

    Images and captions (PENDING .zip file), correspondences (PENDING .zip file), reconstructions (PENDING .zip file)

Reproducing Results

  1. Minimum requirements. This project was originally developed with Python 3.6, PyTorch 1.0 and CUDA 9.0. The training requires at least one Titan X GPU (12Gb memory) .

  2. Setup your Python environment. Clone the repository and install the dependencies:

    conda create -n <environment_name> --file requirements.txt -c conda-forge/label/cf202003
    conda activate <environment_name>
    conda install scikit-learn=0.21
    pip install opencv-python
    
  3. Download the dataset. Download the data as detailed above, unzip and place as follows: Image and textual descriptions in <project>/data/ and the correspondence file in <project>.

  4. Download pre-trained models. Download the initial weights (pre-trained on ImageNet) for the backbone model and place in <project>/models/weights/.

    Backbone Initial Weights Comments
    ResNet50 resnet50-19c8e357.pth PyTorch official model
  5. Train on the WikiScenes dataset. See instructions below. Note that the first run always takes longer for pre-processing. Some computations are cached afterwards.

Training, Inference and Evaluation

The directory launch contains template bash scripts for training, inference and evaluation.

Training. For each run, you need to specify the names of two variables, bash EXP and bash RUN_ID. Running bash EXP=wiki RUN_ID=v01 ./launch/run_wikiscenes_resnet50.sh will create a directory ./logs/wikiscenes_corr/wiki/ with tensorboard events and saved snapshots in ./snapshots/wikiscenes_corr/wiki/v01.

Inference.

If you want to do inference with our pre-trained model, please make a directory and put the model there.

    mkdir -p ./snapshots/wikiscenes_corr/final/ours

Download our validation set, and unzip it.

    unzip val_seg.zip

run sh ./launch/infer_val_wikiscenes.sh to predict masks. You can find the predicted masks in ./logs/masks.

If you want to evaluate you own models, you will also need to specify:

  • EXP and RUN_ID you used for training;
  • OUTPUT_DIR the path where to save the masks;
  • SNAPSHOT specifies the model suffix in the format e000Xs0.000;

Evaluation. To compute IoU of the masks, run sh ./launch/eval_seg.sh.

Pre-trained model

For testing, we provide our pre-trained ResNet50 model:

Backbone Link
ResNet50 model_enc_e024Xs-0.800.pth (157M)

Datasheet

We provide a datasheet for our dataset here.

License

The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide full license text and attribution for all images, make no modifications to any, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset under a CC BY-NC-SA 4.0 license.

Citation

@inproceedings{Wu2021Towers,
 title={Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision},
 author={Wu, Xiaoshi and Averbuch-Elor, Hadar and Sun, Jin and Snavely, Noah},
 booktitle={ICCV},
 year={2021}
}

Acknowledgement

Our code is based on the implementation of Single-Stage Semantic Segmentation from Image Labels

You might also like...
Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Vision-Language Transformer and Query Generation for Referring Segmentation Please consider citing our paper in your publications if the project helps

Data and Code for ACL 2021 Paper
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization

Hybrid solving process for combinatorial optimization problems Combinatorial optimization has found applications in numerous fields, from aerospace to

code for paper
code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

PyTorch implementation of UAGAN(U-net Attention Generative Adversarial Networks) This repository contains the source code for the paper "A High-precis

Rainbow: Combining Improvements in Deep Reinforcement Learning

Rainbow Rainbow: Combining Improvements in Deep Reinforcement Learning [1]. Results and pretrained models can be found in the releases. DQN [2] Double

Code for HodgeNet: Learning Spectral Geometry on Triangle Meshes, in SIGGRAPH 2021.
Code for HodgeNet: Learning Spectral Geometry on Triangle Meshes, in SIGGRAPH 2021.

HodgeNet | Webpage | Paper | Video HodgeNet: Learning Spectral Geometry on Triangle Meshes Dmitriy Smirnov, Justin Solomon SIGGRAPH 2021 Set-up To ins

Learning Generative Models of Textured 3D Meshes from Real-World Images, ICCV 2021
Learning Generative Models of Textured 3D Meshes from Real-World Images, ICCV 2021

Learning Generative Models of Textured 3D Meshes from Real-World Images This is the reference implementation of "Learning Generative Models of Texture

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)
Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)

Table of Content Introduction Getting Started Datasets Installation Experiments Training & Testing Pretrained models Texture fine-tuning Demo Toward R

A task-agnostic vision-language architecture as a step towards General Purpose Vision
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

Comments
  • File not found: ./snapshots/wikiscenes_corr/final/ours/opt_enc_e024Xs-0.800.pth

    File not found: ./snapshots/wikiscenes_corr/final/ours/opt_enc_e024Xs-0.800.pth

    Thank you for your impressive work. I am trying to reproduce your results with the provided datasets and as a first step I wanted to do inference with your pre-trained model. I followed the steps of the Readme but now I am stuck. I am running the following command: python infer_val.py --dataset wikiscenes_corr --cfg configs/wiki_resnet50.yaml --exp final --run ours --resume e024Xs-0.800 --infer-list data/wikiscenes_val.txt --workers 4 --mask-output-dir ./logs/masks/wikiscenes_corr/final/ours/wikiscenes_val

    This is the output: Backbone: ResNet50 Model: AE Loading weights from: ./models/weights/resnet50-19c8e357.pth File not found: ./snapshots/wikiscenes_corr/final/ours/opt_enc_e024Xs-0.800.pth

    Where do I find the file opt_enc_XXX.pth? I only downloaded the file model_enc_XXX.pth and put it into the folder. Do you provide it or do I have to generate it myself?

    opened by tudipffmgt 2
  • replicating image-level classification

    replicating image-level classification

    Thank you for your work, the paper was a great read.

    I was hoping to try the whole image classification task (as seen in table 1 in the paper) with different data. I.e., I would like to evaluate the model on data with ground truth image labels, but not ground truth masks. launch/run_wikiscenes_resnet50_cls.sh and launch/run_wikiscenes_mobilenetv2_cls.sh seem to be the appropriate training scripts.

    However, I haven't been able to find any code for the appropriate eval. Did I miss it or would I need to implement this myself? Any hints would be much appreciated.

    opened by rutescher 2
Owner
Blakey Wu
Blakey Wu
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

BABEL is a large dataset with language labels describing the actions being performed in mocap sequences. BABEL labels about 43 hours of mocap sequences from AMASS [1] with action labels.

null 113 Dec 28, 2022
Alex Pashevich 62 Dec 24, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

PyKale 370 Dec 27, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Jan 4, 2023
[ICCV 2021 Oral] PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers Created by Xumin Yu*, Yongming Rao*, Ziyi Wang, Zuyan Liu, Jiwen Lu, Jie Zhou

Xumin Yu 317 Dec 26, 2022
MAGMA - a GPT-style multimodal model that can understand any combination of images and language

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning Authors repo (alphabetical) Constantin (CoEich), Mayukh (Mayukh

Aleph Alpha GmbH 331 Jan 3, 2023