[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

Overview

3DVG-Transformer

This repository is for the ICCV 2021 paper "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds"

Our method "3DVG-Transformer+" is the 1st method on the ScanRefer benchmark (2021/3 - 2021/11) and is the winner of the CVPR2021 1st Workshop on Language for 3D Scenes

🌟 3DVG-Transformer+ achieves comparable results with papers published in [CVPR2022]. 🌟

image-Model

Introduction

Visual grounding on 3D point clouds is an emerging vision and language task that benefits various applications in understanding the 3D visual world. By formulating this task as a grounding-by-detection problem, lots of recent works focus on how to exploit more powerful detectors and comprehensive language features, but (1) how to model complex relations for generating context-aware object proposals and (2) how to leverage proposal relations to distinguish the true target object from similar proposals are not fully studied yet. Inspired by the well-known transformer architecture, we propose a relation-aware visual grounding method on 3D point clouds, named as 3DVG-Transformer, to fully utilize the contextual clues for relation-enhanced proposal generation and cross-modal proposal disambiguation, relation-aware proposal generation and cross-modal feature fusion, which are enabled by a newly designed coordinate-guided contextual aggregation (CCA) module in the object proposal generation stage, and a multiplex attention (MA) module in the cross-modal feature fusion stage. With the aid of two proposed feature augmentation strategies to alleviate overfitting, we validate that our 3DVG-Transformer outperforms the state-of-the-art methods by a large margin, on two point cloud-based visual grounding datasets, ScanRefer and Nr3D/Sr3D from ReferIt3D, especially for complex scenarios containing multiple objects of the same category.

Dataset & Setup

Data preparation

This codebase is built based on the initial ScanRefer codebase. Please refer to ScanRefer for more data preprocessing details.

  1. Download the ScanRefer dataset and unzip it under data/.
  2. Downloadand the preprocessed GLoVE embeddings (~990MB) and put them under data/.
  3. Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

  1. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py

After this step, you can check if the processed scene data is valid by running:

python visualize.py --scene_id scene0000_00
  1. (Optional) Pre-process the multiview features from ENet.
python script/project_multiview_features.py --maxpool

Setup

The code is tested on Ubuntu 16.04 LTS & 18.04 LTS with PyTorch 1.2.0 CUDA 10.0 installed.

Please refer to the initial ScanRefer for pointnet2 packages for the newer version (>=1.3.0) of PyTorch.

You could use other PointNet++ implementations for the lower version (<=1.2.0) of PyTorch.

conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Usage

Training

To train the 3DVG-Transformer model with multiview features:

python scripts/ScanRefer_train.py --use_multiview --use_normal --batch_size 8 --epoch 200 --lr 0.002 --coslr --tag 3dvg-trans+

settings: XYZ: --use_normal XYZ+RGB: --use_color --use_normal XYZ+Multiview: --use_multiview --use_normal

For more training options (like using preprocessed multiview features), please run scripts/train.py -h.

Evaluation

To evaluate the trained models, please find the folder under outputs/ and run:

python scripts/ScanRefer_eval.py --folder <folder_name> --reference --use_multiview --no_nms --force --repeat 5 --lang_num_max 1

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json

Note that the results generated by ScanRefer_eval.py may be slightly lower than the test results during training. The main reason is that the results of model testing fluctuate, while the maximum value is reported during training, and we do not use a fixed test seed.

Benchmark Challenge

Note that every user is allowed to submit the test set results of each method only twice, and the ScanRefer benchmark blocks update the test set results of a method for two weeks after a test set submission.

After finishing training the model, please download the benchmark data and put the unzipped ScanRefer_filtered_test.json under data/. Then, you can run the following script the generate predictions:

python benchmark/predict.py --folder <folder_name> --use_color

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The generated predictions are stored in outputs/<folder_name>/pred.json. For submitting the predictions, please compress the pred.json as a .zip or .7z file and follow the instructions to upload your results.

Visualization

image-Visualization

To predict the localization results predicted by the trained ScanRefer model in a specific scene, please find the corresponding folder under outputs/ with the current timestamp and run:

python scripts/visualize.py --folder <folder_name> --scene_id <scene_id> --use_color

Note that the flags must match the ones set before training. The training information is stored in outputs/<folder_name>/info.json. The output .ply files will be stored under outputs/<folder_name>/vis/<scene_id>/

In our next version, the heatmap visualization code will be open-sourced in the 3DJCG (CVPR2022, Oral) codebase.

The generated .ply or .obj files could be visualized in software such as MeshLab.

Results

image-Results

settings: 3D Only (XYZ+RGB): --use_color --use_normal 2D+3D (XYZ+Multiview): --use_multiview --use_normal

Validation Set Unique Unique Multiple Multiple Overall Overall
Methods Publication Modality [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
SCRC CVPR16 2D 24.03 9.22 17.77 5.97 18.70 6.45
One-Stage ICCV19 2D 29.32 22.82 18.72 6.49 20.38 9.04
ScanRefer ECCV2020 3D 67.64 46.19 32.06 21.26 38.97 26.10
TGNN AAAI2021 3D 68.61 56.80 29.84 23.18 37.37 29.70
InstanceRefer ICCV2021 3D 77.45 66.83 31.27 24.77 40.23 32.93
SAT ICCV2021 3D 73.21 50.83 37.64 25.16 44.54 30.14
3DVG-Transformer (ours) ICCV2021 3D 77.16 58.47 38.38 28.70 45.90 34.47
BEAUTY-DETR - 3D - - - - 46.40 -
3DJCG CVPR2022 3D 78.75 61.30 40.13 30.08 47.62 36.14
3D-SPS CVPR2022 3D 81.63 64.77 39.48 29.61 47.65 36.43
ScanRefer ECCV2020 2D + 3D 76.33 53.51 32.73 21.11 41.19 27.40
TGNN AAAI2021 2D + 3D 68.61 56.80 29.84 23.18 37.37 29.70
InstanceRefer ICCV2021 2D + 3D 75.72 64.66 29.41 22.99 38.40 31.08
3DVG-Transformer (Ours) ICCV2021 2D + 3D 81.93 60.64 39.30 28.42 47.57 34.67
3DVG-Transformer+(Ours, this codebase) - 2D + 3D 83.25 61.95 41.20 30.29 49.36 36.43
MVT-3DVG CVPR2022 2D + 3D 77.67 66.45 31.92 25.26 40.80 33.26
3DJCG CVPR2022 2D + 3D 83.47 64.34 41.39 30.82 49.56 37.33
3D-SPS CVPR2022 2D + 3D 84.12 66.72 40.32 29.82 48.82 36.98
Online Benchmark Unique Unique Multiple Multiple Overall Overall
Methods Modality [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
ScanRefer 2D + 3D 68.59 43.53 34.88 20.97 42.44 26.03
TGNN 2D + 3D 68.34 58.94 /33.12 25.26 41.02 32.81
InstanceRefer 2D + 3D 77.82 66.69 34.57 26.88 44.27 35.80
3DVG-Transformer (Ours) 2D + 3D 75.76 55.15 42.24 29.33 49.76 35.12
3DVG-Transformer+(Ours) 2D + 3D 77.33 57.87 43.70 31.02 51.24 37.04

Changelog

2022/04: Update Readme.md.

2022/04: Release the codes of 3DVG-Transformer.

2021/07: 3DVG-Transformer is accepted at ICCV 2021.

2021/06: 3DVG-Transformer+ won the ScanRefer Challenge in the CVPR2021 1st Workshop on Language for 3D Scenes.

2021/04: 3DVG-Transformer+ achieves 1st place in ScanRefer Leaderboard.

Citation

If you use the codes in your work, please kindly cite our work 3DVG-Transformer and the original ScanRefer paper:

@inproceedings{zhao2021_3DVG_Transformer,
    title={{3DVG-Transformer}: Relation modeling for visual grounding on point clouds},
    author={Zhao, Lichen and Cai, Daigang and Sheng, Lu and Xu, Dong},
    booktitle={ICCV},
    pages={2928--2937},
    year={2021}
}

@article{chen2020scanrefer,
    title={{ScanRefer}: 3D Object Localization in RGB-D Scans using Natural Language},
    author={Chen, Dave Zhenyu and Chang, Angel X and Nie{\ss}ner, Matthias},
    pages={202--221},
    journal={ECCV},
    year={2020}
}

Acknowledgement

We would like to thank facebookresearch/votenet for the 3D object detection codebase and erikwijmans/Pointnet2_PyTorch for the CUDA accelerated PointNet++ implementation.

For further acceleration, you could use KD-Tree to accelerate the PointNet++ process.

License

This repository is released under MIT License (see LICENSE file for details).

You might also like...
source code of β€œVisual Saliency Transformer” (ICCV2021)
source code of β€œVisual Saliency Transformer” (ICCV2021)

Visual Saliency Transformer (VST) source code for our ICCV 2021 paper β€œVisual Saliency Transformer” by Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, an

PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).
Implementation for our AAAI2021 paper (Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction).

SSAN Introduction This is the pytorch implementation of the SSAN model (see our AAAI2021 paper: Entity Structure Within and Throughout: Modeling Menti

A pytorch-version implementation codes of paper:
A pytorch-version implementation codes of paper: "BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation"

BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation A pytorch-version implementation

[ICCV 2021 Oral] SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer
[ICCV 2021 Oral] SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer

This repository contains the source code for the paper SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer (ICCV 2021 Oral). The project page is here.

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds
(CVPR 2021) PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds by Mutian Xu*, Runyu Ding*, Hengshuang Zhao, and Xiaojuan Qi. Int

γ€ŠA-CNN: Annularly Convolutional Neural Networks on Point Clouds》(2019)
γ€ŠA-CNN: Annularly Convolutional Neural Networks on Point Clouds》(2019)

A-CNN: Annularly Convolutional Neural Networks on Point Clouds Created by Artem Komarichev, Zichun Zhong, Jing Hua from Department of Computer Science

Official PyTorch implementation of CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds
Official PyTorch implementation of CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds

CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds Introduction This is the official PyTorch implementation of o

Comments
  • How do you preprocess the language data?

    How do you preprocess the language data?

    There are 562 scannet scenes paired with 36665 language annotations for training, and 141 scenes paired with 9508 sentences for validation. However, in your implementation, the size of your training set is 1483 and the size of your validation set is 363. I wonder where these numbers come from?

    opened by ch3cook-fdu 4
  • Evaluating on Referit3D Nr3D and Sr3D

    Evaluating on Referit3D Nr3D and Sr3D

    Dears,

    Thanks for sharing your code base. But I can't find the code related to Nr3D/Sr3D training and evaluation.

    More importantly, regarding the results reported into Table 2, I think you have followed InstanceRefer configuration where they use the GT boxes as an input instead of the whole scene (https://github.com/CurryYuan/InstanceRefer/issues/4) to be able to compare against other methods, Is that correct?

    Thanks in advance!

    opened by eslambakr 2
  • IndexError during evaluation...

    IndexError during evaluation...

    I have successfully trained the model, but when I try to evaluate the result, it reported an IndexError to me. It seems that the shape of mask[i] didn't align with the shape of ref_acc[i], but I have had a hard time trying to figure out how to solve it. Have anyone met the same problem? How can I fix it?

    The command I used: python scripts/ScanRefer_eval.py --folder /share/suzhengyuan/ScanRefer-3DVG/3DVG-Transformer-orig/outputs/2022-04-25_01-35-11 --reference --use_color --force

    The output: evaluate localization... preparing data... loading data... evaluate on 363 samples {'mask': 'no_mask', 'weighted_input': True, 'transformer_type': 'myAdd_20;deformable', 'deformable_type': 'myAdd', 'position_embedding': 'none', 'input_dim': 0, 'enc_layers': 0, 'dec_layers': 2, 'dim_feedforward': 2048, 'hidden_dim': 288, 'dropout': 0.1, 'nheads': 8, 'pre_norm': False} << config transformer [build transformer] Using transformer type myAdd_20;deformable {'mask': 'no_mask', 'weighted_input': True, 'transformer_type': 'myAdd_20;deformable', 'deformable_type': 'myAdd', 'position_embedding': 'none', 'input_dim': 0, 'enc_layers': 0, 'dec_layers': 2, 'dim_feedforward': 2048, 'hidden_dim': 288, 'dropout': 0.1, 'nheads': 8, 'pre_norm': False} << transformer config [Attention:] The Transformer Model Have Decoder Module Attention input type myAdd_20 transformer: Using Decoder transformer type myAdd_20 deformable gelu << transformer activation [INFO!] Use Weighted Input! evaluating... generating the scores for seed 42... 0%| | 0/46 [00:00<?, ?it/s]/share/anaconda3/envs/instref/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 46/46 [03:39<00:00, 4.76s/it] Traceback (most recent call last): File "/share/suzhengyuan/ScanRefer-3DVG/3DVG-Transformer-orig/scripts/ScanRefer_eval.py", line 489, in if args.reference: eval_ref(args) File "/share/suzhengyuan/ScanRefer-3DVG/3DVG-Transformer-orig/scripts/ScanRefer_eval.py", line 298, in eval_ref running_ref_acc = np.mean(ref_acc[i][np.logical_and(masks[i] == multiple_dict[k], others[i] == others_dict[k_o])])
    IndexError: boolean index did not match indexed array along dimension 0; dimension is 11616 but corresponding boolean dimension is 9508

    opened by timsu1104 2
  • Can't achieve the results shown in the paper

    Can't achieve the results shown in the paper

    Dears,

    Thank you very much for your amazing work!

    I tried a lot of parameter settings, but still can't achieve the results shown in the paper.I've been trying for a long time.

    X175YZ7LK@ER`)VGP2BIR}A

    So I would like to ask you for possible reasons.

    Thanks for your help.

    Best

    opened by silicon-bond 0
Owner
About me: zlc1114
null
Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021

PV-RAFT This repository contains the PyTorch implementation for paper "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clou

Yi Wei 43 Dec 5, 2022
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral) This is the official implementat

Yifan Zhang 259 Dec 25, 2022
Code for the paper "Relation of the Relations: A New Formalization of the Relation Extraction Problem"

This repo contains the code for the EMNLP 2020 paper "Relation of the Relations: A New Paradigm of the Relation Extraction Problem" (Jin et al., 2020)

YYY 27 Oct 26, 2022
Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs This is an implemetation of the paper Few-shot Relation Extraction via Baye

MilaGraph 36 Nov 22, 2022
A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network The official code of VisionLAN (ICCV2021). VisionLAN successfully a

null 81 Dec 12, 2022
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper] Authors: Chenhang He, Ruihuang Li, Shuai Li, L

Billy HE 141 Dec 30, 2022
[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

LBYL-Net This repo implements paper Look Before You Leap: Learning Landmark Features For One-Stage Visual Grounding CVPR 2021. Getting Started Prerequ

SVIP Lab 45 Dec 12, 2022
A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

One-Stage Visual Grounding ***** New: Our recent work on One-stage VG is available at ReSC.***** A Fast and Accurate One-Stage Approach to Visual Grou

Zhengyuan Yang 118 Dec 5, 2022
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
SeqTR: A Simple yet Universal Network for Visual Grounding

SeqTR This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling fo

seanZhuh 76 Dec 24, 2022