[CVPR 2021] "Multimodal Motion Prediction with Stacked Transformers": official code implementation and project page.

Overview

mmTransformer

Introduction

  • This repo is official implementation for mmTransformer in pytorch. Currently, the core code of mmTransformer is implemented in the commercial project, we provide inference code of model with six trajectory propopals for your reference.

  • For other information, please refer to our paper Multimodal Motion Prediction with Stacked Transformers. (CVPR 2021) [Paper] [Webpage]

img

Set up your virtual environment

  • Initialize virtual environment:

    conda create -n mmTrans python=3.7
    
  • Install agoverse api. Please refer to this page.

  • Install the pytorch. The latest codes are tested on Ubuntu 16.04, CUDA11.1, PyTorch 1.8 and Python 3.7: (Note that we require the version of torch >= 1.5.0 for testing with pretrained model)

    pip install torch==1.8.0+cu111\
          torchvision==0.9.0+cu111\
          torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
    
  • For other requirement, please install with following command:

    pip install -r requirement.txt
    

Preparation

Download the code, model and data

  1. Clone this repo from the GitHub.

     git clone https://github.com/decisionforce/mmTransformer.git
    
  2. Download the pretrained model and data [here] (map.pkl for Python 3.7 is available [here]) and save it to ./models and ./interm_data.

     cd mmTransformer
     mkdir models
     mkdir interm_data
    
  3. Finally, your directory structure should look something like this:

     mmTransformer
     └── models
         └── demo.pt
     └── interm_data
         └── argoverse_info_val.pkl
         └── map.pkl
    

Preprocess the dataset

Alternatively, you can process the data from scratch using following commands.

  1. Download Argoverse dataset and create a symbolic link to ./data folder or use following commands.

     cd path/to/mmtransformer/root
     mkdir data
     cd data
     wget https://s3.amazonaws.com/argoai-argoverse/forecasting_val_v1.1.tar.gz 
     tar -zxvf  forecasting_val_v1.1.tar.gz
    
  2. Then extract the agent and map information from raw data via Argoverse API:

     python -m lib.dataset.argoverse_convertor ./config/demo.py
    
  3. Finally, your directory structure should look something like above illustrated.

Format of processed data in ‘argoverse_info_val.pkl’:

img

Format of map information in ‘map.pkl’:

img

Run the mmTransformer

For testing:

python Evaluation.py ./config/demo.py --model-name demo

Results

Here we showcase the expected results on validation set:

Model Expected results Results in paper
minADE 0.709 0.713
minFDE 1.081 1.153
MR (K=6) 10.2 10.6

TODO

  • We are going to open source our visualization tools and a demo result. (TBD)

Contact us

If you have any issues with the code, please contact to this email: [email protected]

Citation

If you find our work useful for your research, please consider citing the paper

@article{liu2021multimodal,
  title={Multimodal Motion Prediction with Stacked Transformers},
  author={Liu, Yicheng and Zhang, Jinghuai and Fang, Liangji and Jiang, Qinhong and Zhou, Bolei},
  journal={Computer Vision and Pattern Recognition},
  year={2021}
}
Comments
  • Questions about decoder input and positional encoding

    Questions about decoder input and positional encoding

    Hi,

    1. In page 4, it is said that 'The decoder inputs are the trajectory proposals, which are initialied by a set of learnable positional encoding'. And in page 9, it is said that 'The decoder receives proposals(randomly initialized), positional encoding of proposals, as well as encoder memory...' So, what is the input of the first decoder layer? Is it randomly initialized proposals added by learnable positional encoding? And what is the initialization distribution?
    2. In page 9, it is said that 'In encoder, spatial positional encoding are added to the queries and keys at each MHSA layer' Is the pisitional encoding in encoder fixed or learnable? Is this positional encoding used in both motion extractor, map aggregator and social constructor or only one of them? Thank you.
    opened by panda2020-sky 7
  • details about the embedding dimension

    details about the embedding dimension

    Could you provide the the embedding dimension of each step in motion aggregator and map extractor( with VectorNet)? I haven't found them or correponding reference in Implementation Details in Appendix. Are they same with the hidden state(128)?

    opened by Yisten 7
  • Some questions about dataloading and model

    Some questions about dataloading and model

    Hi, congrats on the nice work and thank you for the quick replies on other issues and sharing the data preprocessing repo, which really helps me a lot. I have some questions about the VectorNet and training.

    (a) I wonder if you use the subgraph implemented in this repo? I am new to GNN and torch_geometric, I wonder if I can just implement the model with both torch and torch_geometric?

    (b) How many epochs do you train for a single experiment? My implementation can not get a good results(2.0+ minADE and 5.0+ minFDE) and I found it takes much more epochs for my model to overfit a small subset of data compare to other non-transformer based model. I wonder if my implementation has some bugs or my training process is wrong.

    opened by L4zyy 6
  • Some questions about visulization

    Some questions about visulization

    Hi, many thanks for the quick replies on other issues and sharing the data preprocessing repo. I have some questions about the visualization part.

    1. The demo video looks really impressive, but I think for the argoverse forecasting data, the total length is only 5 seconds. In the demo video, each scene looks like lasts about 30 seconds. So I'm wondering if you are using the argoverse forecasting data for visualization?

    2. For the argo data I think it is relatively easy to get and visualize the map with the help of API, but the forecasting data itself did not provide information regarding bounding box size, orientation, etc. So how do you get that information for visualization?

    3. I saw that the core code is implemented in the commercial project so it may difficult to release to the public, but I'm wondering if it is possible to release the code for visualization?

    Many thanks!

    opened by lyk1993 4
  • Some questions about the paper

    Some questions about the paper

    Hello,

    Congrats on the nice work! It is not clear to me what happens to other agents. It seems that you treat all agents similarly with the same network. (a) What happens in motion extractor? Do you feed all histories and then update proposals for all vehicles? (b) Is the scene normalized for each agent or you keep it normalized for the target agent?

    opened by MohammadHossein-Bahari 4
  • Several questions about the implementation and paper

    Several questions about the implementation and paper

    Thanks for your great work and the inference code! Here are several questions about this work and it will be very helpful if you give me some hints.

    • In the paper, you mentioned that "parallel trajectory proposals can integrate the information from the encoder independently, allowing each single proposal to carry disentangled modality information" (page 3). How to understand the term "disentangled"? Is this means that proposals will focus on different modality automatically? I try to visualize the distribution of endpoints generated by different proposals just like that in Fig. 5, and the result is shown below. The problems are:
      1. The endpoint distribution is not spatially disentangled which is different from Fig. 5. Here, endpoints from different proposals are heavily overlapped. Can I assert the proposed RTS makes the prediction spatially disentangled? So how to understand "each single proposal to carry disentangled modality information"?
      2. It seems that only a few proposals are used in most cases -- {0,1,2,3} are used while {4,5} are always low-confident. Is this unbalanced phenomenon also caused by the vanilla training strategy?

    Filter the points with confidences lower than the uniform probability (1/K): endpt_f Without filtering: endpt_wof

    • In the paper, you mentioned that "we only utilize the decoder of social constructor to update the proposals for target vehicles, instead of all vehicles, in pursuit of higher efficiency" (page 4). However, it seems that the decoder of the social layer (social_dec) is not used in the released code, and social_mem is simply unsqueezed and concatenated with social_out. Is this change intentional? If so, why?

    • It seems that the ablative results of the order of transformers are missing. Tab. 2 shows the effectiveness of each module but does not contain how the order of the modules influences prediction results.

    Thanks. Always happy to hear from you!

    opened by MasterIzumi 3
  • 'map.pkl'文件为Python3.8情况下保存的

    'map.pkl'文件为Python3.8情况下保存的

    如题,在执行'val_dataset = ArgoverseDataset(validation_cfg)'命令时,出现'ValueError: unsupported pickle protocol: 5'问题,应该是pickle储存文件时Python版本问题,但您在code解释部分,要求开发环境保持为Python3.7。

    opened by Gengmaosi 2
  • Is the final result on leadingboard trained using train and val dataset both?

    Is the final result on leadingboard trained using train and val dataset both?

    Hi! I always wonder know, if the models with competitive results use train dataset only? Because when I tried to submit the result using train and validation dataset, there was a drop in performance. :(

    opened by shouldnotfail 1
  • RuntimeError: CUDA error:

    RuntimeError: CUDA error:

    When I tried to implement this code with the below command, I got this error.

    • command
    python Evaluation.py ./config/demo.py --model-name demo
    
    • error
    gpu number:1
    model loaded from ./models/demo.pt
    Successfully Loaded model: ./models/demo.pt
    Finished Initialization in 15.365s!!!
      0%|                                                                                                                                                                                                                                                            | 0/1234 [00:00<?, ?it/s]
    Traceback (most recent call last):
      File "Evaluation.py", line 77, in <module>
        out = model(data)
      File "/home/usaywook/anaconda3/envs/mmTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/mmTransformer.py", line 150, in forward
        social_mask, lane_enc, lane_mask)
      File "/home/usaywook/anaconda3/envs/mmTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_version/stacked_transformer.py", line 128, in forward
        lane_mem = self.lane_enc(self.lane_emb(lane_enc), lane_mask) # (batch size, max_lane_num, 128)
      File "/home/usaywook/anaconda3/envs/mmTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_utils.py", line 49, in forward
        x = layer(x, x_mask)
      File "/home/usaywook/anaconda3/envs/mmTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_utils.py", line 69, in forward
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
      File "/home/usaywook/anaconda3/envs/mmTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_utils.py", line 208, in forward
        return x + self.dropout(sublayer(self.norm(x)))
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_utils.py", line 69, in <lambda>
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
      File "/home/usaywook/anaconda3/envs/mmTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_utils.py", line 170, in forward
        query, key, value, mask=mask, dropout=self.dropout)
      File "/media/usaywook/Samsung_T5/tmp/mmTransformer/lib/models/TF_utils.py", line 227, in attention
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
    

    If I had removed .cuda in line 61 and line 75 from this code, I could resolve the error. However, I cannot use the GPU to implement this code.

    Moreover, in this repository, I cannot find the loss function to consider multimodal trajectories. Could you share the code for the loss function used in the original paper?

    opened by Usaywook 0
  • The ckpt has a social decoder, but the code does not.

    The ckpt has a social decoder, but the code does not.

    Thanks for your code. I encountered several questions when reproducing this paper.
    I hope you can help me solve these questions.

    • Code did not have a social decoder, but the ckpt you provided has. I'm interested in how you trained this ckpt.
    • What is the classification loss you choose for six trajectory proposals? CrossEntropy or KL and what is the target ?
    • What is the loss weight for reg loss and cls loss? Thanks for your project again and I really hope to get your reply.
    opened by xushilin1 1
  • demo error: No such file or directory

    demo error: No such file or directory

    When I tried to run the demo, python -m lib.dataset.argoverse_convertor ./config/demo.py I got this error. FileNotFoundError: [Errno 2] No such file or directory: '/home/mmTransformer/argoverseapi/map_files/pruned_argoverse_PIT_10314_vector_map.xml'

    I have no idea about it! Are there anyone could give me a hand? Thanks!

    opened by fgqile 1
  • Are all agents involved in calculating the loss? How long one epoch takes in training?

    Are all agents involved in calculating the loss? How long one epoch takes in training?

    I have three question: 1、I guess you only use target without other agents in calculating loss and propagating backward because you only generate one theta value in a sample scene, if not please give me more detail......If you use all agents in loss, you regard every agent as target, then your data preprocess code need to be modified for training 2、How long one epoch takes in training, and how many GPU did you use in experiment, which type GPU did you use. 3、How much improvement did the data augmentation give?

    opened by fengsky401 0
  • Kmeans normalization method in 435 line

    Kmeans normalization method in 435 line

    Another Question, in part D of Appendix, what is exactly normalization described in line 435 of your paper. Because I cannot figure out what is exactly the 435 line in the paper, so....

    opened by YouSonicAI 0
Owner
DeciForce: Crossroads of Machine Perception and Autonomy
Research on Unifying Machine Perception and Autonomy in Zhou Group
DeciForce: Crossroads of Machine Perception and Autonomy
Official project website for the CVPR 2021 paper "Exploring intermediate representation for monocular vehicle pose estimation"

EgoNet Official project website for the CVPR 2021 paper "Exploring intermediate representation for monocular vehicle pose estimation". This repo inclu

Shichao Li 138 Dec 9, 2022
[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, CVPR 2021. Ayan Kumar Bhunia, Pinaki nath Chowdhury, Yongxin Yan

Ayan Kumar Bhunia 44 Dec 12, 2022
project page for VinVL

VinVL: Revisiting Visual Representations in Vision-Language Models Updates 02/28/2021: Project page built. Introduction This repository is the project

null 308 Jan 9, 2023
Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

EPSR (Enhanced Perceptual Super-resolution Network) paper This repo provides the test code, pretrained models, and results on benchmark datasets of ou

Subeesh Vasu 78 Nov 19, 2022
Official page of Struct-MDC (RA-L'22 with IROS'22 option); Depth completion from Visual-SLAM using point & line features

Struct-MDC (click the above buttons for redirection!) Official page of "Struct-MDC: Mesh-Refined Unsupervised Depth Completion Leveraging Structural R

Urban Robotics Lab. @ KAIST 37 Dec 22, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

null 184 Dec 11, 2022
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

selfcontact This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] It includes the main function

Lea Müller 68 Dec 6, 2022
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

SMPLify-XMC This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] License Software Copyright Lic

Lea Müller 83 Dec 14, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

TUCH This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] License Software Copyright License fo

Lea Müller 45 Jan 7, 2023
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 109 Dec 23, 2022
Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

Hongsuk Choi 215 Jan 6, 2023
Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

ReDet: A Rotation-equivariant Detector for Aerial Object Detection ReDet: A Rotation-equivariant Detector for Aerial Object Detection (CVPR2021), Jiam

csuhan 334 Dec 23, 2022
Official code for the paper: Deep Graph Matching under Quadratic Constraint (CVPR 2021)

QC-DGM This is the official PyTorch implementation and models for our CVPR 2021 paper: Deep Graph Matching under Quadratic Constraint. It also contain

Quankai Gao 55 Nov 14, 2022
Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Mo

Abhinav Kumar 76 Jan 2, 2023
Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

How Well Do Self-Supervised Models Transfer? This repository hosts the code for the experiments in the CVPR 2021 paper How Well Do Self-Supervised Mod

Linus Ericsson 157 Dec 16, 2022
Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Implicit3DUnderstanding (Im3D) [Project Page] Holistic 3D Scene Understanding from a Single Image with Implicit Representation Cheng Zhang, Zhaopeng C

Cheng Zhang 149 Jan 8, 2023
Official code of CVPR 2021's PLOP: Learning without Forgetting for Continual Semantic Segmentation

PLOP: Learning without Forgetting for Continual Semantic Segmentation This repository contains all of our code. It is a modified version of Cermelli e

Arthur Douillard 116 Dec 14, 2022
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Isen (Songyao Jiang) 128 Dec 8, 2022