YouRefIt: Embodied Reference Understanding with Language and Gesture

Last update: Jul 11, 2022

Related tags

Deep Learning YouRefIt_ERU

Overview

YouRefIt: Embodied Reference Understanding with Language and Gesture

by Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Tao Gao, Yixin Zhu, Song-Chun Zhu and Siyuan Huang

The IEEE International Conference on Computer Vision (ICCV), 2021

Introduction

We study the machine's understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. To tackle this problem, we introduce YouRefIt, a new crowd-sourced, real-world dataset of embodied reference.

For more details, please refer to our paper.

Checklist

Image ERU
Video ERU

Installation

The code was tested with the following environment: Ubuntu 18.04/20.04, python 3.7/3.8, pytorch 1.9.1. Run

    git clone https://github.com/yixchen/YouRefIt_ERU
    pip install -r requirements.txt

Dataset

Download the YouRefIt dataset from Dataset Request Page and put under ./ln_data

Model weights

Yolov3: download the pretrained model and place the file in ./saved_models by
```
sh saved_models/yolov3_weights.sh
```
More pretrained models are availble Google drive, and should also be placed in ./saved_models.

Make sure to put the files in the following structure:

|-- ROOT
|	|-- ln_data
|		|-- yourefit
|			|-- images
|			|-- paf
|			|-- saliency
|	|-- saved_modeks
|		|-- final_model_full.tar
|		|-- final_resc.tar

Training

Train the model, run the code under main folder.

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id

Evaluation

Evaluate the model, run the code under main folder. Using flag --test to access test mode.

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id \
 --resume saved_models/model.pth.tar \
 --test

Evaluate Image ERU on our released model

Evaluate our full model with PAF and saliency feature, run

python train.py --data_root ./ln_data/ --dataset yourefit  --gpu gpu_id \
 --resume saved_models/final_model_full.tar --use_paf --use_sal --large --test

Evaluate baseline model that only takes images as input, run

python train.py --data_root ./ln_data/ --dataset yourefit  --gpu gpu_id \
 --resume saved_models/final_resc.tar --large --test

Evalute the inference results on test set on different IOU levels by changing the path accordingly,

 python evaluate_results.py

Citation

@inProceedings{chen2021yourefit,
 title={YouRefIt: Embodied Reference Understanding with Language and Gesture},
 author = {Chen, Yixin and Li, Qing and Kong, Deqian and Kei, Yik Lun and Zhu, Song-Chun and Gao, Tao and Zhu, Yixin and Huang, Siyuan},
 booktitle={The IEEE International Conference on Computer Vision (ICCV),
 year={2021}
 }

Acknowledgement

Our code is built on ReSC and we thank the authors for their hard work.

You might also like...

Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

89 Dec 22, 2022

A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

OpenHands OpenHands is a gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor. Currently the system can iden

12 Jan 10, 2022

Gesture-controlled Video Game. Just swing your finger and play the game without touching your PC

Gesture Controlled Video Game Detailed Blog : https://www.analyticsvidhya.com/blog/2021/06/gesture-controlled-video-game/ Introduction This project is

35 Jan 6, 2023

Unified learning approach for egocentric hand gesture recognition and fingertip detection

Unified Gesture Recognition and Fingertip Detection A unified convolutional neural network (CNN) algorithm for both hand gesture recognition and finge

227 Dec 25, 2022

Deep learning based hand gesture recognition using LSTM and MediaPipie.

Hand Gesture Recognition Deep learning based hand gesture recognition using LSTM and MediaPipie. Demo video using PingPong Robot Files Pretrained mode

24 Nov 11, 2022

Gesture Volume Control Using OpenCV and MediaPipe

This Project Uses OpenCV and MediaPipe Hand solutions to identify hands and Change system volume by taking thumb and index finger positions

6 Sep 12, 2022

A hobby project which includes a hand-gesture based virtual piano using a mobile phone camera and OpenCV library functions

Overview This is a hobby project which includes a hand-gesture controlled virtual piano using an android phone camera and some OpenCV library. My moti

1 Nov 19, 2021

Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

QuickDraw - AirGesture Introduction Here is my python source code for QuickDraw - an online game developed by google, combined with AirGesture - a sim

89 Dec 18, 2022

GBIM(Gesture-Based Interaction map)

手势交互地图 GBIM(Gesture-Based Interaction map)，基于视觉深度神经网络的交互地图，通过电脑摄像头观察使用者的手势变化，进而控制地图进行简单的交互。网络使用PaddleX提供的轻量级模型PPYOLO Tiny以及MobileNet V3 small，使得整个模型大小约10MB左右，即使在CPU下也能快速定位和识别手势。

8 Feb 10, 2022

Comments

Normalize the bounding box?

Did you normalize the bounding box to train the model (i.e. calculate the loss for training)? If you normalize the bounding box, did you unnormalize the bounding box to calculate IOU? Thanks in advance.

opened by mmiakashs 1
Problem with model training

Thank you for your excellent work! I tried to reproduce the performance in the paper, but the results were poor. I trained the model directly using a single GPU by

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id --use_paf --use_sal --large

and the evaluation results are:

Accuracy = 0.18784972022382093 Small Accuracy = 0.0 in 418.0 samples Medium Accuracy = 0.005063291139240506 in 395.0 samples Large Accuracy = 0.5319634703196348 in 438.0 samples Accuracy = 0.07993605115907274 Small Accuracy = 0.0 in 418.0 samples Medium Accuracy = 0.0 in 395.0 samples Large Accuracy = 0.228310502283105 in 438.0 samples Accuracy = 0.019184652278177457 Small Accuracy = 0.0 in 418.0 samples Medium Accuracy = 0.0 in 395.0 samples Large Accuracy = 0.0547945205479452 in 438.0 samples

I'd like to know if there are any other settings needed?

opened by cxx226 1

YouRefIt: Embodied Reference Understanding with Language and Gesture

Related tags

Overview

YouRefIt: Embodied Reference Understanding with Language and Gesture

Introduction

Checklist

Installation

Dataset

Model weights

Training

Evaluation

Evaluate Image ERU on our released model

Citation

Acknowledgement

You might also like...

Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

Gesture-controlled Video Game. Just swing your finger and play the game without touching your PC

Unified learning approach for egocentric hand gesture recognition and fingertip detection

Deep learning based hand gesture recognition using LSTM and MediaPipie.

Gesture Volume Control Using OpenCV and MediaPipe

A hobby project which includes a hand-gesture based virtual piano using a mobile phone camera and OpenCV library functions

Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

GBIM(Gesture-Based Interaction map)

Comments

Normalize the bounding box?

Problem with model training

Owner

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Embodied Intelligence via Learning and Evolution

This is the pytorch code for the paper Curious Representation Learning for Embodied Intelligence.

The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.