img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation

Overview

img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation

License: CC BY-NC 4.0 PWC PWC

Figure 1: We estimate the 6DoF rigid transformation of a 3D face (rendered in silver), aligning it with even the tiniest faces, without face detection or facial landmark localization. Our estimated 3D face locations are rendered by descending distances from the camera, for coherent visualization.

Table of contents

Paper details

Vítor Albiero, Xingyu Chen, Xi Yin, Guan Pang, Tal Hassner, "img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation," arXiv:2012.07791, Dec., 2020

Abstract

We propose real-time, six degrees of freedom (6DoF), 3D face pose estimation without face detection or landmark localization. We observe that estimating the 6DoF rigid transformation of a face is a simpler problem than facial landmark detection, often used for 3D face alignment. In addition, 6DoF offers more information than face bounding box labels. We leverage these observations to make multiple contributions: (a) We describe an easily trained, efficient, Faster R-CNN--based model which regresses 6DoF pose for all faces in the photo, without preliminary face detection. (b) We explain how pose is converted and kept consistent between the input photo and arbitrary crops created while training and evaluating our model. (c) Finally, we show how face poses can replace detection bounding box training labels. Tests on AFLW2000-3D and BIWI show that our method runs at real-time and outperforms state of the art (SotA) face pose estimators. Remarkably, our method also surpasses SotA models of comparable complexity on the WIDER FACE detection benchmark, despite not been optimized on bounding box labels.

Citation

If you use any part of our code or data, please cite our paper.

@article{albiero2020img2pose,
  title={img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation},
  author={Albiero, Vítor and Chen, Xingyu and Yin, Xi and Pang, Guan and Hassner, Tal},
  journal={arXiv preprint arXiv:2012.07791},
  year={2020}
}

Installation

Install dependecies with Python 3.

pip install -r requirements.txt

Install the renderer, which is used to visualize predictions. The renderer implementation is forked from here.

cd Sim3DR
sh build_sim3dr.sh

Training

Prepare WIDER FACE dataset

First, download our annotations as instructed in Annotations.

Download WIDER FACE dataset and extract to datasets/WIDER_Face.

Then, to create the train and validation files (LMDB), run the following scripts.

python3 convert_json_list_to_lmdb.py
--json_list ./annotations/WIDER_train_annotations.txt
--dataset_path ./datasets/WIDER_Face/WIDER_train/images/
--dest ./datasets/lmdb/
-—train

This first script will generate a LMDB dataset, which contains the training images along with annotations. It will also output a pose mean and std deviation files, which will be used for training and testing.

python3 convert_json_list_to_lmdb.py 
--json_list ./annotations/WIDER_val_annotations.txt 
--dataset_path ./datasets/WIDER_Face/WIDER_val/images/ 
--dest ./datasets/lmdb

This second script will create a LMDB containing the validation images along with annotations.

Train

Once the LMDB train/val files are created, to start training simple run the script below.

CUDA_VISIBLE_DEVICES=0 python3 train.py
--pose_mean ./datasets/lmdb/WIDER_train_annotations_pose_mean.npy
--pose_stddev ./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy
--workspace ./workspace/
--train_source ./datasets/lmdb/WIDER_train_annotations.lmdb
--val_source ./datasets/lmdb/WIDER_val_annotations.lmdb
--prefix trial_1
--batch_size 2
--lr_plateau
--early_stop
--random_flip
--random_crop
--max_size 1400

For now, only single GPU training is tested. Distributed training is partially implemented, PRs welcome.

Testing

To evaluate with the pretrained model, download the model from Model Zoo, and extract it to the main folder. It will create a folder called models, which contains the model weights and the pose mean and std dev that was used for training.

If evaluating with own trained model, change the pose mean and standard deviation to the ones trained with.

Visualizing trained model

To visualize a trained model on the WIDER FACE validation set run the notebook visualize_trained_model_predictions.

WIDER FACE dataset evaluation

If you haven't done already, download the WIDER FACE dataset and extract to datasets/WIDER_Face.

python3 evaluation/evaluate_wider.py 
--dataset_path datasets/WIDER_Face/WIDER_val/images/
--dataset_list datasets/WIDER_Face/wider_face_split/wider_face_val_bbx_gt.txt
--pretrained_path models/img2pose_v1.pth
--output_path results/WIDER_FACE/Val/

To check mAP and plot curves, download the eval tools and point to results/WIDER_FACE/Val.

AFLW2000-3D dataset evaluation

Download the AFLW2000-3D dataset and unzip to datasets/AFLW2000.

Run the notebook aflw_2000_3d_evaluation.

BIWI dataset evaluation

Download the BIWI dataset and unzip to datasets/BIWI.

Run the notebook biwi_evaluation.

Testing on your own images

Run the notebook test_own_images.

Output customization

For every face detected, the model outputs by default:

  • Pose: pitch, yaw, roll, horizontal translation, vertical translation, and scale
  • Projected bounding boxes: left, top, right, bottom
  • Face scores: 0 to 1

Since the projected bounding box without expansion ends at the start of the forehead, we provide a way of expanding the forehead invidually, along with default x and y expansion.

To customize the size of the projected bounding boxes, when creating the model change any of the bounding box expansion variables as shown below (a complete example can be seen at visualize_trained_model_predictions).

# how much to expand in width
bbox_x_factor = 1.1
# how much to expand in height
bbox_y_factor = 1.1
# how much to expand in the forehead
expand_forehead = 0.3

img2pose_model = img2poseModel(
    ...,    
    bbox_x_factor=bbox_x_factor,
    bbox_y_factor=bbox_y_factor,
    expand_forehead=expand_forehead,
)

Align faces

To align the detected faces, call the function bellow passing the reference points, the image with the faces to align, and the poses outputted by img2pose. The function will return a list with PIL images containing one aligned face per give pose.

from utils.pose_operations import align_faces

# load reference points
threed_points = np.load("pose_references/reference_3d_5_points_trans.npy")

aligned_faces = align_faces(threed_points, img, poses)

Resources

Model Zoo

Annotations

Data Zoo

License

Check license for license details.

Comments
  • Question on fine-tuning for face pose evaluation

    Question on fine-tuning for face pose evaluation

    As your paper declares, for face pose evaluation, you fine-tune the model on 300W-LP dataset. However, I cannot find the corresponding code in your repository, did I miss it?

    question 
    opened by FunkyKoki 14
  • Question about the RPN scales

    Question about the RPN scales

    Thanks again for your work.

    I wonder whether the scales of RPN would change the performance too much. The original code you provided use 5 scales at all, have you ever tried 3 scales?

    question 
    opened by FunkyKoki 9
  • How to extract camera extrinsics?

    How to extract camera extrinsics?

    I want to use img2pose to extract the camera pose (for the purpose of using it as the input for NeRF). On page 3 of the paper it is stated that this can be obtained from the 6DoF pose h by "standard means," but I'm struggling to figure out how this is done. I'm especially struggling with how to determine the t of the [R¦t] matrix; I did manage to extract the R.

    In short: How can one obtain a camera pose from the output 6DoF pose of this model?

    Edit: My specific use case is an input video of a single talking head; I would like to get a camera pose determined by the head pose for each frame; i.e. interpret head movement as camera movement instead.

    question 
    opened by RobinRenggli 7
  • Drawing axis based on yaw, pitch, roll

    Drawing axis based on yaw, pitch, roll

    Hi,

    I am trying to render (x,y,z) axis based on network output instead of using provided renderer.

    Currently using this code:

    def draw_axis(img, euler_angle, center, size=80, thickness=3,
                  angle_const=np.pi / 180, copy=False):
        if copy:
            img = img.copy()
    
        euler_angle *= angle_const
        sin_pitch, sin_yaw, sin_roll = np.sin(euler_angle)
        cos_pitch, cos_yaw, cos_roll = np.cos(euler_angle)
    
        axis = np.array([
            [cos_yaw * cos_roll,
             cos_pitch * sin_roll + cos_roll * sin_pitch * sin_yaw],
            [-cos_yaw * sin_roll,
             cos_pitch * cos_roll - sin_pitch * sin_yaw * sin_roll],
            [sin_yaw,
             -cos_yaw * sin_pitch]
        ])
        axis *= size
        axis += center
    
        axis = axis.astype(np.int)
        print('axis', axis)
    
        tp_center = tuple(center.astype(np.int))
    
        cv2.line(img, tp_center, tuple(axis[0]), (0, 0, 255), thickness)
        cv2.line(img, tp_center, tuple(axis[1]), (0, 255, 0), thickness)
        cv2.line(img, tp_center, tuple(axis[2]), (255, 0, 0), thickness)
    
        return img
    

    According to readme poses variable contains 6 values per detected face

          [
            0.013399183198461687,
            0.0015700862562677677,
            -0.0008193041494016704,
            0.1667461395263672,
            -7.139801979064941,
            53.44799041748047
          ]
    

    Is it true that first one is pitch then yaw then roll other 3 are horizontal, vertical translation and scale?

    Based on provided rendering code, i get pretty static axis always pointing in same direction, but the face mask shows orientation

    question 
    opened by vladimirmujagic 7
  • Question about 300W-LP labels acquirements

    Question about 300W-LP labels acquirements

    Thanks for your work and help. Now I have got a lite-weight model using MobileNetV3-small as backbone. This lite-weight model can achieve the same pose evaluation performance on AFLW2000 as your model (both without fine-tuning on 300W-LP).

    Now I am focused on fine-tuning. In your paper, you said:

    Training pose rotation labels are obtained by converting the 300W-LP ground-truth Euler angles to rotation vectors, and pose translation labels are created using the ground-truth landmarks, using standard means.

    I open this issue to confirm several things, and I will be very grateful if you can help.

    Here are my questions:

    1. How did you define the face bounding box for each image in 300W-LP since in some images there can be two or more faces detected? Did you just use one bounding box, which has landmark annotations? How did you get the bounding box? Did you use a face detector, like InsightFace?
    2. Since the only one face in each image of 300W-LP are annotated with 68 points, can I directly use the labeled landmarks to make the JSON files, and choose to use self.threed_68_points in the code here to generate the lmdb file?

    That's all. Thank you so much.

    question 
    opened by FunkyKoki 6
  • focal length setting

    focal length setting

    Great work! I have a question on focal length.

    In your paper, Pose conversion methods were explained in the end.

    But I have no idea on this statement image because the larger focal length means a smaller field of view, image

    I just wonder why this is a "zoom-out" operation instead of "zoom-in" operation

    question 
    opened by gravitychen 6
  • Questions related to the prediction values and rendered results

    Questions related to the prediction values and rendered results

    Dear Vítor Albiero,

    Thanks for your helpful comments in the previous git issues. It was great help in understanding the paper, img2pose.

    I have additional questions.

    1. What is the definition of the proposed idea's result (i.e. img2pose prediction value) ? According to the equation (2) in your paper, 6D vector h_i consists of Euler angles and 3D face translation vectors. Also, you let me know the pose_pred includes rotation vectors, not the Euler angles. (Both information can be easily converted.) I clearly understand your comment and also cannot find the code where the rotation can convert to the Euler angles [2]. Thus, I hope to fully understand your paper, so I politely ask you again. Is it correct that the whole proposed method (i.e. network and post processing) makes the global pose, h_{i}^{img} that consists of a 3D face translation vector and 3D rotation vector, not Euler information in the given entire image, not an image corp B that is defined in the Appendix A..

    2. Questions related to the rendered results. Please refer to the below images in the [3]. Please note that the name of the image is the 27_Spa_Spa_27_32.jpg in the WIDERFACE training dataset. In other words, I think the model may already used the image in the training phase. I get the values for the boxes, labels and dofs from the lmdb dataset that is obtained by your guide. I think the values might be obtained correctly because the both functions, random_crop and random_clip are turned off. However, the result in [3]-b is a little odd. It makes me confused. If I get the GT values correctly, it is impressive that the generalization of the model is well trained by numerous other GTs and the results are nice although GT used for training may be inaccurate as far as you can see. For reference, please note that both the rendering results using prediction and GT values corresponding to the other images obtained in the same way as above were nice unlike [3], although I did not attach a picture.

      • Is it correct that the GT - dof values in the lmdb dataset corresponds to h_{i}^{img*} for the given entire image (i.e. whole image) in the Fig. 4?
      • What do you think that which factor determines the size of face? It's difficult to understand why the rotation vector determines the size of the face. Could you tell me your explanation for this?

    [1] https://github.com/vitoralbiero/img2pose/issues/27#issuecomment-804506944 [2] In inference mode (i.e. test_own_images.ipynb), the proposed network performs transform module, transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std), at the end. However the transform does not include any rotation - Euler conversion method.
    [3] Rendered result for the image, 27_Spa_Spa_27_32.jpg. res a) The rendered result of the img2pose. res_gt b) The rendered result using the GT values that are obtained in the train.py using below additional codes:

    • dict = targets[0]
    • boxes = data_dict['boxes'].numpy().tolist()
    • labels = data_dict['labels'].numpy().tolist()
    • dofs = data_dict['dofs'].numpy().tolist()
    question 
    opened by vujadeyoon 6
  • Question about the 3D points (pose reference)

    Question about the 3D points (pose reference)

    Thanks again for your contribution.

    I have another question, how did you get the pose reference files, i.e. reference_3d_5_points_trans.npy, reference_3d_68_points_trans.npy, triangles.npy and vertices_trans.npy ?

    question 
    opened by FunkyKoki 5
  • Not able to evaluate images using the file test_own_images.ipynb

    Not able to evaluate images using the file test_own_images.ipynb

    First of all, thank you for your work. I am trying to run "test_own_images.ipynb" evaluation but I am getting the error "the size of tensor a must match the size of tensor b at non-singleton dimension 0" at line "res = img2pose_model.predict([transform(img)])[0]". I am trying to use images from CASIA webface and MultiPIE dataset. Please let me know I should I solve this.

    question 
    opened by 97jay 5
  • How to get mAP and plot curves in python scripts, instead of `eval tools`?

    How to get mAP and plot curves in python scripts, instead of `eval tools`?

    Hello, I m not quite familiar with eval tools. How to get mAP and plot curves in python scripts, instead of eval tools, upon getting results/WIDER_FACE/Val.

    question 
    opened by KindleHe 4
  • How to train on my own dataset?

    How to train on my own dataset?

    Hellow! Thanks for your perfect project! I wonder to know how to train on my own dataset! I notice that you use the five landmarks to generate 6DoF pose labels by using standard means. Can you share the code about these methods? Thanks for your beautiful work! I am waiting for your reply!

    question 
    opened by LonglongaaaGo 4
  • A Question about fine-tuning

    A Question about fine-tuning

    I'm focusing on the fine-tuning work of img2pose。I refer to the steps you suggested on issues in github to fine tuned based on the model“img2pose_v1.pth”you provided. However,the head pose estimation were worse after the fine-tuning than before. my steps are as follows: 1、use the 300wlp annotations "300W_LP_annotations_train.txt" you provided in github and dowload 300wlp datasets on the official website. 2、use the json_loader_300wlp.py in your codes to create "300W_LP_annotations_train.lmdb", "300W_LP_annotations_train_pose_mean.npy" and "300W_LP_annotations_train_pose_stddev.npy" by execute "convert_json_list_to_lmdb.py" 3、in models.py change rpn_batch_size_per_image to 2 (proposals), and box_detections_per_img to 4 (head samples)。 4、 use "300W_LP_annotations_train.lmdb" to train without augmentations, use "300W_LP_annotations_train_pose_mean.npy" and "300W_LP_annotations_train_pose_stddev.npy" as pose_mean and pose_stddev. 5、 lr = 0.001 for 2 epochs 6、Other parameters are set as follows:       "--pose_mean", "./datasets/lmdb/WIDER_train_annotations_pose_mean.npy",                 "--pose_stddev", "./datasets/lmdb/WIDER_train_annotations_pose_stddev.npy",                 "--pretrained_path", "./models/img2pose_v1.pth",                 "--workspace", "./workspace/",                 "--train_source", "./datasets/lmdb/300W_LP_annotations_train.lmdb",                 "--prefix", "trial_1",                 "--batch_size", "2",                 "--max_size", "1400",

    however, after 2 epochs,the head pose estimation were worse , and the MAE in aflw2000 is "Yaw: 20.656 Pitch: 17.178 Roll: 13.957 MAE: 17.264; H. Trans.: 0.179 V. Trans.: 0.363 Scale: 1.465 MAE: 0.669". I don't know which step went wrong. I would appreciate it if you could help me.

    opened by FengWei2000 0
Owner
Vítor Albiero
Vítor Albiero
Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Good news! We release a clean version of PVNet: clean-pvnet, including how to train the PVNet on the custom dataset. Use PVNet with a detector. The tr

ZJU3DV 722 Dec 27, 2022
Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

OpenGaze: Web Service for OpenFace Facial Behaviour Analysis Toolkit Overview OpenFace is a fantastic tool intended for computer vision and machine le

Sayom Shakib 4 Nov 3, 2022
OpenFace – a state-of-the art tool intended for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

OpenFace 2.2.0: a facial behavior analysis toolkit Over the past few years, there has been an increased interest in automatic facial behavior analysis

Tadas Baltrusaitis 5.8k Dec 31, 2022
The PyTorch improved version of TPAMI 2017 paper: Face Alignment in Full Pose Range: A 3D Total Solution.

Face Alignment in Full Pose Range: A 3D Total Solution By Jianzhu Guo. [Updates] 2020.8.30: The pre-trained model and code of ECCV-20 are made public

Jianzhu Guo 3.4k Jan 2, 2023
Code for BMVC2021 "MOS: A Low Latency and Lightweight Framework for Face Detection, Landmark Localization, and Head Pose Estimation"

MOS-Multi-Task-Face-Detect Introduction This repo is the official implementation of "MOS: A Low Latency and Lightweight Framework for Face Detection,

null 104 Dec 8, 2022
MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images

Main repo for ECCV 2020 paper MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images. visual.cs.brown.edu/matryodshka

Brown University Visual Computing Group 75 Dec 13, 2022
SE3 Pose Interp - Interpolate camera pose or trajectory in SE3, pose interpolation, trajectory interpolation

SE3 Pose Interpolation Pose estimated from SLAM system are always discrete, and

Ran Cheng 4 Dec 15, 2022
[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

SADRNet Paper link: SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction Requirements python

Multimedia Computing Group, Nanjing University 99 Dec 30, 2022
Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

SimplePose Code and pre-trained models for our paper, “Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation”, a

Jia Li 256 Dec 24, 2022
Repository for the paper "PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation", CVPR 2021.

PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation Code repository for the paper: PoseAug: A Differentiable Pose Augme

Pyjcsx 328 Dec 17, 2022
This repository contains codes of ICCV2021 paper: SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation

SO-Pose This repository contains codes of ICCV2021 paper: SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation This paper is basically an

shangbuhuan 52 Nov 25, 2022
Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Ibai Gorordo 99 Dec 31, 2022
Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (MTCNN)

Face-Detection-with-MTCNN Face detection is a computer vision problem that involves finding faces in photos. It is a trivial problem for humans to sol

Chetan Hirapara 3 Oct 7, 2022
CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices.

CenterFace Introduce CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices. Recent Update 2019.09.

StarClouds 1.2k Dec 21, 2022
Code for "3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop"

PyMAF This repository contains the code for the following paper: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop Hongwe

Hongwen Zhang 450 Dec 28, 2022
Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Denis Emelin 42 Nov 24, 2022
End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

onnx-facial-lmk-detector End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model, model.onnx. Demo You can

atksh 42 Dec 30, 2022
Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Single-view robot pose and joint angle estimation via render & compare Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic CVPR: Conference on C

Yann Labbé 51 Oct 14, 2022
This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression Introduction In this paper, we are interested in the bottom-up paradigm of estima

HRNet 367 Dec 27, 2022