Info and sample codes for "NTU RGB+D Action Recognition Dataset"

Amir Shahroudy

Last update: Dec 30, 2022

Related tags

Deep Learning NTURGB-D

Overview

"NTU RGB+D" Action Recognition Dataset

"NTU RGB+D 120" Action Recognition Dataset

"NTU RGB+D" is a large-scale dataset for human action recognition. It is introduced in our CVPR 2016 paper [PDF].

"NTU RGB+D 120" is the extended version of the "NTU RGB+D" dataset. It is introduced in our TPAMI 2020 paper [PDF].

For any possible query regarding the datasets, please contact the first author of the paper.

How to download the datasets

The full datasets can be downloaded via:

https://rose1.ntu.edu.sg/dataset/actionRecognition/

If you need the skeleton data only, you could also obtain it via:

https://drive.google.com/open?id=1CUZnBtYwifVXS21yVg62T-vrPVayso5H

https://drive.google.com/open?id=1tEbuaEqMxAV7dNc4fqu1O4M7mC6CJ50w

Structures of the datasets

"NTU RGB+D" and NTU RGB+D 120" datasets contain 56,880 and 114,480 action samples, respectively. Both datasets include 4 different modalities of data for each sample:

RGB videos
depth map sequences
3D skeletal data
infrared (IR) videos

Video samples have been captured by three Microsoft Kinect V2 cameras concurrently. The resolutions of RGB videos are 1920×1080, depth maps and IR videos are all in 512×424, and 3D skeletal data contains the 3D locations of 25 major body joints at each frame.

Each file/folder name in both datasets is in the format of SsssCcccPpppRrrrAaaa (e.g., S001C002P003R002A013), in which sss is the setup number, ccc is the camera ID, ppp is the performer (subject) ID, rrr is the replication number (1 or 2), and aaa is the action class label.

The "NTU RGB+D" dataset includes the files/folders with setup numbers between S001 and S017, while the "NTU RGB+D 120" dataset includes the files/folders with setup numbers between S001 and S032.

For more details about the setups, camera IDs, ..., please refer to the "NTU RGB+D" dataset paper and the "NTU RGB+D 120" dataset paper.

Samples with missing skeletons

302 samples in "NTU RGB+D" dataset and 535 samples in "NTU RGB+D 120" dataset have missing or incomplete skeleton data. If you are working on skeleton-based analysis, please ignore these files in your training and testing procedures.
The list of these samples in "NTU RGB+D" dataset are provided here.
The list of these samples in "NTU RGB+D 120" dataset are provided here.

Sample codes

We have provided some MATLAB codes here to demonstrate how to read the skeleton files, map them to other modalities (RGB, depth, and IR frames), and visualize the skeleton data. The codes are suitable for both "NTU RGB+D" and "NTU RGB+D 120".

Action Classes

"NTU RGB+D" dataset contains 60 action classes, and "NTU RGB+D 120" dataset contains 120 action classes. The actions in these two datasets are listed below. Note that actions labelled from A1 to A60 are in "NTU RGB+D", while actions labelled from A1 to A120 are in "NTU RGB+D 120".

A1. drink water.
A2. eat meal/snack.
A3. brushing teeth.
A4. brushing hair.
A5. drop.
A6. pickup.
A7. throw.
A8. sitting down.
A9. standing up (from sitting position).
A10. clapping.
A11. reading.
A12. writing.
A13. tear up paper.
A14. wear jacket.
A15. take off jacket.
A16. wear a shoe.
A17. take off a shoe.
A18. wear on glasses.
A19. take off glasses.
A20. put on a hat/cap.
A21. take off a hat/cap.
A22. cheer up.
A23. hand waving.
A24. kicking something.
A25. reach into pocket.
A26. hopping (one foot jumping).
A27. jump up.
A28. make a phone call/answer phone.
A29. playing with phone/tablet.
A30. typing on a keyboard.
A31. pointing to something with finger.
A32. taking a selfie.
A33. check time (from watch).
A34. rub two hands together.
A35. nod head/bow.
A36. shake head.
A37. wipe face.
A38. salute.
A39. put the palms together.
A40. cross hands in front (say stop).
A41. sneeze/cough.
A42. staggering.
A43. falling.
A44. touch head (headache).
A45. touch chest (stomachache/heart pain).
A46. touch back (backache).
A47. touch neck (neckache).
A48. nausea or vomiting condition.
A49. use a fan (with hand or paper)/feeling warm.
A50. punching/slapping other person.
A51. kicking other person.
A52. pushing other person.
A53. pat on back of other person.
A54. point finger at the other person.
A55. hugging other person.
A56. giving something to other person.
A57. touch other person's pocket.
A58. handshaking.
A59. walking towards each other.
A60. walking apart from each other.
A61. put on headphone.
A62. take off headphone.
A63. shoot at the basket.
A64. bounce ball.
A65. tennis bat swing.
A66. juggling table tennis balls.
A67. hush (quite).
A68. flick hair.
A69. thumb up.
A70. thumb down.
A71. make ok sign.
A72. make victory sign.
A73. staple book.
A74. counting money.
A75. cutting nails.
A76. cutting paper (using scissors).
A77. snapping fingers.
A78. open bottle.
A79. sniff (smell).
A80. squat down.
A81. toss a coin.
A82. fold paper.
A83. ball up paper.
A84. play magic cube.
A85. apply cream on face.
A86. apply cream on hand back.
A87. put on bag.
A88. take off bag.
A89. put something into a bag.
A90. take something out of a bag.
A91. open a box.
A92. move heavy objects.
A93. shake fist.
A94. throw up cap/hat.
A95. hands up (both hands).
A96. cross arms.
A97. arm circles.
A98. arm swings.
A99. running on the spot.
A100. butt kicks (kick backward).
A101. cross toe touch.
A102. side kick.
A103. yawn.
A104. stretch oneself.
A105. blow nose.
A106. hit other person with something.
A107. wield knife towards other person.
A108. knock over other person (hit with body).
A109. grab other person’s stuff.
A110. shoot at other person with a gun.
A111. step on foot.
A112. high-five.
A113. cheers and drink.
A114. carry something with other person.
A115. take a photo of other person.
A116. follow other person.
A117. whisper in other person’s ear.
A118. exchange things with other person.
A119. support somebody with hand.
A120. finger-guessing game (playing rock-paper-scissors).

Evaluation Protocol of One-Shot Action Recognition on "NTU RGB+D 120"

In "NTU RGB+D 120" dataset paper, we introduced the one-shot recognition setting, in which "NTU RGB+D 120" dataset is split to two parts: auxiliary set and one-shot evaluation set. Auxiliary set contains 100 classes, and all samples of these classes can be used for learning. Evaluation set consists of 20 novel classes, and one sample from each novel class is picked as the exemplar, while all the remaining samples of these classes are used to test the recognition performance.

Evaluation set. 20 novel classes, namely, A1, A7, A13, A19, A25, A31, A37, A43, A49, A55, A61, A67, A73, A79, A85, A91, A97, A103, A109, A115. The following 20 samples are the exemplars:
(01)S001C003P008R001A001, (02)S001C003P008R001A007, (03)S001C003P008R001A013, (04)S001C003P008R001A019, (05)S001C003P008R001A025, (06)S001C003P008R001A031, (07)S001C003P008R001A037, (08)S001C003P008R001A043, (09)S001C003P008R001A049, (10)S001C003P008R001A055, (11)S018C003P008R001A061, (12)S018C003P008R001A067, (13)S018C003P008R001A073, (14)S018C003P008R001A079, (15)S018C003P008R001A085, (16)S018C003P008R001A091, (17)S018C003P008R001A097, (18)S018C003P008R001A103, (19)S018C003P008R001A109, (20)S018C003P008R001A115.

Auxiliary set. 100 classes (the remaining 100 classes of "NTU RGB+D 120" excluding the 20 classes in evaluation set).

Citation

To cite our datasets, please use the following bibtex records:

@inproceedings{shahroudy2016ntu,
  title={NTU RGB+D: A large scale dataset for 3D human activity analysis},
  author={Shahroudy, Amir and Liu, Jun and Ng, Tian-Tsong and Wang, Gang},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={1010--1019},
  year={2016}
}

@article{liu2020ntu,
  title={NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding},
  author={Liu, Jun and Shahroudy, Amir and Perez, Mauricio and Wang, Gang and Duan, Ling-Yu and Kot, Alex C},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  volume={42},
  number={10},
  pages={2684--2701},
  year={2020}
}

Mailing List

If you are interested to recieve news, updates, and future events about this dataset, please subscribe in the Google group of the dataset at: https://groups.google.com/d/forum/ntu-rgbd. If you cannot access the group's page, please email me, I will add your email to the list. You can find my email in the paper, and I'm the first author :)

3D Human Activity Analysis Challenge

We organized the action recognition challenge "Large Scale 3D Human Activity Analysis Challenge in Depth Videos" based on the "NTU RGB+D" dataset in ACCV 2016. Details about this challenge can be found here.

Reported results on "NTU RGB+D" benchmark and "NTU RGB+D 120" benchmark

Coming soon :)

FAQs

(1) When I've sent the request for the dataset, I recieved '500 - Internal server error'.
This error happens some of the time and is a technical problem in the hosting server. Please contact us via email, so that we can follow this up. The best person to email is the technician of our lab: Chai Ooy Mei ([email protected]). Please CC us ([email protected], [email protected]) in your email so that we can follow it up and ensure you will get access as soon as possible.

(2) I want to align RGB and depth frames. Are there any camera calibration data recorded?
Unfortunately no camera calibration info is recorded. However, one applicable solution for this is to use the skeletal data. For each video sample, the skeletal data includes a big number of body joints and their precise locations in both RGB and depth frames. So for each sample you have a big number of mappings. Keep in mind that the cameras were fixed during each setup (Sxxx in the file names mean this sample is from setup xxx). So for each camera at each setup you have a huge number of mappings between RGB and depth cameras (and also between the three sensors!). Finding a transformation between the cameras will be as easy as solving a linear system with a lot of known points!

(3) There are some extra values recorded for each skeletal joint like orientation, lean, etc. What do they mean?
In almost all of the applications, the 3D locations of the joints are enough... We tried to keep everything generated by the SDK, so we recorded all. For more info about the meaning of those extra values, you can read this: https://medium.com/@lisajamhoury/understanding-kinect-v2-joints-and-coordinate-system-4f4b90b9df16

(4) What are masked depthmap in the download page?
The main purpose of providing masked depthmaps were to have a smaller sized version of the original depth maps. We used the position of the body skeletons to find regions of interest in depthmaps. We copied the depth values for the regions of interest (from the original depthmaps) and set the other regions' depth to zero. This helped to achieve a much more efficient frame-wise compression ratios.

(5) Why the individual and mutual actions are considered together? Isn't it better to separate them in our evaluations?
Having these classes of human actions together is a part of our dataset design to cope with more realistic scenarios of human action analysis. Therefore, the ideal evalution should not provide any prior info about the type of the action.

(6) How did you handle the variable subject numbers (one or two) in the input of the network?
Our inputs initially includes two sets of joints (for two skeletons). When we observed just one, the second set was filled with zeros. When we observed two or more, we decided about which one to be the main subject and which one to be the second one, by measuring the amount of motion of their joints. Also, some of the detected skeletons are noise, like tables and seats. You can eliminate them by filtering out the skeletons that do not have reasonable Y spread over X spread values over all of their joints.

(7) How did you choose the main actor in the preprocessing step?
We used a heuristic method. It's very simple (but not necessarily correct for all the samples). We consider the variance of the X, Y, and Z values of all the joints and add them up. We took the body with the higher value as the main subject.

(8) How important is the skeleton normalization step, described in experimantal setup section?
In the extension of our experiments, we found out the normalization is not vital. You can skip the normalization step and it should work fine. Actually the network is supposed to learn how to normalize the data by itself.

(9) The provided MATLAB code cannot read .avi files on my Linux machine.
Most probably it's a missing codec problem. I used this solution, and it worked on my own machine. Hope it would help you also.

Comments

Can't finish download request form

Hi;

I'm trying to apply for NTU120 dataset download authorization, but the system fails to complete the operation. It seems that a database error occurs (please see image below). Is there any other form to request access?

Cheers.

opened by danilobcardoso 5

New datasets split

Hi, After downloading the new extended dataset, I found that the previous 60 classes of subjects also made new 60 class actions, but the new subject did not do the previous 60 classes, so that when the dataset is divided(eg, cross-setup eval), is there any gap when making predictions for new subjects?

opened by VSunN 5
the access to the dataset

Hello, Thanks for your favorable work. I cannot sent request for the access to the dataset on your website provided by you, because it shows '500 - Internal server error'. Could you probably help me to get Action Recognition Dataset--3D skeletons (body joints) which is 5.8 GB ? I am an undergraduate student in the UK. I would like to use it to do the research.. My email address is [email protected]

opened by Xingyu-Jin 5
depth and rgb registration

Hi, Is there any code to register the RGB and depth images? They are of different aspect ratio. Is there any code to get the common areas in both the images?

opened by malreddysid 5
Error in Estimated Skeleton Data

Sample: "S001C001P001R001A027" Action: Jump Up

In the RGB video, there exists one person performing the action. But there exist joint values of two persons in the skeleton data (the file named: S001C001P001R001A027.skeleton). This is the case for a few other samples also like S001C001P001R001A028, S001C001P001R001A024, S001C001P001R001A026, S001C001P001R001A029.

P.S These are the cases that I found so far in examining the data.

Is this an error? If yes, anyways to eliminate this?

opened by sksenthilkumar 4
get orignial data from normalized data

It would be great if I could known how to get the original projectable code from the normalized x, y coordinates? In other words it will be helpful if you can provide the code for normalization of x,y coordinates

opened by worthlessFella 3
What does each parameter mean in raw data?

I appreciate your work very much. There are some questions that plague me. In raw data and the "read_skeleton_file.m", there are some parameter that I can't really understand what they really mean. As follows:
body.clipedEdges body.handLeftConfidence body.handLeftState body.handRightConfidence body.handRightState body.isResticted body.leanX body.leanY joint.orientationW joint.orientationX joint.orientationY joint.orientationZ What does each parameter mean? And how can we get these parameter by sensor? Look forward to your reply!!!

opened by XiaRongjie 3
queries about nturgb+d_depth_masked data

Hi, when I imread your masked depth data (e.g. MDepth-00000001.png), it shows as follows:

Could you tell me why the masked image just show the top left corner？ Thanks very much!

opened by bangligit 3
Question regarding action span

Is there any information at all on when the action begin and ends? Or have anyone came up with a heuristics to determine which parts of the clip contains the action?

I am asking this because from what i observed, in most clips i see that the action does not immediately begin, but rather, about 1/3 to 1/2 into the clip (these numbers came from like 10 clips, i cannot say anything about the significance). Having the idle frames counted towards the action might add noise to the data..

Does the dataset provide any info regarding at what time the action happened, or do i need to resort to heuristics along the lines of what i mentioned above?

opened by usamahjundia 2
Intrinsic parameters of Kinect

Hi, thank you for providing awesome dataset! I want to re-project the depth image to the 3D point cloud. How do I get the information of intrinsic parameters of each Kinect?

Thanks in advance for considering!

opened by ryohachiuma 2
question about skeleton data

Hi, I'm interested in human action recognition using skeleton data. When I look in to a data I see some misleading joint parts due to side view of camera. Like the image below right hand is occluded but kinect camera estimates a women drinking water with two hands. So i'm curious if it is okay with this kind of misleading joints data as a input to LSTM model.

opened by techjjun 2
Download NTURGB-D from the google colab
Hi, I was trying to download both datasets from google colaboratory using the command !gdown "<drive-id>&confirm=t", though I am getting the following error:

Access denied with the following error: Cannot retrieve the public link of the file. You may need to change the permission to 'Anyone with the link', or have had many accesses. You may still be able to access the file from the browser

It would be great to be able to download both datasets directly from the google colab. Thanks
opened by alirezadizaji 0
About the data volume of nturgbd120-X-Sub

Hello, I would like to know what is the specific number of training sets and the number of test sets for the nturgbd120 dataset X-Sub. I'm a little scared that I've got it wrong

opened by lizaowo 0
Data Normalization

Hi, I just have trained a model on your normalized data and need to pass it my videos data. I have a skeleton file with non normalized 3D joint coordinates can you provide the code to normalize them accordingly with yours? Thank you

opened by amcorGit 4
Action recognition for a particular class category

Let's say I want to have a fall detection solution that recognizes the action by skeleton information. Since that program doesn't consider any other class actions, is it better to train the model, say PoseC3D, with one class category? Other similar actions like sitting or standing (from sitting position) are in the class categories as well, should I also take those video samples too and train the network with 3 class actions (falling, standing and sitting)? Thanks.

opened by bit-scientist 0
train_data_joint.npy

ive downloaded the kinetics dataset multiple times and ran preprocessing but i never get a train_data_joint.npy file, so i cant run the subsequent training where can i get this file? IF its generated by preprocessing.py is there a reason why i dont get one after running that .py

opened by Djmcflush 0

Info and sample codes for "NTU RGB+D Action Recognition Dataset"

Related tags

Overview

"NTU RGB+D" Action Recognition Dataset

"NTU RGB+D 120" Action Recognition Dataset

How to download the datasets

Structures of the datasets

Samples with missing skeletons

Sample codes

Action Classes

Evaluation Protocol of One-Shot Action Recognition on "NTU RGB+D 120"

Citation

Mailing List

3D Human Activity Analysis Challenge

Reported results on "NTU RGB+D" benchmark and "NTU RGB+D 120" benchmark

FAQs

Comments

Owner

Amir Shahroudy

Search Youtube Video and Get Video info

Api for getting bin info and getting encrypted card details for adyen.

Active and Sample-Efficient Model Evaluation

Sample and Computation Redistribution for Efficient Face Detection

The GitHub repository for the paper: “Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction“.

Sample code from the Neural Networks from Scratch book.

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

A sample pytorch Implementation of ACL 2021 research paper "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Omniverse sample scripts - A guide for developing with Python scripts on NVIDIA Ominverse

BasicRL: easy and fundamental codes for deep reinforcement learning。It is an improvement on rainbow-is-all-you-need and OpenAI Spinning Up.

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

codes for Image Inpainting with External-internal Learning and Monochromic Bottleneck

Source codes for "Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs"

Codes for our IJCAI21 paper: Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization