BigHand2.2M

Abstract

In this paper we introduce a large-scale hand pose dataset, collected using a novel capture method. Existing datasets are either generated synthetically or captured using depth sensors: synthetic datasets exhibit a certain level of appearance difference from real depth images, and real datasets are limited in quantity and coverage, mainly due to the difficulty to annotate them. We propose a tracking system with six 6D magnetic sensors and inverse kinematics to automatically obtain 21-joints hand pose annotations of depth maps captured with minimal restriction on the range of motion. The capture protocol aims to fully cover the natural hand pose space. As shown in embedding plots, the new dataset exhibits a significantly wider and denser range of hand poses compared to existing benchmarks. Current state-of-the-art methods are evaluated on the dataset, and we demonstrate significant improvements in cross-benchmark performance. We also show significant improvements in egocentric hand pose estimation with a CNN trained on the new dataset.

Depth-Based 3D Hand Pose Estimation

This task builds on BigHand2.2M dataset in a similar format to HANDS 2017 challenge. Hands appear in both 3rd person and egocentric viewpoints. No objects are present in this task.

Training set: Contains images from 5 different subjects. Some hand articulations and viewpoints are strategically excluded.
Test set: Contains images from 10 different subjects. 5 subjects overlap with the training set. Exhaustive coverage of viewpoints and articulations.

The following performance scores (as mean joint error) will be evaluated:

Interpolation (INTERP.): performance on test samples that have shape, viewpoints and articulations present in the training set.
Extrapolation:
- Total (EXTRAP.): performance on test samples that have hand shapes, viewpoints and articulations not present in the training set.
- Shape (SHAPE): performance on test samples that have hand shapes not present in the training set. Viewpoints and articulations are present in the training set.
- Articulation (ARTIC.): performance on test samples that have articulations not present in the training set. Shapes and viewpoints are present in the training set.
- Viewpoint (VIEWP.): performance on test samples that have viewpoints not present in the training set. Shapes and articulations are present in the training set. Viewpoint is defined as elevation and azimuth angles of the hand respect to the camera. Both angles are analyzed independently.

Use of fitted MANO model for synthesizing data is encouraged.
Images are captured with Intel RealSense SR300 camera at 640 × 480-pixel resolution.
Use of training data from HANDS 2017 challenge is not allowed as some images may overlap with the test set.
Use of other labelled datasets (either real or synthetic) is not allowed. Use of fitted MANO model for synthesizing data is encouraged. Use of external unlabelled data is allowed (self-supervised and unsupervised methods).

Submission website

BibTeX


        @inproceedings{BigHand2.2M_CVPR2017,
          title={BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis},
          author={Yuan, Shanxin and Ye, Qi and Stenger, Bjorn and Jain, Siddhant and Kim, Tae-Kyun}
          booktitle = {Proceedings of Computer Vision and Pattern Recognition ({CVPR})},
          year = {2017}
        }