Towards understanding human actions out of context with the Mimetics dataset - Naver Labs Europe
loader image

This blog article describes our recent work on human action recognition, specifically on understanding actions out of context thanks to our new Mimetics dataset.

Progress in human action recognition …

Action recognition methods have recently obtained impressive results thanks to large-scale datasets, such as Kinetics400, which contains more than 300,000 video clips and covers 400 classes.  Methods based on deep neural networks with spatio-temporal convolutions [1,2] reach outstanding performance with approximately 75% top-1 accuracy and 95% top-5 accuracy. The resulting models transfer well to other video recognition tasks and datasets, however, the explanation behind such high performance remains unclear.

… thanks to context biases (scene, object)

We observed that most actions can actually be recognized from their context only. As an example, videos where the person have been masked out, as shown in the figures above,  can still be accurately classified..  without even seeing the person!  We retrained a model on this masked videos and obtained a top-1 accuracy of 65% on Kinetics!  This shows how much context can be leveraged by spatio-temporal CNNs designed for human action recognition.

Out-of-context human action

While this contextual information is certainly useful to predict human actions, it’s not sufficient to truly understand what’s happening in a scene. People have a more complete understanding of human actions and can even recognize them without any context, object or scene, the most obvious example being mimes.

To understand action in out-of-context scenarios, i.e. when object and scene are absent or misleading, action recognition can only rely on body language captured by human pose and motion. While 3D action recognition (i.e. action recognition from skeleton data) has been well studied in the community, its application has been limited to constrained scenarios where accurate ground-truth body poses can be acquired through a motion capture system. In this work, we propose to compare the performance of pose-based action recognition methods and spatio-temporal CNNs in out-of-context scenarios.

To analyse the understanding of human actions in out-of-context scenarios, we introduce the Mimetics dataset that contains 713 Youtube video clips of mimed actions for a subset of 50 classes of the Kinetics400 dataset. You can evaluate on Mimetics methods that have been trained on Kinetics, and see how well they perform out-of-context.

Pose-based action recognition in the wild

As a 3D action recognition baseline, we developped an intuitive pipeline where 3D poses are extracted using LCR-Net++ [3], before applying a state-of-the-art skeleton-based action recognition method [4].

This method, based on explicit body keypoint localisation, might suffer from noisy estimations of 3D poses in the wild (due to abrupt camera motion, motion blur, ambiguous poses etc.) so we also benchmark an approach where poses are not explicitly estimated. To do so, we use the features learned for the pose estimation task and transfer them to an action recognition pipeline. We denote this approach as implicit pose representation.

Overview of results

  • Implicit pose representation performs better than explicit body keypoint representations for action recognition in the wild
  • Recognizing out-of-context actions while training on Kinetics is hard: RGB 3D CNN obtains 8.6% top-1 accuracy, Flow 3D CNN 11.8% and the implicit pose representation 14.3%. In the left example below, we observe that the fact that the piano is covered by a table cloth, makes the RGB model confuse it with a massage table, explaining the “massage back” prediction while the correct prediction can be made using optical flow or implicit poses.
implicit poses: playing piano RGB: massage back Flow: playing piano
implicit poses: shooting goal RGB: playing badminton Flow: dancing ballet
  • One reason for the low performance is that mimed artists tend to exaggerate the movements.

 

  • 3D CNNs model fail when object and/or scene are not relevant while implicit poses are less sensitive to this aspect
  • 3D CNNs have less difficulty for actions where there’s no object or a small object is involved in the action : the model therefore focuses more on the humans than on the context

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


References

[1] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR, 2017.

[2] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019.

[3] Gregory Rogez, PhilippeWeinzaepfel, and Cordelia Schmid. LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE trans. PAMI, 2019.

[4] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, 2018.