This blog article describes our recent work on human action recognition, specifically on understanding actions out of context thanks to our new Mimetics dataset.
Progress in human action recognition …
Action recognition methods have recently obtained impressive results thanks to large-scale datasets, such as Kinetics400, which contains more than 300,000 video clips and covers 400 classes. Methods based on deep neural networks with spatio-temporal convolutions [1,2] reach outstanding performance with approximately 75% top-1 accuracy and 95% top-5 accuracy. The resulting models transfer well to other video recognition tasks and datasets, however, the explanation behind such high performance remains unclear.
… thanks to context biases (scene, object)
We observed that most actions can actually be recognized from their context only. As an example, videos where the person have been masked out, as shown in the figures above, can still be accurately classified.. without even seeing the person! We retrained a model on this masked videos and obtained a top-1 accuracy of 65% on Kinetics! This shows how much context can be leveraged by spatio-temporal CNNs designed for human action recognition.
Out-of-context human action
While this contextual information is certainly useful to predict human actions, it’s not sufficient to truly understand what’s happening in a scene. People have a more complete understanding of human actions and can even recognize them without any context, object or scene, the most obvious example being mimes.
To understand action in out-of-context scenarios, i.e. when object and scene are absent or misleading, action recognition can only rely on body language captured by human pose and motion. While 3D action recognition (i.e. action recognition from skeleton data) has been well studied in the community, its application has been limited to constrained scenarios where accurate ground-truth body poses can be acquired through a motion capture system. In this work, we propose to compare the performance of pose-based action recognition methods and spatio-temporal CNNs in out-of-context scenarios.