Task and challenges
Action recognition in videos means you need to process both spatial and temporal information and, although CNNs have been pretty successful in modeling spatial information, their performance in modeling temporal information has been subpar. Current state-of-the-art techniques use 3D CNN based two stream architectures that are trained on a large dataset and where one stream processes appearance information using RGB frames while the other deals with motion information using optical flow. However, computing optical flows creates a latency for recognizing videos which obviously limits its use in real-time applications. You can see in the figure below that Flow stream and RGB+Flow streams are as much as ~70 times slower than an RGB stream but, because RGB+Flow is better than RGB this means that 3D convolutions cannot effectively leverage motion information.
In our CVPR paper, we therefore investigate if optical flow computation can be avoided at test time and propose two training strategies: MERS (Motion Emulating RGB Stream) and MARS (Motion Augmented RGB Stream). These strategies are based on the principle of distillation and Learning under privileged information.
MERS: Motion Emulating RGB Stream
For MERS we first train a network to mimic features of the Flow stream using a standard 3D CNN architecture with RGB frames as input and without explicit optical flow computation. The training involves minimizing l2distance between the features of MERS and Flow stream. We can illustrate the training steps as follows:
MERS achieves similar performance to the Flow stream but is significantly faster.
MERS shows that explicit optical flow computation is not needed at test time. Can we go even further by enhancing RGB features with motion information?
MARS: Motion Augmented RGB Stream
Instead of only mimicking flow features, in MARS we additionally leverage appearance information by minimizing cross entropy loss in addition to l2 feature loss. MARS is trained by the following steps:
The results show that MARS is ~140 times faster than RGB+Flow, ~2 times faster than MERS+RGB and it equals the accuracy of RGB+Flow clearly demonstrating that it’s the most effective in combining motion and appearance in a single stream at test time.