Task and challenges
Action recognition in videos means you need to process both spatial and temporal information and, although CNNs have been pretty successful in modeling spatial information, their performance in modeling temporal information has been subpar. Current state-of-the-art techniques use 3D CNN based two stream architectures that are trained on a large dataset and where one stream processes appearance information using RGB frames while the other deals with motion information using optical flow. However, computing optical flows creates a latency for recognizing videos which obviously limits its use in real-time applications. You can see in the figure below that Flow stream and RGB+Flow streams are as much as ~70 times slower than an RGB stream but, because RGB+Flow is better than RGB this means that 3D convolutions cannot effectively leverage motion information.
In our CVPR paper, we therefore investigate if optical flow computation can be avoided at test time and propose two training strategies: MERS (Motion Emulating RGB Stream) and MARS (Motion Augmented RGB Stream). These strategies are based on the principle of distillation and Learning under privileged information.
MERS: Motion Emulating RGB Stream
For MERS we first train a network to mimic features of the Flow stream using a standard 3D CNN architecture with RGB frames as input and without explicit optical flow computation. The training involves minimizing l2distance between the features of MERS and Flow stream. We can illustrate the training steps as follows:
MERS achieves similar performance to the Flow stream but is significantly faster.
MERS shows that explicit optical flow computation is not needed at test time. Can we go even further by enhancing RGB features with motion information?
MARS: Motion Augmented RGB Stream
Instead of only mimicking flow features, in MARS we additionally leverage appearance information by minimizing cross entropy loss in addition to l2 feature loss. MARS is trained by the following steps:
The results show that MARS is ~140 times faster than RGB+Flow, ~2 times faster than MERS+RGB and it equals the accuracy of RGB+Flow clearly demonstrating that it’s the most effective in combining motion and appearance in a single stream at test time.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.