MARS: Motion-Augmented RGB Stream for Action Recognition - Naver Labs Europe
preloder

A training strategy based on distillation that leverages both motion and appearance information achieving state-of-the-art results using a 3D CNN with only RGB frames as input, without explicit optical flow computation.

This blog presents our CVPR’19 paper on “MARS: Motion-Augmented RGB Stream for Action Recognition” done with the Thoth team at Inria. The code and trained models are available here.

Mars cartwheel gif  Mars blog figure

 

Task and challenges

Action recognition in videos means you need to process both spatial and temporal information and, although CNNs have been pretty successful in modeling spatial information, their performance in modeling temporal information has been subpar. Current state-of-the-art techniques use 3D CNN based two stream architectures that are trained on a large dataset and where one stream processes appearance information using RGB frames while the other deals with motion information using optical flow. However, computing optical flows creates a latency for recognizing videos which obviously limits its use in real-time applications. You can see in the figure below that Flow stream and RGB+Flow streams are as much as  ~70 times slower than an RGB stream but, because RGB+Flow is better than RGB this means that 3D convolutions cannot effectively leverage motion information.

Mars 3D two stream image 

In our CVPR paper, we therefore investigate if optical flow computation can be avoided at test time and propose two training strategies: MERS (Motion Emulating RGB Stream) and MARS (Motion Augmented RGB Stream). These strategies are based on the principle of distillation and Learning under privileged information.

 

MARS train12a figure

MERS: Motion Emulating RGB Stream

For MERS we first train a network to mimic features of the Flow stream using a standard 3D CNN architecture with RGB frames as input and without explicit optical flow computation. The training involves minimizing l2distance between the features of MERS and Flow stream. We can illustrate the training steps as follows:

  • Train the flow stream to classify actions using the standard cross entropy loss and freeze its weights.
  • Train MERS to minimize the difference between features of the penultimate fully connected layer and the features at the same layer of the frozen Flow stream.
  • Train the last fully connected layer of MERS to classify actions using cross entropy loss.

MERS achieves similar performance to the Flow stream but is significantly faster.

time MERS clean figure

MERS shows that explicit optical flow computation is not needed at test time. Can we go even further by enhancing RGB features with motion information?

 

MARS train13 figureMARS: Motion Augmented RGB Stream

Instead of only mimicking flow features,  in MARS we additionally leverage appearance information by minimizing cross entropy loss in addition to l2 feature loss. MARS is trained by the following steps:

  • Train the flow stream to classify actions using the standard cross entropy loss and freeze its weights.
  • Train MARS to minimize a weighted combination of the difference between features of the penultimate fully connected layer and the features at the same layer of the flow stream and cross entropy loss.

The results show that MARS is ~140 times faster than RGB+Flow, ~2 times faster than MERS+RGB and it equals the accuracy of RGB+Flow clearly demonstrating that it’s the most effective in combining motion and appearance in a single stream at test time.

time MARS clean figure   time MARS2 clean figure

NAVER LABS Europe
NAVER LABS Europe
Ceci correspond à une petite biographie d'environ 200 caractéres