Blog home

Topics

Computer vision

MARS: Motion-augmented RGB stream for action recognition

Published by Philippe Weinzaepfel at 7 June 2019

Nieves CRASTO, Philippe Weinzaepfel

2019

Careers home

A training strategy based on distillation that leverages both motion and appearance information achieving state-of-the-art results using a 3D CNN with only RGB frames as input, without explicit optical flow computation.

This blog presents our CVPR’19 paper on “MARS: Motion-Augmented RGB Stream for Action Recognition” done with the Thoth team at Inria. The code and trained models are available here.

Task and challenges

Action recognition in videos means you need to process both spatial and temporal information and, although CNNs have been pretty successful in modeling spatial information, their performance in modeling temporal information has been subpar. Current state-of-the-art techniques use 3D CNN based two stream architectures that are trained on a large dataset and where one stream processes appearance information using RGB frames while the other deals with motion information using optical flow. However, computing optical flows creates a latency for recognizing videos which obviously limits its use in real-time applications. You can see in the figure below that Flow stream and RGB+Flow streams are as much as ~70 times slower than an RGB stream but, because RGB+Flow is better than RGB this means that 3D convolutions cannot effectively leverage motion information.

In our CVPR paper, we therefore investigate if optical flow computation can be avoided at test time and propose two training strategies: MERS (Motion Emulating RGB Stream) and MARS (Motion Augmented RGB Stream). These strategies are based on the principle of distillation and Learning under privileged information.

MERS: Motion Emulating RGB Stream

For MERS we first train a network to mimic features of the Flow stream using a standard 3D CNN architecture with RGB frames as input and without explicit optical flow computation. The training involves minimizing l2distance between the features of MERS and Flow stream. We can illustrate the training steps as follows:

Train the flow stream to classify actions using the standard cross entropy loss and freeze its weights.
Train MERS to minimize the difference between features of the penultimate fully connected layer and the features at the same layer of the frozen Flow stream.
Train the last fully connected layer of MERS to classify actions using cross entropy loss.

MERS achieves similar performance to the Flow stream but is significantly faster.

MERS shows that explicit optical flow computation is not needed at test time. Can we go even further by enhancing RGB features with motion information?

MARS: Motion Augmented RGB Stream

Instead of only mimicking flow features, in MARS we additionally leverage appearance information by minimizing cross entropy loss in addition to l2 feature loss. MARS is trained by the following steps:

Train the flow stream to classify actions using the standard cross entropy loss and freeze its weights.
Train MARS to minimize a weighted combination of the difference between features of the penultimate fully connected layer and the features at the same layer of the flow stream and cross entropy loss.

The results show that MARS is ~140 times faster than RGB+Flow, ~2 times faster than MERS+RGB and it equals the accuracy of RGB+Flow clearly demonstrating that it’s the most effective in combining motion and appearance in a single stream at test time.

MARS: Motion-augmented RGB stream for action recognition

A training strategy based on distillation that leverages both motion and appearance information achieving state-of-the-art results using a 3D CNN with only RGB frames as input, without explicit optical flow computation.

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Topics

MARS: Motion-augmented RGB stream for action recognition

A training strategy based on distillation that leverages both motion and appearance information achieving state-of-the-art results using a 3D CNN with only RGB frames as input, without explicit optical flow computation.

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings