Fine-grained action retrieval through multiple parts-of-speech embeddings

Published by Diane Larlus at 2 September 2019

Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 October-2 November, 2019

 @InProceedings{wray2019fgar,
author    = {Wray, Michael and Larlus, Diane and Csurka, Gabriela and Damen, Dima},
title     = {Fine-Grained Action Retrieval through Multiple Parts-of-Speech Embeddings},
booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year      = {2019}}

Careers home

Abstract

We address the problem of cross-modal fine-grained ac-tion retrieval between text and video. Cross-modal retrievalis commonly achieved through learning a shared embed-ding space, that can indifferently embed modalities. Inthis paper, we propose to enrich the embedding by disen-tangling parts-of-speech (PoS) in the accompanying cap-tions. We build a separate multi-modal embedding spacefor each PoS tag. The outputs of multiple PoS embed-dings are then used as input to an integrated multi-modalspace, where we perform action retrieval. All embeddingsare trained jointly through a combination of PoS-aware andPoS-agnostic losses. Our proposal enables learning spe-cialised embedding spaces that offer multiple views of thesame embedded entities.

We report the first retrieval results on fine-grained ac-tions for the large-scale EPIC dataset, in a generalisedzero-shot setting. Results show the advantage of our ap-proach for both video-to-text and text-to-video action re-trieval. We also demonstrate the benefit of disentanglingthe PoS for the generic task of cross-modal video retrievalon the MSR-VTT dataset.

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Fine-grained action retrieval through multiple parts-of-speech embeddings

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings