Fine-grained action retrieval made possible thanks to a new annotated dataset and part-of-speech embedding.
Using words to search for a picture or vice-versa is one instance of ‘cross-modal visual search’.
The methods to do this kind of search generally transform an image or text into a representation vector within a space that’s shared across all modalities (text, image, video). We call this space an embedding space and it’s built in such a way that you can directly compare the representation of a text query and the representation of an image within that space to decide whether or not they’re relevant to each other. The transformations which build the representations in the embedding space are learned from data.
They’re called embedding functions.
Below you can see the two functions represented by green and blue arrows that, respectively, embed text and images, making sure that, in the embedding space, all the items related to cats are close to each other and far away from all the items related to dogs.
Up to now, such an embedding space had been used for the tak of action retrieval but only at a fairly coarse level. Without a good annotated dataset it wasn’t possible to retrieve very specific, fine-grained items. This changed with the appearance of the EPIC-Kitchens dataset which considers the fine-grained aspect of actions in the context of cooking (see the video clip). This allowed us to tackle the new task of ‘fine-grained action retrieval’ where we wanted to query and retrieve short videos that show very specific actions i.e. someone ‘chopping a leek’ without retrieving a video of someone ‘slicing an onion’.
Fine-grained action retrieval departs from existing action recognition work in the following ways:
- Prior work on action retrieval focused on coarse actions. Going to a deeper level makes video retrieval more realistic. People know pretty well what they’re looking for and don’t want the results to be cluttered with irrelevant files. We also handle more actions.
- Prior work on fine-grained action recognition focused on classification not retrieval. Here, the action description associated with a video clip is free form, which means that we’d also like to be able to retrieve new actions we’ve never seen before, simply by extrapolating from what we have seen. For instance, if we’ve seen a video with someone cutting a tomato and another one where someone is washing an apple, we’d like to be able to retrieve a video of someone cutting an apple.
To do this we had to ask ourselves how we could change or improve the standard embedding-based approach described earlier to be able to apply it to fine-grained action retrieval.
We began by considering that, quite simply, an action involves an actor, an act and a list of objects related to the act. Based on this, the idea is to extract similar types of information from the text by using a parser, whose job is to decompose a sentence into parts-of-speech such as verbs, nouns or adjectives. We then use this information to learn multiple specialized embedding spaces. For instance, an embedding space which focuses solely on verbs focuses on the act and ignores the objects: it will consider cutting apples, carrots or onions as identical, and try to embed them as close as possible, totally ignoring the object that is being cut. Similarly, an embedding space specialized for nouns will consider that peeling, cutting or washing an apple is similar.
We then build a final embedding space which summarizes the individual parts-of-speech spaces and learns to combine them in the best way. This final embedding space is the one used to perform fine-grained action retrieval. How it’s constructed can be illustrated like this:
Intuitively, splitting the text into multiple part-of-speech embedding spaces produces complementary views of the data. It also injects prior information that comes for free with the text. We think it also helps generalize across several fine-grained actions involving the same “act” – such as ‘cutting’ x,y or z – or involving the same objects – for instance all activities involving an onion (frying, slicing etc).
To learn these multiple embedding spaces, we propose an architecture composed of several parts – an input branch for the text and an input branch for the video sequences and multiple embedding functions. These embedding functions are trained with a combination of several part-of-speech aware losses and one part-of-speech agnostic loss. The architecture is illustrated below and there are more details in the paper such as the nature of the different embedding functions and the training strategy. You’ll also find extensive experiments that show the benefit of the method compared to several baselines on two tasks: the first is the fine-grained video retrieval that we talked about earlier on the EPIC kitchen dataset, and the second is general video retrieval on the MSR-VTT dataset.