28th August 2019
Title: “Visual, semantic and cross-modal search“.
Abstract: Visual search can be formulated as a ranking problem where the goal is to order a collection of images by decreasing similarity to the query. Recent deep models for image retrieval have outperformed traditional methods by leveraging ranking-tailored loss functions such as the contrastive loss or the triplet loss. Yet, these losses do not optimize for the global ranking. In the first part of this presentation, we will see how one can directly optimize the global mean average precision, by leveraging recent advances in listwise loss formulations. In a second part, the presentation will move beyond instance-level search and consider the task of semantic image search in complex scenes, where the goal is to retrieve images that share the same semantics as the query image. Despite being more subjective and more complex, one can show that the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators, and that suitable embedding spaces can be learnt for this task of semantic retrieval. The last part will focus on cross-modal search. More specifically, we will consider the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space that can indifferently embed modalities. In this last part we will show how to enrich the embedding space by disentangling parts-of-speech (PoS) in the accompanying captions.