Abstract: There is huge value in enabling machines to understand and interpret, through vision, written information, in unconstrained conditions in the world around us. At the same time, our visual interpretation capacity is jointly acquired with the linguistic structures we use to describe the world – it would be desirable for machines to be able to learn in a similar way.
My research group at the Computer Vision Centre focuses on the design of computational mo dels at the meeting point between vision and language, that efficiently exploit available textual information to solve any type of computer vision challenge. We investigate new technologies to give machines the capacity to read, as well as methods to enable computer vision models to learn by properly exploiting textual information, in or about images, and to use natural language interfaces to interact with humans.
In this talk I will be discussing recent research in the group for modelling the interplay between visual and textual information for computer vision applications. I will focus on recent work we have done on image captioning and visual question answering, while during the presentation I will touch upon scene text recognition methods, cross-modal image retrieval, joint visual-textual embeddings, semantic retrieval and self-supervised learning.
Date: 26th April 2019