A new approach to image search uses images returned by traditional search methods as nodes in a graph neural network through which similarity signals are propagated, achieving improved ranking in cross-modal retrieval.
A new approach to image search uses images returned by traditional search methods as nodes in a graph neural network through which similarity signals are propagated, achieving improved ranking in cross-modal retrieval.
If you’re like most people, you probably search for an image on the web by typing the words you think best describe what you’re looking for into a search engine. After browsing through the results, you might try to narrow them down using different words. As you go through this process, you likely don’t consider the many complicated steps that are being taken by the search engine behind the scenes in its endeavour to provide you with an image that matches your query.
Yet searching for an image from text is neither simple nor straightforward. The search engine must take words (often loaded with meaning) and try to match them with representations of images based on pixels. This is difficult because the words in a text query generally have a meaning that is impossible to directly compare or match with the pixels that constitute an image. This issue is referred to as the ‘semantic gap’. To complicate matters further, if you’re anything like me, you might use queries that are only indirectly connected to the image you’re looking for. This problem is known as the ‘intent gap’, which describes the difficulty inherent in trying to understand the thoughts and intentions of a user.
The semantic and intent gaps are best illustrated with examples. Recently, I used a search engine to determine which landmarks and buildings I might be interested in visiting during a stay in an unfamiliar city. One rather tedious approach would have been to first carry out an extensive search for a list of points of interest and then query them individually. Instead, I used a broad query—the name of the city—so that I could quickly browse through images of the points of interest and decide which ones I’d like to visit.
I adopted a similar strategy this morning when I discovered that some of the zucchini in my vegetable patch were looking a bit weird. I had no idea of the cause and wanted to get an idea of what the problem might be, as well as to find advice on what to do. Because I didn’t want to describe the symptoms (I didn’t know which were the most important ones and felt lazy), I instead decided to use an image search as a guide by simply querying ‘diseased zucchini’. I was then able to compare the symptoms I could see on the plant to those shown in the results and, from there, navigate to a website that described the most likely cause.
In both examples, I wasn’t trying to find (or use) words that precisely described my needs. Instead, I used a vague, high-level query in the hope that—with the range of results provided, and the speed with which one can grasp the content of an image—I would quickly find the information I needed.
Trying to bridge the semantic gap has always been a challenge. Fortunately, advances in technology (in particular, deep learning) have helped make substantial progress in this area. Specific neural architectures mean that a machine is now better able to understand the content of an image and is therefore capable of extracting higher semantic information from pixels. Machines can even automatically caption an image, or supply answers to an open-ended question about an image with an image (1). Despite all this progress, though, the semantic gap has not been perfectly filled—primarily because words are not able to describe everything that an image communicates.
Conversely, addressing the intent gap in search is considered one of the most difficult tasks in information retrieval. Indeed, the inability to know what a person is thinking when they enter words as a query would appear to be an insurmountable problem. But by exploiting the user context (e.g. their location, device, and the time of their query) and their previous search history, this problem can be partially alleviated. In particular, using huge collections of logs that contain both the queries and the clicked objects associated with those queries, machine learning techniques can be used to uncover the subtle relationship between the query, its context and the desired result.
In our project, we build on recent state-of-the-art developments—namely ‘text as proxy’, ‘joint embeddings’, and ‘wisdom of the crowd’—to address both the semantic gap and the intent gap. We go beyond what has been achieved by these three approaches by introducing a graph neural network (GNN) that successfully improves image search.
In the early era of web image search, ‘text as a proxy’ was the primary strategy adopted by most search engines. Because images on the web are usually embedded in web pages, they are generally surrounded by text (including the ‘alt text’ of the image) and structured metadata. This information can be used to match images with queries. It’s for this reason that an image with its associated bundle of information is referred to as a ‘multimodal object’. If we’re lucky, an image caption or title within this bundle will accurately describe the content of the image and this text can be relied on for the retrieval task.
More generally, the idea is to index an image not by its visual content but by its associated text. Existing mono-modal text-based retrieval methods can then be applied to rank the objects by relevance score. Additional metrics—such as image quality (determined by neural models that are able to quantify an image’s aesthetics), the PageRank (2) of the associated web page and how new, or ‘fresh’, the image is—can also be used to determine the relevance score of an image. Based on a training set that contains queries and multimodal objects that have been manually annotated for relevance, a machine can learn how to combine all of these so-called weak relevance signals (including the relevance score derived from the text-retrieval engine) to obtain an even more accurate ranking.
Although this approach succeeds in integrating a broad set of elements, some weaknesses remain. First, we can’t be sure that the text used as a proxy for the image is indeed a good summary of its content, or of all the ways that this content could be expressed in a query. These pieces of text are often written by everyday users (i.e. user-generated content) in a specific context, where there’s no obligation for precision or even relevance between the choice of image and the text (or vice versa). Indeed, images and text found in the same page can provide complementary but quite different kinds of information that are not necessarily strongly associated. Second, it is universally accepted that describing everything about an image with words is impossible, particularly in terms of the emotion that it arouses or the aesthetic perceptions, which remain very individual. Finally, any approach that uses a manually annotated dataset entails the costs of those annotations (which are generally commissioned or purchased).
The ‘joint embeddings’ family of methods is driven mainly by scientists from the computer vision community and aims to find common representations for images and pieces of text, such as phrases or sentences. In a nutshell, the goal is to design ‘projection’ functions that transform objects from their original spaces—RGB (red, green, blue) pixels for images, and words or sub-words for text—into a new common space, where images and text can then be matched together. These projected objects are high-dimensional numerical vectors called ‘embeddings’.
From these projections, the text describing an image (i.e. one of its perfect captions) can be expected to have its embedding positionally very close, in vector space, to the embedding of its corresponding image. If such a mapping can be built from the data, the query text of the user can then be applied and used to look for the images that lie closest to the projected query in this new common space. Such images are considered most relevant.
Of course, these projection functions are highly complex and typically rely on deep neural architectures. For a machine to learn them requires training data, consisting of a universal collection of thousands (if not millions) of images annotated with one or more sentences that perfectly describe their content. Additionally, for the learning to be feasible and for generalization purposes, the descriptions need to be ‘clean’. In other words, they must be purely factual and observational and free of any subjective aspects (emotions, beauty, etc.) or named entities (proper nouns). Such requirements make training data exceedingly difficult to collect.
The main weaknesses of this approach are the high cost of the human annotations required and the sheer impossibility of covering all possible concepts and domains of human knowledge. Additionally, the maintenance of huge, annotated collections of images becomes increasingly difficult over time because new concepts and entities appear every day. For these reasons, joint embeddings can only partially address the semantic gap from the perspective of annotations. Moreover, this problem exists with all concepts and words to be addressed in the queries. The likelihood of a mismatch between the (often, poorly expressed) user query and the training data is pretty high, even before considering the challenges posed by the intent gap!
Who better to tell if an image is relevant to a query than users themselves? This ‘wisdom of the crowd’ approach captures several subjective, time-varying and contextualized factors that a human annotator would never be able to figure out. As an example, consider the query words ‘brown bear’ for an image search. It’s highly likely that the images clicked on in the result will indeed be brown bears. However, these images will be diverse in content (i.e. some will be real brown bears whilst others will represent toys). Additionally, the most attractive images are the ones generally clicked on. Based on a very large collection of image search logs containing several million time-stamped queries covering very diverse topics and their associated clicked images, though, it is relatively easy to characterize images with textual queries. With this information in mind, one of the two previous approaches can then be applied. For instance, in text as a proxy, if the images are indexed with their frequently associated queries, a simple mono-modal text search algorithm can be used to find images that are relevant to a new query. Alternatively, a new joint-embedding model can be trained based on huge crowd-annotated training collections of image-text pairs.
Unfortunately, the ‘wisdom’ in the wisdom of the crowd is not, in fact, all that wise. It’s well known that ‘click signals’ are both noisy and biased. Among the biases that exist, the ‘position bias’—i.e. the likelihood that images are clicked is greater for those placed higher in the list or grid of a page, irrespective of their relevance—is a relatively straightforward one to understand and mitigate for. But there are several other kinds of biases that must be considered too, including the trust bias, layout bias and self-selection bias. Removing bias and noise is a major challenge in exploiting click information. Another difficulty arises due to the number of new, freshly published images that are not yet linked to any queries (or are only linked to a few) with little text representation. For these images, the wisdom of the crowd would appear to be of little use.
An obvious strategy for overcoming this issue would be to consider that, in the case where an image is visually close to a popular image (i.e. one with a high number of clicks, and a high number of clicks relevant to the given query), this newer image should also be popular. Propagating these kinds of ‘weak’ relevance signals within a network of visually linked images comes quite naturally, and is precisely what inspired us in our work. We generalize the propagation mechanism in a more principled way using the graph formalism and, in particular, graph convolution networks (GCNs). These GCNs constitute a general framework that learns how to propagate signals in a graph to solve a task and is the core method we have developed for image search.
Our idea for improving image search is a simple one: to combine the best features of each of these three approaches. We wanted to take advantage of the abundant and easily available text descriptions of images (captions, surrounding text and associated queries) even if they are somewhat noisy, incomplete or occasionally unrelated to the image content. We also wanted to exploit other kinds of weak relevance signals, such as the PageRank of the image web page, the unbiased popularity of the image and its freshness. Finally, we wanted to leverage recent computer vision models, such as ResNet (3), that can extract very powerful semantic representations of images that are especially useful when comparing two images.
Our method initially relies on the text as a proxy strategy. Given a user’s text query, we build an initial set of retrieved images from a standard text-based search engine using the associated noisy textual descriptions. It’s worth pointing out that observing the set as a whole can reveal some interesting patterns and provide hints on how these multimodal objects should be sorted. The next steps consist of considering the whole set of objects, analysing how their weak relevance signals are distributed and, finally, learning to propagate them so that a more accurate global relevance measure can be computed for every object.
Next, we adopt a graph formalism to build a graph from this initial set. The nodes in our graph are the multimodal objects (i.e. the images with their associated text) that are returned by the search engine. Nodes have several features, such as the textual relevance score (computed by the text search engine), the PageRank of the page, the freshness of the image and its de-biased popularity. The edges in the graph are defined by the visual nearest-neighbour relationship between objects. In other words, we compute the visual similarity between every image in the set and retain the edges between images that are the most similar. Each edge has a weight, which is determined by the visual similarity between the images at its two end points.
Once the graph has been built, we apply a GCN that propagates the weak relevance signals of the nodes through the graph. More precisely, these signals are first transformed locally and then transferred to the neighbour following a message-passing scheme (4). This procedure can be repeated multiple times, corresponding to multistep (or multi-hop) propagation in the graph. The important thing here is that the message-passing equations use the edge weights (i.e. the visual similarity between nodes). The algorithm allows these weights to be modulated, or fine-tuned, so that refined visual similarities more suitable for the retrieval task at hand can be automatically derived. The transformation and feature-aggregation functions have parameters that are learned during a training phase. More precisely, the training stage exploits a manually annotated collection of queries and multimodal objects, and the parameters of the model are identified to obtain global relevance scores of the nodes after propagation. We indirectly maximize standard information retrieval metrics, such as Precision@10 (5) and normalized discounted cumulative gain (6), by using ‘proxy’ objectives (i.e. pointwise differentiable objective functions).
Let’s return to our example of an image search for ‘brown bear’. Although the images that are returned from this query may contain a brown bear, the textual similarity to the query, the image quality or the PageRank of their source website could be quite different. By learning to exchange these relevance signals through the image modality, the GNN can provide better rankings.
Another advantage of this approach is that the image ranking is fully contextualized. The score of an image to a query depends on all other objects in the initial retrieved set, so it is relative rather than absolute. In other words, the GNN learns to propagate or transfer relevance signals between different images; instead of learning the conceptual representation of an image compared directly to the text (as in the joint embedding approach), similarity signals are propagated. The model architecture is depicted in Video 1.
To compute image similarities, we rely on pretrained ResNet descriptors (3) and learn a generalized dot product (also known as ‘symmetric bilinear form’) to adapt these pretrained image similarities to our web image search task. The model is very compact, with roughly 3–10K parameters, depending on the number of cross-modal convolution layers. Any feature can be plugged in, and the visual, textual, structural and popularity features are aggregated in a fully differentiable model. See Figure 1 for an illustration of the global architecture of our approach.
Although the graph structure is implicit in image search, we have found GNNs to be a very powerful model that can generalize some popular heuristics in information retrieval, such as pseudo relevance feedback (7). Additionally, our model can be used in other cross-modal search scenarios (i.e. text to image) and, being end to end, can learn how to optimally combine the different pieces of information.
We evaluated our approach and compared it to other state-of-the-art methods (8) on two datasets, i.e. the MediaEval 2017 dataset (9) and a proprietary NAVER Search dataset. The datasets include hundreds of queries and a collection of images whose relevance with respect to the queries is manually annotated. For these, our method achieved an improvement of around 2% in both Precision@20 and mean average precision (mAP) metrics, which in practice can lead to a significant increase in click-through rate (10). In more recent experiments with a larger NAVER Image Web Search dataset (corresponding to a sample of 1000 queries), the improvement is even larger: 12% in mAP. This leads us to conclude that the gain is much higher when we use refined visual similarities and web-based data. For more details, please see our ECIR20 paper (11) and code (12).
The fundamental challenges we face when designing a multimodal search engine root from the semantic and intent gaps, or what a user can express in words as a text query and the difficulty in surmising the visual content that they’re expecting to receive as a result.
Our approach seeks to combine the strength of three well-known methods (i.e. text as proxy, joint embeddings and the wisdom of the crowd) by modelling the whole set of image results as a graph whose nodes are the images with their associated text and whose edges encode the visual similarity between end points. Expressing the inputs of our problem as a graph with some weak relevance signals associated to each node of the graph enables us to benefit from recent advances in GNNs—namely a set of signal propagation mechanisms that are learned from the data itself. These mechanisms allow us to go from a set of unrelated, incomplete and noisy pieces of relevant information into a globally consistent set of strong relevance signals by considering all of these pieces synergistically.
Our approach opens up new avenues for solving cross- or multimodal problems, as the methods and techniques that we describe can be easily extended to compare other mixtures of media beyond text and image, such as speech, video and motion.
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.