By dissecting the matching process of the recent ColBERT [2] model, we make a step towards unveiling the ranking properties of BERT-based ranking models. This post is an introduction to our 2021 ECIR paper “A White Box Analysis of ColBERT” [1].
By dissecting the matching process of the recent ColBERT [2] model, we make a step towards unveiling the ranking properties of BERT-based ranking models. This post is an introduction to our 2021 ECIR paper “A White Box Analysis of ColBERT” [1].
By analysing two key ingredients for good search models – term importance and term matching patterns – we show that:
In the past two years, we’ve witnessed the take over of large, pre-trained language models in many language-related tasks. Information Retrieval (IR) is no exception to the rule: while the field has been rather hermetic to “neural breakthroughs”, and term-based approaches like BM25 still remain hard to beat, BERT-based ranking models have shifted this paradigm (starting from [3]), showing that large neural models significantly outperform previous approaches on various datasets like MSMARCO [3] and Robust04 [4, 5, 6].
While a lot of effort has been put into designing variants of BERT-based ranking models, aiming for either better performance or efficiency, not much has been done towards understanding why/how such models perform so well in IR (even though there’s been a lot of work in analysing BERT in NLP – see the recent survey [7]).
Since its beginning, IR has been driven by heuristics, such as term importance (“some terms are more discriminative than others for ranking”, known as the IDF effect). In practice, it is unclear if/how such heuristics can be/are learned by ranking models; and actually, previous neural rankers owe their success to explicitly including those into the models e.g. [8].
Previous work from last year [9] investigated if IR axioms are respected – or not – by transformer-based models by the means of diagnostic datasets. Such axioms define properties that a good model should fulfil, for instance “the more occurrence of a query term a document has, the higher its retrieval score”. A deeper analysis was recently proposed in [10], studying the effect of different properties like word order or fluency.
Although these analyses give a global view of certain characteristics of the models, they don’t shed light on how ranking is conducted internally by the models. Instead of investigating whether BERT ranking models behave like standard ones, we go towards understanding how they manage to improve over standard baselines.
For this study, we chose to focus on ColBERT [2], for two reasons:
The model is detailed in Figure 1 whereby, for each query term, we seek the most similar term in the given document, in terms of cosine similarity between BERT embeddings. The final score is just a sum of these weights.
We trained the model on the MSMARCO passage dataset, and consider for the analysis the passage retrieval tasks from TREC-DL 2019 and 2020 (400 test queries in total). To control the role of fine-tuning on the properties we looked at, we consider a ColBERT model that has not been fine-tuned on relevance but just initialized from a pre-trained BERT checkpoint.
What we’re interested in is how the model attributes scores to documents for each query term. We suspect that some terms might be more important, and this should be reflected in the scores. To do so, we place ourselves in a re-ranking scenario, where the model has to re-rank a set of documents Sq provided by a first ranker, typically BM25 (in our case, |Sq| <= 1000). Now, for a given query and for each term of this query, we can analyse the distribution of scores for all the documents in Sq. An even more interesting view can be obtained by considering two distributions: scores that come from an exact match with the query term (i.e. the maxsim is obtained for the same term), and scores that come from a soft match (i.e. the maxsim is obtained for a term in the document that is not the query term). Some examples are given in Figures 2, 3 and 4:
We start to notice a pattern here…
Although at first sight all of this may appear natural, the observation is actually not so straightforward. Our work focused on quantifying this information under the IR prism, as well as providing some hints to the actual reasons for this behavior.
In the examples above, we have the impression that some terms are more important. Before analysing the role of such terms, we first wanted to check if the model captures a notion of term importance, and if so, how it relates to standard definitions like IDF.
For ColBERT – and other BERT based models – it’s not easy to measure the importance of terms, because it depends on both the document and query contexts. That’s why we have to resort to indirect means by considering:
We now have two rankings which we want to compare. Intuitively one would think that if the term is important in some sense, then removing its contribution should disturb the ranking, while a “useless” term should not impact the ranking so much, i.e. the two lists should be similar. To compare two ranked lists of documents, we choose to report the AP-correlation τ-AP [11]: two similar lists have a τ-AP close to 1. *Note: the way we define importance does not rely on relevance annotation. Other types of measures could be possible, e.g. by computing the delta of some IR metrics between the two lists. One advantage here is that we can use virtually any set of queries without annotation.
In Figure 5, we plotted ColBERT term importance with respect to IDF. We witness a (moderate) correlation between the two, showing that ColBERT implicitly captures a notion of term importance. Note that the correlation is not perfect, especially because the model is able to learn term importance and correct the defaults of IDF which is an imperfect measure of such importance.
A second point we wanted to investigate is the issue of exact and soft match patterns. It’s well established that exact matches (same term in query and document) remain a critical component of IR systems. But solely relying on exact matching leads to the so-called vocabulary mismatch problem. Competitive IR models therefore generally also rely on softer notions of matching (allowing for instance to match synonyms etc.), and there’s a need to find the right balance between lexical and semantic matching. This makes it of interest to check how (Col)BERT deals with the two aspects, and especially to understand if exact matching still remains a key component of these transformer-based models. We observed in the previous examples that some terms seem to focus more on exact matching, and when so, contributions tend to be higher (in average). In Figure 6, we compute the difference between the mean of exact distribution and soft distribution ΔES for each query term (remember the two distributions for each query in previous examples), and plot it against IDF. High delta tends to indicate that the model favors an exact match for this term, as the model learns to widen the gap (in average) between exact and soft scores for this term. We can see there is a (moderate) positive correlation between terms focusing on exact matches and IDF. Interestingly, this effect is already present before fine-tuning, but is reinforced when fine tuning on relevance signals.
We give below in Figure 7 some examples of queries, with the ΔES for each term. We can see that for terms that intuitively seem important, ΔES is higher, meaning that here, the model promotes a stricter notion of match. For instance, the first query refers to a coronary problem, so words related to that have a higher delta.
The previous analysis shows that ColBERT relies on exact matching for certain terms, but it doesn’t tell us how. Our hypothesis is that contextual embeddings for such terms tend to not vary much, so the cosine similarity between the query term and the document term would be close to 1, and ColBERT will tend to select this term. On the contrary, terms carrying less “information” (e.g. stopwords, but not only) are more influenced by their context, and will act as some sort of reservoir to encode concepts of the sequence, and their contextual embeddings would likely vary a lot. To check this hypothesis, we perform a spectral analysis of contextual term embeddings collected on a subset of the collection, restricting the analysis to terms occurring in queries only. We use a singular value decomposition (SVD) on each matrix composed of the contextual embeddings in the corpus for a given term (so one matrix per term), and look at the relative magnitude of singular values λ1,…,λd (d=embedding dim). Intuitively, if the magnitude of λ1 is far greater than the others, it means that the embeddings (for this term) tend to point in the same direction in the embedding space, promoting exact matches because of the ColBERT design. In Figure 8, we have confirmation of our intuition: as (word piece level) IDF increases, the ratio of λ1 w.r.t. to other singular values increases.
Of interest is the fact that the effect is reinforced when the model is fine-tuned on relevance data. In particular, words with a low IDF tend to point in a different direction each time, showing that what they capture is more about their context. We give some examples of queries in Figure 9 below, including what some of their terms match in a sample of 15 documents with respect to the ColBERT mechanism, which reinforces our intuition:
ColBERT (implicitly) learns a notion of term importance that correlates with IDF
There obviously remains a lot to do, either by analyzing other models, or by extending our analysis of ColBERT to first stage ranking, where retrieval axioms might be more critical. Check out our paper for further details and reach out to us!
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.