**By dissecting the matching process of the recent ColBERT [2] model, we make a step towards unveiling the ranking properties of BERT-based ranking models. This post is an introduction to our 2021 ECIR paper “A White Box Analysis of ColBERT” [1].**

Blog home

### Related Content

- Research scientist in Spoken Language Processing (open position)
- Research scientist in Search and Recommendation (multiple positions) (open position)

By analysing two key ingredients for good search models – term importance and term matching patterns – we show that:

- (Col)BERT indeed captures a notion of
**term importance** **Exact match**remains a key component of the model, especially for*important*terms- Exact match is promoted in ColBERT for terms with high Inverse Document Frequency (IDF), for which contextual embeddings tend to point in the same direction in the embedding space

In the past two years, we’ve witnessed the take over of large, pre-trained language models in many language-related tasks. Information Retrieval (IR) is no exception to the rule: while the field has been rather hermetic to “neural breakthroughs”, and term-based approaches like BM25 still remain hard to beat, BERT-based ranking models have shifted this paradigm (starting from [3]), showing that large neural models significantly outperform previous approaches on various datasets like MSMARCO [3] and Robust04 [4, 5, 6].

While a lot of effort has been put into designing variants of BERT-based ranking models, aiming for either better performance or efficiency, not much has been done towards understanding **why/how** such models perform so well in IR (even though there’s been a lot of work in analysing BERT in NLP – see the recent survey [7]).

Since its beginning, IR has been driven by heuristics, such as term importance (“*some terms are more discriminative than others for ranking”*, known as the IDF effect). In practice, it is unclear if/how such heuristics can be/are learned by ranking models; and actually, previous neural rankers owe their success to explicitly including those into the models e.g. [8].

Previous work from last year [9] investigated if IR axioms are respected – or not – by transformer-based models by the means of diagnostic datasets. Such axioms define properties that a good model *should* fulfil, for instance “*the more occurrence of a query term a document has, the higher its retrieval score*”. A deeper analysis was recently proposed in [10], studying the effect of different properties like word order or fluency.

Although these analyses give a global view of certain characteristics of the models, they don’t shed light on *how ranking is conducted internally by the models*. Instead of investigating whether BERT ranking models behave like standard ones, we go towards understanding **how** they manage to improve over standard baselines.

For this study, we chose to focus on ColBERT [2], for two reasons:

- By delaying interaction between query and document to the very end – query and document are encoded
**independently**– it manages to keep performance on par with the vanilla approach while drastically reducing computation (both offline and online) - Its structure is similar to standard BOW models (sum over query terms of some similarity score): it makes the analysis easier as the
**contribution of each term for ranking is explicit**

The model is detailed in Figure 1 whereby, for each query term, we seek the most similar term in the given document, in terms of cosine similarity between BERT embeddings. The final score is just a sum of these weights.

We trained the model on the MSMARCO passage dataset, and consider for the analysis the passage retrieval tasks from TREC-DL 2019 and 2020 (400 test queries in total). To control the role of fine-tuning on the properties we looked at, we consider a ColBERT model that has *not* been fine-tuned on relevance but just initialized from a pre-trained BERT checkpoint.

What we’re interested in is how the model attributes scores to documents for each query term. We suspect that some terms might be more *important*, and this should be reflected in the scores. To do so, we place ourselves in a re-ranking scenario, where the model has to re-rank a set of documents **S**_{q} provided by a first ranker, typically BM25 (in our case, |**S**_{q}| <= 1000). Now, for a given query and for each term of this query, we can analyse the distribution of scores for all the documents in **S**_{q}. An even more interesting view can be obtained by considering two distributions: scores that come from an **exact match** with the query term (i.e. the maxsim is obtained for the same term), and scores that come from a **soft match** (i.e. the maxsim is obtained for a term in the document that is not the query term). Some examples are given in Figures 2, 3 and 4:

We start to notice a pattern here…

- Some terms that seem important in the query tend to focus on exact match: for instance “
*kitchen*” and “*cabinets*” in Figure 2, or “*michigan*” and “*county*” in Figure 3. For these terms, similarity scores tend to be higher with respect to other query terms AND with respect to the soft-case, hence having a larger*contribution*in the ranking score (remember that ColBERT is just a sum of weights over query terms). - Inversely, some terms carrying less content, like “for” in Figure 2, tend to focus more on soft-matching. In this case, similarity ranges generally look similar between exact and soft. *
*Note: for some of them, it is likely they don’t appear in the document, hence always doing soft matches by default.*

Although at first sight all of this may appear natural, the observation is actually not so straightforward. Our work focused on quantifying this information under the IR prism, as well as providing some hints to the actual reasons for this behavior.

In the examples above, we have the impression that some terms are more *important*. Before analysing the role of such terms, we first wanted to check if the model captures a notion of term importance, and if so, how it relates to standard definitions like IDF.

For ColBERT – and other BERT based models – it’s not easy to measure the importance of terms, because it depends on both the document and query contexts. That’s why we have to resort to indirect means by considering:

- the ranking obtained by ColBERT
- the ranking obtained by ColBERT when the corresponding term
*contribution*is masked (i.e. when we remove from the ColBERT sum all the contributions of subwords that compose the word)**Note: word is not masked in the input, we solely mask its final contribution in the score, but not its influence on other terms*

We now have two rankings which we want to compare. Intuitively one would think that if the term is important in some sense, then removing its contribution should disturb the ranking, while a “useless” term should not impact the ranking so much, i.e. the two lists should be similar. To compare two ranked lists of documents, we choose to report the AP-correlation *τ-AP* [11]: two similar lists have a *τ-AP* close to 1. **Note: the way we define importance does not rely on relevance annotation. Other types of measures could be possible, e.g. by computing the delta of some IR metrics between the two lists. One advantage here is that we can use virtually any set of queries without annotation. *

In Figure 5, we plotted ColBERT term importance with respect to IDF. We witness a (moderate) correlation between the two, showing that ColBERT implicitly captures a notion of term importance. Note that the correlation is not perfect, especially because the model is able to learn term importance and correct the defaults of IDF which is an imperfect measure of such importance.

A second point we wanted to investigate is the issue of exact and soft match patterns. It’s well established that exact matches (same term in query and document) remain a critical component of IR systems. But solely relying on exact matching leads to the so-called vocabulary mismatch problem. Competitive IR models therefore generally also rely on softer notions of matching (allowing for instance to match synonyms etc.), and there’s a need to find the right balance between lexical and semantic matching. This makes it of interest to check how (Col)BERT deals with the two aspects, and especially to understand if *exact matching* still remains a key component of these transformer-based models. We observed in the previous examples that some terms seem to focus more on exact matching, and when so, contributions tend to be higher (in average). In Figure 6, we compute the difference between the mean of exact distribution and soft distribution Δ_{ES} for each query term (*remember the two distributions for each query in previous examples*), and plot it against IDF. High delta tends to indicate that the model favors an exact match for this term, as the model learns to widen the gap (in average) between exact and soft scores for this term. We can see there is a (moderate) positive correlation between terms focusing on exact matches and IDF. Interestingly, this effect is already present before fine-tuning, but is reinforced when fine tuning on relevance signals.

_{ES }for each term. We can see that for terms that *intuitively* seem important, Δ_{ES }is higher, meaning that here, the model promotes a stricter notion of match. For instance, the first query refers to a coronary problem, so words related to that have a higher delta.

The previous analysis shows that ColBERT relies on exact matching for certain terms, but it doesn’t tell us how. Our hypothesis is that contextual embeddings for such terms tend to not vary much, so the cosine similarity between the query term and the document term would be close to 1, and ColBERT will tend to select this term. On the contrary, terms carrying less “information” (e.g. stopwords, but not only) are more influenced by their context, and will act as some sort of reservoir to encode concepts of the sequence, and their contextual embeddings would likely vary a lot. To check this hypothesis, we perform a spectral analysis of contextual term embeddings collected on a subset of the collection, restricting the analysis to terms occurring in queries only. We use a singular value decomposition (SVD) on each matrix composed of the contextual embeddings in the corpus for a given term (so one matrix per term), and look at the relative magnitude of singular values λ_{1},…,λ_{d} (d=embedding dim). Intuitively, if the magnitude of λ_{1} is far greater than the others, it means that the embeddings (for this term) tend to point in the same direction in the embedding space, promoting exact matches because of the ColBERT design. In Figure 8, we have confirmation of our intuition: as (word piece level) IDF increases, the ratio of λ_{1} w.r.t. to other singular values increases.

ColBERT (implicitly) learns a notion of term importance that correlates with IDF

- Exact matching remains a key component, especially for terms with high IDF
- Embeddings for terms with high IDF tend to point in the same direction in the embedding space, thus promoting exact matching due to the ColBERT design

There obviously remains a lot to do, either by analyzing other models, or by extending our analysis of ColBERT to first stage ranking, where retrieval axioms might be more critical. Check out our paper for further details and reach out to us!

- A white box analysis of ColBERT, Thibault Formal, Benjamin Piwowarski and Stéphane Clinchant, European Conference on Information Retrieval (ECIR), (online), 28 March – 1 April, 2021. Received the Best Short Paper Award.
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, Omar Khattab and Matei Zaharia, Conference on Research and Development in Information Retrieval (SIGIR), virtual event, July, 2020
- Passage Re-ranking with BERT, Rodrigo Nogueira and Kyunghyun Cho, 2019.
- Deeper Text Understanding for IR with Contextual Neural Language Modeling, Zhuyun Dai and Jamie Callan, Conference on Research and Development in Information Retrieval (SIGIR), Paris, France, 21-25 July, 2019.
- CEDR: Contextualized Embeddings for Document Ranking, Sean MacAvaney, Andrew Yates, Arman Cohan and Nazli Goharian, Conference on Research and Development in Information Retrieval (SIGIR), Paris, France, 21-25 July, 2019.
- Document Ranking with a Pretrained Sequence-to-Sequence Model, Rodrigo Nogueira, Zhiying Jiang and Jimmy Lin, 2020
- A Primer in BERTology: What we know about how BERT works, Anna Rogers, Olga Kovaleva and Anna Rumshisky, Transactions of the Association for Computational Linguistics (TACL) 2020.
- A Deep Relevance Matching Model for Ad-hoc Retrieval, Jiafeng Guo, Yixing Fan, Qingyao Ai and W. Bruce Croft, Conference on Information and Knowledge Management (CIKM), Indianapolis, USA, 24-28 October, 2016.
- Diagnosing BERT with Retrieval Heuristics, Arthur Câmara and Claudia Hauff, European Conference on Information Retrieval (ECIR), virtual event Portugal, 14-17 April, 2020.
- ABNIRML: Analyzing the Behavior of Neural IR Models, Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey and Arman Cohan, 2020
- A New Rank Correlation Coefficient for Information Retrieval, Emine Yilmaz, Javed A. Aslam and Stephen Robertson, Conference on Research and Development in Information Retrieval (SIGIR), Singapore, 20-24 April, 2008.

NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact

Details on the gender equality index score 2020 for NAVER France of 92/100.

- Difference in female/male salary: 37/40 points
- Difference in salary increases female/male: 35/35 points
- Salary increases upon return from maternity leave: 15/15 points
- Number of employees in under-represented gender in 10 highest salaries: 5/10 points

En 2020, NAVER France a obtenu les notes suivantes pour chacun des indicateurs :

- Les écarts de salaire entre les femmes et les hommes: 37 sur 40 points
- Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
- Toutes les salariées augmentées revenant de congé maternité : 15 sur 15 points
- Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.

For more information on cookies see our privacy notice.