A white box analysis of ColBERT - Naver Labs Europe
Blog Colbert Image

By dissecting the matching process of the recent ColBERT [2] model, we make a step towards unveiling the ranking properties of BERT-based ranking models. This post is an introduction to our 2021 ECIR paper “A White Box Analysis of ColBERT” [1].

By analysing two key ingredients for good search models – term importance and term matching patterns – we show that: 

  • (Col)BERT indeed captures a notion of term importance
  • Exact match remains a key component of the model, especially for important terms 
  • Exact match is promoted in ColBERT for terms with high Inverse Document Frequency (IDF), for which contextual embeddings tend to point in the same direction in the embedding space


In the past two years, we’ve witnessed the take over of large, pre-trained language models in many language-related tasks. Information Retrieval (IR) is no exception to the rule: while the field has been rather hermetic to “neural breakthroughs”, and term-based approaches like BM25 still remain hard to beat, BERT-based ranking models have shifted this paradigm (starting from [3]), showing that large neural models significantly outperform previous approaches on various datasets like MSMARCO [3] and Robust04 [4, 5, 6].

While a lot of effort has been put into designing variants of BERT-based ranking models, aiming for either better performance or efficiency, not much has been done towards understanding why/how such models perform so well in IR (even though there’s been a lot of work in analysing BERT in NLP – see the recent survey [7]).

Since its beginning, IR has been driven by heuristics, such as term importance (“some terms are more discriminative than others for ranking”, known as the IDF effect). In practice, it is unclear if/how such heuristics can be/are learned by ranking models; and actually, previous neural rankers owe their success to explicitly including those into the models e.g. [8]. 

Previous work from last year [9] investigated if IR axioms are respected – or not – by transformer-based models by the means of diagnostic datasets. Such axioms define properties that a good model should fulfil, for instance “the more occurrence of a query term a document has, the higher its retrieval score”. A deeper analysis was recently proposed in [10], studying the effect of different properties like word order or fluency.

Although these analyses give a global view of certain characteristics of the models, they don’t shed light on how ranking is conducted internally by the models. Instead of investigating whether BERT ranking models behave like standard ones, we go towards understanding how they manage to improve over standard baselines.  

For this study, we chose to focus on ColBERT [2], for two reasons:

  • By delaying interaction between query and document to the very end – query and document are encoded independently – it manages to keep performance on par with the vanilla approach while drastically reducing computation (both offline and online)
  • Its structure is similar to standard BOW models (sum over query terms of some similarity score): it makes the analysis easier as the contribution of each term for ranking is explicit

The model is detailed in Figure 1 whereby, for each query term, we seek the most similar term in the given document, in terms of cosine similarity between BERT embeddings. The final score is just a sum of these weights.

Figure 1. ColBERT model description.

We trained the model on the MSMARCO passage dataset, and consider for the analysis the passage retrieval tasks from TREC-DL 2019  and 2020 (400 test queries in total). To control the role of fine-tuning on the properties we looked at, we consider a ColBERT model that has not been fine-tuned on relevance but just initialized from a pre-trained BERT checkpoint.

What we’re interested in is how the model attributes scores to documents for each query term. We suspect that some terms might be more important, and this should be reflected in the scores. To do so, we place ourselves in a re-ranking scenario, where the model has to re-rank a set of documents Sq provided by a first ranker, typically BM25 (in our case, |Sq| <= 1000). Now, for a given query and for each term of this query, we can analyse the distribution of scores for all the documents in Sq. An even more interesting view can be obtained by considering two distributions: scores that come from an exact match with the query term (i.e. the maxsim is obtained for the same term), and scores that come from a soft match (i.e. the maxsim is obtained for a term in the document that is not the query term). Some examples are given in Figures 2, 3 and 4: 

Figure 2. Distributions of ColBERT query term scores for exact (left) and soft (right) matches on the set of documents to re-rank. QUERY: average price for kitchen cabinets installation.
Figure 3. Distributions of ColBERT query term scores for exact (left) and soft (right) matches on the set of documents to re-rank. QUERY: what county is dexter michigan in.
Figure 4. Distributions of ColBERT query term scores for exact (left) and soft (right) matches on the set of documents to re-rank. QUERY: how often to button quail lay eggs.

We start to notice a pattern here… 

  • Some terms that seem important in the query tend to focus on exact match: for instance “kitchen” and “cabinets” in Figure 2, or “michigan” and “county” in Figure 3. For these terms, similarity scores tend to be higher with respect to other query terms AND with respect to the soft-case, hence having a larger contribution in the ranking score (remember that ColBERT is just a sum of weights over query terms). 
  • Inversely, some terms carrying less content, like “for” in Figure 2, tend to focus more on soft-matching. In this case, similarity ranges generally look similar between exact and soft. *Note: for some of them, it is likely they don’t appear in the document, hence always doing soft matches by default. 

Although at first sight all of this may appear natural, the observation is actually not so straightforward. Our work focused on quantifying this information under the IR prism, as well as providing some hints to the actual reasons for this behavior. 

Term importance

In the examples above, we have the impression that some terms are more important. Before analysing the role of such terms, we first wanted to check if the model captures a notion of term importance, and if so, how it relates to standard definitions like IDF. 

For ColBERT – and other BERT based models – it’s not easy to measure the importance of terms, because it depends on both the document and query contexts. That’s why we have to resort to indirect means by considering:

  • the ranking obtained by ColBERT
  • the ranking obtained by ColBERT when the corresponding term contribution is masked (i.e. when we remove from the ColBERT sum all the contributions of subwords that compose the word) *Note: word is not masked in the input, we solely mask its final contribution in the score, but not its influence on other terms 

We now have two rankings which we want to compare. Intuitively one would think that if the term is important in some sense, then removing its contribution should disturb the ranking, while a “useless” term should not impact the ranking so much, i.e. the two lists should be similar. To compare two ranked lists of documents, we choose to report the AP-correlation τ-AP [11]: two similar lists have a τ-AP close to 1. *Note: the way we define importance does not rely on relevance annotation. Other types of measures could be possible, e.g. by computing the delta of some IR metrics between the two lists. One advantage here is that we can use virtually any set of queries without annotation. 


Figure 5. ColBERT term importance (as computed using T-AP) with respect to IDF (standard term importance).

In Figure 5, we plotted ColBERT term importance with respect to IDF. We witness a (moderate) correlation between the two, showing that ColBERT implicitly captures a notion of term importance. Note that the correlation is not perfect, especially because the model is able to learn term importance and correct the defaults of IDF which is an imperfect measure of such importance.

Exact/Soft match patterns

A second point we wanted to investigate is the issue of exact and soft match patterns. It’s well established that exact matches (same term in query and document) remain a critical component of IR systems. But solely relying on exact matching leads to the so-called vocabulary mismatch problem. Competitive IR models therefore generally also rely on softer notions of matching (allowing for instance to match synonyms etc.), and there’s a need to find the right balance between lexical and semantic matching. This makes it of interest to check how (Col)BERT deals with the two aspects, and especially to understand if exact matching still remains a key component of these transformer-based models. We observed in the previous examples that some terms seem to focus more on exact matching, and when so, contributions tend to be higher (in average). In Figure 6, we compute the difference between the mean of exact distribution and soft distribution ΔES for each query term (remember the two distributions for each query in previous examples), and plot it against IDF. High delta tends to indicate that the model favors an exact match for this term, as the model learns to widen the gap (in average) between exact and soft scores for this term. We can see there is a (moderate) positive correlation between terms focusing on exact matches and IDF. Interestingly, this effect is already present before fine-tuning, but is reinforced when fine tuning on relevance signals.

figure 6

Figure 6. ΔES with respect to IDF. We observe a moderate correlation, showing that the less frequent a term is, the more it is likely to be matched exactly.

We give below in Figure 7 some examples of queries, with the ΔES for each term. We can see that for terms that intuitively seem important, ΔES is higher, meaning that here, the model promotes a stricter notion of match. For instance, the first query refers to a coronary problem, so words related to that have a higher delta.

figure 7_table1

Figure 7. Sample of queries with ΔES for each term.

Contextual embeddings variation

The previous analysis shows that ColBERT relies on exact matching for certain terms, but it doesn’t tell us how. Our hypothesis is that contextual embeddings for such terms tend to not vary much, so the cosine similarity between the query term and the document term would be close to 1, and ColBERT will tend to select this term. On the contrary, terms carrying less “information” (e.g. stopwords, but not only) are more influenced by their context, and will act as some sort of reservoir to encode concepts of the sequence, and their contextual embeddings would likely vary a lot. To check this hypothesis, we perform a spectral analysis of contextual term embeddings collected on a subset of the collection, restricting the analysis to terms occurring in queries only. We use a singular value decomposition (SVD) on each matrix composed of the contextual embeddings in the corpus for a given term (so one matrix per term), and look at the relative magnitude of singular values λ1,…,λd (d=embedding dim). Intuitively, if the magnitude of λ1 is far greater than the others, it means that the embeddings (for this term) tend to point in the same direction in the embedding space, promoting exact matches because of the ColBERT design. In Figure 8, we have confirmation of our intuition: as (word piece level) IDF increases, the ratio of λ1 w.r.t. to other singular values increases. 

Figure 8. Ratio of the first eigenvalue to the sum of the eigenvalues with respect to IDF (subword level). The less frequent the term is, the higher the ratio is, showing that all contextualized embeddings for a rare term are concentrated in the same direction.

Of interest is the fact that the effect is reinforced when the model is fine-tuned on relevance data. In particular, words with a low IDF tend to point in a different direction each time, showing that what they capture is more about their context. We give some examples of queries in Figure 9 below, including what some of their terms match in a sample of 15 documents with respect to the ColBERT mechanism, which reinforces our intuition:

figure 9
Figure 9. Sample of matched terms, for query terms with different IDF values.


ColBERT (implicitly) learns a notion of term importance that correlates with IDF

  • Exact matching remains a key component, especially for terms with high IDF
  • Embeddings for terms with high IDF tend to point in the same direction in the embedding space, thus promoting exact matching due to the ColBERT design 

There obviously remains a lot to do, either by analyzing other models, or by extending our analysis of ColBERT to first stage ranking, where retrieval axioms might be more critical. Check out our paper for further details and reach out to us! 


  1.  A white box analysis of ColBERT, Thibault Formal, Benjamin Piwowarski and Stéphane Clinchant, European Conference on Information Retrieval (ECIR), Lucca, Italy, 28 March – 1 April, 2021 (to appear).
  2. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, Omar Khattab and Matei Zaharia, Conference on Research and Development in Information Retrieval (SIGIR), virtual event, China, July, 2020
  3. Passage Re-ranking with BERT, Rodrigo Nogueira and Kyunghyun Cho, 2019.
  4. Deeper Text Understanding for IR with Contextual Neural Language Modeling, Zhuyun Dai and Jamie Callan, Conference on Research and Development in Information Retrieval (SIGIR), Paris, France, 21-25 July, 2019.
  5. CEDR: Contextualized Embeddings for Document Ranking, Sean MacAvaney, Andrew Yates, Arman Cohan and Nazli Goharian, Conference on Research and Development in Information Retrieval (SIGIR), Paris, France, 21-25 July, 2019.
  6. Document Ranking with a Pretrained Sequence-to-Sequence Model, Rodrigo Nogueira, Zhiying Jiang and Jimmy Lin, 2020
  7. A Primer in BERTology: What we know about how BERT works, Anna Rogers, Olga Kovaleva and Anna Rumshisky, Transactions of the Association for Computational Linguistics (TACL) 2020.
  8. A Deep Relevance Matching Model for Ad-hoc Retrieval, Jiafeng Guo, Yixing Fan, Qingyao Ai and W. Bruce Croft, Conference on Information and Knowledge Management (CIKM), Indianapolis, USA, 24-28 October, 2016.
  9. Diagnosing BERT with Retrieval Heuristics, Arthur Câmara and Claudia Hauff, European Conference on Information Retrieval (ECIR), virtual event Portugal, 14-17 April, 2020.
  10. ABNIRML: Analyzing the Behavior of Neural IR Models, Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey and Arman Cohan, 2020
  11. A New Rank Correlation Coefficient for Information Retrieval, Emine Yilmaz, Javed A. Aslam and Stephen Robertson, Conference on Research and Development in Information Retrieval (SIGIR), Singapore, 20-24 April, 2008.