BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

Published by Hervé Déjean at 11 November 2024

David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant

Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, Florida, 12-16 November, 2024

Paper Code arXiv

Careers home

Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets.

Post by author Stéphane Clinchant, November 6, 2024 (LinkedIn)

When we started working on RAG in 2023, we read many research papers and ended up being clueless about what were good practices or simply what could be a good baseline to build on. RAG experimental setups are fragmented, therefore comparing results across papers leads to comparing apples and oranges:

That’s why with David Rau Shuai Wang Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina we built BERGEN, an open source library to ease the reproducibility of RAG experiments with a set of recommendations for strong baselines in RAG.

The problems with metrics

There are a variety of metrics that papers report to evaluate their models, such as Exact Match (EM), i.e the produced string is exactly the expected answer, Match (M) the expected answer is contained in the generated string , even though sometimes it is now called accuracy by some recent papers. Others prefer F1 or precision. However, it has already been reported (Kamalloo, E. et al), that some of these lexical metrics correlate poorly with human judgements..

We use an LLM-as-a-Judge based on SOLAR-10.7B from Upstage , referred as LLMeval. We then compute correlation with GPT-4 as a judge. LLMeval closely aligns with GPT-4’s evaluation, followed by Match and Recall, making them the most effective non-commercial metrics for (zero-shot) RAG evaluation, among the ones tested.

Which datasets are interesting to benchmark RAG systems?

Now that we selected a metric, we can look at different QA datasets. We examine more than 10 datasets and compare the performance with RAG and without RAG for different LLMs. The results suggest that ASQA, HotpotQA, NQ, TriviaQA, POPQA are the most interesting ones, while some KILT datasets such as ELI5, WoW are not relevant for RAG.

RAG is also about Retrieval, right?

It’s surprising to see how many RAG papers have a poor retrieval setup, using outdated retrievers and no reranker at all. The following graph simply shows the relation between retrieval performance and generation performance to convince you to use a strong retrieval setup (before reinventing reranking with your big LLMs).

Better Baselines

Thanks to our 500+ experiments, we identify a strong baseline for academic RAG datasets (all the results are in the paper)

top_k documents: 5
First Stage Retrieval: splade-v3
Reranker: DeBERTa-v3
LLM: SOLAR 10.7B from Upstage

Furthermore, we also add multilingual RAG datasets and experiments (cf https://arxiv.org/abs/2407.01463), where we recommend

Retrieval & Reranker: BGE-m3
LLM: Command-R from Cohere combined with language-specific prompts
Collection: multilingual Wiki

Conclusion

While there exists more application oriented libraries for RAG such as LangChain, the situation for academic RAG is more complex and it is hard to compare papers and to know the current state of the art on existing datasets. We hope that BERGEN can help build strong baselines, and better RAG systems!

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings