Open-ended error analysis in retrieval-augmented generation

Published by Nadezhda Chirkova at 10 April 2025

Workshop on Evaluation of RAG Systems (Eval4RAG), co-located at the 47th European Conference on Information Retrieval (ECIR), Lucca, Italy, 10 April, 2025

Careers home

Existing fine-grained evaluation approaches for retrieval-augmented generation (RAG) such as RAGAS, ARES, or RagChecker can provide insights into the performance of various steps of the RAG pipeline, but could not detect failures beyond a predefined set of errors. At the same time, a simple approach of manually inspecting a small subset of predictions is helpful to discover arbitrary error types, but requires human effort and lacks scalability. In this work, we test the possibility of automating per-example inspection of predictions with the use of modern large language models. We prompt GPT-4o to analyse RAG failure cases in an open-ended fashion and then ask it to output a structured summarized report of discovered error types. In our case study, we find that automatically generated error reports can capture main error types listed in the human expert’s reports and that automatic per-example analysis is correct in 80% of cases. At the same time, we notice occasional issues with correctly interpreting the particular failure case, over-generalization of error types in the report summary step and some variability in reports when relaunching the analysis.

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Open-ended error analysis in retrieval-augmented generation

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings