Existing fine-grained evaluation approaches for retrieval-augmented generation (RAG) such as RAGAS, ARES, or RagChecker can provide insights into the performance of various steps of the RAG pipeline, but could not detect failures beyond a predefined set of errors. At the same time, a simple approach of manually inspecting a small subset of predictions is helpful to discover arbitrary error types, but requires human effort and lacks scalability. In this work, we test the possibility of automating per-example inspection of predictions with the use of modern large language models. We prompt GPT-4o to analyse RAG failure cases in an open-ended fashion and then ask it to output a structured summarized report of discovered error types. In our case study, we find that automatically generated error reports can capture main error types listed in the human expert’s reports and that automatic per-example analysis is correct in 80% of cases. At the same time, we notice occasional issues with correctly interpreting the particular failure case, over-generalization of error types in the report summary step and some variability in reports when relaunching the analysis.