Large Language Models (LLMs) for Robotics

Large Language Models (LLMs) are advanced AI models built using neural architectures like transformers and trained on enormous amounts of text data to predict subsequent words from a given prompt. These general-purpose language generators exhibit several key strengths: they possess extensive world knowledge, including common sense; can learn new tasks contextually; demonstrate reasoning and planning capabilities; support communication in multiple languages, including programming languages; use tools such as data retrieval and API calls; follow instructions accurately if properly fine-tuned and provide interpretable natural language outputs. These attributes make LLMs highly versatile and powerful for a wide range of applications.
Visual Language Models (VLMs) extend these capabilities with visual understanding, making it possible to ground the textual knowledge encoded in LLMs and bringing them closer to the physical world. VLMs allow multimodal reasoning which leads to multiple interesting applications and we’ve worked on guiding image generation with language instructions (1,2), providing language-based explainability of complex visual scenes (3,4) and leveraging the reasoning capabilities of LLMs for the development of new robotics skills (14).

We’re using our expertise in LLMs to specifically support the development and deployment of robotic services in large organizations and buildings such as enhancing access to the services through dedicated chatbots or agents. We believe LLMs can accelerate the creation of new missions, ideally using natural language, to bridge the gap between end-users and robotic hardware. However, because LLMs often lack robustness and may produce inaccurate or harmful information, they pose a significant challenge in robotic applications where reliability is crucial. The key scientific challenges we’re currently addressing in relation to this are summarised below.

To improve contextual accuracy, we’ve been enhancing retrieval-augmented generation (RAG) . This approach also reduces costs, as smaller models with better retrieval capabilities can achieve performance comparable to much larger ones (5,6,13).
We address the lack of quality guarantees in LLM outputs which can undermine user trust, especially critical when deploying robot-assisted services (7,8). We also tackle a practical yet challenging setting where only a small number of preference annotations need be collected per user to align at the level of the user. This is a problem we define as Personalized Preference Alignment (9,10).
Recently we’ve been investigating LLMs for reasoning by exploring concerns around standard tuning methods like Reinforcement Learning and their potential impact on response diversity (11). Maintaining diversity is crucial for solving complex reasoning tasks. Additionally, we explore LLM-powered chatbots from an HCI perspective, including tools to support alignment and evaluation in deployment contexts (12).
Related publications
1: Bridging Environments and Language with Rendering Functions and Vision-Language Models, ICML 2024
2: PoseEmbroider: towards a 3D, visual, semantic-aware human pose representation, ECCV 2024
3: What could go wrong? Discovering and describing failure modes in computer vision, arXiv 2024
4: Weatherproofing retrieval for localization with generative AI and geometric consistency, ICLR 2024
5: Provence: efficient and robust context pruning for retrieval-augmented generation, ICLR 2025
6: PISCO: Pretty simple compression for retrieval-augmented generation, ACL 2025
7: Guaranteed Generation from Large Language Models, ICLR 2025
8: Compositional preference models for aligning LMs, ICLR 2024
9: FaST: Feature-aware Sampling and Tuning for personalized preference alignment with limited data, EMNLP 2025
10: Drift: Decoding-time personalized alignments with implicit user preferences, EMNLP 2025
11: Whatever remains must be true: filtering drives reasoning in LLMs, shaping diversity, ICLR 2026
12: Surfacing Governing Principles for Chatbots: A Workbench and Comparative Study, CHI 2026
13: XProvence: zero-cost multilingual context pruning for retrieval-augmented generation,. ECIR 2026
14: Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking, arXiv2602.24143
