Large Language Models (LLMs) for Robotics
Large Language Models (LLMs) are advanced AI models built using neural architectures like transformers and trained on enormous amounts of text data to predict subsequent words from a given prompt. These general-purpose language generators exhibit several key strengths: they possess extensive world knowledge, including common sense; can learn new tasks contextually; demonstrate reasoning and planning capabilities; support communication in multiple languages, including programming languages; use tools such as data retrieval and API calls; follow instructions accurately if properly fine-tuned and provide interpretable natural language outputs. These attributes make LLMs highly versatile and powerful for a wide range of applications. Visual Language Models (VLMs) extend these capabilities with visual understanding, making it possible to ground the textual knowledge encoded in LLMs and bringing them closer to the physical world. VLMs allow multimodal reasoning which leads to multiple interesting applications: guiding image generation with language instructions (1,2), providing language-based explainability of complex visual scenes (3,4) and leveraging the reasoning capabilities of LLMs for the development of new robotics skills.
We’re using our expertise in LLMs to specifically support the development and deployment of robotic services in large organizations and buildings. LLMs provide a valuable opportunity to enhance access to robotic services through dedicated chatbots. We also believe they can accelerate the creation of new missions—ideally using natural language—to bridge the gap between end-users and robotic hardware. However, LLMs often lack robustness and can produce inaccurate or harmful information, posing a significant challenge in robotic applications where reliability is crucial. To address this, we’e enhancing retrieval-augmented generation (RAG) to improve contextual accuracy. This approach also reduces costs, as smaller models with better retrieval capabilities can achieve performance comparable to much larger ones (5). We’re also addressing the lack of quality guarantees (6,7,8) in LLM outputs which can harm user trust – especially important for scaling robot-assisted services. Finally, since LLMs can hold outdated knowledge, we’re implementing scalable and sustainable processes to keep them continuously updated, ensuring they always provide relevant and accurate information (9,10).
Related publications
1: Bridging Environments and Language with Rendering Functions and Vision-Language Models, ICML 2024
2: PoseEmbroider: towards a 3D, visual, semantic-aware human pose representation, ECCV 2024
3: What could go wrong? Discovering and describing failure modes in computer vision, arXiv 2024
4: Weatherproofing retrieval for localization with generative AI and geometric consistency, ICLR 2024
5: BERGEN: A Benchmarking Library for Retrieval-Augmented Generation, EMNLP 2024
6: Compositional preference models for aligning LMs, ICLR 2024
7: disco: a toolkit for Distributional Control of Generative Models, ACL 2023
8: Aligning Language Models with Preferences through f-divergence Minimization, ICML 2023
9: Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks, NAACL 2024
10: BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting, ACL 2023