Our Text2Control method controlling the Humanoid from textual instructions. It finds goals from text using vision-language models and reaches these goals with a goal-conditioned policy (GCRL). Try it yourself with the interactive demo below!
Theo Cachet, Christopher Dance, Olivier Sigaud |
International Conference on Machine Learning (ICML), Vienna, Austria, 21-27 July, 2024 |
TL;DR: We present a new method ‘Text2Control’ for enabling agents to perform new tasks specified with natural language. The method involves inferring a goal from text using vision-language models (VLMs), then reaching that goal with a goal-conditioned agent. Our approach outperforms multitask reinforcement learning (MTRL) baselines in zero-shot generalization to new tasks.
Language is a convenient medium for humans to specify tasks. Therefore, we aim to create language-conditioned agents (LCAs) capable of performing diverse tasks specified with natural language. This requires a way to ground language, that is, to link language with an agent’s observations and actions. A natural approach to grounding language is to gather textual annotations of an environment [1, 2, 3], but collecting human annotations is costly. Recently, vision-language models (VLMs) have emerged as a promising alternative approach to grounding language.
VLMs, such as CLIP [4], measure how well an image goes with a text by computing a VLM score. This score is given by the dot product between an encoding of the image, called the image embedding, and an encoding of the text, called the text embedding. To infer the relationship between a text and the state of a robot and the objects with which it may interact, one may compute VLM scores based on rendered images of a simulated environment. As such images only depend on the configuration (i.e., position rather than velocity components of the state), we call such scores configuration-text scores. Similarly, we call the embeddings of such images configuration embeddings. These configuration-text scores have been used by previous works [5, 6, 7, 8, 9] as language-conditioned reward functions, enabling the training of LCAs with reinforcement learning (RL).
If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, we introduce a novel decomposition of the problem of building an LCA: first, find a configuration that has a high configuration-text score for text describing a task; then use a (pre-trained) goal-conditioned policy to reach that configuration. We also explore several enhancements to the speed and quality of VLM-based LCAs. Notably, the use of distilled models and the evaluation of configurations from multiple viewpoints.
To evaluate our approach, we wished for an environment in which we could describe a vast array of tasks with natural language. Therefore, we built the Humanoid-plus-Cube environment, which extends the Humanoid environment from OpenAI Gym with a cube, allowing for humanoid-cube interactions. We compare the performance of various LCAs performing 256 textual instructions in this environment.
1: Distilling rendering functions and VLMs allows to assess configurations up to 40000 times faster. It also provides the gradient between configurations and configurations-embeddings which can be used to sample datasets of diverse configurations.
2: Using multiple viewpoints to assess configurations robustifies their evaluation and mitigates ambiguities inherent in a single 2D view such as occlusions.
3: Our LCA, based on the decomposition of VLM-based goal generation and goal reaching, outperforms MTRL baselines in zero-shot generalization in 210 out of 256 tasks
Our method, summarized in Figure 1, consists of four stages:
Precomputation:
1: Sample a dataset of diverse configurations
2: Precompute the configuration embeddings with a VLM
Inference:
3: Retrieve the configuration with the highest score for a given text from the dataset
4: Reach that configuration with a pretrained goal-conditioned policy
1] Configuration dataset sampling. In this first step, we sample a dataset of 2.5 million diverse configurations. We explore several dataset sampling methods. The best method generates a dataset of configurations with diverse configuration embeddings called the embedding-diversity dataset. It is constructed by optimizing configurations to minimize the dot product of the closest configuration-embedding pairs. Figure 2 below illustrates this optimization process: starting from similar configurations, we get configurations that are sufficiently diverse that humans would describe them with different texts.
2] Configuration-embedding precomputation. This second step precomputes the embedding of every configuration in the dataset. As illustrated in Figure 3 below, these configuration embeddings are obtained by rendering images of the configurations then encoding these images with a VLM encoder. When multiple rendering functions are used, we simply average the different embeddings.
3] Configuration retrieval. The dot product between a configuration embedding and a text embedding is called the configuration-text score. This score measures how well the text describes the configuration. We use this score to retrieve from the dataset the configurations that correspond the most to a given text. Thanks to the precomputation steps, this process is quick, taking 15 milliseconds to compute the text embedding and 13 milliseconds to score each configuration in the dataset and retrieve the highest-scoring one.
We enhance configuration-text evaluation by assessing configurations from multiple viewpoints using multiple rendering functions (front view, left view and right view). As shown in Figure 5, evaluating configurations from multiple viewpoints mitigates ambiguities inherent in single 2D images such as occlusions, distance ambiguities and stability issues.
4] Goal reaching. To reach goals, we compare two methods: one that learns to reach configurations and one that learns to reach configuration embeddings. The method that reaches configurations has the advantage that it allows us to visualize the agent’s precise goal. However, the goal configurations may be difficult to reach and may contain irrelevant information for a given task. On the other hand, the method that reaches configuration embeddings provides less control over the final configurations but allows the agent to aim for more stable configurations and to focus only on the task-relevant part of the configurations.
4.1 Reaching configurations. We use a goal-conditioned policy trained with proximal policy optimization (PPO) to learn to reach configurations. It is trained by randomly sampling goal configurations from the embedding-diversity dataset. We use the time difference of the Euclidean distance between the current and goal configurations as reward:
where is the goal configuration, function maps states to configurations and function gives the next state.
4.2 Reaching embeddings. We use a goal-conditioned policy trained with proximal policy optimization (PPO) to learn to reach configurations-embeddings. It is trained by randomly sampling goal configuration embeddings from the embedding-diversity dataset. We use the time difference of the cosine similarity between the current and goal configurations embeddings as reward:
with the distilled model, approximating the composition of the rendering functions with the VLM image encoder, which is up to 40 000 times faster to compute than the original composition.
[1] Stepputtis et al., Language-conditioned imitation learning for robot manipulation tasks, NeurIPS, 2020.
[2] Fu et al., From language to goals: Inverse reinforcement learning for vision-based instruction following. arXiv, 2019.
[3] Colas et al., Language-conditioned goal generation: a new approach to language grounding for RL, arXiv, 2020.
[4] Radford et al., Learning transferable visual models from natural language supervision, ICML, 2021.
[5] Mahmoudieh et al., Zero-shot reward specification via grounded natural language, ICML, 2022.
[6] Fan et al., MineDojo: Building open-ended embodied agents with internet-scale knowledge, NeurIPS, 2022.
[7] Rocamonde et al., Vision-language models are zero-shot reward models for reinforcement learning, arXiv, 2023.
[8] Baumli et al., Vision-language models as a source of rewards, arXiv, 2023.
[9] Adeniji et al., Language reward modulation for pretraining reinforcement learning, arXiv, 2023.
@inproceedings{ cachet2024bridging, title={Bridging Environments and Language with Rendering Functions and Vision-Language Models}, author={Theo Cachet and Christopher R Dance and Olivier Sigaud}, booktitle={Forty-first International Conference on Machine Learning}, year={2024}, url={https://openreview.net/forum?id=ZrM67ZZ5vj} }
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.