Bridging environments and language with rendering functions and vision-language models

Published by Claudia Heyer at 16 July 2024

Theo Cachet, Christopher Dance, Olivier Sigaud

International Conference on Machine Learning (ICML), Vienna, Austria, 21-27 July, 2024

Our Text2Control method controlling the Humanoid from textual instructions. It finds goals from text using vision-language models and reaches these goals with a goal-conditioned policy (GCRL). Try it yourself with the interactive demo below!

TL;DR: We present a new method ‘Text2Control’ for enabling agents to perform new tasks specified with natural language. The method involves inferring a goal from text using vision-language models (VLMs), then reaching that goal with a goal-conditioned agent. Our approach outperforms multitask reinforcement learning (MTRL) baselines in zero-shot generalization to new tasks.

Introduction

Language is a convenient medium for humans to specify tasks. Therefore, we aim to create language-conditioned agents (LCAs) capable of performing diverse tasks specified with natural language. This requires a way to ground language, that is, to link language with an agent’s observations and actions. A natural approach to grounding language is to gather textual annotations of an environment [1, 2, 3], but collecting human annotations is costly. Recently, vision-language models (VLMs) have emerged as a promising alternative approach to grounding language.

VLMs, such as CLIP [4], measure how well an image goes with a text by computing a VLM score. This score is given by the dot product between an encoding of the image, called the image embedding, and an encoding of the text, called the text embedding. To infer the relationship between a text and the state of a robot and the objects with which it may interact, one may compute VLM scores based on rendered images of a simulated environment. As such images only depend on the configuration (i.e., position rather than velocity components of the state), we call such scores configuration-text scores. Similarly, we call the embeddings of such images configuration embeddings. These configuration-text scores have been used by previous works [5, 6, 7, 8, 9] as language-conditioned reward functions, enabling the training of LCAs with reinforcement learning (RL).

If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, we introduce a novel decomposition of the problem of building an LCA: first, find a configuration that has a high configuration-text score for text describing a task; then use a (pre-trained) goal-conditioned policy to reach that configuration. We also explore several enhancements to the speed and quality of VLM-based LCAs. Notably, the use of distilled models and the evaluation of configurations from multiple viewpoints.

To evaluate our approach, we wished for an environment in which we could describe a vast array of tasks with natural language. Therefore, we built the Humanoid-plus-Cube environment, which extends the Humanoid environment from OpenAI Gym with a cube, allowing for humanoid-cube interactions. We compare the performance of various LCAs performing 256 textual instructions in this environment.

Main findings

1: Distilling rendering functions and VLMs allows to assess configurations up to 40000 times faster. It also provides the gradient between configurations and configurations-embeddings which can be used to sample datasets of diverse configurations.

2: Using multiple viewpoints to assess configurations robustifies their evaluation and mitigates ambiguities inherent in a single 2D view such as occlusions.

3: Our LCA, based on the decomposition of VLM-based goal generation and goal reaching, outperforms MTRL baselines in zero-shot generalization in 210 out of 256 tasks

Method

Our method, summarized in Figure 1, consists of four stages:

Precomputation:

1: Sample a dataset of diverse configurations
2: Precompute the configuration embeddings with a VLM

Inference:

3: Retrieve the configuration with the highest score for a given text from the dataset
4: Reach that configuration with a pretrained goal-conditioned policy

1] Configuration dataset sampling. In this first step, we sample a dataset of 2.5 million diverse configurations. We explore several dataset sampling methods. The best method generates a dataset of configurations with diverse configuration embeddings called the embedding-diversity dataset. It is constructed by optimizing configurations to minimize the dot product of the closest configuration-embedding pairs. Figure 2 below illustrates this optimization process: starting from similar configurations, we get configurations that are sufficiently diverse that humans would describe them with different texts.

2] Configuration-embedding precomputation. This second step precomputes the embedding of every configuration in the dataset. As illustrated in Figure 3 below, these configuration embeddings are obtained by rendering images of the configurations then encoding these images with a VLM encoder. When multiple rendering functions are used, we simply average the different embeddings.

3] Configuration retrieval. The dot product between a configuration embedding and a text embedding is called the configuration-text score. This score measures how well the text describes the configuration. We use this score to retrieve from the dataset the configurations that correspond the most to a given text. Thanks to the precomputation steps, this process is quick, taking 15 milliseconds to compute the text embedding and 13 milliseconds to score each configuration in the dataset and retrieve the highest-scoring one.

We enhance configuration-text evaluation by assessing configurations from multiple viewpoints using multiple rendering functions (front view, left view and right view). As shown in Figure 5, evaluating configurations from multiple viewpoints mitigates ambiguities inherent in single 2D images such as occlusions, distance ambiguities and stability issues.

4] Goal reaching. To reach goals, we compare two methods: one that learns to reach configurations and one that learns to reach configuration embeddings. The method that reaches configurations has the advantage that it allows us to visualize the agent’s precise goal. However, the goal configurations may be difficult to reach and may contain irrelevant information for a given task. On the other hand, the method that reaches configuration embeddings provides less control over the final configurations but allows the agent to aim for more stable configurations and to focus only on the task-relevant part of the configurations.

4.1 Reaching configurations. We use a goal-conditioned policy trained with proximal policy optimization (PPO) to learn to reach configurations. It is trained by randomly sampling goal configurations from the embedding-diversity dataset. We use the time difference of the Euclidean distance between the current and goal configurations as reward:

where is the goal configuration, function maps states to configurations and function gives the next state.

Our Text2Control method controlling the Humanoid from textual instructions by reaching goal configurations. The configuration-text score achieved by the agent is shown in blue and the configuration text-score of the goal in red.

4.2 Reaching embeddings. We use a goal-conditioned policy trained with proximal policy optimization (PPO) to learn to reach configurations-embeddings. It is trained by randomly sampling goal configuration embeddings from the embedding-diversity dataset. We use the time difference of the cosine similarity between the current and goal configurations embeddings as reward:

with the distilled model, approximating the composition of the rendering functions with the VLM image encoder, which is up to 40 000 times faster to compute than the original composition.

Our Text2Control method controlling the Humanoid from textual instructions by reaching goal embeddings. The configuration-text score achieved by the agent is shown in blue and the configuration text-score of the goal in red.

[1] Stepputtis et al., Language-conditioned imitation learning for robot manipulation tasks, NeurIPS, 2020.
[2] Fu et al., From language to goals: Inverse reinforcement learning for vision-based instruction following. arXiv, 2019.
[3] Colas et al., Language-conditioned goal generation: a new approach to language grounding for RL, arXiv, 2020.
[4] Radford et al., Learning transferable visual models from natural language supervision, ICML, 2021.
[5] Mahmoudieh et al., Zero-shot reward specification via grounded natural language, ICML, 2022.
[6] Fan et al., MineDojo: Building open-ended embodied agents with internet-scale knowledge, NeurIPS, 2022.
[7] Rocamonde et al., Vision-language models are zero-shot reward models for reinforcement learning, arXiv, 2023.
[8] Baumli et al., Vision-language models as a source of rewards, arXiv, 2023.
[9] Adeniji et al., Language reward modulation for pretraining reinforcement learning, arXiv, 2023.

@inproceedings{
cachet2024bridging,
title={Bridging Environments and Language with Rendering Functions and Vision-Language Models},
author={Theo Cachet and Christopher R Dance and Olivier Sigaud},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=ZrM67ZZ5vj}
}

Introduction

Main findings

Method

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2023

Action

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

Bridging environments and language with rendering functions and vision-language models

Introduction

Main findings

Method

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings