Gregory Rogez |
2020 |
Estimating details from an image or video about the pose and shape of a human hand, including the kinematic configuration of the fingers, is a popular research topic in computer vision. Increased interest in hand-pose-estimation algorithms has been fuelled by the growing number of applications—from video games to augmented and virtual reality (AR and VR)—that use this technology to enable the manipulation of virtual objects and interaction with virtual scenes. The success of recent VR (e.g. the Oculus Rift) and AR (e.g. HoloLens) devices was made possible thanks to these types of algorithms.
The pose of hands during object manipulation is more challenging to estimate than simpler interactions. This is because, when viewed from a single angle, hands are often partially hidden behind the object that they’re manipulating, especially when the object is large. To achieve high accuracy, such estimation should consider not only the pose and shape of the hands, but also the contact points that the hands are making with the object and possibly even the forces that are applied during manipulation.
Research in this area has enabled the estimation of grasp poses (1), but has so far neglected the prediction of grasp types and affordances (defined as ‘opportunities of interaction’ in the scene), thus limiting applicability. Producing a more functional description could help in the development of biomedical technology that assists people. For example, the automated parsing of object manipulation could help in the long-term monitoring of rehabilitation patients handling everyday objects. Another practical application of understanding hands in action would be to enhance ‘imitation learning’, or ‘learning by demonstration’ approaches for robotics, where a robot learns a task by observing humans performing certain actions or object manipulations. Grasp prediction is one of the most important problems in robot manipulation. It involves predicting where a robot’s hand, or ‘gripper’ (typically controlled by 6 degrees of freedom, DoF), should be moved to pick an up object. Predicting affordances is an active research area (2) that combines the domains of robotics and computer vision, but ‘human grasp affordances’ have not yet been incorporated in these predictions.
Our work takes another step forward by using the geometry within images to predict related human hand poses, even when there are no hands in the scene. In other words, we’ve built technology capable of hallucinating human hands! We focused on a new, previously unexplored problem: predicting how a human would naturally grasp one or several objects, based on a single colour (RGB) image of those objects. Predicting natural human grasp affordances (examples shown in Video 1) has enormous potential not only in AR and prosthetic design but also in robotics, which is our main motivation. Indeed, when faced with one or several objects, our technology would enable an autonomous robot to predict how a human would grasp and manipulate these objects and then adapt the corresponding human hand configurations to its own gripper.
We’re also interested in predicting natural interactions. In other words, we want not only to predict all grasps that are physically possible, but also those that a human would naturally make in the context of an object. As a simple example, suppose we consider picking up a potted plant: a human would grasp the pot to avoid damaging the plant, whereas a robot might simply grasp the stem of the plant, killing it in the process. To give another example, you probably wouldn’t want a robot to grasp your laptop by the screen unless you have particularly good insurance coverage!
To predict which grasps are feasible, we need to understand the semantic content of the image (in terms of both the object and its surroundings), the geometric structures within it and all of the potential interactions between the objects in the image and a hand-shaped model. This is where GanHand comes in.
GanHand is a multi-task architecture that, given only one input image, can: 1) estimate the 3D shape and pose of the objects; 2) predict the best grasp type, according to a taxonomy of 33 grasp classes; and 3) refine the hand configuration, given by the grasp class, through an optimization of the 51 parameters of a hand model. The latter process involves maximizing the number of contact points between the object and the hand-shaped model while minimizing interpenetration, to ensure realistic predictions. Our generative model is also stochastic, which means it can predict several grasps per object. This seems appropriate when you look around and think about the many possible ways you could grasp the different objects you see.
It’s worth pointing out that the GanHand architecture is a GAN (general adversarial network)-type of architecture. GANs are considered by many to be one of the biggest breakthroughs in the history of AI. In such architectures, two systems—two neural networks—are competing against each other: a generator learns to generate plausible data, while a discriminator learns to distinguish the generator’s fake data from real data. This is the same technology that enables the creation of deepfakes (videos in which the faces of often famous people are transposed onto the faces of other people in unrelated footage) and videos of freakishly realistic people who don’t actually exist.
In our case, we applied the GAN methodology to generate hand–object interactions. For this, the discriminator’s mission is to force our generator to predict realistic interactions, i.e. hand object configurations that could look like real data. For full details of our mathematical approach, see our CVPR 2020 oral paper (3).
As is often the case, data was a critical part of our research. To train our model, we needed a lot of data: thousands of images of scenes showing multiple objects, with annotations of realistic human grasps of these objects. Unfortunately for us, such data didn’t exist, probably because we were tackling a new problem. To address this, we built a dedicated dataset—the YCB-Affordance dataset—that contains more than 133,000 images of 21 objects from the YCB-Video dataset. We annotated these images with more than 28 million plausible 3D human grasps, making it the largest existing dataset of human grasp affordances in real scenes so far. The grasps were defined following the 33-class taxonomy of Feix and colleagues (4) in a semi-automatic manner, entailing quite a bit of sweat; the GraspIt simulator wasn’t able to automatically find certain complex grasps (e.g. those that require abducted thumbs), so human annotation was required to achieve the final dataset.
To test the ability of our model, we performed a thorough evaluation with synthetic and real images. Our results showed that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact, as shown in the examples in Video 2. GanHand achieves a higher percentage of graspable objects and a higher accuracy in predicted grasp types compared to our baseline (a pre-trained ResNet-50 model).
All the code used for training, the data and a pre-trained model will be made available on the project web site.
Our next challenge is to use GanHand technology with a real robot. With Aalto University and IRI, NAVER LABS Europe is focusing on applying this functionality for a simple robot gripper with three fingers and 11 DoFs. We’re currently annotating realistic grasps for this gripper in relation to a group of objects. Then, we’ll use this data to train a model that predicts how to grasp each object in a scene from a single RGB image.
Acknowledgements. This work was done in collaboration with our colleagues from the Institut de Robòtica i Informàtica Industrial in Barcelona, Spain. It is a Joint Research Center of the Spanish Council for Scientific Research (CSIC) and the Technical University of Catalonia (UPC), and one of our collaborators in the NAVER Global AI R&D belt.
References
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.