Estimating details from an image or video about the pose and shape of a human hand, including the kinematic configuration of the fingers, is a popular research topic in computer vision. Increased interest in hand-pose-estimation algorithms has been fuelled by the growing number of applications—from video games to augmented and virtual reality (AR and VR)—that use this technology to enable the manipulation of virtual objects and interaction with virtual scenes. The success of recent VR (e.g. the Oculus Rift) and AR (e.g. HoloLens) devices was made possible thanks to these types of algorithms.
The pose of hands during object manipulation is more challenging to estimate than simpler interactions. This is because, when viewed from a single angle, hands are often partially hidden behind the object that they’re manipulating, especially when the object is large. To achieve high accuracy, such estimation should consider not only the pose and shape of the hands, but also the contact points that the hands are making with the object and possibly even the forces that are applied during manipulation.
Research in this area has enabled the estimation of grasp poses (1), but has so far neglected the prediction of grasp types and affordances (defined as ‘opportunities of interaction’ in the scene), thus limiting applicability. Producing a more functional description could help in the development of biomedical technology that assists people. For example, the automated parsing of object manipulation could help in the long-term monitoring of rehabilitation patients handling everyday objects. Another practical application of understanding hands in action would be to enhance ‘imitation learning’, or ‘learning by demonstration’ approaches for robotics, where a robot learns a task by observing humans performing certain actions or object manipulations. Grasp prediction is one of the most important problems in robot manipulation. It involves predicting where a robot’s hand, or ‘gripper’ (typically controlled by 6 degrees of freedom, DoF), should be moved to pick an up object. Predicting affordances is an active research area (2) that combines the domains of robotics and computer vision, but ‘human grasp affordances’ have not yet been incorporated in these predictions.
Our work takes another step forward by using the geometry within images to predict related human hand poses, even when there are no hands in the scene. In other words, we’ve built technology capable of hallucinating human hands! We focused on a new, previously unexplored problem: predicting how a human would naturally grasp one or several objects, based on a single colour (RGB) image of those objects. Predicting natural human grasp affordances (examples shown in Video 1) has enormous potential not only in AR and prosthetic design but also in robotics, which is our main motivation. Indeed, when faced with one or several objects, our technology would enable an autonomous robot to predict how a human would grasp and manipulate these objects and then adapt the corresponding human hand configurations to its own gripper.
We’re also interested in predicting natural interactions. In other words, we want not only to predict all grasps that are physically possible, but also those that a human would naturally make in the context of an object. As a simple example, suppose we consider picking up a potted plant: a human would grasp the pot to avoid damaging the plant, whereas a robot might simply grasp the stem of the plant, killing it in the process. To give another example, you probably wouldn’t want a robot to grasp your laptop by the screen unless you have particularly good insurance coverage!
To predict which grasps are feasible, we need to understand the semantic content of the image (in terms of both the object and its surroundings), the geometric structures within it and all of the potential interactions between the objects in the image and a hand-shaped model. This is where GanHand comes in.
GanHand is a multi-task architecture that, given only one input image, can: 1) estimate the 3D shape and pose of the objects; 2) predict the best grasp type, according to a taxonomy of 33 grasp classes; and 3) refine the hand configuration, given by the grasp class, through an optimization of the 51 parameters of a hand model. The latter process involves maximizing the number of contact points between the object and the hand-shaped model while minimizing interpenetration, to ensure realistic predictions. Our generative model is also stochastic, which means it can predict several grasps per object. This seems appropriate when you look around and think about the many possible ways you could grasp the different objects you see.
It’s worth pointing out that the GanHand architecture is a GAN (general adversarial network)-type of architecture. GANs are considered by many to be one of the biggest breakthroughs in the history of AI. In such architectures, two systems—two neural networks—are competing against each other: a generator learns to generate plausible data, while a discriminator learns to distinguish the generator’s fake data from real data. This is the same technology that enables the creation of deepfakes (videos in which the faces of often famous people are transposed onto the faces of other people in unrelated footage) and videos of freakishly realistic people who don’t actually exist.
In our case, we applied the GAN methodology to generate hand–object interactions. For this, the discriminator’s mission is to force our generator to predict realistic interactions, i.e. hand object configurations that could look like real data. For full details of our mathematical approach, see our CVPR 2020 oral paper (3).
As is often the case, data was a critical part of our research. To train our model, we needed a lot of data: thousands of images of scenes showing multiple objects, with annotations of realistic human grasps of these objects. Unfortunately for us, such data didn’t exist, probably because we were tackling a new problem. To address this, we built a dedicated dataset—the YCB-Affordance dataset—that contains more than 133,000 images of 21 objects from the YCB-Video dataset. We annotated these images with more than 28 million plausible 3D human grasps, making it the largest existing dataset of human grasp affordances in real scenes so far. The grasps were defined following the 33-class taxonomy of Feix and colleagues (4) in a semi-automatic manner, entailing quite a bit of sweat; the GraspIt simulator wasn’t able to automatically find certain complex grasps (e.g. those that require abducted thumbs), so human annotation was required to achieve the final dataset.
To test the ability of our model, we performed a thorough evaluation with synthetic and real images. Our results showed that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact, as shown in the examples in Video 2. GanHand achieves a higher percentage of graspable objects and a higher accuracy in predicted grasp types compared to our baseline (a pre-trained ResNet-50 model).
All the code used for training, the data and a pre-trained model will be made available on the project web site.
Our next challenge is to use GanHand technology with a real robot. With Aalto University and IRI, NAVER LABS Europe is focusing on applying this functionality for a simple robot gripper with three fingers and 11 DoFs. We’re currently annotating realistic grasps for this gripper in relation to a group of objects. Then, we’ll use this data to train a model that predicts how to grasp each object in a scene from a single RGB image.
Acknowledgements. This work was done in collaboration with our colleagues from the Institut de Robòtica i Informàtica Industrial in Barcelona, Spain. It is a Joint Research Center of the Spanish Council for Scientific Research (CSIC) and the Technical University of Catalonia (UPC), and one of our collaborators in the NAVER Global AI R&D belt.