Human Centric Computer Vision
Learning visual models that can understand and predict human behaviour from images or videos.
Research in human understanding, modelling and analysis in natural scenes is about learning data-efficient visual models that understand and predict human behaviour for, in our case, safer human-robot interaction (HRI) and learning from demonstration. We focus on 2D/3D human pose estimation (estimation of articulated body joints), 3D shape reconstruction and activity recognition and prediction in natural scenes with the associated challenges of large variations in visual appearance (due to camera viewpoint and lighting conditions, clothing, morphology, age/gender…) and the difficulty to produce and annotate the required data. We work on low-level models of the 3D shape and the pose of the body and its parts as well as models for higher-level activities and interactions, for example with objects, other humans or robotic agents. We work with visual data, such as single images, monocular sequences of frames or live video streams. The research can be roughly divided along the four lines below although a number of models and algorithms overlap (perception/understanding or understanding/control).
Perceive
In the field of perception we’ve worked on representing the human body first as a few 3D keypoints with DOPE back in 2020, then as a denser 3D shape representation with PoseBERT. PoseBERT is a generic transformer module that can be plugged into any video-based model with inputs that vary from 3D skeleton keypoints to rotations of a 3D parametric model, the SMPL full body or even just the hands. More recent work has been on providing finer grained detail to include hair and clothing using a multi-camera framework but we should soon be able to obtain this from a single camera. These different stages are illustrated in Video 1 below.
Our Multi-HMR model goes a step further by being able to detect humans in the camera space. It also extracts visual cues, such as the position of the eyes or fingers, to estimate human meshes of multiple humans that are expressive, captured from a single image and processed in a single pipeline. You can try a demo and access the code here! We will soon be adding hair and clothing to Multi-HMR.
As part of this research in perception we created a 4D human motion dataset in collaboration with our partner Inria using their Kinovis platform. This dataset, called 4D Human Outfits, has been released to the broader research community.
Understand
By being able to accurately analyse the position/gesture of a person from sufficient training, we can make predictions as to what they will do next. This action prediction is of particular interest to us in the field of HRI and autonomous robots. Video 2 illustrates the PoseGPT algorithm, inspired by GPT, which compresses human motion into quantized latent sequences. Based on the input on the far left, PoseGPT makes a number of predictions which, ordered from left to right after the 3D keypoint output are, ‘person could carry on stretching‘, ‘person could jump‘ or ‘person could move away‘. Purposer further extends this approach by adding environmental and contextual constraints. Related work in understanding is PoseBERT (mentioned above), Mimetics (action recognition out of context) and GanHand (predicts how a robot will pick up objects) as well as PoseFix and PoseScript which are described in the Control section below.
Control
Vision based control is of interest for many applications in VR/AR but also in teaching robots. PoseScript uses a natural language text description to generate body poses from a learned model. In Video 3 below you can see that, if we don’t give too many details, the model generates many different body poses corresponding to the description and, as more details about the arms and the legs are given the poses generated by the model look the same because there are subsequently fewer possibilities. Applications from using natural language descriptions include retrieval i.e. of poses from large datasets and synthetic pose generation for AR/VR and training. PoseScript also comes with a dataset, code and even a demo!
PoseFix extends PoseScript to correct a 3D human pose using natural language. We show the potential of this dataset on text-based pose editing, to generate a corrected 3D body pose given a query pose and a text modifier. On correctional text generation the instructions are generated based on the differences between two body poses.
In PoseEmbroider, we combine 3D poses, pictures of people and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.
Render
We’ve developed the first algorithm that, given a single image of a person, can generate novel views of them. Monocular Neural Human Renderer (MonoNHR) first removes the background (the parts that are not human) from the image then moves the camera. In the example below shown in Video 4 on the left, we don’t move the camera too far from the original location and the results are pretty good. If we move the camera around the person the algorithm invents what it doesn’t see as shown on the right in Video 5 which is a difficult task since we only have 1 single image. For now the results are a bit blurry but we’re working on improving them.
Applications in robotics
There are lots of applications for this work but our focus at NAVER LABS is on robotics. We use our research to help make a robot navigate smoothly and safely in an environment full of people, to help it identify if a person is paying attention to it or not, and even predict if they intend to interact with them.
Recovering the detailed 3D shape of a hand interacting with objects can be used to learn a robot how to grasp and manipulate an object where the robot needs to understand the contact points between the hand and the object in great detail. It could also be useful in the context of human-robot collaboration where a person is handing over an object to a robot. The robot needs to precisely detect the hand to avoid hurting the person when taking the object.
Learn more about GanHand: estimating the pose of a hand to enable human like robot manipulation
Learn more about Leveraging MoCap data for human mesh recovery
Related Publications
- PoseEmbroider: towards a 3D, visual, semantic-aware human pose representation, ECCV 2024
- Multi-HMR: multi-person whole-body human mesh recovery in a single shot, ECCV 2024
- CroCoMan: Cross-view and Cross-pose Completion for 3D Human Understanding, CVPR 2024
- Purposer: putting human motion generation in context, 3DV 2024
- PoseFix: Correcting 3D human poses with natural language, ICCV 2023
- PoseBERT: a generic transformer module for temporal 3D human modeling, TPAMI, December 2022
- PoseScript: 3D human poses from natural language, ECCV 2022
- Multi-finger grasping like humans, IROS 2022
- PoseGPT: quantization-based 3D human motion generation and forecasting, ECCV 2022
- MonoNHR: Monocular Neural Human Renderer, 3DV 2022
- Leveraging MoCap data for human mesh recovery, 3DV 2021
- Multi-FinGAN: generative coarse-to-fine sampling of multi-finger grasps, ICRA 2021
- Mimetics: towards understanding human action out of context. IJCV, 2021
- SMPLy benchmarking 3D human pose estimation in the wild, 3DV 2020
- DOPE: distillation of part experts for whole-body 3D pose estimation in the wild, ECCV 2020
- Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction, ECCV 2020 (based on HANDS 2019 challenge)
- GanHand: predicting human grasp affordance in multi-object scenes, CVPR 2020
- Moulding humans: non-parametric 3D human shape estimation from single images, ICCV 2019