Human Centric Computer Vision

Learning visual models that can understand and predict human behaviour from images or videos.

Research in human understanding, modelling and analysis in natural scenes is about learning data-efficient visual models that understand and predict human behaviour for, in our case, safer human-robot interaction (HRI) and learning from demonstration. We focus on 2D/3D human pose estimation (estimation of articulated body joints), 3D shape reconstruction and activity recognition and prediction in natural scenes with the associated challenges of large variations in visual appearance (due to camera viewpoint and lighting conditions, clothing, morphology, age/gender…) and the difficulty to produce and annotate the required data. We work on low-level models of the 3D shape and the pose of the body and its parts as well as models for higher-level activities and interactions, for example with objects, other humans or robotic agents. We work with visual data, such as single images, monocular sequences of frames or live video streams. The research can be roughly divided along the four lines below although a number of models and algorithms overlap (perception/understanding or understanding/control).

Perceive

In the field of perception we’ve worked on representing the human body first as a few 3D keypoints with DOPE back in 2020, then as a denser 3D shape representation with PoseBERT. PoseBERT is a generic transformer module that can be plugged into any video-based model with inputs that vary from 3D skeleton keypoints to rotations of a 3D parametric model, the SMPL full body or even just the hands. More recent work has been on providing finer grained detail to include hair and clothing.

Our Multi-HMR model goes a step further by being able to detect humans in the camera space. It also extracts visual cues, such as the position of the eyes or fingers, to estimate human meshes of multiple humans that are expressive, captured from a single image and processed in a single pipeline. You can try a demo and access the code here! We will soon be adding hair and clothing to Multi-HMR.

As part of this research in perception we created a 4D human motion dataset in collaboration with our partner Inria using their Kinovis platform. This dataset, called 4D Human Outfits, has been released to the broader research community.

Video 1: Multi-HMR is a simple yet effective single-shot model for multi-person and expressive human mesh recovery. It takes as input a single RGB image and efficiently performs 3D reconstruction of multiple humans in camera space.

Understand

By being able to accurately analyse the position/gesture of a person from sufficient training, we can make predictions as to what they will do next. This action prediction is of particular interest to us in the field of HRI and autonomous robots. Video 2 illustrates the PoseGPT algorithm, inspired by GPT, which compresses human motion into quantized latent sequences. Purposer further extends this approach by adding environmental and contextual constraints. Related work in understanding is PoseBERT (mentioned above), Mimetics (action recognition out of context) and GanHand (predicts how a robot will pick up objects) as well as PoseFix and PoseScript which are described in the Control section below.

Video 2: Samples generated from PoseGPT (‘Jump’ generated from scratch, ‘Turn’ while conditioning the model on an initial pose and ‘Throw’ generated while conditioning the model on 10 frames and visualising future motions for that class.

Control

Vision based control is of interest for many applications in VR/AR but also in teaching robots. PoseScript uses a natural language text description to generate body poses from a learned model. In Video 3 below you can see that, if we don’t give too many details, the model generates many different body poses corresponding to the description and, as more details about the arms and the legs are given the poses generated by the model look the same because there are subsequently fewer possibilities. Applications from using natural language descriptions include retrieval i.e. of poses from large datasets and synthetic pose generation for AR/VR and training. PoseScript also comes with a dataset, code and even a demo!

PoseFix extends PoseScript to correct a 3D human pose using natural language. We show the potential of this dataset on text-based pose editing, to generate a corrected 3D body pose given a query pose and a text modifier. On correctional text generation the instructions are generated based on the differences between two body poses.

In PoseEmbroider, we combine 3D poses, pictures of people and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities.

Video 3: As the text is edited in the text box to become more precise, the number of poses generated by PoseScript is reduced.

Render

We’ve developed the first algorithm that, given a single image of a person, can generate novel views of them. Monocular Neural Human Renderer (MonoNHR) first removes the background (the parts that are not human) from the image then moves the camera. In the example below shown in Video 4 on the left, we don’t move the camera too far from the original location and the results are pretty good. If we move the camera around the person the algorithm invents what it doesn’t see as shown on the right in Video 5 which is a difficult task since we only have 1 single image. For now the results are a bit blurry but we’re working on improving them.

Video 4: Human rendering by MonoNHR from a single image and only slight camera movement.

Video 5: Human rendering by MonoNHR where the camera is moved around the human and the algorithm invents what it does not see from the original single image.

Applications in robotics

There are lots of applications for this work but our focus at NAVER LABS is on robotics. We use our research to help make a robot navigate smoothly and safely in an environment full of people, to help it identify if a person is paying attention to it or not, and even predict if they intend to interact with them.

Video 5: Experimenting with Multi-HMR on NAVER research datasets to detect employees and how far away they are.

Recovering the detailed 3D shape of a hand interacting with objects can be used to learn a robot how to grasp and manipulate an object where the robot needs to understand the contact points between the hand and the object in great detail. It could also be useful in the context of human-robot collaboration where a person is handing over an object to a robot. The robot needs to precisely detect the hand to avoid hurting the person when taking the object.

Learn more about GanHand: estimating the pose of a hand to enable human like robot manipulation

Learn more about Leveraging MoCap data for human mesh recovery

Related Publications

PoseEmbroider: towards a 3D, visual, semantic-aware human pose representation, ECCV 2024
CroCoMan: Cross-view and Cross-pose Completion for 3D Human Understanding, CVPR 2024
Purposer: putting human motion generation in context, 3DV 2024
PoseFix: Correcting 3D human poses with natural language, ICCV 2023
PoseBERT: a generic transformer module for temporal 3D human modeling, TPAMI, December 2022
PoseScript: 3D human poses from natural language, ECCV 2022
Multi-finger grasping like humans, IROS 2022
PoseGPT: quantization-based 3D human motion generation and forecasting, ECCV 2022
MonoNHR: Monocular Neural Human Renderer, 3DV 2022
Leveraging MoCap data for human mesh recovery, 3DV 2021
Multi-FinGAN: generative coarse-to-fine sampling of multi-finger grasps, ICRA 2021
SMPLy benchmarking 3D human pose estimation in the wild, 3DV 2020
Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction, ECCV 2020 (based on HANDS 2019 challenge)
Moulding humans: non-parametric 3D human shape estimation from single images, ICCV 2019

Human Centric Computer Vision

Learning visual models that can understand and predict human behaviour from images or videos.

Perceive

Understand

Control

Render

Applications in robotics

Related Publications

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Human Centric Computer Vision

Learning visual models that can understand and predict human behaviour from images or videos.

Perceive

Understand

Control

Render

Applications in robotics

Related Publications

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings