Human centric computer vision

Learning visual models that can understand and predict human behaviour from images or videos.


Research in human understanding, modelling and analysis in natural scenes is about learning data-efficient visual models that understand and predict human behaviour for, in our case, safer human-robot interaction (HRI) and learning from demonstration. We focus on 2D/3D human pose estimation (estimation of articulated body joints), 3D shape reconstruction and activity recognition and prediction in natural scenes with the associated challenges of large variations in visual appearance (due to camera viewpoint and lighting conditions, clothing, morphology, age/gender…) and the difficulty to produce and annotate the required data. We work on low-level models of the 3D shape and the pose of the body and its parts as well as models for higher-level activities and interactions, for example with objects, other humans or robotic agents. We work with visual data, such as single images, monocular sequences of frames or live video streams. The research can be roughly divided along the four lines below although a number of models and algorithms overlap (perception/understanding or understanding/control).


In the field of perception we’ve worked on representing the human body first as a few 3D keypoints with DOPE back in 2020, then as a denser 3D shape representation with PoseBERT. PoseBERT is a generic transformer module that can be plugged into any video-based model with inputs that vary from 3D skeleton keypoints to rotations of a 3D parametric model, the SMPL full body or even just the hands. More recent work has been on providing finer grained detail to include hair and clothing using a multi-camera framework but we should soon be able to obtain this from a single camera. These different stages are illustrated in Video 1 below.

Our Multi-HMR model goes a step further by being able to detect humans in the camera space. It also extracts visual cues, such as the position of the eyes or fingers, to estimate human meshes of multiple humans that are expressive, captured from a single image and processed in a single pipeline.  You can try a demo and access the code here! We will soon be adding hair and clothing to Multi-HMR.

As part of this research in perception we created a 4D human motion dataset in collaboration with our partner Inria using their Kinovis platform. This dataset, called 4D Human Outfits, has been released to the broader research community.

Video 1: On the extreme left is the original input, second from left the 3D keypoints produced by PoseBERT, second from the right is dense full body 3D representation and on the far right, a dense 3D representation with more detail such as clothes and hair.


By being able to accurately analyse the position/gesture of a person from sufficient training, we can make predictions as to what they will do next. This action prediction is of particular interest to us in the field of HRI and autonomous robots. Video 2 illustrates the PoseGPT algorithm, inspired by GPT, which compresses human motion into quantized latent sequences. Based on the input on the far left, PoseGPT makes a number of predictions which, ordered from left to right after the 3D keypoint output are,  ‘person could carry on stretching‘, ‘person could jump‘  or ‘person could move away‘. Purposer further extends this approach by adding environmental and contextual constraints. Related work in understanding is PoseBERT (mentioned above), Mimetics (action recognition out of context) and GanHand (predicts how a robot will pick up objects) as well as PoseFix and PoseScript which are described in the Control section below.

Video 2: PoseGPT makes predictions on the ‘next move’ of the human based on the input on the far left.


Vision based control is of interest for many applications in VR/AR but also in teaching robots. PoseScript uses a natural language text description to generate body poses from a learned model. In Video 3 below you can see that, if we don’t give too many details, the model generates many different body poses corresponding to the description and, as more details about the arms and the legs are given the poses generated by the model look the same because there are subsequently fewer possibilities. Applications from using natural language descriptions include retrieval i.e. of poses from large datasets and synthetic pose generation for AR/VR and training. PoseScript also comes with a dataset, code and even a demo!

PoseFix extends PoseScript to correct a 3D human pose using natural language. We show the potential of this dataset on text-based pose editing, to generate a corrected 3D body pose given a query pose and a text modifier. On correctional text generation the instructions are generated based on the differences between two body poses.

Video 3: As the text is edited in the text box to become more precise, the number of poses generated by PoseScript is reduced.


We’ve developed the first algorithm that, given a single image of a person, can generate novel views of them. Monocular Neural Human Renderer (MonoNHR) first removes the background (the parts that are not human) from the image then moves the camera. In the example below shown in Video 4 on the left, we don’t move the camera too far from the original location and the results are pretty good. If we move the camera around the person the algorithm invents what it doesn’t see as shown on the right in Video 5 which is a difficult task since we only have 1 single image. For now the results are a bit blurry but we’re working on improving them.

Video 4: Human rendering by MonoNHR from a single image and only slight camera movement.

Video 5: Human rendering by MonoNHR where the camera is moved around the human and the algorithm invents what it does not see from the original single image.

Applications in robotics

There are lots of applications for this work but our focus at NAVER LABS is on robotics. We use our research to help make a robot navigate smoothly and safely in an environment full of people, to help it identify if a person is paying attention to it or not, and even predict if they intend to interact with them.

Video 6: Multi-HMR on an autonomous delivery robot in the NAVER office building 1784. The model detects employees and how far away they are.

Recovering the detailed 3D shape of a hand interacting with objects can be used to learn a robot how to grasp and manipulate an object where the robot needs to understand the contact points between the hand and the object in great detail. It could also be useful in the context of human-robot collaboration where a person is handing over an object to a robot. The robot needs to precisely detect the hand to avoid hurting the person when taking the object.


Learn more about GanHand: estimating the pose of a hand to enable human like robot manipulation


This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.


Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.