Vision foundation models for human understanding

Understanding humans in the real world is a challenge in computer vision. It requires models that can interpret pose, motion, intention, social cues and interactions across diverse environments and demographics, all within the uncertainty of real-world and real-life situations. Yet most existing approaches (such as Yolo or ViTPose) are designed for narrow, task-specific objectives. They struggle to generalize and require retraining when conditions change.
We’re developing a foundation model for human understanding that provides a unified 3D representation of people and that supports perception, localization and higher-level reasoning in open-world settings. By learning a comprehensive and adaptable model of human behavior and form, this research aims to advance general-purpose visual understanding and provide a robust basis for downstream applications and in particular the deployment of versatile robot assistants.
Anny: a foundational representation of the human body
At the core of our foundation model for human understanding lies a principled representation of the human body: Anny, a new parametric body model (1). Rather than being learned from large-scale 3D scans which are difficult to acquire, Anny is built on anthropometric knowledge, making it interpretable, controllable and consistent across ages—from children to adults. It is open-source under Apache 2.0, designed to serve as an interpretable and extensible standard for human representation. By providing a structured, semantically meaningful parameterization of body shape and pose, Anny offers a stable foundation upon which additional capabilities can be built. This explicit representation allows robots not only to perceive humans, but to reason about them – supporting downstream tasks such as navigation in crowded spaces, interaction planning, safety-aware motion and multi-agent orchestration – while maintaining a shared, unified understanding of the human body across applications.
Video 1: Anny, built on MakeHuman assets, models people of different gender, age and body type and supports multiple rigs and mesh resolutions.
Towards a unique feedforward model for 3D human understanding
Building on Anny as a structured and interpretable human representation, we’re developing feedforward models (Multi-HMR (2), CondiMen (3)) that predict multi-person 3D pose, shape and localization in a single stage. The goal is to move beyond fragmented pipelines and enable a fast, unified system that both detects people, estimates their expressive body pose and shape and places them in the camera coordinate system. A one-stage approach such as this is critical for real-world robotics, where perception must be both reliable and real-time. By providing structured 3D humans as output (rather than sparse keypoints (ViTPose) or 2D detections (Yolo)) the model delivers actionable information for downstream tasks. Such a representation supports robot navigation in dynamic environments, safe motion planning around people, human–robot interaction and assistant behaviors that require awareness of posture, distance, orientation and social context. In short, our models transform raw visual input into a grounded understanding of humans which is semantically meaningful such that robots can plan and adapt to humans.
Video 2: Our Multi-Human Mesh Recovery (Multi-HMR) model running on an autonomous robot at NAVER LABS Europe. On the left the input source and on the right the input source with Multi-HMR applied. The Multi-HMR model detects humans, reconstructs their 3D meshes, and enables online tracking in videos based on predicted per-human features.
Building the Anny Ecosystem
Data is essential in making the foundation model for human understanding effective at scale. Our feedforward models predict structured Anny parameters which requires large amounts of annotated 3D human data across ages, shapes, interactions and real-world conditions. To address this, we’re expanding the Anny ecosystem with both high-quality synthetic data, such as Anny-One, which provides photorealistic images with perfect ground truth. We’re also focussing on generating realiable pseudo-ground truth on real world images that can be obtained by using, for example, the output of HAMSt3R (4) which is a method to generate dense 3D reconstructions of humans and scenes.
By combining principled representations (with Anny), efficient perception (our one-stage foundation models), and controlled data generation, we aim to create a scalable framework that continuously improves 3D human understanding in real-world conditions.

Human understanding for real-world robotics at scale
Our research on human-centric computer vision is connected to NAVER’s broader vision of service robotics and embodied AI deployed at scale. Whether in data centers, logistics environments, public spaces or future assistive applications, robots must operate around people in a safe and intelligent way. By combining a principled human representation (Anny), fast feedforward multi-person 3D perception and scalable data generation, we’re working on providing a unified perceptual backbone that can power a wide range of robotic systems. Our ambition is not only to advance 3D human understanding in computer vision, but to make it reliable, adaptable and deployable in the real world to bring robots closer to becoming helpful and human-aware assistants.
Related Publications (and links to code and data)
1: Human mesh modeling for Anny body, arXiv 2025
2: Multi-HMR: multi-person whole-body human mesh recovery in a single shot, ECCV 2024
3: CondiMen: conditional multi-person mesh recovery, CVPR’W 2025
4: HAMSt3R: Human-aware Multi-view stereo 3D Reconstruction, ICCV 2025
