Panoptic segmentation of 3D scenes, which consists in isolating object instances in a dense 3D reconstruction of a scene, is challenging given only unposed images. Existing approaches typically extract 2D panoptic segmentations for each image using an off-the-shelf model, before optimizing an implicit geometric representation (often NeRF-based) that integrates and fuses 2D panoptic constraints. Not only this requires camera parameters and costly test-time optimization for each scene, but we argue that performing 2D panoptic segmentation despite the problem at hand being fundamentally 3D and multi-view, is likely suboptimal. In this work, we instead propose a simple integrated and unified approach. Our novel network, named PanSt3R, jointly predicts the 3D geometry and panoptic segmentation without any test-time optimization in a single forward pass. PanSt3R builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, which we entail with semantic knowledge and panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach. Overall, the proposed PanSt3R is simple, fast and scalable. We conduct extensive experiments on multiple benchmarks and show that our method yields state of-the-art results while being orders of magnitude faster.