Talk at IMAGINE Seminar: Language-Grounded Understanding of 3D Shapes via Foundation Models

I gave an invited talk at the IMAGINE seminar, LIGM, École des Ponts ParisTech. [slides]

Abstract

Understanding 3D shapes at a fine-grained, part-level has traditionally required expensive manual annotations and category-specific training, limiting scalability to new object types. In this talk, I will present two complementary approaches that leverage the rich knowledge embedded in Multi-Modal Large Language Models to achieve localized 3D understanding without 3D-specific supervision.

First, I will introduce ZeroKey, a zero-shot method for 3D keypoint detection that demonstrates, for the first time, that pixel-level annotations from recent MLLMs can be exploited to both extract and semantically name salient keypoints on 3D models, without any ground truth labels. By reasoning at the point level through language grounding, ZeroKey achieves competitive performance against fully-supervised methods despite requiring no 3D keypoint annotations during training.

Second, I will present PatchAlign3D, an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Unlike prior multi-view pipelines that require expensive inference over multiple renderings and heavy prompt engineering, PatchAlign3D operates in a single feed-forward pass. Through a two-stage pre-training approach—first distilling dense 2D features from DINOv2 into 3D patches, then aligning these embeddings with part-level text descriptions via contrastive learning—the model achieves state-of-the-art zero-shot 3D part segmentation across multiple benchmarks.

Together, these works open new avenues for cross-modal learning and demonstrate the surprising effectiveness of foundation models in bridging the gap between 2D vision-language understanding and localized 3D shape reasoning.