Embodied VLM
Perception for agents and robotics
Building multimodal systems that can perceive, reason, and act in 3D environments for navigation and embodied decision making.
I am a Principal Research Scientist and Tech Lead at NVIDIA Research, where I work with the LPR team led by Jan Kautz. My current research focuses on embodied foundational models, efficient transformer architectures, and spatial reasoning.
Before joining NVIDIA, I earned my Ph.D. at the VLLAB at UC Merced, advised by Ming-Hsuan Yang. I have been fortunate to receive the Baidu Graduate Fellowship, the NVIDIA Pioneering Research Award, and the Rising Star EECS recognition.
Current Focus
Research Themes
I work on making multimodal systems more grounded, more efficient, and more capable in open-world environments.
Embodied VLM
Building multimodal systems that can perceive, reason, and act in 3D environments for navigation and embodied decision making.
Efficient Models
Designing token-efficient architectures and attention mechanisms that preserve detail without paying the full compute cost.
Spatial Intelligence
Connecting images, language, and geometry so models can reason about structure, localization, and relationships across views.
SR-3D unifies single-view 2D and multi-view 3D representations for flexible region prompting and grounded spatial reasoning.
DAM generates detailed localized captions for user-specified regions in images and videos, preserving both local detail and global context.
TEVA improves high-resolution image understanding by dynamically selecting detail-rich regions while keeping token usage efficient.
GSPN is a fast vision attention module that accelerates generic vision foundation models for high-resolution input images.
NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.
Efficient frontier VLM models with efficient training and inference.
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
SpatialRGPT is a grounded spatial reasoning model that can reason about spatial relationships in images.
3D Gaussian Splatting without COLMAP computation.
The paper introduces TUVF, a method for learning generalizable texture UV radiance fields.
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation.