Portrait of Sifei Liu

Sifei Liu

I am a Principal Research Scientist and Tech Lead at NVIDIA Research, where I work with the LPR team led by Jan Kautz. My current research focuses on embodied foundational models, efficient transformer architectures, and spatial reasoning.

Before joining NVIDIA, I earned my Ph.D. at the VLLAB at UC Merced, advised by Ming-Hsuan Yang. I have been fortunate to receive the Baidu Graduate Fellowship, the NVIDIA Pioneering Research Award, and the Rising Star EECS recognition.

Current Focus

Building grounded multimodal systems that can reason, scale, and act in open-world settings.

Embodied foundation models Efficient transformers Spatial reasoning

Research Themes

What I work on

I work on making multimodal systems more grounded, more efficient, and more capable in open-world environments.

Embodied VLM

Perception for agents and robotics

Building multimodal systems that can perceive, reason, and act in 3D environments for navigation and embodied decision making.

Efficient Models

Transformer efficiency at high resolution

Designing token-efficient architectures and attention mechanisms that preserve detail without paying the full compute cost.

Spatial Intelligence

Grounded multimodal understanding

Connecting images, language, and geometry so models can reason about structure, localization, and relationships across views.

News

  • Jan 2026
    We released SR-3D, a 3D-aware region-prompted VLM for grounded spatial reasoning across views and scenes.
  • Oct 2025
    Our ICCV 2025 paper Describe Anything introduces DAM for detailed localized image and video captioning.
  • Oct 2025
    Our ICCV 2025 paper Token-Efficient VLM presents an efficient VLM for high-resolution visual understanding.
  • Mar 2025
    SpatialRGPT was demoed at GTC 2025 as a part of Agentic AI for Physical Operations!
  • Feb 2025
    We release the GSPN, a fast vision attention module that accelerates Stable Diffusion inference 84x. Stay tuned for more details!
  • Feb 2025
    5 papers was accepted to CVPR 2025! Stay tuned for more updates!
  • Jan 2025
    We released the NaVILA, a navigation agent that can navigate in a 3D environment with a language instruction.
  • Dec 2024
    We presented CosAE at NeurIPS 2024! Stay tuned for code release.
  • Oct 2024
    We released the SpatialRGPT code, datasets, and models! Welcome to try demos!

Selected Publications

ICLR 2026

3D Aware Region Prompted Vision Language Model

SR-3D unifies single-view 2D and multi-view 3D representations for flexible region prompting and grounded spatial reasoning.

Describe Anything: Detailed Localized Image and Video Captioning
ICCV 2025

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

DAM generates detailed localized captions for user-specified regions in images and videos, preserving both local detail and global context.

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal
ICCV 2025

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal

Yitong Jiang, Jingwei Gu, Tianfan Xue, Ka Chun Cheung, Pavlo Molchanov, Hongxu (Danny) Yin, Sifei Liu

TEVA improves high-resolution image understanding by dynamically selecting detail-rich regions while keeping token usage efficient.

Parallel Sequence Modeling via Generalized Spatial Propagation Network
CVPR 2025

Parallel Sequence Modeling via Generalized Spatial Propagation Network

GSPN is a fast vision attention module that accelerates generic vision foundation models for high-resolution input images.

NaVILA: Legged Robot Vision-Language-Action Model for Navigation
arxiv 2025

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

NeurIPS 2024

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

SpatialRGPT is a grounded spatial reasoning model that can reason about spatial relationships in images.

COLMAP-Free 3D Gaussian Splatting
CVPR 2025

COLMAP-Free 3D Gaussian Splatting

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang

3D Gaussian Splatting without COLMAP computation.

TUVF: Learning Generalizable Texture UV Radiance Fields
arXiv 2023

TUVF: Learning Generalizable Texture UV Radiance Fields

The paper introduces TUVF, a method for learning generalizable texture UV radiance fields.

Open-vocabulary panoptic segmentation with text-to-image diffusion models
CVPR 2023

Open-vocabulary panoptic segmentation with text-to-image diffusion models

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation.