Hello, I'm Sifei Liu

I currently hold the position of a staff-level Senior Research Scientist at NVIDIA, where I am part of the LPR team led by Jan Kautz. My work primarily revolves around the development of generalizable visual representation learning for images, videos, and 3D content. Prior to this, I pursued my Ph.D. at the VLLAB, under the guidance of Ming-Hsuan Yang.

Over the years, I’ve been fortunate to receive several prestigious awards and recognitions. In 2013, I was honored with the Baidu Graduate Fellowship. This was followed by the NVIDIA Pioneering Research Award in 2017, and the Rising Star EECS accolade in 2019. Additionally, I was nominated for the VentureBeat Women in AI Award in 2020.

CV Scholar Github Nvidia Twitter

News

Mar 2025: SpatialRGPT was demoed at GTC 2025 as a part of Agentic AI for Physical Operations!
Feb 2025: We release the GSPN, a fast vision attention module that accelerates Stable Diffusion inference 84x. Stay tuned for more details!
Feb 2025: 5 papers was accepted to CVPR 2025! Stay tuned for more updates!
Jan 2025: We released the NaVILA, a navigation agent that can navigate in a 3D environment with a language instruction.
Dec 2024: We presented CosAE at NeurIPS 2024! Stay tuned for code release.
Oct 2024: We released the SpatialRGPT code, datasets, and models! Welcome to try demos!

Recent Research

Full publications can be found at Google Scholar and CV

Parallel Sequence Modeling via Generalized Spatial Propagation Network

Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jingwei Gu, , Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu

CVPR 2025

GSPN is a fast vision attention module that accelerates generic vision foundation models for high-resolution input images.

Project Page PDF arXiv Code

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu (Danny) Yin, Sifei Liu, Xiaolong Wang

arxiv 2025

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

Project Page PDF arXiv Code

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, , Song Han, Yao Lu

CVPR 2025

Efficient frontier VLM models with efficient training and inference.

Project Page PDF arXiv Code

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, Songyou Peng

ICLR 2025

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Project Page PDF arXiv Code

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

An-Chieh Cheng, Hongxu (Danny) Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu

NeurIPS 2024

SpatialRGPT is a grounded spatial reasoning model that can reason about spatial relationships in images.

Project Page PDF arXiv Code

COLMAP-Free 3D Gaussian Splatting

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang

CVPR 2025

3D Gaussian Splatting without COLMAP computation.

Project Page PDF arXiv Code

TUVF: Learning Generalizable Texture UV Radiance Fields

An-Chieh Cheng, Xueting Li, Sifei Liu, Xiaolong Wang

arXiv 2023

The paper introduces TUVF, a method for learning generalizable texture UV radiance fields.

Project Page PDF arXiv Code

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello

CVPR 2023

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation.