GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language-Navigation

Xinhai Li , Xiaotao Zhang , Yuehao Huang , Jiankun Dong , Tianhang Wang , Sunyao Zhou , Yunzi Wu , Chenguo Sun , Yunfei Ge , Qizhen Weng , Chi Zhang , Chenjia Bai^✉ , Xuelong Li^✉

January, 2025

Image credit: Chenjia Bai

Abstract

Embodied navigation connects intelligent agents with the physical world and serves as a fundamental capability for achieving general robotic intelligence. However, the availability and quality of navigation data have long limited the generalization ability of Vision-and-Language Navigation (VLN) systems and their capacity to handle long-horizon tasks. To bridge this gap, we curate diverse 3D scene resources and develop an automated pipeline for large-scale navigation data generation, resulting in the GN-Matrix dataset. Building upon a 3D Gaussian Splatting (3DGS) rendering engine, we further introduce a high-fidelity simulation platform that supports interactive roaming and collision-aware navigation. Furthermore, we propose GN-Bench, the first benchmark to provide BEV-based evaluation, incorporating dynamic 3DGS avatars to establish new standards for human-robot interaction evaluation. To fully leverage the interactive nature of the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE)), after supervised learning, DAgger exposes the model to rollout-induced states and recovery behaviors beyond idealized offline demonstrations, breaking the narrow expert-centric distribution inherited from supervision and providing a better starting point for downstream RL exploration. This unified VLN paradigm seamlessly integrates both map-based and map-free settings, enabling a broad spectrum of tasks including instruction following, human following, and goal navigation. To the best of our knowledge, this work is the first to formalize high-fidelity 3DGS-rendered Bird’s Eye View (BEV) representations as a compact and efficient memory mechanism, thereby unlocking the latent spatial reasoning capabilities of Vision-and-Language Models (VLMs). The dual-system architecture of GN0 ensures both flexibility and high-efficiency deployment in real-world scenarios. Extensive evaluations on GN-Bench, encompassing both quantitative and qualitative analyses,demonstrate that the proposed GN0 significantly outperforms existing state-of-the-art VLN methods. Overall, this work, GN-Maxtrix, presents a unified framework spanning data, simulation, and learning, offering new insights for advancing embodied navigation in both academic research and industrial applications.

Type

Conference

Publication

Under Review. 2026