SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

NeurIPS 2024


Hang Yin1*   Xiuwei Xu1*†   Zhenyu Wu2   Jie Zhou1   Jiwen Lu1‡

1Tsinghua University  2Beijing University of Posts and Telecommunications


paper  Paper (arXiv)      code  Code (GitHub)      code  中文解读 (Zhihu)

If video does not load, click HERE to download.

Abstract


In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.

pipeline

Approach


Overall framework of our approach. We construct a hierarchical 3D scene graph as well as an occupancy map online. At each step, we divide the scene graph into several subgraphs, each of which is prompted to LLM with a hierarchical chain-of-thought for structural understanding of the scene context. We interpolate the probability score of each subgraph to the frontiers and select the frontier with highest score for exploration. This decision is also explainable by summarizing the reasoning process of the LLM. With the scene graph representation, we further design a re-perception mechanism, which helps the agent give up false positive goal object by continuous credibility judgement.

pipeline

Experiments


We evaluate our method on MP3D, HM3D and RoboTHOR.

pipeline

Object-goal navigation results on MP3D, HM3D and RoboTHOR. We compare the Success Rate (SR) and success rate weighted by path length (SPL) of state-of-the-art methods in different settings.

pipeline

Left: Per category Success Rate on MP3D. Right: Time cost of connecting n edges for online 3D scene graph construction.

Bibtex


@article{yin2024sgnav, title={SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation}, author={Hang Yin and Xiuwei Xu and Zhenyu Wu and Jie Zhou and Jiwen Lu}, journal={arXiv preprint arXiv:2410.08189}, year={2024} }


© Hang Yin | Last update: Oct. 8, 2024