Hang Yin1*
Xiuwei Xu1*†
Zhenyu Wu2
Jie Zhou1
Jiwen Lu1‡
1Tsinghua University 2Beijing University of Posts and Telecommunications
Paper (arXiv)
Code (GitHub)
中文解读 (Zhihu)
If video does not load, click HERE to download.
In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark.
Overall framework of our approach. We construct a hierarchical 3D scene graph as well as an occupancy map online. At each step, we divide the scene graph into several subgraphs, each of which is prompted to LLM with a hierarchical chain-of-thought for structural understanding of the scene context. We interpolate the probability score of each subgraph to the frontiers and select the frontier with highest score for exploration. This decision is also explainable by summarizing the reasoning process of the LLM. With the scene graph representation, we further design a re-perception mechanism, which helps the agent give up false positive goal object by continuous credibility judgement.
We evaluate our method on MP3D, HM3D and RoboTHOR.
Object-goal navigation results on MP3D, HM3D and RoboTHOR. We compare the Success Rate (SR) and success rate weighted by path length (SPL) of state-of-the-art methods in different settings.
Left: Per category Success Rate on MP3D. Right: Time cost of connecting n edges for online 3D scene graph construction.