Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Published 4 Jun 2025 in cs.CV | (2506.04034v1)

Abstract: Object referring aims to detect all objects in an image that match a given natural language description. We argue that a robust object referring model should be grounded, meaning its predictions should be both explainable and faithful to the visual content. Specifically, it should satisfy two key properties: 1) Verifiable, by producing interpretable reasoning that justifies its predictions and clearly links them to visual evidence; and 2) Trustworthy, by learning to abstain when no object in the image satisfies the given expression. However, most methods treat referring as a direct bounding box prediction task, offering limited interpretability and struggling to reject expressions with no matching object. In this work, we propose Rex-Thinker, a model that formulates object referring as an explicit CoT reasoning task. Given a referring expression, we first identify all candidate object instances corresponding to the referred object category. Rex-Thinker then performs step-by-step reasoning over each candidate to assess whether it matches the given expression, before making a final prediction. To support this paradigm, we construct a large-scale CoT-style referring dataset named HumanRef-CoT by prompting GPT-4o on the HumanRef dataset. Each reasoning trace follows a structured planning, action, and summarization format, enabling the model to learn decomposed, interpretable reasoning over object candidates. We then train Rex-Thinker in two stages: a cold-start supervised fine-tuning phase to teach the model how to perform structured reasoning, followed by GRPO-based RL learning to improve accuracy and generalization. Experiments show that our approach outperforms standard baselines in both precision and interpretability on in-domain evaluation, while also demonstrating improved ability to reject hallucinated outputs and strong generalization in out-of-domain settings.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Rex-Thinker, a novel framework using Chain-of-Thought reasoning for grounded object referring.
It features a structured three-stage process—planning, action, and summarization—that boosts prediction accuracy and interpretability.
The model leverages a two-stage training method with supervised fine-tuning and reinforcement learning, achieving robust out-of-domain results.

Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning

Introduction

The research paper titled "Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning" (2506.04034) presents a novel framework, Rex-Thinker, that aims to address the limitations of existing object referring models by introducing grounded reasoning and interpretability. Object referring involves detecting objects in images that match natural language descriptions. Traditionally, approaches predict bounding boxes, offering limited interpretability and struggling with rejecting expressions with no matching object. Rex-Thinker tackles these issues by formulating object referring as a Chain-of-Thought (CoT) reasoning task, improving both the accuracy and reliability of predictions.

Methodology

Rex-Thinker begins by identifying candidate objects in an image based on a given natural language description. The model evaluates each candidate through a structured Chain-of-Thought reasoning process consisting of three stages: planning, action, and summarization.

Figure 1: An example of Rex-Thinker for object referring with CoT reasoning of planning (task decomposition), action (evaluating each candidate), and summarization (final decision). Each step is grounded in a specific hint box (as denoted in the left image), enabling interpretable predictions.

The planning stage involves decomposing the referring expression into subgoals. During the action stage, the model assesses each candidate object against these subgoals. Finally, the summarization stage aggregates intermediate decisions to produce the final prediction. This process is crafted to maximize interpretability by grounding each reasoning step within specific regions of the image.

Figure 2: Overview of the proposed CoT reasoning referring data engine. We prompt GPT-4o to generate a three-step CoT reasoning process, including planning, action, and summarization.

Dataset Construction

To implement this framework, the researchers created a large-scale dataset named HumanRef-CoT using GPT-4o prompts on the HumanRef dataset. The dataset contains over 90,000 annotated samples structured in the CoT format, enabling explicit training for step-by-step reasoning over object candidates.

Training Approach

The training of Rex-Thinker is divided into two stages. The first stage is cold-start supervised fine-tuning, where the model learns to perform structured reasoning in the defined CoT format. The second stage employs reinforcement learning via Group Relative Policy Optimization (GRPO), which enhances accuracy and generalization capabilities.

Figure 3: Overview of the Rex-Thinker architecture and our two-stage training methods.

Experimental Results

Rex-Thinker demonstrates superior performance over baseline models, achieving improved precision and interpretability in domain-specific evaluations. Its ability to reject hallucinated outputs and generalize effectively in out-of-domain settings reflects the practical implications of adopting CoT reasoning in object referring tasks.

Figure 4: The out-of-domain result. We use Rex-Thinker-GPRO trained on HumanRef-CoT to infer an unseen category (i.e., fish), resulting in a strong generalization. Boxes in the image denote hints.

Conclusion

In summary, Rex-Thinker introduces a significant advancement in object referring by fostering a grounded and interpretable approach through Chain-of-Thought reasoning. This paradigm enhances the robustness and trustworthiness of object referring systems, paving the way for future advancements in AI interpretability and application flexibility, especially in settings requiring critical decision-making with visual inputs. The approach hints at broader implications for AI development, emphasizing transparency and reasoning in complex, multimodal tasks.

Markdown Report Issue