DocDancer: Open-Source DocQA Framework

Updated 15 January 2026

DocDancer is an open-source agentic framework for Document QA that formalizes information seeking with explicit tool-driven exploration.
It employs a structured document parser and iterative Search/Read actions to synthesize evidence across multimodal and long documents.
Its two-phase synthetic QA generation pipeline boosts data efficiency and performance, setting new benchmarks against closed-source models.

DocDancer is an open-source agentic framework for Document Question Answering (DocQA), designed to address the limitations of existing agents in tool utilization, reliance on closed-source models, and constrained generalization. By formalizing DocQA as an information-seeking problem, DocDancer leverages explicit tool-driven document exploration, a purpose-built data synthesis regime, and end-to-end supervised fine-tuning to set new data efficiency and performance benchmarks for long, multimodal document understanding (Zhang et al., 8 Jan 2026).

1. Motivation and Formalization of Agentic Document QA

Traditional DocQA pipelines typically adopt one of three paradigms: (a) optical character recognition (OCR) followed by LLMs, which are brittle to layout and discard visual cues; (b) retrieval-augmented generation (RAG), which often fails due to single-shot retrieval and lack of multi-step reasoning; or (c) prompt-based agents, which rely on closed-source models and have constrained or non-learned tool-use behaviors.

DocDancer reconceptualizes DocQA as iterative agentic information seeking over a structured document environment. Let 𝒟 be a source document (typically PDF) parsed into a structured outline with visual nodes. The agent operates with two actions—Search and Read—from a toolset 𝒜 = {Search, Read}. At each timestep t, it issues an action $a_t \in \mathcal{A}$ , guided by an internal thought $\tau_t$ , observes tool feedback $o_t$ , and accumulates an interaction history: $\mathcal{H}_T = (\tau_0, a_0, o_0, ..., \tau_t, a_t, o_t, ..., \tau_T, a_T)$ The agent’s policy $\pi_{\theta}$ is learned to maximize the likelihood over trajectories: $(\tau_t, a_t) \sim \pi_\theta(\cdot \mid \mathcal{H}_{t-1})$ The objective is end-to-end optimization of $\pi_{\theta}$ such that, through Search and Read actions, the agent locates, synthesizes, and composes evidence to answer document-based questions.

2. Agent Framework and System Architecture

DocDancer is instantiated as a single-agent system built atop open-source LLM backbones: Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507. The core processing pipeline comprises:

Document Parsing: MinerU2.5 extracts high-precision layout and semantics into an XML outline with 17 element types. Headings are clustered hierarchically into fine-grained sections. Images and charts are captioned automatically using a multimodal model $M_m$ and integrated into the node structure.
Tool Suite:
- Search(keywords): Full-text keyword search over the document outline, returning matching section ids, page numbers, and contextual snippets.
- Read(section_ids, goal): Aggregates all text, images, tables, and section screenshots, then summarizes goal-relevant content using $M_m$ .
Agentic Loop (ReAct-inspired): At each step, the agent chooses a subgoal ( $\tau_t$ ), selects and executes a tool action ( $a_t$ ), incorporates the observation ( $o_t$ ), and updates its working history: $\mathcal{H}_{t} = \mathcal{H}_{t-1} \cup \{(\tau_t, a_t, o_t)\}$

The workflow is strictly iterative, with explicit subgoal formulation and tool selection at each step.

3. Exploration-then-Synthesis Data Pipeline

To mitigate the scarcity of high-quality DocQA training data, DocDancer employs a two-phase synthetic QA generation pipeline:

Exploration Stage: An agent $M_e$ interacts with document 𝒟, sampling a trajectory $\xi = \{(i_t, u_t, y_t)\}_{t=1}^T$ , where $i_t$ is the explicit intent and $u_t$ is the tool action. Sampling is controlled (max depth 15–20, with strongly guided prompts) to enforce multi-page, multimodal, multi-hop evidence collection.
Synthesis Stage: A secondary model $M_s$ transforms each trajectory-document pair $(\xi, \mathcal{D})$ into QA pairs $(q, a)$ , forming a set $\mathcal{QA} = \{(q_k, a_k)\}_{k=1}^K$ . High-quality pairs are filtered through rejection sampling with a strong open-source model $M_t$ .

This regime yields a training set that explicitly encodes reasoning chains and evidence-gathering strategies.

for each doc D in corpus:
    ξ ← explore(D)              # collect (i,u,y) steps
    QA_candidates ← synthesize(ξ,D)
    QA_final ← reject_sample(QA_candidates, M_t)
    add QA_final to training_set

4. Training Objectives and Optimization

Final models are supervised-fine-tuned on the synthetic trajectories $\mathcal{H} = (x_0, ..., x_n)$ , where each $x_i$ is a thought, action, or observation. Following FireAct, the loss function excludes observation tokens to avoid overfitting to external tool outputs: $L(\theta) = -\frac{1}{\sum_{i}\mathbb{I}[x_i\neq o]}\sum_{i=1}^{|\mathcal{H}|} \mathbb{I}[x_i\neq o]\log\pi_\theta(x_i\mid\mathbf{tc}, x_{<i})$ where $\mathbf{tc}$ denotes the task context (document outline plus question).

5. Benchmarking, Results, and System Comparison

DocDancer evaluation covers two multimodal, long-context benchmarks:

MMLongBench-Doc: 135 documents, average 47.5 pages, 1,091 questions (33% cross-page, multimodal).
DocBench: 229 documents, 1,082 questions across five domains and four question types.

Metrics include accuracy (acc), span-based F₁, and LLM-as-Judge (LasJ, using GPT-4.1/GPT-4o) scoring. Baselines are drawn from VLMs (GPT-4o, Gemini-2.5), OCR pipelines (Tesseract, fitz + GPT-4/Gemini), advanced RAG systems (VisRAG, Colpali, M3DocRAG), and prompt-based agents (Doc-React, MDocAgent, MACT, SimpleDoc, DocLens, DocAgent).

Model	MMLongBench-Doc Acc	MMLongBench-Doc LasJ	DocBench LasJ	Params
DocDancer (4B-ft)	48.4	59.4	79.8	4B
DocDancer (30B-A3B-ft)	54.4	65.3	81.2	30B
GPT-5.2 (closed)	57.0	67.6 / 85.5	–	–
Human	–	–	81.2	–

In a financial QA case study ("What is advertising expense to sales ratio of Netflix in FY 2015?"), DocDancer retrieves and synthesizes evidence from multiple document sections, arriving at an accurate ratio. By contrast, the best open-source QA baseline confuses advertising and marketing expenses, producing an erroneous answer.

6. Analysis, Ablations, and Future Directions

Ablation studies demonstrate that MinerU2.5-based outlines confer a 2–3 point performance advantage over alternatives (AdobePDF+DocXChain+PyMuPDF). Tool simplification (Search/Read, as opposed to 5 tools) outperforms more complex setups in DocAgent.

Testing Read-tool variants (Qwen3-VL-235B versus Gemini-3-Pro) reveals only marginal (~0.2 accuracy) differences, evidencing robustness in multimodal summarization. Synthetic QA generation outperforms equivalently sized open-source QA data by 3–5 points across all metrics and document domains, with especially strong gains for structurally complex documents.

Limitations include restriction to two open-source model backbones, exclusive reliance on supervised fine-tuning (no agentic RL), and dataset scale capped at 5,000 synthetic trajectories. Scaling to larger corpora and model families remains open for exploration.

A plausible implication is that agentic design with explicit tool reasoning and data synthesis pipelines may yield generalizable improvements for document-centric information retrieval and QA. DocDancer’s modular agentic framework, efficient synthetic data pipeline, and strong open-source baseline establish a new standard for agentic DocQA research (Zhang et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DocDancer: Towards Agentic Document-Grounded Information Seeking (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocDancer.