daVinci-Agency: Data-Driven Agentic AI

Updated 4 February 2026

daVinci-Agency is a data-centric framework designed to structure long-horizon workflows for agentic AI in contexts like software engineering and robotic surgery.
It employs progressive task decomposition, long-term consistency, and verifiable refinement to train LLM agents using authentic workflow logs such as GitHub PR chains and surgical voice commands.
The framework demonstrates improved performance measures, data-efficiency, and reliable multi-stage reasoning, paving the way for scalable, tool-augmented AI systems.

The term daVinci-Agency refers to data-centric frameworks and system architectures for agentic AI grounded in the mining, structuring, and leveraging of long-horizon human workflows, and more broadly to orchestration platforms for complex, tool-augmented LLM agents. The concept is exemplified by two main research instantiations: (1) daVinci-Agency for long-horizon software engineering agent data curation and training (Jiang et al., 2 Feb 2026), and (2) the Surgical Agent Orchestration Platform for multimodal, voice-driven interaction in robotic surgery (Park et al., 10 Nov 2025). Both address persistent challenges in scaling agentic LLMs to multi-stage, causally coherent, and verifiable workflows that tightly couple reasoning, action-selection, and environment evolution.

1. Foundations: Agency in AI and Long-Horizon Workflows

Agency in artificial intelligence, in both functional and phenomenal terms (Das, 9 Feb 2025), requires the persistent pursuit of goals through sequence-dependent action selection, adaptive state maintenance, and iterative refinement. In practical agentic settings—such as building, testing, and deploying nontrivial software features across interdependent commits, or orchestrating surgical interactions with multimodal data—agentic systems must solve the problems of:

Progressive Decomposition: Breaking a global objective into subgoals aligned with underlying system structure.
Long-Term Functional Consistency: Maintaining code or world-state invariants across workflow stages, resisting drift and error accumulation.
Iterative Refinement: Diagnosing, correcting, and verifying solutions over real error/failure distributions.

Existing supervised or synthetic datasets are inadequate for these challenges because they typically encode short horizons, lack authentic failure trajectories, or require prohibitive annotation at scale (Jiang et al., 2 Feb 2026). The daVinci-Agency paradigm exploits real-world, granular workflow logs (e.g., GitHub PR chains, surgical command sequences), which natively encode the necessary supervision signals for high-fidelity, long-horizon agent learning.

2. Data Mining and Structuring for Agentic Supervision

In the daVinci-Agency framework for software engineering, training supervision is mined from real GitHub Pull Request (PR) chains, formalized as sequences $C = \{\mathrm{pr}_1, ..., \mathrm{pr}_k\}$ where each PR consists of human discussion and ground-truth patches, with explicit semantic reference relations ensuring topological ordering along authentic development paths (Jiang et al., 2 Feb 2026). This yields trajectories $\tau = \{(o_j, m_j, t_j)\}_{j=0}^N$ in the form of observable state ( $o_j$ ), reasoning message ( $m_j$ ), and tool action ( $t_j$ ), which transition the codebase from $S_{\mathrm{init}}$ to $S_{\mathrm{target}}$ in response to a natural-language query $q$ .

For surgical interaction, the data structure centers around voice command logs and multimodal memory states, with systematized extraction of command–response pairs and the corresponding system states (e.g., video context, overlay parameters, anatomical render views) (Park et al., 10 Nov 2025).

Table 1: Key Data Mining Dimensions in daVinci-Agency

Domain	Long-horizon Unit	Authenticity Mechanism
Software Engineering	Chain of PRs	Real PR histories, topological links
Robotic Surgery	Voice-command sequence	Real-time operator utterances, multimodal agent responses

3. Interlocking Mechanisms for Robust Agentic Learning

The daVinci-Agency approach operationalizes long-horizon agentic learning through three interlocking supervisory mechanisms (Jiang et al., 2 Feb 2026):

Progressive Task Decomposition via Commits/Commands: Each workflow step (PR or command) is distilled into a sub-query $q_i$ (via LLM-based query construction or correction prompts), isolating the intent, functional site, and logic shift without leaking the ground-truth solution. This enforces explicit planning of subgoals within the overarching workflow.
Long-Term Consistency through State Carry-Over: All workflow units are tied to a unified objective; each patch or action accumulates, inducing state transitions that are scrutinized by downstream test/verifiability checks. Misaligned edits or invalid actions propagate errors, enforcing persistent stateful reasoning.
Verifiable Refinement from Authentic Bug-Fix/Correction Trajectories: Explicitly labeled bug-fixes and semantic corrections provide naturally occurring negative (failed) and positive (fixed) supervision. Automatic evaluators (e.g., GLM-4.6, Gemma3) implement rejection sampling, enforcing semantic alignment with ground truth and providing up to three auto-feedback rounds per refinement stage.

4. Model Architectures and Training Protocols

The agentic data mined by daVinci-Agency is used to fine-tune unified LLMs for both code-related and multimodal tool-use tasks. For software workflows, the primary model is GLM-4.6 (4.6B parameters), but the protocol generalizes to larger and smaller architectures (Qwen3-32B, 30B MoE, 8B) (Jiang et al., 2 Feb 2026). Supervised fine-tuning is performed using frameworks such as Slime v3, leveraging high-fidelity multi-PR trajectories (average: 85k tokens, 116 tool calls). Key training settings include batch size 64, 2 epochs, learning rate 2e–6 (cosine decay), 10% warmup, Adam optimizer, and parallel sequence processing. Fine-tuning is scaffolded by execution environments (SII-CLI, mini-swe-agent) to capture interaction diversity.

In surgical agent orchestration, hierarchical models are constructed with a workflow orchestrator agent (LLM-driven, e.g., Gemma3:27b-it-qat) controlling three subordinate task-specific LLM agents—Information Retrieval, Image Viewer, Anatomy Rendering—each operating over distinct action grammars and state variables (Park et al., 10 Nov 2025). Speech-to-text is handled by Whisper-medium, voice activation by Silero-VAD, and domain-specific correction by LLMs using tailored prompt engineering.

5. Evaluation Methodologies and Empirical Outcomes

Software Agentic Workflows: Benchmarks include SWE-bench Verified (feature-level engineering), Toolathlon (tool-use density), DS-1000 (file-level coding), τ²-Bench, and SciCode-MP. Fine-tuning on only 239 daVinci-Agency multi-PR trajectories yielded:

Absolute improvement from 0.441 to 0.475 (+7.7%) on overall composite benchmarks.
47% relative gain on Toolathlon (0.157→0.231).
Data-efficiency: daVinci-Agency outperforms large synthetic/manual datasets (e.g., SWE-Smith 66k samples, avg 0.373; CC-Bench 260 samples, avg 0.436) (Jiang et al., 2 Feb 2026).
Scaling laws: longer trajectories (85k vs. 59k tokens) improve SWE-bench and τ²-Bench by up to +8%; increased allowed tool calls further widen gains over baselines.

Surgical Agent Orchestration: Evaluated using the Multi-level Orchestration Evaluation Metric (MOEM), capturing both stage-level accuracy and multi-pass workflow-level success:

240 commands, stratified by structure, type, and expression category.
Stage-level: STT correction boosts raw 75% to ~95%; command correction/reasoning >96%; agent decision (action determination) ~93–95%; overlay execution 100%.
Workflow-level SR: strict = 65.8% (STT bottleneck), single-pass 89.2%, multi-pass 95.8%. Agent-specific, IR ≈ 99%, IV ≈ 95%, AR ≈ 94% (Park et al., 10 Nov 2025).
Robustness verified across multiple voice types; composite/paraphrase command structures remain most error-prone.

6. Limitations and Prospective Directions

Current Constraints:

Software engineering PR chains are capped at length five; longer chains yield lower self-distillation rates (Jiang et al., 2 Feb 2026).
Reliance on single-model evaluators for rejection sampling (GLM-4.6) or command correction (Gemma3) may limit coverage and introduce bias; integrating ensemble or stronger judgment models is suggested.
Domain and language generality are bounded by the scope of mined repositories (nine repos) and domain vocabularies. Underrepresented domains include mobile development and large-scale microservices.

Proposed Advances:

Extend chain length via improved self-rollout or trace mining.
Generalize data synthesis to infrastructure as code and DevOps pipelines.
Incorporate reinforcement-style fine-tuning with built-in evaluators as model-internal reward signals.
Expand the orchestration framework for surgical agents to additional intraoperative functions and multi-language support.
Address cumulative LLM latency in practical deployment by proactive caching and more domain-specific adaptation.

7. Context within Agency Measurement and Broader AI Research

The daVinci-Agency paradigm sits at the intersection of functionalist and phenomenal agency monitoring (Das, 9 Feb 2025), providing high-fidelity data for goal-directed, tool-using agents as well as generalizable scaffolding for meta-cognitive and intrinsic causal power evaluations. With its principled approach to mining authentic, causally coherent human work trajectories—and model-agnostic data-efficiency—the framework constitutes a scalable, high-yield paradigm for advancing long-horizon, persistent agentic intelligence under real-world constraints (Jiang et al., 2 Feb 2026, Park et al., 10 Nov 2025).