VDC-Agent-19K: Captioning & VDC Control

Updated 25 November 2025

VDC-Agent-19K is a dual-use framework featuring a self-evolving video captioning dataset for multimodal models and a decentralized control method for robotics.
It employs direct preference optimization with curriculum-based fine-tuning to significantly enhance caption accuracy and VDC benchmark performance.
Its decentralized control network uses localized observer-controller pairs to enable scalable, high-frequency control across 19,000 degrees of freedom.

VDC-Agent-19K refers to two distinct but independently influential frameworks in academic literature: (1) an agentic, self-evolving dataset for video detailed captioning in multimodal LLMs (MLLMs), and (2) a scalable network of decentralized controller-observer agents for high-dimensional virtual decomposition control (VDC) of robotic mechanisms. Each system is characterized by an explicit agentic architecture designed for self-improvement and large-scale scalability, and both are rigorously grounded in theory and empirical evaluation.

1. Agentic Construction in Video Detailed Captioning

VDC-Agent-19K, in the context of video captioning, is a curated dataset comprising 18,886 preference tuples automatically generated by an MLLM through agentic self-reflection. The construction process forms a closed loop: an agent receives an unlabeled video $x$ , generates an initial caption $y_0$ conditioned on prompt $p_0$ , and receives quality guidance (score $s_t$ and suggestions $g_t$ ) according to a set of textual principles $R$ encompassing object coverage, action discrimination, camera behavior, and scene attributes. Iterative prompt refinement occurs:

If $s_t \geq \lambda = 90$ , the loop halts.
For $t = 1, \ldots, T$ $t = 1, \dots, T$ (with $T=4$ $T = 4$ ):
- If $s_t \geq s_{t-1}$ , the prompt is updated with self-improvement instructions.
- If $s_t < s_{t-1}$ , a self-reflection update path is triggered, utilizing the previous chain-of-thought $\text{CoT}_{t-1}$ to diagnose and amend errors.

Each caption trajectory $P(x,d)$ for video $x$ and VDC dimension $d \in\{\text{camera}, \text{short}, \text{background}, \text{main object}, \text{detailed}\}$ takes the form $\{(y_0, s_0), (y_1, s_1), \ldots, (y_{T_v}, s_{T_v})\}, T_v \leq T$ .

Raw data is filtered to discard (1) immediate-correct generations ( $s_0 \geq \lambda$ ), and (2) invalid outputs (malformed JSON), producing a clean set of 18,886 trajectories. For each, the system extracts the best/worst captions $(y^+, s^+)$ and $(y^-, s^-)$ , forming a preference tuple $(x, y^+, y^-, \Delta s)$ with gap $\Delta s = s^+ - s^-$ . This structure enables preference learning via Direct Preference Optimization (DPO).

2. Properties and Structure of the VDC-Agent-19K Dataset

The final dataset comprises 18,886 tuples, with approximately 3,777 samples per VDC dimension. Caption length averages 25–30 tokens, and the vocabulary contains ≈42,000 unique word types, yielding significant lexical diversity (mean BERTScore distance ≈0.75). Score statistics show $s^-$ typically ranges from 50 to 85, and $s^+$ from 90 to 100, with $\Delta s$ covering 5–50 (mean ≈20). Qualitative analysis reveals that high-scoring captions reliably describe camera motion, objects, and temporal aspects, while low-scoring outputs frequently omit important context or hallucinate events not present in the video.

3. Direct Preference Optimization and Curriculum-based Fine-tuning

VDC-Agent-19K is used to fine-tune MLLMs such as Qwen2.5-VL-7B-Instruct via Direct Preference Optimization (DPO). The loss is defined as:

$L_\text{DPO} = -\log P(y^+ \succ y^- | x), \quad \text{where}$

$P(y^+ \succ y^- | x) = \sigma\left(\beta \left[ \log \pi_\theta(y^+|x) - \log \pi_\theta(y^-|x) - ( \log \pi_\text{ref}(y^+|x) - \log \pi_\text{ref}(y^-|x) ) \right] \right)$

Here, $\beta$ is a scaling parameter ( $\beta\approx0.1$ –$0.5$), and $\pi_\theta$ , $\pi_\text{ref}$ denote the fine-tuned and reference model policies, respectively. Curriculum-based batch streaming is implemented by ordering preference tuples by descending $\Delta s$ , i.e., easy-to-hard, so that larger-confidence pairs precede more ambiguous ones. Fine-tuning uses LoRA adaptation on the LLM backbone (rank=16, $\alpha$ =32, dropout=0.1), batch size 16, across 3 epochs on 4×A800 GPUs, with a cosine learning-rate schedule starting at $5\times10^{-5}$ and 10% warmup.

4. Empirical Performance and Benchmark Impact

VDC-Agent-7B, trained using the VDC-Agent-19K dataset and curriculum DPO, achieves state-of-the-art results on the VDC benchmark. Performance metrics are as follows:

Model	Accuracy	VDCscore
Qwen2.5-VL-7B-Instruct (baseline)	43.95%	2.23
VDC-Agent-7B (fine-tuned)	49.08%	2.50

Dimension-wise improvements over baseline include:

Camera: +7.91% accuracy / +0.49 VDCscore
Background: +7.83% / +0.40
Main Object: +4.23% / +0.18
Detailed: +3.96% / +0.20
Short: +1.73% / +0.08

These results demonstrate a +5.13% boost in overall accuracy and a +0.27 increment in VDCscore, at inference cost comparable to the base model (Wang et al., 24 Nov 2025).

5. VDC-Agent-19K in Decentralized Virtual Decomposition Control

Distinguished from the data-centric MLLM framework, the term VDC-Agent-19K also denotes a schematic for scaling decentralized observer-based virtual decomposition control to mechanisms with on the order of 19,000 degrees of freedom (2002.01292). In this scheme, a large open-chain manipulator is virtually partitioned into $2n$ subsystems: $n$ rigid-link and $n$ joint agents, each operating a local controller and observer pair.

The communication protocol is strictly local, relying on force/moment transfers at Virtual Cutting Points (VCPs) and decoupled ordinary differential equations (ODEs) per agent, with interconnection through Virtual Power Flows (VPFs)—quantities that cancel in global Lyapunov analysis. The explicit observer and controller equations for each link and joint are as follows:

Link- $i$ observer (6D):

$\dot{\hat{P}}_{B_i} = \hat{V}_{B_i} - M_{B_i}^{-1} L_{B_i} (\hat{P}_{B_i} - P_{B_i})$

$M_{B_i} \dot{\hat{Z}}_{B_i} = F_{B_i}^* - C_{B_i}(\hat{\omega}_{B_i}) \hat{V}_{B_i} - G_{B_i}$

Joint- $i$ observer (1D):

$\dot{\hat{q}}_i = z_i - L_i(\hat{q}_i - q_i)$

$I_{m,i}\dot{z}_i = \tau_i - \tau_{a,i} - f_{c,i}(\dot{\hat{q}}_i) - \ell_i(\hat{q}_i - q_i)$

Controller laws and Lyapunov-based global convergence proofs scale independently of $n$ . By instantiating this pattern 19,000 times (for both links and joints), a VDC-Agent-19K can coordinate systems at unprecedented scale with guaranteed semiglobal exponential convergence.

6. Scalability and Practical Considerations in Large-Scale VDC-Agent Networks

Each of the 38,000 subsystems in a VDC-Agent-19K network (for a 19,000-DoF chain) independently integrates its observer and controller ODEs and only requires data from immediate neighbors. State updates consist of matrix-vector operations and one-step ODE integration. Parallelization is straightforward, enabling hardware implementation across multi-core CPU, GPU, or FPGA clusters at effective control rates exceeding 1kHz.

Key engineering considerations include:

Gain selection per analytic bounds to maintain passivity and avoid instability.
Incremental gain adaptation to balance convergence speed against noise tolerance.
Robustness to model uncertainty via passivity property and possible adaptive updates.
Straightforward inclusion of contact/environmental subsystems within the VDC formalism.
Fixed-step numerical integration (e.g., 4th-order Runge–Kutta for joints, Lie-group-aware schemes for links), with precomputation of repeated transforms and exploitation of sparsity in inertial and Coriolis matrices.

All stability and performance guarantees remain valid at this scale due to analytic cancellation of VPFs, yielding a theoretically rigorous and computationally tractable path to extreme-scale chain-type robot and simulation control.

VDC-Agent-19K thus designates: (a) a large-scale, self-constructed video preference dataset critical for preference-based multimodal model training (Wang et al., 24 Nov 2025), and (b) a blueprint for massively scalable, decentralized, passivity-based observer-controller networks in virtual decomposition control (2002.01292). Both leverage agentic principles to attain performance and stability unattainable with monolithic, centrally engineered strategies.

Markdown Report Issue Upgrade to Chat

References (2)

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection (2025)

Decentralized Observer Design for Virtual Decomposition Control (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VDC-Agent-19K.