Replay-Based Telemetry Synthesis

Updated 19 February 2026

Replay-based telemetry synthesis is the process of combining deterministic replay of instructional prefixes with real-time tracing to generate high-fidelity execution trajectories.
It supports agent-centric synthesis by converting large-scale web tutorials into structured, multimodal data, enabling scalable training of UI agents and debugging.
Prefix-based tracing in concurrent systems offers deterministic replay with subsequent free-form tracing, yielding actionable performance metrics and reproducibility.

Replay-based telemetry synthesis is the process of constructing high-fidelity sequential data of agent or program execution by combining automatic replay of observed (or instructional) prefixes with real-time recording of subsequent behavior. This paradigm enables scalable, semi-automated generation of labeled multimodal trajectory data for data-driven methods, as well as controllable reproduction and analysis of concurrent systems. It underpins recent advances in training GUI agents with web tutorials and rigorous tracing of message-passing programs. Notably, approaches such as AgentTrek operationalize replay-based telemetry synthesis for web environments by guiding agent behavior with externally sourced instructional trajectories, while prefix-based tracing supports customizable instrumentation for concurrency debugging and reproducibility.

1. Key Conceptual Foundations

Replay-based telemetry synthesis centers around the execution of target agents or programs in environments that allow a phase of deterministic replay—guided by a reference or instructional prefix—followed by an unconstrained recording (tracing) phase. The foundational principle is to split the execution trace into a controlled prefix (for reproducibility or data bootstrapping) and a post-prefix suffix, which is recorded for downstream analysis or learning.

In controlled concurrency settings (e.g., Erlang-style message-passing systems), this is formalized as prefix-based tracing, wherein program instrumentation enforces adherence to a supplied “partial log” prefix during execution and logs all subsequent actions as they occur nondeterministically. In agent-centric data synthesis contexts, replay is guided instead by stepwise instructions parsed from textual tutorials or previously collected expert demonstrations, enabling high-throughput synthetic data generation for web or GUI agents (Xu et al., 2024, González-Abril et al., 2021).

2. Methodologies: Agent-Centric Synthesis and Concurrency Tracing

Replay-based telemetry synthesis methodologies vary by target domain:

A. Agent Trajectory Synthesis via Guided Replay

Tutorial Harvesting & Filtering: Textual web tutorials—sourced at scale (e.g., from the RedPajama 20.8B URL corpus)—undergo multi-stage filtering using rule-based keyword matching, LLM (GPT-4o mini) binary classification (F1 ≈ 0.89), and statistical FastText models (test F1 ≈ 0.895), resulting in millions of tutorial-like entries with high recall and precision (Xu et al., 2024).
Text-to-Task Transformation: Tutorials are parsed and standardized by LLMs into structured task specifications (JSON schema) defining target platform, application object, URLs, prerequisites, instructions, and expected postconditions, at a processing cost of $0.89 per 1,000 tutorials.
Guided Replay Execution: Visual-LLM (VLM) agents (e.g., Qwen2-VL) execute tasks within instrumented environments (e.g., BrowserGym with Chromium and Playwright drivers), synthesizing multimodal observations (screenshots, AXTree snapshots) and action records matched against tutorial-derived steps. Actions may be API-based or pixel-level, depending on the agent modality.

B. Prefix-Based Tracing in Message-Passing Concurrency

Scheduler Injection: Programs are instrumented by rewriting concurrent primitives (spawn, send, receive) to interact synchronously or asynchronously with a central scheduler (sched), which enforces a per-process action log/prefix (“replay mode”) and, upon its exhaustion, records subsequent actions (“trace mode”) (González-Abril et al., 2021).
One-Pass “Replay-then-Trace” Discipline: Each process exactly reproduces its log-specific prefix; then, upon completion, it switches to free tracing. Full execution trace collection is performed in a single run, encompassing both the replayed prefix and nondeterministic suffix.
Pseudocode Realization: Algorithmic recipes define the handling of action scheduling, message delivery buffering, and trace log maintenance, supporting reproducibility and analysis at controllable levels of granularity.

3. Data Specifications and Evaluation Metrics

Replay-based telemetry synthesis produces multi-resolution agent trajectories or execution logs:

AgentTrek Specification: Each data point comprises JSON task metadata, sequences of screenshots, AXTree (accessibility tree) snapshots, chain-of-thought inner reasoning snippets, stepwise actions (API calls or pixel coordinates), and Playwright traces (DOM/network events). Actions and observations are organized temporally, supporting multimodal and highly granular learning (Xu et al., 2024).
Concurrency Trace Specification: Traces are dictionary-mappings from symbolic process references to action sequences, recording spawn, send, deliver, and receive events, with full or partial logs corresponding to different replay granularity (González-Abril et al., 2021).
Performance Metrics:
- Success Rate (SR): $\mathrm{SR} = \frac{\#\mathrm{successful\ trajectories}}{\#\mathrm{tasks}}$
- Evaluator Accuracy: $\mathrm{Acc} = \frac{\#\mathrm{evaluator\ judgments\ matching\ human}}{\#\mathrm{samples}}$
- Operation F1 (OpF1): $\mathrm{OpF1} = \frac{2\,\mathrm{Prec}\,\mathrm{Rec}}{\mathrm{Prec}+\mathrm{Rec}}$
Cost-Efficiency Formula (AgentTrek):

$C_{\rm traj} = \frac{C_{\rm tag}}{r_{\rm web}} + \frac{C_{\rm replay} + C_{\rm eval}}{R_{\rm success}}$

Where $r_{\rm web} = 0.275$ , $R_{\rm success} = 0.399$ , $C_{\rm tag} = 0.886\$/10^{3 $,$ C_{\rm} replay} = 215.359\$/10^3 $,$ C_{\rm eval} = 3.104\$/10^{3 $\mathrm{Acc} = \frac{\#\mathrm{evaluator\ judgments\ matching\ human}}{\#\mathrm{samples}}$ 0C_{\rm} traj} \approx 0.551\$\mathrm{Acc} = \frac{\#\mathrm{evaluator\ judgments\ matching\ human}}{\#\mathrm{samples}}$1 (Xu et al., 2024).

4. Empirical Results and Comparative Performance

Replay-based telemetry synthesis enables large-scale, cost-effective, high-fidelity data generation with measurable impacts on downstream training and evaluation:

AgentTrek Quantitative Results: Generation of 10,398 trajectories across 127 real websites and 11 task categories yields state-of-the-art performance on textual and visual web benchmarks. Notably, the Qwen2.5-32B model with AgentTrek achieves WebArena success rates of 16.26%, outperforming GPT-4o baseline (13.10%). ScreenSpot Web grounding rises from 30.7% (Qwen2-VL-7B) to 67.4% (w/ AgentTrek). On Mind2Web, combining AgentTrek and Mind2Web data achieves cross-task step SR up to 55.7% (Xu et al., 2024).
Scalability and Cost: The fully automated pipeline lowers trajectory generation costs to roughly $\mathrm{Acc} = \frac{\#\mathrm{evaluator\ judgments\ matching\ human}}{\#\mathrm{samples}}$210–$100 per human-annotated trajectory.
Prefix-Based Tracing Fidelity and Overhead: The scheduler-based approach is deterministic over the replayed prefix and records precise action orderings, with an instrumentation-induced slowdown of 2×–3× on micro-benchmarks. Limiting factors include central scheduler bottlenecks and synchronous spawn overhead. The trace size is proportional to per-message and per-spawn event counts (González-Abril et al., 2021).

Method/Model	WebArena SR	ScreenSpot Avg %	Mind2Web Cross-Task SR
GPT-4o baseline	13.10%	10.1	–
Qwen2.5-7B w/ AgentTrek	10.46%	–	–
Qwen2.5-32B w/ AgentTrek	16.26%	–	–
Qwen2-VL-7B	–	30.7	–
Qwen2-VL-7B w/ AgentTrek	–	67.4	40.9%
+AgentTrek +Mind2Web	–	–	55.7%

This table summarizes key success rates for various models and ablations as reported in the AgentTrek experiments.

5. Strengths, Limitations, and Future Directions

Replay-based telemetry synthesis frameworks demonstrate several strengths:

Scalability: Fully automated pipelines from large-scale web-sourced tutorials to verified execution traces enable orders-of-magnitude data growth with minimal human effort.
Cost Efficiency: Automated synthesis reduces data costs by approximately 20–100× relative to manual annotation (Xu et al., 2024).
Diversity and Multimodality: Generated traces reflect realistic, semantically rich user tasks, incorporating multiple observation and action modalities.
Controllability: Partial-log replay in concurrency supports targeted reproducibility and debugging, unifying tracing and replay as special cases (González-Abril et al., 2021).

Notable limitations include a dependency on the quality and currency of available tutorials (impacting replay success), considerable token costs for VLM-based replay (approximately 8,000 tokens per step for GPT-4o), residual model navigation errors, and scope currently limited to web environments (for agent-centric synthesis). For prefix-based tracing, the centralized scheduler can become a bottleneck and imposes additional synchronization latency.

Possible extensions involve adaptation to desktop/mobile domains via targeted tutorial harvesting, integration of real-time tutorial updating and retrieval for UI drift adaptation, active learning with evaluator feedback, hybrid policy architectures, and inclusion of closed-loop or human-in-the-loop evaluations (Xu et al., 2024).

6. Context within Program Analysis and Data Synthesis

Replay-based telemetry synthesis generalizes the dichotomy between record/replay and pure tracing found in systems research. In prefix-based tracing frameworks, program executions range from fully deterministic replay (given a complete log) to open-ended tracing (empty log), with arbitrary partial prefixes enabling mixed-mode operation. In agent trajectory synthesis, replay of instructional trajectories facilitates both imitation and scalable reward-based supervision, with guided replay providing an avenue for verifying generalization, robustness, and benchmarking in digital environments. This duality underpins broad applications in debugging, learning, and evaluation for both software systems and AI agents (Xu et al., 2024, González-Abril et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (2024)

A Program Instrumentation for Prefix-Based Tracing in Message-Passing Concurrency (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Replay-Based Telemetry Synthesis.