TEA-Bench Benchmark Frameworks
- TEA-Bench is a suite of benchmarking frameworks that assess end-to-end system fidelity across emotional support dialogues, TEE security, table extraction, and tensor network computations.
- It employs interactive tool invocation, rigorous process-level metrics, and hardware-sensitive evaluations to capture real-world performance.
- Its comprehensive design bridges subcomponent accuracy with holistic application challenges, driving reliable AI and secure systems research.
TEA-Bench denotes several distinct, rigorously designed benchmarking frameworks, each targeting a high-value problem class—tool-enhanced emotional support dialogues, transparent execution approaches for confidential compute, table extraction from complex scientific documents, and high-performance tensor network computations. Across these domains, TEA-Bench shares a unifying principle of process-level, end-to-end evaluation that captures real-world requirements beyond technical subtask accuracy.
1. TEA-Bench: Overview and Context
The term "TEA-Bench" has been independently adopted in at least four major research lines:
- Tool-Enhanced Emotional Support Dialogue Agent Benchmark: An interactive, process-level benchmark for evaluating how LLMs use external tools (e.g., map, Wikipedia, Reddit) to ground emotional support conversations, reducing hallucination and improving actionable guidance (Sui et al., 26 Jan 2026).
- Transparent Execution Approaches Benchmark: A comprehensive framework for the experimental performance assessment of Trusted Execution Environment (TEE) solutions (Intel SGX, AMD SEV, Intel TDX), evaluating real-world application workloads under realistic adversary models (Coppolino et al., 2024).
- Table Extraction Assessment Benchmark: An end-to-end evaluation protocol for scientific table extraction from PDFs, combining classic detection and structure recognition with metrics for model calibration, robustness, and uncertainty on heterogeneous datasets (Soric et al., 20 Nov 2025).
- Quantum Red TEA Benchmark: A performance benchmark for tensor network algorithms implemented across CPU, GPU, and TPU backends, quantifying hardware and toolkit choices for variational quantum ground-state search (Jaschke et al., 2024).
Each TEA-Bench instantiation advances its field by shifting from subcomponent accuracy to holistic, system-level process fidelity, integrating multi-modal real-world context, calibrated evaluation, and generalization stress-testing.
2. Tool-Enhanced Emotional Support Dialogue: Motivation, Design, and Novelty
TEA-Bench for ESC systems addresses a critical shortcoming of prior emotional support dialogue benchmarks, which focus narrowly on affective validation in text-only settings. These traditional approaches do not penalize hallucinated advice or reward trustworthy, context-sensitive instrumental support. TEA-Bench introduces:
- Interactive, Multi-Turn Dialogue Simulation: Agents converse with a user simulator (action-oriented or emotion-oriented), with each turn optionally invoking any of 31 external tools via Model Context Protocol (MCP).
- Rich, Grounded Scenarios: 81 scenarios with spatiotemporal context derived via LLMs and Map APIs, verified for realism and supporting meaningful factual queries (e.g., location-specific weather).
- Process-Level Metrics: Joint evaluation of empathy, fluency, recommendation quality, user acceptance, and factual grounding at the episode level, not just single responses.
- Hallucination Detection: An explicit module flags ungrounded claims, triggering user doubt and penalizing agents that hallucinate facts (Sui et al., 26 Jan 2026).
Key distinction: Unlike static datasets, TEA-Bench tightly couples agent decision-making (when/how to use tools) to user reactions and long-range trust-building.
3. Experimental Methodology and Metrics
3.1 Emotional Support Dialogue
- Agent Workflow Per Turn:
- Optionally invoke tools (e.g., Wikipedia summary, weather API, Reddit search).
- Integrate tool responses.
- Generate a concise (≤30 words), empathetic reply.
- User Modeling: Simulated users either demand practical advice (action-oriented) or require validation before accepting suggestions (emotion-oriented).
- Factuality Metrics: For each dialogue with agent turns:
where is number of turns with factual claims, is number with hallucinations. Macro-averaged across dialogues for process-level evaluation.
- TEA Scores: Diversity, Fluency, Humanoid (human-likeness), Information, Effectiveness, each scored 0–4 and normalized.
3.2 Trusted Execution Environment Benchmark
- Workloads: TensorFlow, PyTorch (CPU-intensive); Redis, Vault (memory); NGINX, NodeJS (I/O).
- Metrics: Wall-clock execution time, throughput, 95th-percentile latency, CPU/memory overhead, and normalized costs across SGX, SEV, TDX.
- Threat Model: Attacker controls OS/hypervisor; TEE hardware trusted; attestation channels assumed authentic.
- Analysis: Near-native performance for VM-based TEEs on memory/I/O workloads, moderate overhead for SGX, with strong security trade-offs (Coppolino et al., 2024).
3.3 Table Extraction Assessment
- Datasets: PubTables-Test (biomedical, homogeneous), Table-arXiv (heterogeneous LaTeX, all ArXiv domains), Table-BRGM (geological reports, complex layouts).
- Subtasks: Table Detection (box), Table Structure Recognition (rows/cols, spans), Table Content Recognition (text).
- Metrics: Precision, recall, F1 (at multiple IoU thresholds), expected metrics integrating over IoU, Average Precision (AP), calibration measure (D-ECE), and structure/content similarity (TEDS, GriTS) (Soric et al., 20 Nov 2025).
3.4 Quantum Red TEA Benchmark
- Algorithm: Variational ground-state energy minimization for 2D Ising model, using binary tree-tensor-networks (TTN).
- Evaluation Knobs: Backend selection, mixed precision, RG-tensor skipping, CPU threading, GPU/TPU acceleration, shape tiling, block sparsity.
- Results: Best CPU tuning achieves ~34× speedup; GPU yields additional 2.76× over CPU-best; block-sparse symmetry up to 2× on CPU for up to 64 (Jaschke et al., 2024).
4. Main Results and Comparative Analysis
4.1 Emotional Support Dialogue
| Model Tier | TEA Gain (Tools) | HallucRate Drop | Tool Call Efficiency |
|---|---|---|---|
| Strong (GPT-4o, etc.) | +3–5 | 20–50% | 1–2 calls/dialogue |
| Mid (Qwen3-32B) | +1–3 | 20–50% | 3–5 calls/dialogue |
| Weak (Qwen3-8B) | Negligible | 20–50% | <1 call/dialogue |
- Strong models achieve the highest per-call reduction in hallucination and empathy improvement, reflecting skill in judicious tool selection and integration.
- Supervised fine-tuning on high-quality TEA-Dialog data increases in-distribution performance but generalizes poorly, with out-of-domain hallucinations rising beyond baseline levels (Sui et al., 26 Jan 2026).
4.2 TEE Workload Benchmark
| Workload | TDX Overhead | SEV Overhead | Gramine-SGX Overhead | Occlum-SGX Overhead |
|---|---|---|---|---|
| PyTorch | +8% | +48% | +5% | +15% |
| TensorFlow | –2% | +15% | +12% | +500% |
| Redis | +3% | +25% | +65% | +85% |
| Vault | +5% | +22% | +55% | +90% |
| NGINX | +29% | +32% | +68% | +120% |
| NodeJS | +35% | +38% | n/a | n/a |
- VM-based TEEs (TDX, SEV) deliver low overhead on memory/I/O workloads and enable "lift-and-shift" legacy application deployment.
- Process-based TEEs (SGX+Gramine/Occlum) achieve best-in-class protection for CPU-bound tasks at moderate overhead but scale poorly under heavy I/O load (Coppolino et al., 2024).
4.3 Table Extraction
- Modern detector-based methods (DETR, TATR) reach >0.9 AP for detection on homogeneous styles; on heterogeneous pages, AP drops below 0.8, with structure and content F1 rarely exceeding 0.7.
- Content recognition remains limited by OCR/token-alignment, especially for mathematical, multilingual, or borderless tables.
- Model calibration (low D-ECE) is critical for interpretable outputs; off-the-shelf heuristics and LVLMs underperform on nonstandard layouts (Soric et al., 20 Nov 2025).
4.4 Quantum Red TEA
- PyTorch backend outperforms NumPy/CuPy and JAX on tested CPUs/GPUs.
- Mixed precision, RG-skipping, and thread tuning yield up to 34× CPU speedups; A100 GPU provides ~2.7× extra over CPU.
- Block-sparse (Z2) symmetry doubles efficiency on CPU, suggesting further gains for larger bond dimensions and parallel block updates (Jaschke et al., 2024).
5. Analysis, Guidelines, and Future Directions
- Emotional Support Agents: Tool augmentation confers significant hallucination reduction and trustworthiness for high-capacity LLMs, but requires precise invocation and empathetic linguistic integration. SFT on narrow data can degrade performance under distribution shift. Recommendations emphasize process-level, scenario-rich benchmarking and robust generalization (Sui et al., 26 Jan 2026).
- TEE Workloads: VM-based TEEs offer near-native performance and easier deployment for I/O/memory bound workloads; normalizing CPU baselines is essential for fair comparison. SGX remains preferred for high-assurance compute, but scalability is limited (Coppolino et al., 2024).
- Table Extraction: End-to-end extraction remains unsolved for complex scientific documents; gains hinge on advances in calibration, OCR integration, and training on diversified, representative data (Soric et al., 20 Nov 2025).
- Quantum Tensor Networks: Hardware-aware backend selection, hybrid precision, and algorithmic shortcut exploitation are critical; best practices are provided via open APIs. Larger, block-sparse workloads and parallelization represent promising avenues (Jaschke et al., 2024).
6. Significance and Ongoing Development
TEA-Bench frameworks shift benchmarking from isolated subcomponent accuracy to end-to-end, contextually grounded, process-level rigor. This approach surfaces hidden deficiencies (e.g., hallucination, scalability collapse, overfitting to narrow data distributions), enabling more trustworthy, generalizable ML and systems research. Releases of code, datasets (e.g., TEA-Dialog), and APIs accelerate the translation of these benchmarks into community standards (Sui et al., 26 Jan 2026, Coppolino et al., 2024, Soric et al., 20 Nov 2025, Jaschke et al., 2024).
TEA-Bench benchmarks are expected to continue evolving, with anticipated focus on adversarial robustness, human-in-the-loop evaluation, longitudinal trust dynamics, and adaptation to emerging hardware and interaction modalities.