Unsupervised Metrics for Objective-Driven Interactions

Updated 10 November 2025

The paper introduces unsupervised metrics based on distributional divergence, clustering, and completion detection to evaluate multi-agent and human-agent interactions without reliance on ground-truth labels.
Unsupervised metrics for objective-driven interactions are evaluation methods that quantify goal achievement, interaction structure, and uncertainty using mathematical constructs like Procrustes distance and response trees.
They offer actionable insights for robust system evaluation, skill discovery, and real-time intervention in domains such as social robotics, reinforcement learning, and language agent analysis.

Unsupervised metrics for objective-driven interactions encompass a diverse family of evaluation methods designed to assess multi-agent, agent–human, or system–environment interactions without ground-truth labels or explicit human annotation. These metrics are integral across language agent evaluation, social robotics, reinforcement learning, multiparty behavioral analysis, representation learning, and generative feature engineering. They focus on goal identification, completion, agent responsibility, interaction structure, and feature consistency, capturing properties crucial for downstream utility and system robustness.

1. Key Metric Families and Definitions

Unsupervised metrics for objective-driven interactions are deeply heterogeneous but can be grouped around several core principles and mathematical constructs:

Distributional Divergence Metrics: Matching output distribution of system behaviors to reference or empirical distributions—exemplified by Output Distribution Matching (ODM) cost, which is minimized when the model’s predictive or generative output shares the same marginal as ground-truth outcomes (Sutskever et al., 2015). These are applicable to labeling, structured prediction, or trajectory matching.
Clustering and Goal Inference: Identification of user goals or latent intents based on interaction data, leveraging embedding models and LLM-guided summarization with cluster merging governed by both statistical similarity and LLM semantic labeling (Soroka et al., 4 Nov 2025). These workflows infer interpretable clusters of objective types without labels or manual taxonomies.
Task Completion Metrics: Use of fine-tuned models to probabilistically predict whether a given multi-turn conversation achieves the user’s implicit or explicit goal, operationalizing "completion" as high likelihood of an end-of-interaction marker in the generated sequence by a model trained to recognize termination (Soroka et al., 4 Nov 2025).
Uncertainty Quantification: Construction of response trees through controlled expansion of high-probability branches in the LLM output distribution, providing metrics such as leaf node count (plausible divergent completions) and maximum log-probability to diagnose when the model faces epistemic uncertainty during interaction (Soroka et al., 4 Nov 2025).
Interaction Structure Metrics: Quantification of agent–agent or agent–environment interaction structure using geometry-aware or information-theoretic metrics, including Procrustes distance on paired agent trajectories (Guha et al., 2020), local dependency graphs (conditional mutual information across state factors) (Wang et al., 2024), and metric-aware abstraction for state-space coverage (Park et al., 2023).
Conflict and Responsibility Attribution: Metrics to quantify the intensity of "conflict potentials" in agent interactions and to partition responsibility for conflict resolution between agents—superseding kinematic or distance-based surrogates that ignore asymmetric cooperation (Wenzel et al., 16 Sep 2025).
Evaluation of Representation and Pretext Mismatch: Rigorous measurement of representational mismatch and objective function misalignment between unsupervised pretext tasks and desired downstream objectives, extending protocol beyond final-model reporting to trajectory-aligned and normalized error metrics (Stuhr et al., 2020).

2. Principal Methodologies

Metric Class	Input Data	Objective
Distributional	Output sequences, labels	Marginal alignment (f-divergence)
Clustering/Intent	Turn-level transcripts	Goal labeling, cluster stability
Completion Detection	Dialogue histories	Probabilistic end-of-task
Uncertainty Tree	Model token log-probs	Support size, entropy
Geometric Structure	Trajectories	Shape/distance invariance
Responsibility	Positions + velocities	Agentwise “blame” partition
Feature Consistency	Feature matrices	Local distributional coherence

Goal inference proceeds in three phases:

Unsupervised summarization of interaction using a compact LLM,
Embedding and k-means overclustering (with conservative $k_1$ ), and
Cluster semantic merging based on embedding similarity and LLM judgment. Completion labeling is achieved with a fine-tuned LLM (LLaMA3.2-8B + LoRA) that predicts whether an end-tag follows a conversation prompt; this model is trained on explicit end-tokenized dialogues and achieves F1 near or surpassing that of much larger LLM judges (e.g., 0.94 for WebShop task).

The uncertainty quantification metric utilizes response trees where, for each prefix $p$ , nodes branch on tokens satisfying $P(\text{token}|p) \geq \alpha$ , and metrics (number of leaves, max log-prob) are derived to gauge multi-modal or uncertain agent response spaces.

SkiLD’s pointwise conditional mutual information (pCMI) metric operates on factored MDPs: $\mathrm{pCMI}^{\,i,j}(s,a,s') = \log \frac{p(s'{}^i|s,a)}{p(s'{}^i|s^{-\!j},a)}$ Edges in the induced dependency graph correspond to factor pairs $(i,j)$ where pCMI exceeds a threshold. The skill learning objective prioritizes inducing diverse interaction graphs (not just covering distinct states), using novelty-based rewards on graph coverage combined with mutual information-based diversity terms. This contrasts classical state coverage (DIAYN) by explicitly biasing the agent towards mastering interactions that involve complex, multi-factor dependencies and are compositionally relevant for downstream tasks.

METRA introduces a Wasserstein dependency measure between skills and states, enforcing that the learned embedding $\phi(s)$ contracts state space $\mathcal{S}$ to a compact latent space $\mathcal{Z}$ such that

$\|\phi(s) - \phi(s')\| \le 1 \quad \forall(s, s')\text{ adjacent}.$

The agent is trained to maximize expected inner product $\Delta\phi_t^\top z$ over sampled directions $z$ ; this construct ensures scalable diversity of discovered skills while requiring only local, sample-based constraints, in contrast to global state visitation matching.

Conflict intensity $I$ is defined by integrating the conflict potential $C(t)$ , which is based on the predicted distance-at-closest-encounter (DCE) between two agents: $C(t) = \max\left(0, 1 - \frac{\mathrm{DCE}(t)}{s_{\mathrm{ego}} + s_{\mathrm{other}}}\right).$ Responsibility $R_a$ for agent $a$ is the normalized reduction in conflict potential due to that agent’s motion changes, computed by comparing $C(t)$ to a counterfactual with $a$ ’s velocity frozen. This rigorously distinguishes cooperative from non-cooperative behavior in navigation tasks.

The Procrustes distance is constructed to be invariant to global rotation, translation, and agent ordering: $d\bigl((f_{11}, f_{12}), (f_{21}, f_{22})\bigr) = \inf_{O, c} \min\{\rho, \rho'\},$ with closed-form SVD-based computation, facilitating unsupervised clustering, stability measurement, and prototype mining over interaction datasets. Clustering, comparison, and barycenter computations are defined directly on this metric space, providing a robust basis for evaluating learned primitives or interaction stability.

3. Empirical Validation and Comparative Results

Unsupervised metrics have shown quantitative and qualitative superiority to traditional baselines where human annotation is impractical or unreliable:

LLM-guided completion detection with fine-tuned 8B adapters achieves F1 scores up to 0.94, matching LLM-as-a-judge (70B) (Soroka et al., 4 Nov 2025).
Unsupervised intent clustering is stable under repeated runs, outperforming direct LLM-prompting baselines in confusion analysis (Soroka et al., 4 Nov 2025).
Skill discovery with local dependency metrics results in highly sample-efficient policies, with >50% induction rates for hard multi-factor interactions, compared to <5% for state-coverage methods (Wang et al., 2024).
Conflict intensity and responsibility metrics discern partial from total cooperation scenarios (e.g., S2/S3 vs. S4), which speed and distance metrics cannot (Wenzel et al., 16 Sep 2025).
The Procrustes-based trajectory metric provides tighter clusterings and stability heatmaps than DTW-based or spline-based alternatives (Guha et al., 2020).

4. Limitations, Failure Modes, and Open Challenges

Several limitations are common across these approaches:

Goal clustering methods require the choice of summarization prompt and overestimate the initial cluster count; cannot handle multi-label objectives or ambiguous thematic/structural boundaries (Soroka et al., 4 Nov 2025).
Completion metrics may underperform when objectives resolve in a single turn or involve closely-coupled multi-turn follow-ups where task boundaries blur (Soroka et al., 4 Nov 2025).
Responsibility metrics assume reasonably accurate positions/velocities and the constancy of motion between timesteps; nonlinear or highly erratic maneuvers may violate model assumptions (Wenzel et al., 16 Sep 2025).
Geometric alignment metrics rely on precise synchronization or robust segmentation; inconsistent segmentation can destabilize clustering outcomes (Guha et al., 2020).
Most approaches depend on the reliability of intermediate learned models (embeddings, fine-tuned LLMs, or dynamics predictors), and small datasets can lead to inconsistent behavior or low-quality “end” token generation.

5. Integration into Downstream Systems and Practical Guidance

Unsupervised, objective-grounded metrics are critical for scalable, automated evaluation in settings where:

The full enumeration of task goals or completion requirements is infeasible.
Real-time monitoring and triggering for escalation or fallback require robust signals (e.g., task-uncertainty spikes, low end-probability).
Diversity and coverage in RL must be driven not just by state novelty but by the richness of transition structure and factor interaction graphs.
Attribution of agent roles, such as cooperative avoidance or responsibility, is key to regulatory compliance or user trust in human–robot interaction.
Representation learning is susceptible to objective function mismatch, and practitioners must quantify and mitigate representational collapse using mean or maximum objective-function-mismatch (OFM) metrics (Stuhr et al., 2020).

Pseudocode sketches and model details for all principal workflows are provided in the source works (see, e.g., LLM fine-tuning and skill learning loops in (Soroka et al., 4 Nov 2025) and (Wang et al., 2024)). Practitioners should prefer metrics that reflect structural and outcome-driven properties of interactions over purely syntactic or surface-level overlaps.

6. Future Directions and Theoretical Open Problems

Ongoing and emerging research emphasizes several promising directions:

Formal statistical characterization of how distributional or embedding “distance” between pretraining and evaluation domains modulates metric precision and generalization.
Development of scalable multi-label goal/intent clustering metrics with theoretical guarantees on purity, coverage, and annotation costs.
Integration of full probabilistic response trees for uncertainty-calibrated LLMs, moving beyond sample-level scoring.
Online deployment of unsupervised metrics for real-time intervention, escalation, or fallback in operational systems.
Transfer and abstraction of metrics (e.g., Procrustes, local dependency) to multi-agent, high-dimensional vision or multimodal domains, or to complex social interaction contexts.

Metrics in this class are unifying, mathematically principled, and detectable directly from data streams; they are set to supplant heuristic or human-in-the-loop evaluation in many enterprise and safety-critical AI domains.