SD-E²: Semantic Diversity in Exploration

Updated 1 February 2026

SD-E² is a framework that balances semantic exploration and exploitation by rewarding meaningful novelty via learned embeddings.
It leverages language models, high-dimensional metrics, and hierarchical search to guide diverse and interpretable solution discovery.
Empirical results demonstrate improved compute efficiency and accuracy across tasks such as video processing, reasoning, and novelty search.

Semantic Diversity-Exploration-Exploitation (SD-E $^2$ ) is a family of algorithmic frameworks designed to balance the trade-off between discovering novel, semantically meaningful behaviors (exploration), maintaining broad or diverse coverage (diversity), and efficiently leveraging learned information for performant task completion (exploitation). Unlike traditional exploration/exploitation methods that focus on state-space novelty or reward maximization, SD-E $^2$ approaches prioritize diversity and novelty at the semantic or conceptual level, often operationalized via learned embeddings or language-guided signals.

1. Conceptual Foundations and Motivation

SD-E $^2$ frameworks address the limitations of classical exploration-exploitation strategies by introducing mechanisms that explicitly optimize for semantic diversity. While classical reinforcement learning or search methods may induce superficial diversity (e.g., through randomization or count-based bonuses), these do not guarantee coverage of genuinely distinct solution strategies, information sources, or behaviors relevant to the task semantics. SD-E $^2$ methods leverage high-dimensional embedding spaces, LLMs, and hierarchical or information-theoretic tools to measure and reward semantic diversity, thus enabling efficient exploration even under resource constraints or sparse reward regimes.

Motivations for developing SD-E $^2$ include:

Improving compute efficiency when exhaustive search or dense exploration is infeasible (Mishra et al., 25 Jan 2026, Yang et al., 3 Dec 2025).
Enabling adaptive, interpretable exploration that aligns with human notions of difference and novelty (Truong et al., 21 Jan 2026, Khajehabdollahi et al., 4 Sep 2025).
Overcoming the plateauing of purely local novelty-driven search, reaching new behavioral or solution regimes via semantic goal-setting (Khajehabdollahi et al., 4 Sep 2025).
Achieving fairness and reducing externalities in online learning by exploiting naturally-occurring data diversity (Raghavan et al., 2018).

2. Formal Definitions and Mechanisms

SD-E $^2$ frameworks instantiate semantic diversity in several domains via shared formal and computational principles:

Semantic Embedding and Novelty

Data points, strategies, code states, or video frames are mapped to a high-dimensional embedding space ( $\mathbb{R}^d$ ) using models such as sentence-transformers, vision-LLMs, or CLIP (Mishra et al., 25 Jan 2026, Yang et al., 3 Dec 2025, Khajehabdollahi et al., 4 Sep 2025).
Diversity is quantified either by average pairwise embedding dissimilarity, coverage over a set of distinct regions, or via information-theoretic metrics such as entropy or spread (Truong et al., 21 Jan 2026, Mishra et al., 25 Jan 2026).
For reasoning tasks, SD-E $^2$ rewards the coverage and spread of semantically distinct solution trajectories using frozen sentence-embedding models. The diversity reward combines: (i) number of unique strategy blocks (determined via a greedy, thresholded clustering on cosine similarity), and (ii) mean dissimilarity across the set (Mishra et al., 25 Jan 2026).

Hierarchical Search and Semantic Anchors

In long video understanding, SD-E $^2$ uses latent language queries to dynamically identify semantic anchors—frames most representative of task-relevant concepts—and structures exploration using a hierarchical tree search that mixes anchor-guided expansion with coverage-based selection (Yang et al., 3 Dec 2025).
Video segment informativeness is scored both by direct VLM-derived intrinsic reward and similarity to semantic anchors, with the final exploration policy adaptively fusing these scores based on uncertainty.

Diversity-Driven Reward Shaping

For reinforcement learning over discrete trajectories, SD-E $^2$ includes the diversity score as an intrinsic (potential-based) shaping reward, preserving policy optimality while incentivizing deep semantic exploration. Heuristics such as conditional shaping and bonus clipping prevent reward hacking (Hu et al., 30 Sep 2025).
Multi-objective reward formulations balance correctness, outcome coverage, semantic diversity, and format adherence, often using z-score normalization to stabilize training under heterogeneous reward scales (Mishra et al., 25 Jan 2026).

Alternation of Local Expansion and Goal-Directed Expeditions

In automated discovery domains, SD-E $^2$ alternates between undirected, novelty-driven expansion in semantic space and directed expeditions where vision-LLMs generate explicit semantic goals, directly guiding search to unexplored regions (Khajehabdollahi et al., 4 Sep 2025).

3. Domain-Specific Methodological Instantiations

Multiple research domains have adopted SD-E $^2$ principles with tailored methodologies:

Domain	Embedding Space	Semantic Signal	Exploit/Explore Drivers
Reasoning with SLMs	Sentence-Transformer	Strategy blocks	Correctness (exploit), diversity (explore)
Long Video Understanding	Vision-Language (VLM)	Semantic queries/anchors	VLM reward, anchor relevance, coverage
Social Media Language	Transformer embedding	Text entropy	Local/global entropy (explore/exploit)
Cellular Automata Discovery	CLIP-like VLM	Behavioral embeddings, text	Novelty search, linguistic goals
Industrial Data Selection	MLLM+sentence-transformers	Caption embeddings	Downstream utility, cluster coverage
Recommender Systems	Hyperbolic space	User/item semantic profiles	Depth = exploitation, breadth = exploration

For example, in long video agents, semantic queries extracted from natural language instructions are used to retrieve candidate frames; these frames guide temporal anchor points in a tree-based search. The node expansion explicitly enforces both exploitation of semantically informative anchors and exploration for temporal coverage (Yang et al., 3 Dec 2025). In reasoning for SLMs, each generated reasoning chain is embedded, and exploration is favored until a correct answer is achieved—at which point the model collapses to exploitation, focusing on known-good solution paths (Mishra et al., 25 Jan 2026).

4. Information-Theoretic and Uncertainty Principles

SD-E $^2$ frameworks frequently rely on entropy or information-theoretic metrics to quantify the trade-off between semantic exploration and exploitation:

Local semantic entropy, measured as the log-distance to k-th nearest neighbor, indexes how novel a datum or action is with respect to current context (exploration) (Truong et al., 21 Jan 2026).
Global semantic entropy, via the log-determinant of the sample covariance in embedding space, quantifies overall dispersion or compression of the dataset, with reduction correlating to exploitation as content clusters around shared hubs (Truong et al., 21 Jan 2026).
In tree-structured search (e.g., videos), uncertainty in VLM intrinsic rewards (measured by normalized entropy of reward distributions) modulates the weighting of semantic versus direct informativeness priors, providing stabilized decision-making (Yang et al., 3 Dec 2025).

Empirical studies show that leveraging semantic diversity reduces the sample complexity of learning, increases the coverage of solution space, and leads to more robust and computationally efficient models (Mishra et al., 25 Jan 2026, Hu et al., 30 Sep 2025, Yang et al., 3 Dec 2025).

5. Empirical Results and Ablation Analyses

SD-E $^2$ -based algorithms demonstrate significant gains in multiple application settings:

Reasoning Under Token Budgets: SD-E $^2$ -optimized SLMs on GSM8K achieve 82.03% accuracy (Qwen2.5-3B), surpassing GRPO-CFL and GRPO-CFEE baselines by +5.2 and +1.5 percentage points, respectively. The model discovers on average 9.8 semantically distinct strategies per question and achieves order-of-magnitude improvements in MedMCQA and AIME tasks (Mishra et al., 25 Jan 2026).
Video Understanding: The EEA agent utilizing SD-E $^2$ principles attains 75.6% accuracy using 6.1 frames on EgoSchema, and 50.8% using 14.2 frames on LVBench—outperforming baseline sampling approaches in both accuracy and computational efficiency (Yang et al., 3 Dec 2025).
Novelty Search in Cellular Automata: SD-E $^2$ alternation between expansion and expedition uncovers more semantically distinct patterns—genealogical influence of expedition-born solutions reaches $\approx$ 42.2%, compared to random GA's $\approx$ 4.64% (Khajehabdollahi et al., 4 Sep 2025).
Language and Circadian Modulation: Semantic exploration and exploitation exhibit robust circadian rhythms across countries, with early-morning peaks in local entropy and later peaks in global semantic diversity, supporting an information-theoretic model of daily linguistic and cognitive variation (Truong et al., 21 Jan 2026).

Ablation studies systematically reveal the necessity of all SD-E $^2$ components. Removing semantic-guided expansion, uncertainty-aware fusion, or dynamic query update in video agents all result in marked declines in both efficiency and accuracy (Yang et al., 3 Dec 2025). Disabling diversity rewards or exploration-completion logic in reasoning models collapses both solution quality and the number of distinct strategies (Mishra et al., 25 Jan 2026, Hu et al., 30 Sep 2025).

6. Algorithmic, Theoretical, and Practical Considerations

Several key properties and considerations emerge in the design and implementation of SD-E $^2$ systems:

Potential-Based Reward Shaping: Guarantees policy invariance while rigorously incentivizing semantic exploration (Hu et al., 30 Sep 2025).
Hierarchical Control: Tree-based or structured search (e.g., in recommender systems or long-video tasks) enables controllable trade-offs between focused exploitation and broad exploration (Ma et al., 2024, Yang et al., 3 Dec 2025).
Batch and Group Normalization: Utilizing batchwise z-score reward normalization in multi-objective optimization prevents domination by any single reward source and stabilizes training (Mishra et al., 25 Jan 2026).
Architectural Adaptation: Cognitive adaptation—dynamically gating reasoning structure as a function of accumulated semantic novelty—contrasts with per-token compute scaling in conventional LLM optimization (Mishra et al., 25 Jan 2026).
Dimensional Alignment: Use of learned, frozen, or hyperbolic embedding spaces requires careful cross-modal or cross-task alignment to faithfully represent semantic relations (Ma et al., 2024, Khajehabdollahi et al., 4 Sep 2025).

Practical limitations include dependency on the geometry and representational fidelity of frozen encoders, brittleness of parsing for strategy block extraction, and the potential for spurious semantic novelty (reward hacking) if the diversity metric is not properly bounded or regularized (Mishra et al., 25 Jan 2026, Hu et al., 30 Sep 2025).

7. Broader Significance and Future Directions

SD-E $^2$ provides a principled, information-theoretic approach to balancing efficiency and coverage in discovery, reasoning, and adaptive learning. Empirical and theoretical results support its role in:

Improving compute-constrained performance via meaningfully diverse exploration.
Enabling interpretable, human-aligned search trajectories in high-dimensional or open-ended domains.
Reducing learning externalities and unfairness by exploiting natural data diversity, rendering explicit forced exploration less necessary in sufficiently variable environments (Raghavan et al., 2018).
Bridging cognitive neuroscience, language modeling, and algorithmic discovery by connecting circadian modulations, dopaminergic exploration, and semantic coverage (Truong et al., 21 Jan 2026).

A plausible implication is that future SD-E $^2$ frameworks may incorporate meta-learned or adaptive semantic encoding spaces, robust parsing or abstraction discovery, and hybrid architectures that integrate sequence-level semantic signals with token-level control. Extending SD-E $^2$ to multilingual settings, code synthesis, and human-in-the-loop applications offers promising directions for expanding the scope and impact of semantic diversity-aware intelligent systems.