Search Policies: Design and Analysis

Updated 14 February 2026

Search policies are parameterized prescriptions that guide the selection of actions in a search process, balancing exploration and exploitation.
They are applied across domains—from tree search in LLM decoding to multi-agent path finding—using fixed, learned, and index-based methods.
Recent advances emphasize resource-aware designs, robust Bayesian approaches, and integration with classical heuristics for improved performance.

A search policy is a parameterized prescription that determines how to select actions in a search process, guiding exploration, exploitation, or a resource-constrained traversal of a space. Search policies are central to domains including tree search for LLM decoding, automated planning, algorithmic synthesis, robotics, multi-agent path finding, policy synthesis in @@@@1@@@@ and programmatic control, and cost-sensitive sequential decision making. Their technical realization spans value-function-based methods, index policies, imitation learning, statistical optimization, and integration of learned models with classical heuristics. This article surveys the core frameworks, algorithmic instantiations, and theoretical developments in the design and analysis of search policies per recent research.

1. Principles and Classes of Search Policies

Search policies encode the mapping from the current search state, history, or belief (possibly including resource status such as remaining budget) to actions, controlling the expansion, selection, and diversification of candidate solutions or trajectories. Classical search policies often manifest as:

Fixed rule-based strategies: e.g., best-first, greedy, random, or fixed multi-stage schedules.
Parameterized or learned policies: where a function (possibly neural) determines local choices or global modes as a function of search state summaries (Gomoluch et al., 2019).
Index policies: assigning indices (reservation values, Gittins indices) to alternatives, with action selection governed by these indices (Greminger, 2019).
Bayesian or stochastic meta-policies: e.g., acquisition strategies in robust Bayesian optimization or POMDP-based planning with adaptive balancing of exploration and exploitation (Garcia-Barcos et al., 2020, Heinonen et al., 2022).
Budget- and resource-aware policies: adapting search parameters or priorities based on cumulative resource consumption, as in Budget-Guided MCTS for fixed-token LLM decoding (Miyamoto et al., 10 Feb 2026).

Effective search policy construction thus depends critically on the problem structure (e.g., deterministic/uncertain, combinatorial/continuous), information structure (observable, partially observable), and resource constraints.

2. Tree Search and Budget-Aligned Decoding Policies

Tree search policies in LLM decoding exemplify the interplay between selection/widening strategies and resource-aware scheduling. Standard approaches such as MCTS with PUCT typically treat the token budget as a merely terminal constraint, but this induces pathologies (late-stage overbranching, premature stopping) in fixed-budget deployment (Miyamoto et al., 10 Feb 2026).

Budget-Guided MCTS (BG-MCTS) addresses these issues by dynamically aligning both node selection and tree widening to the remaining token budget. It introduces a sufficiency ratio $\rho = 1 - C_\text{used}/B$ , annealing exploration bonuses and favoring breadth early when $\rho$ is high, transitioning to exploitation and answer-completion bonuses as $\rho \to 0$ . Node selection employs a $\rho$ -weighted PUCT score while the option to widen (spawn new children) is suppressed or encouraged according to a $\rho$ -scaled variance term. Ablation confirms that exploration annealing, exploit shaping (completion bias), and controlled widening are all essential to robust performance under hard budgets; BG-MCTS consistently outperforms vanilla tree search baselines across token budgets and model families (Miyamoto et al., 10 Feb 2026).

3. Learned and Neural Search Policies

Learned search policies leverage data-driven or distribution-specific priors to accelerate and adapt classical search. In classical planning, a parameterized template is defined over a large family of search algorithms, with a neural net policy mapping search state features to parameter choices (e.g., $\epsilon$ -greedy probability, random-walk usage, expansion quotas) (Gomoluch et al., 2019). These policies are trained via data-driven population-based black-box optimization (e.g., cross-entropy method), maximizing problem-dependent objectives such as plan quality averaged over sampled tasks.

Search policies can also manifest as guiding the prioritization of node expansions or action orderings inside frameworks such as Focal Search, exploiting learned policies to reorder focal ties for speed and quality without violating bounded-suboptimality guarantees (Araneda et al., 2021). In such settings, discrepancy-based heuristics exploiting the likelihood of a prefix being on an optimal path generally yield the best empirical tradeoff between search effort and solution quality.

4. Index and Reservation-Based Policies in Search and Discovery

In multi-stage discovery and inspection problems, as seen in consumer search models (Greminger, 2019), optimal policies are cast in terms of reservation indices—values satisfying stopping or switching indifference conditions. The solution involves reservation values for (i) whether to inspect a partially known product, (ii) whether to buy a fully known product, and (iii) whether to discover new alternatives. The optimal policy follows a three-way rule: buy if the best considered value exceeds both the best uninspected and the discovery threshold; inspect if the best uninspected index exceeds the rest; otherwise, discover. This index-based policy reduces complex exploration/tradeoff reasoning to tractable, closed-form reservation equations, permitting efficient policy computation.

Related work in diagnosis as cost-sensitive sequential decision making exploits analogous AO* search policies with admissible heuristics—optimistically estimating future costs to prune suboptimal measurement and diagnosis trees (Bayer-Zubek, 2012, Bayer-Zubek et al., 2011).

5. Search Policies in Partially Observable and Multi-Agent Domains

In POMDPs and cooperative multi-agent systems, search policies must operate over belief or information states. For Bayesian olfactory search in turbulent flows, the Perseus point-based value iteration algorithm constructs near-optimal search policies by maintaining a set of $\alpha$ -vectors representing value functions over beliefs. Reward-shaping functions, aligned with physical metrics such as Manhattan distance, accelerate convergence. Perseus-based policies exploit full-horizon Bayesian planning, dynamically blending exploration and exploitation for superior performance versus heuristics such as infotaxis or Thompson sampling (Heinonen et al., 2022).

Learned Belief Search (LBS) sidesteps prohibitive belief maintenance by training a fast auto-regressive neural model for approximate belief updates, enabling practical search-based policy improvement in high-dimensional POMDPs (e.g., Hanabi). LBS captures a substantial fraction of the benefit of exact search at dramatically reduced computational cost (Hu et al., 2021).

In multi-agent path finding (MAPF), search policies coordinate decentralized agents by wrapping learned local 1-step policies in collision-resolving schemes such as CS-PIBT and integrating them with full-horizon planners (e.g., LaCAM DFS). This can unlock scalable, robust decentralized MAPF even in highly congested regimes (Veerapaneni et al., 2024).

6. Policy Search over Structural and Semantic Spaces

Programmatic policy synthesis via syntax-guided search often suffers poor sample efficiency due to syntactic perturbations failing to induce semantic diversity. Recent work constructs semantic search spaces by augmenting local search neighborhoods with behaviorally distinct fragments from libraries built during prior search. The resulting $\beta$ -proper semantic space increases the probability that a local move leads to a substantially new behavior, empirically yielding far better sample efficiency and generalization in real-time game domains (Moraes et al., 2024).

In differentiable AutoML, joint search policies optimize over both data augmentation and architecture spaces, parameterized by continuous relaxations of both sets of choices and trained end-to-end. Policy search here allows the pipeline to co-adapt augmentation and architecture to the task at hand, outperforming approaches that optimize these components in a decoupled fashion (Kashima et al., 2020).

7. Robustness, Resource Awareness, and Theoretical Properties

Robust search policies incorporate explicit modeling of uncertainty (input noise, partial observability, surrogate approximation error) and safety/resource concerns. Examples include robust Bayesian optimization for policy parameter search under input uncertainty using unscented transforms, nonstationary surrogate models, and stochastic meta-acquisition policies (softmax over expected improvement), all of which support provable convergence under mild regularity conditions (Garcia-Barcos et al., 2020). In resource-constrained settings, search policies must adapt their priorities and scheduling dynamically in response to residual budgets, as exemplified by BG-MCTS and related budget-aware algorithms (Miyamoto et al., 10 Feb 2026).

No-regret guarantees are critical for search policies trained via imitation learning, particularly for structured prediction and beam search (Negrinho et al., 2018). Surrogate losses, data-collection strategies (e.g., DAgger, stop/reset), and differentiable scoring functions allow these policies to approach the loss of the best fixed-in-hindsight map, with high-probability bounds derived from martingale concentration arguments.

For advanced theory, empirical benchmarks, and algorithmic details, see primary sources including (Miyamoto et al., 10 Feb 2026, Gomoluch et al., 2019, Araneda et al., 2021, Negrinho et al., 2018, Greminger, 2019, Heinonen et al., 2022, Hu et al., 2021, Kashima et al., 2020, Moraes et al., 2024, Garcia-Barcos et al., 2020, Veerapaneni et al., 2024, Bayer-Zubek et al., 2011), and (Bayer-Zubek, 2012).