Batch Prioritized Routing (BPR)
- Batch Prioritized Routing (BPR) is a class of algorithms that jointly optimize resource allocation by coordinating batch-level routing decisions in both neural inference and software-defined networks.
- In mixture-of-experts models, BPR employs a two-phase routing approach—baseline expert selection followed by opportunistic piggybacking—to minimize distinct expert activations and reduce memory latency.
- For SDN, BPR formulations use ILP and genetic algorithm heuristics to maximize admitted flow priorities while adhering to link capacity constraints, demonstrating strong empirical performance on moderate-scale networks.
Batch Prioritized Routing (BPR) refers to a class of algorithms for assignment and resource allocation in both large-scale neural inference and communication networks, unified by the principle of jointly optimizing route or activation choices for a batch of items in order to maximize resource efficiency under quality and priority constraints. BPR is particularly prominent in two domains: (1) mixture-of-experts (MoE) LLMs, where it is also known as Opportunistic Expert Activation or Batch-Aware Expert Routing, and (2) network flow admission and routing in software-defined networking (SDN), where it appears as batch-wise priority flow scheduling with capacity and path constraints. While differing in application, both incarnations are characterized by batch-level coordination to reduce contention and memory or bandwidth bottlenecks, typically through mathematical optimization or heuristic search.
1. Mathematical Formulation and Objectives
In MoE inference, BPR seeks to map each token in a batch of tokens to at most out of available experts, while minimizing the number of distinct experts activated across the batch, subject to per-token minimum quality constraints. The formal objective is:
where is the set of experts assigned to token , and is a quality threshold derived from the router's softmax scores (Oncescu et al., 4 Nov 2025). This reduces the number of expert weight fetches from high-bandwidth memory to local memory, the dominant contributor to inference latency in modern MoE systems.
In SDN systems, BPR is defined with respect to a directed graph representing the network, a batch of flow requests each with a source , destination , bandwidth demand , and priority . The BPR problem is maximizing the total admitted priority while ensuring no link is overloaded:
where each flow can either be admitted (and assigned a single pre-enumerated path ) or dropped, and is the number of candidate paths for flow (López et al., 2020).
2. BPR Algorithms in Mixture-of-Experts Models
The BPR procedure for MoE models consists of a two-phase routing selection:
Phase 1 (Baseline expert selection): For each token , select the top experts by router score. The baseline sets are . The union forms the baseline set of potential expert fetches for the batch.
Phase 2 (Opportunistic piggybacking): For each token, if , assign additional experts from the remaining top-ranked experts, but only if those experts are already present in (i.e., being loaded for tokens with higher router mass). This minimizes additional expert loads and exploits overlap between tokens.
Afterwards, router scores are renormalized over the chosen sets and the forward pass proceeds identically to vanilla MoE (Oncescu et al., 4 Nov 2025). The computational cost of this routine is dominated by in the rare worst case but in practice.
3. BPR Algorithms for Priority Flow Routing
In SDN, exact BPR (equivalent to PFAR) is solved using an Integer Linear Programming (ILP) model after enumerating a small set of candidate paths per flow (e.g., –10). Each solution encodes which flows are admitted and which paths are assigned, subject to link capacity constraints. For larger instances or stringent brevity requirements, a Genetic Algorithm (GA) heuristic is used:
- Initialization: Population of chromosomes, each encoding a configuration of flow-to-path assignments.
- Fitness Function: Rewards sum of admitted priorities, penalizes any link overload.
- Operators: Tournament or roulette selection; block-swap crossover at the flow granularity; per-flow mutation (re-routing or dropping flows).
- Iteration: Elitism ensures persistence of best solutions; termination is controlled by time limits or convergence plateaus.
Empirical evaluation shows that the ILP can solve instances with up to 50 nodes and 3,500 flows in seconds; the GA attains within 95% of optimum in a fixed time budget (e.g., 10 s) (López et al., 2020).
4. Complexity and Resource Usage
MoE Latency Model
Let denote the one-time expert load cost and the per-token compute. For distinct experts and total token-expert dispatches (fixed by batch size and per-token fanout ):
In the memory-bound (large ) regime, reducing almost linearly reduces latency. Vanilla routing yields , while BPR bounds for baseline size (Oncescu et al., 4 Nov 2025).
SDN Routing
ILP is exponential in in the worst case, but tractable at moderate scale with limited candidate paths and modern solvers. GA approach scales linearly with population and generation counts, and is amenable to parallelization and time-budgeted operation (López et al., 2020).
5. Empirical Results and Quality Trade-Offs
MoE Inference
On Qwen3-30B-A3B (with , , ), BPR with reduces mean expert count from $48.8$ to $25.1$ and layer latency from s to s (39% speedup). At , active experts drop to $35.1$ and latency to s (23% reduction). For Qwen3-235B-A22B, reduces experts from $54.0$ to $40.2$ and latency from s to s (15% gain) (Oncescu et al., 4 Nov 2025).
$\begin{array}{|l|c|c|c|c|} \hline \mathrm{Model} & k_0 & \mathrm{Experts} & \mathrm{Latency} (\mu\mathrm{s}) & \mathrm{Speedup} \ \hline \mathrm{Qwen3\text{-}30B} & 3 & 25.1 & 106.8 & 39\% \ \mathrm{Qwen3\text{-}30B} & 5 & 35.1 & 136.0 & 23\% \ \mathrm{Qwen3\text{-}235B} & 5 & 40.2 & 101.4 & 15\% \ \hline \end{array}$
Accuracy on AIME24, GPQA, LiveCodeBench, and MATH500 remains within statistical error for ; for , accuracy loss is visible but largely recovered with piggybacking relative to naive pruning (Oncescu et al., 4 Nov 2025).
SDN Routing
ILP achieves near-optimal total priority for up to 50-node, 3,500-flow topologies. GA reaches of optimum in 10 s. The table below summarizes characteristic results (López et al., 2020):
$\begin{array}{|c|c|c|c|c|} \hline F (\text{flows}) & \text{ILP Optimum} & \text{ILP Time (s)} & \text{GA (10 s)} & \text{GA/Opt} \ \hline 1146 & 381{,}757 & 0.65 & 379{,}072 & 0.993 \ 2584 & 882{,}535 & 5.0 & 864{,}960 & 0.980 \ 3191 & 1{,}165{,}834 & 220 & 1{,}131{,}384 & 0.970 \ 3518 & 1{,}353{,}846 & 34 & 1{,}279{,}134 & 0.945 \ \hline \end{array}$
6. Practical Integration and Deployment
BPR in MoE models requires no weight modification or architectural change; routing decisions are altered only for inference, and integration involves rerouting logic, possible during the decode stage only (prefill untouched). The hyperparameter is tuned to balance speed and accuracy. The method applies readily to prevalent systems (vLLM, SGLang, DeepSpeed-MoE), with attention to masking out padding tokens to avoid spurious expert activation (Oncescu et al., 4 Nov 2025).
In SDN, BPR deployment involves collecting flow requests, enumerating candidate paths offline or asynchronously, and invoking ILP or GA depending on flow/network size and latency strictness. OpenFlow rules can be derived from solution assignments. Parameters such as mutation rate, crossover rate, and population size are tuned according to operational cycles (1–10 s reconfiguration periods typical). Extensions include penalties for path length, priority-tiered admission, temporal smoothing, and blended traffic engineering objectives (López et al., 2020).
7. Extensions, Limitations, and Research Directions
BPR’s efficacy is predicated on batch-level overlap and the presence of high-cost, batch-global resource fetches. In MoE, benefits accrue in memory-bound (not compute-bound) decoding regimes with accessible router logits and expert index orderings. In networks, results depend on the quality of candidate path enumeration and the heterogeneity of flow priorities. Extensions concern multi-objective optimization (e.g., for delay), strict prioritization, and temporal smoothing. Limitations include scalability of exact ILP solutions and potential quality trade-offs at aggressive latency settings.
A plausible implication is that in both domains, batch-level, opportunistically shared resource activation offers an efficient design lever, minimizing redundancy without significant loss in per-item quality under proper hyperparameter tuning (Oncescu et al., 4 Nov 2025, López et al., 2020).