Batch Prioritized Routing (BPR)

Updated 26 January 2026

Batch Prioritized Routing (BPR) is a class of algorithms that jointly optimize resource allocation by coordinating batch-level routing decisions in both neural inference and software-defined networks.
In mixture-of-experts models, BPR employs a two-phase routing approach—baseline expert selection followed by opportunistic piggybacking—to minimize distinct expert activations and reduce memory latency.
For SDN, BPR formulations use ILP and genetic algorithm heuristics to maximize admitted flow priorities while adhering to link capacity constraints, demonstrating strong empirical performance on moderate-scale networks.

Batch Prioritized Routing (BPR) refers to a class of algorithms for assignment and resource allocation in both large-scale neural inference and communication networks, unified by the principle of jointly optimizing route or activation choices for a batch of items in order to maximize resource efficiency under quality and priority constraints. BPR is particularly prominent in two domains: (1) mixture-of-experts (MoE) LLMs, where it is also known as Opportunistic Expert Activation or Batch-Aware Expert Routing, and (2) network flow admission and routing in software-defined networking (SDN), where it appears as batch-wise priority flow scheduling with capacity and path constraints. While differing in application, both incarnations are characterized by batch-level coordination to reduce contention and memory or bandwidth bottlenecks, typically through mathematical optimization or heuristic search.

1. Mathematical Formulation and Objectives

In MoE inference, BPR seeks to map each token in a batch of $B$ tokens to at most $k$ out of $N$ available experts, while minimizing the number of distinct experts activated across the batch, subject to per-token minimum quality constraints. The formal objective is:

$\begin{aligned} &\text{minimize} && T = \left|\bigcup_{i=1}^B S_i\right| \ &\text{subject to} && |S_i| \leq k \ \forall\, i, \quad \mathrm{quality}_i(S_i) \geq Q_0 \ \forall\, i \end{aligned}$

where $S_i$ is the set of experts assigned to token $i$ , and $Q_0$ is a quality threshold derived from the router's softmax scores (Oncescu et al., 4 Nov 2025). This reduces the number of expert weight fetches from high-bandwidth memory to local memory, the dominant contributor to inference latency in modern MoE systems.

In SDN systems, BPR is defined with respect to a directed graph $G=(V,E,c)$ representing the network, a batch of $N$ flow requests $\mathcal{F} = \{f_1, \dots, f_N\}$ each with a source $s_i$ , destination $d_i$ , bandwidth demand $b_i$ , and priority $p_i$ . The BPR problem is maximizing the total admitted priority while ensuring no link is overloaded:

$\begin{aligned} &\text{maximize} && \sum_{i=1}^N p_i \alpha_i \ &\text{subject to} && \sum_{i,m : \ell \in P_{i,m}} b_i \rho_{i,m} \leq c(\ell)\quad \forall \ell\in E \ &&& \sum_{m=1}^{K_i} \rho_{i,m} = \alpha_i\quad \forall\,i \ &&& \alpha_i \in \{0,1\},\ \rho_{i,m} \in \{0,1\} \end{aligned}$

where each flow can either be admitted (and assigned a single pre-enumerated path $P_{i,m}$ ) or dropped, and $K_i$ is the number of candidate paths for flow $i$ (López et al., 2020).

2. BPR Algorithms in Mixture-of-Experts Models

The BPR procedure for MoE models consists of a two-phase routing selection:

Phase 1 (Baseline expert selection): For each token $i$ , select the top $k_0 \leq k$ experts by router score. The baseline sets are $S_i^{\text{base}}$ . The union $S^{\text{base}} = \bigcup_{i=1}^B S_i^{\text{base}}$ forms the baseline set of potential expert fetches for the batch.

Phase 2 (Opportunistic piggybacking): For each token, if $|S_i| < k$ , assign additional experts from the remaining top-ranked experts, but only if those experts are already present in $S^{\text{base}}$ (i.e., being loaded for tokens with higher router mass). This minimizes additional expert loads and exploits overlap between tokens.

Afterwards, router scores are renormalized over the chosen $S_i$ sets and the forward pass proceeds identically to vanilla MoE (Oncescu et al., 4 Nov 2025). The computational cost of this routine is dominated by $O(B\,N)$ in the rare worst case but $O(B\,k\,k_0)$ in practice.

3. BPR Algorithms for Priority Flow Routing

In SDN, exact BPR (equivalent to PFAR) is solved using an Integer Linear Programming (ILP) model after enumerating a small set of candidate paths per flow (e.g., $K_i = 5$ –10). Each solution encodes which flows are admitted and which paths are assigned, subject to link capacity constraints. For larger instances or stringent brevity requirements, a Genetic Algorithm (GA) heuristic is used:

Initialization: Population of chromosomes, each encoding a configuration of flow-to-path assignments.
Fitness Function: Rewards sum of admitted priorities, penalizes any link overload.
Operators: Tournament or roulette selection; block-swap crossover at the flow granularity; per-flow mutation (re-routing or dropping flows).
Iteration: Elitism ensures persistence of best solutions; termination is controlled by time limits or convergence plateaus.

Empirical evaluation shows that the ILP can solve instances with up to 50 nodes and $\sim$ 3,500 flows in seconds; the GA attains within 95% of optimum in a fixed time budget (e.g., 10 s) (López et al., 2020).

4. Complexity and Resource Usage

MoE Latency Model

Let $b$ denote the one-time expert load cost and $a$ the per-token compute. For $T$ distinct experts and $Bk$ total token-expert dispatches (fixed by batch size $B$ and per-token fanout $k$ ):

$L \approx b \cdot T + a \cdot (B\,k)$

In the memory-bound (large $b$ ) regime, reducing $T$ almost linearly reduces latency. Vanilla routing yields $T_0 = N\cdot [1 - (1 - k/N)^B]$ , while BPR bounds $T' \leq Bk_0$ for baseline size $k_0 < k$ (Oncescu et al., 4 Nov 2025).

SDN Routing

ILP is exponential in $|V|, |F|$ in the worst case, but tractable at moderate scale with limited candidate paths and modern solvers. GA approach scales linearly with population and generation counts, and is amenable to parallelization and time-budgeted operation (López et al., 2020).

5. Empirical Results and Quality Trade-Offs

MoE Inference

On Qwen3-30B-A3B (with $k=8$ , $N=128$ , $B=16$ ), BPR with $k_0=3$ reduces mean expert count from $48.8$ to $25.1$ and layer latency from $175.7\ \mu$ s to $106.8\ \mu$ s (39% speedup). At $k_0=5$ , active experts drop to $35.1$ and latency to $136.0\ \mu$ s (23% reduction). For Qwen3-235B-A22B, $k_0=5$ reduces experts from $54.0$ to $40.2$ and latency from $119.4\ \mu$ s to $101.4\ \mu$ s (15% gain) (Oncescu et al., 4 Nov 2025).

$\begin{array}{|l|c|c|c|c|} \hline \mathrm{Model} & k_0 & \mathrm{Experts} & \mathrm{Latency} (\mu\mathrm{s}) & \mathrm{Speedup} \ \hline \mathrm{Qwen3\text{-}30B} & 3 & 25.1 & 106.8 & 39\% \ \mathrm{Qwen3\text{-}30B} & 5 & 35.1 & 136.0 & 23\% \ \mathrm{Qwen3\text{-}235B} & 5 & 40.2 & 101.4 & 15\% \ \hline \end{array}$

Accuracy on AIME24, GPQA, LiveCodeBench, and MATH500 remains within statistical error for $k_0=5$ ; for $k_0=3$ , accuracy loss is visible but largely recovered with piggybacking relative to naive $k_0$ pruning (Oncescu et al., 4 Nov 2025).

SDN Routing

ILP achieves near-optimal total priority for up to 50-node, 3,500-flow topologies. GA reaches $\geq 95\%$ of optimum in 10 s. The table below summarizes characteristic results (López et al., 2020):

$\begin{array}{|c|c|c|c|c|} \hline F (\text{flows}) & \text{ILP Optimum} & \text{ILP Time (s)} & \text{GA (10 s)} & \text{GA/Opt} \ \hline 1146 & 381{,}757 & 0.65 & 379{,}072 & 0.993 \ 2584 & 882{,}535 & 5.0 & 864{,}960 & 0.980 \ 3191 & 1{,}165{,}834 & 220 & 1{,}131{,}384 & 0.970 \ 3518 & 1{,}353{,}846 & 34 & 1{,}279{,}134 & 0.945 \ \hline \end{array}$

6. Practical Integration and Deployment

BPR in MoE models requires no weight modification or architectural change; routing decisions are altered only for inference, and integration involves rerouting logic, possible during the decode stage only (prefill untouched). The hyperparameter $k_0$ is tuned to balance speed and accuracy. The method applies readily to prevalent systems (vLLM, SGLang, DeepSpeed-MoE), with attention to masking out padding tokens to avoid spurious expert activation (Oncescu et al., 4 Nov 2025).

In SDN, BPR deployment involves collecting flow requests, enumerating candidate paths offline or asynchronously, and invoking ILP or GA depending on flow/network size and latency strictness. OpenFlow rules can be derived from solution assignments. Parameters such as mutation rate, crossover rate, and population size are tuned according to operational cycles (1–10 s reconfiguration periods typical). Extensions include penalties for path length, priority-tiered admission, temporal smoothing, and blended traffic engineering objectives (López et al., 2020).

7. Extensions, Limitations, and Research Directions

BPR’s efficacy is predicated on batch-level overlap and the presence of high-cost, batch-global resource fetches. In MoE, benefits accrue in memory-bound (not compute-bound) decoding regimes with accessible router logits and expert index orderings. In networks, results depend on the quality of candidate path enumeration and the heterogeneity of flow priorities. Extensions concern multi-objective optimization (e.g., for delay), strict prioritization, and temporal smoothing. Limitations include scalability of exact ILP solutions and potential quality trade-offs at aggressive latency settings.

A plausible implication is that in both domains, batch-level, opportunistically shared resource activation offers an efficient design lever, minimizing redundancy without significant loss in per-item quality under proper hyperparameter tuning (Oncescu et al., 4 Nov 2025, López et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining (2025)

Priority Flow Admission and Routing in SDN: Exact and Heuristic Approaches (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Batch Prioritized Routing (BPR).

Batch Prioritized Routing (BPR)

1. Mathematical Formulation and Objectives

2. BPR Algorithms in Mixture-of-Experts Models

3. BPR Algorithms for Priority Flow Routing

4. Complexity and Resource Usage

MoE Latency Model

SDN Routing

5. Empirical Results and Quality Trade-Offs

MoE Inference

SDN Routing

6. Practical Integration and Deployment

7. Extensions, Limitations, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics