Papers
Topics
Authors
Recent
Search
2000 character limit reached

Batch Prioritized Routing (BPR)

Updated 26 January 2026
  • Batch Prioritized Routing (BPR) is a class of algorithms that jointly optimize resource allocation by coordinating batch-level routing decisions in both neural inference and software-defined networks.
  • In mixture-of-experts models, BPR employs a two-phase routing approach—baseline expert selection followed by opportunistic piggybacking—to minimize distinct expert activations and reduce memory latency.
  • For SDN, BPR formulations use ILP and genetic algorithm heuristics to maximize admitted flow priorities while adhering to link capacity constraints, demonstrating strong empirical performance on moderate-scale networks.

Batch Prioritized Routing (BPR) refers to a class of algorithms for assignment and resource allocation in both large-scale neural inference and communication networks, unified by the principle of jointly optimizing route or activation choices for a batch of items in order to maximize resource efficiency under quality and priority constraints. BPR is particularly prominent in two domains: (1) mixture-of-experts (MoE) LLMs, where it is also known as Opportunistic Expert Activation or Batch-Aware Expert Routing, and (2) network flow admission and routing in software-defined networking (SDN), where it appears as batch-wise priority flow scheduling with capacity and path constraints. While differing in application, both incarnations are characterized by batch-level coordination to reduce contention and memory or bandwidth bottlenecks, typically through mathematical optimization or heuristic search.

1. Mathematical Formulation and Objectives

In MoE inference, BPR seeks to map each token in a batch of BB tokens to at most kk out of NN available experts, while minimizing the number of distinct experts activated across the batch, subject to per-token minimum quality constraints. The formal objective is:

minimizeT=i=1BSi subject toSik i,qualityi(Si)Q0 i\begin{aligned} &\text{minimize} && T = \left|\bigcup_{i=1}^B S_i\right| \ &\text{subject to} && |S_i| \leq k \ \forall\, i, \quad \mathrm{quality}_i(S_i) \geq Q_0 \ \forall\, i \end{aligned}

where SiS_i is the set of experts assigned to token ii, and Q0Q_0 is a quality threshold derived from the router's softmax scores (Oncescu et al., 4 Nov 2025). This reduces the number of expert weight fetches from high-bandwidth memory to local memory, the dominant contributor to inference latency in modern MoE systems.

In SDN systems, BPR is defined with respect to a directed graph G=(V,E,c)G=(V,E,c) representing the network, a batch of NN flow requests F={f1,,fN}\mathcal{F} = \{f_1, \dots, f_N\} each with a source sis_i, destination did_i, bandwidth demand bib_i, and priority pip_i. The BPR problem is maximizing the total admitted priority while ensuring no link is overloaded:

maximizei=1Npiαi subject toi,m:Pi,mbiρi,mc()E m=1Kiρi,m=αii αi{0,1}, ρi,m{0,1}\begin{aligned} &\text{maximize} && \sum_{i=1}^N p_i \alpha_i \ &\text{subject to} && \sum_{i,m : \ell \in P_{i,m}} b_i \rho_{i,m} \leq c(\ell)\quad \forall \ell\in E \ &&& \sum_{m=1}^{K_i} \rho_{i,m} = \alpha_i\quad \forall\,i \ &&& \alpha_i \in \{0,1\},\ \rho_{i,m} \in \{0,1\} \end{aligned}

where each flow can either be admitted (and assigned a single pre-enumerated path Pi,mP_{i,m}) or dropped, and KiK_i is the number of candidate paths for flow ii (López et al., 2020).

2. BPR Algorithms in Mixture-of-Experts Models

The BPR procedure for MoE models consists of a two-phase routing selection:

Phase 1 (Baseline expert selection): For each token ii, select the top k0kk_0 \leq k experts by router score. The baseline sets are SibaseS_i^{\text{base}}. The union Sbase=i=1BSibaseS^{\text{base}} = \bigcup_{i=1}^B S_i^{\text{base}} forms the baseline set of potential expert fetches for the batch.

Phase 2 (Opportunistic piggybacking): For each token, if Si<k|S_i| < k, assign additional experts from the remaining top-ranked experts, but only if those experts are already present in SbaseS^{\text{base}} (i.e., being loaded for tokens with higher router mass). This minimizes additional expert loads and exploits overlap between tokens.

Afterwards, router scores are renormalized over the chosen SiS_i sets and the forward pass proceeds identically to vanilla MoE (Oncescu et al., 4 Nov 2025). The computational cost of this routine is dominated by O(BN)O(B\,N) in the rare worst case but O(Bkk0)O(B\,k\,k_0) in practice.

3. BPR Algorithms for Priority Flow Routing

In SDN, exact BPR (equivalent to PFAR) is solved using an Integer Linear Programming (ILP) model after enumerating a small set of candidate paths per flow (e.g., Ki=5K_i = 5–10). Each solution encodes which flows are admitted and which paths are assigned, subject to link capacity constraints. For larger instances or stringent brevity requirements, a Genetic Algorithm (GA) heuristic is used:

  • Initialization: Population of chromosomes, each encoding a configuration of flow-to-path assignments.
  • Fitness Function: Rewards sum of admitted priorities, penalizes any link overload.
  • Operators: Tournament or roulette selection; block-swap crossover at the flow granularity; per-flow mutation (re-routing or dropping flows).
  • Iteration: Elitism ensures persistence of best solutions; termination is controlled by time limits or convergence plateaus.

Empirical evaluation shows that the ILP can solve instances with up to 50 nodes and \sim3,500 flows in seconds; the GA attains within 95% of optimum in a fixed time budget (e.g., 10 s) (López et al., 2020).

4. Complexity and Resource Usage

MoE Latency Model

Let bb denote the one-time expert load cost and aa the per-token compute. For TT distinct experts and BkBk total token-expert dispatches (fixed by batch size BB and per-token fanout kk):

LbT+a(Bk)L \approx b \cdot T + a \cdot (B\,k)

In the memory-bound (large bb) regime, reducing TT almost linearly reduces latency. Vanilla routing yields T0=N[1(1k/N)B]T_0 = N\cdot [1 - (1 - k/N)^B], while BPR bounds TBk0T' \leq Bk_0 for baseline size k0<kk_0 < k (Oncescu et al., 4 Nov 2025).

SDN Routing

ILP is exponential in V,F|V|, |F| in the worst case, but tractable at moderate scale with limited candidate paths and modern solvers. GA approach scales linearly with population and generation counts, and is amenable to parallelization and time-budgeted operation (López et al., 2020).

5. Empirical Results and Quality Trade-Offs

MoE Inference

On Qwen3-30B-A3B (with k=8k=8, N=128N=128, B=16B=16), BPR with k0=3k_0=3 reduces mean expert count from $48.8$ to $25.1$ and layer latency from 175.7 μ175.7\ \mus to 106.8 μ106.8\ \mus (39% speedup). At k0=5k_0=5, active experts drop to $35.1$ and latency to 136.0 μ136.0\ \mus (23% reduction). For Qwen3-235B-A22B, k0=5k_0=5 reduces experts from $54.0$ to $40.2$ and latency from 119.4 μ119.4\ \mus to 101.4 μ101.4\ \mus (15% gain) (Oncescu et al., 4 Nov 2025).

$\begin{array}{|l|c|c|c|c|} \hline \mathrm{Model} & k_0 & \mathrm{Experts} & \mathrm{Latency} (\mu\mathrm{s}) & \mathrm{Speedup} \ \hline \mathrm{Qwen3\text{-}30B} & 3 & 25.1 & 106.8 & 39\% \ \mathrm{Qwen3\text{-}30B} & 5 & 35.1 & 136.0 & 23\% \ \mathrm{Qwen3\text{-}235B} & 5 & 40.2 & 101.4 & 15\% \ \hline \end{array}$

Accuracy on AIME24, GPQA, LiveCodeBench, and MATH500 remains within statistical error for k0=5k_0=5; for k0=3k_0=3, accuracy loss is visible but largely recovered with piggybacking relative to naive k0k_0 pruning (Oncescu et al., 4 Nov 2025).

SDN Routing

ILP achieves near-optimal total priority for up to 50-node, 3,500-flow topologies. GA reaches 95%\geq 95\% of optimum in 10 s. The table below summarizes characteristic results (López et al., 2020):

$\begin{array}{|c|c|c|c|c|} \hline F (\text{flows}) & \text{ILP Optimum} & \text{ILP Time (s)} & \text{GA (10 s)} & \text{GA/Opt} \ \hline 1146 & 381{,}757 & 0.65 & 379{,}072 & 0.993 \ 2584 & 882{,}535 & 5.0 & 864{,}960 & 0.980 \ 3191 & 1{,}165{,}834 & 220 & 1{,}131{,}384 & 0.970 \ 3518 & 1{,}353{,}846 & 34 & 1{,}279{,}134 & 0.945 \ \hline \end{array}$

6. Practical Integration and Deployment

BPR in MoE models requires no weight modification or architectural change; routing decisions are altered only for inference, and integration involves rerouting logic, possible during the decode stage only (prefill untouched). The hyperparameter k0k_0 is tuned to balance speed and accuracy. The method applies readily to prevalent systems (vLLM, SGLang, DeepSpeed-MoE), with attention to masking out padding tokens to avoid spurious expert activation (Oncescu et al., 4 Nov 2025).

In SDN, BPR deployment involves collecting flow requests, enumerating candidate paths offline or asynchronously, and invoking ILP or GA depending on flow/network size and latency strictness. OpenFlow rules can be derived from solution assignments. Parameters such as mutation rate, crossover rate, and population size are tuned according to operational cycles (1–10 s reconfiguration periods typical). Extensions include penalties for path length, priority-tiered admission, temporal smoothing, and blended traffic engineering objectives (López et al., 2020).

7. Extensions, Limitations, and Research Directions

BPR’s efficacy is predicated on batch-level overlap and the presence of high-cost, batch-global resource fetches. In MoE, benefits accrue in memory-bound (not compute-bound) decoding regimes with accessible router logits and expert index orderings. In networks, results depend on the quality of candidate path enumeration and the heterogeneity of flow priorities. Extensions concern multi-objective optimization (e.g., for delay), strict prioritization, and temporal smoothing. Limitations include scalability of exact ILP solutions and potential quality trade-offs at aggressive latency settings.

A plausible implication is that in both domains, batch-level, opportunistically shared resource activation offers an efficient design lever, minimizing redundancy without significant loss in per-item quality under proper hyperparameter tuning (Oncescu et al., 4 Nov 2025, López et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Batch Prioritized Routing (BPR).