MARL-Optimized Reviewer Assignment

Updated 3 February 2026

The paper introduces a Constrained Multi-Agent Reinforcement Learning framework that optimizes reviewer assignments by balancing timeliness, review quality, and group fairness.
MARL-Optimized Reviewer Assignment models the process as a stochastic multi-agent game, incorporating dynamic states like reviewer load, historical lateness, decline rates, and topic distances.
The approach combines rigorous reward design, trust-region policy updates, and an ILP-based runtime mechanism to ensure efficient, fair, and computationally tractable assignments.

MARL-Optimized Reviewer Assignment is a computational framework for peer review assignment based on Constrained Multi-Agent Reinforcement Learning (MARL), designed to address problems of reviewer fatigue, fairness, and review quality in the peer review ecosystem. It models the assignment process as a stochastic multi-agent game, where agents (reviewers) interact with a dynamically evolving system representing real-world constraints and objectives. This method seeks not only to optimize the matching of papers to reviewers, but also to enforce explicit constraints on timeliness, load-balance, and group equity, while adapting to historical reviewer behaviors and institutional demands (Farooq et al., 27 Jan 2026).

1. Formal Multi-Agent Model for Reviewer Assignment

In this formulation, the reviewer assignment subproblem is codified as a constrained stochastic game $G = \langle \mathcal{A}, \mathcal{S}, \mathcal{A}_c, \mathcal{P}, \mathcal{R}, \mathcal{C} \rangle$ :

Agents $\mathcal{A}$ : Each reviewer is an agent $i \in \{1,\ldots,N\}$ .
State Space $\mathcal{S}$ : At each assignment epoch $t$ $t$ , the state $s_t$ $s_{t}$ aggregates reviewer-level features:
- $ℓ_i^t$ (current uncompleted load for reviewer $i$ )
- $τ_i^t$ (historical mean lateness of reviewer $i$ )
- $δ_i^t$ (decline rate of reviewer $i$ )
- $d_{ij}^t$ (topic distance between reviewer $i$ and paper $j$ )
- $s_t = (\ell_1^t,\ldots,\ell_N^t; τ_1^t,\ldots,τ_N^t; δ_1^t,\ldots,δ_N^t; D^t)$ with $D^t \in \mathbb{R}^{N \times M}$ for $M$ papers.
Joint Action $\mathcal{A}_c$ : Assignment matrix $X_t = [x_{ij}] \in \{0,1\}^{N \times M}$ with $x_{ij}=1$ iff reviewer $i$ is assigned paper $j$ at epoch $t$ .
Constraints:
- Exactly $K$ reviewers per paper ( $\sum_i x_{ij} = K$ )
- Load-bound per reviewer ( $\sum_j x_{ij} \leq L_i$ )
- Conflict-of-interest exclusion ( $x_{ij}=0$ if COI)
Transition Kernel $\mathcal{P}$ : $s_{t+1}$ is updated by tracking assignment completion, lateness, and decline.
Reward $\mathcal{R}$ : Vector per epoch incorporates timeliness, review quality, and fairness penalties. The global instantaneous reward is:

$r_t = \sum_{i=1}^N \Big[ \alpha T_i(s_t, a_t) + \beta Q_i(s_t, a_t) \Big] - \gamma F(s_t, a_t)$

where: - $T_i$ : Timeliness indicator for reviewer $i$ - $Q_i$ : Review specificity/quality score - $F$ : Group fairness penalty

Cost Functions $\mathcal{C}$ : Quantities (e.g., load-balance, group-imbalance) that serve as optimization constraints.

2. Reward Design and Constraint Specification

The multi-objective reward is constructed to align incentives and enforce operational constraints:

Timeliness: $T_i(s_t, a_t) = 1$ if reviewer $i$ completes on time, else $0$.
Review Quality: $Q_i(s_t, a_t) \in [0,1]$ , typically normalized specificity or meta-review consistency.
Fairness Penalty: For demographic groups set $G$ , loading per group $g$ is $Load_g(s_t, a_t)$ ; the group fairness penalty is:

$F(s_t, a_t) = \sum_{g \in G} | Load_g(s_t, a_t) - \mu_\text{load} |$

with $\mu_\text{load} = \frac{\text{total assignments}}{|G|}$ .

The constrained RL objective is:

$\max_\theta\, J_R(\theta)\quad \text{subject to}\quad J_C^j(\theta) \leq d_j \quad \forall j$

where $J_R(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$ and $J_C^j(\theta)$ similarly aggregates costs for constraint $j$ . Hard thresholds $d_j$ are selected (e.g., coefficient of variation of load $\leq 0.2$ , group-imbalance $\leq 1$ ).

3. MARL Policy Learning and Offline (Counterfactual) Evaluation

The training approach leverages Constrained Multi-Agent Policy Optimization (CMAPO), extending single-agent constrained policy optimization to multi-agent systems:

Actor-Critic Architecture: Individual actor networks $\theta = \{\theta_1, \ldots, \theta_N\}$ (per reviewer); centralized critic $\phi$ for rewards and costs.
Trust-Region Policy Update: At each iteration, gradients $\nabla_{\theta_i} J_R$ $\nabla_{θ_{i}} J_{R}$ , $\nabla_{\theta_i} J_C^j$ $\nabla_{θ_{i}} J_{C}^{j}$ are computed for each agent; a quadratic program solves for joint trust-region updates under all cost constraints:
- Quadratic program:
$\begin{align*} &\max \quad g^\top \delta\theta \ &\text{s.t.} \quad b^j{}^\top \delta\theta + (J_C^j - d_j) \leq 0,\,\, \forall j \ &\quad\quad\,\,\,\, \delta\theta^\top H\,\delta\theta \leq \delta^2 \end{align*}$

where $\delta$ is the trust-region radius.
Offline RL with Doubly Robust Estimation: Policy improvement utilizes logged historical data and Doubly Robust off-policy estimators as in Thomas & Brunskill (2016):

$\hat J(\pi) = \frac{1}{n}\sum_{k=1}^n \Big[ w_k (r_k - \hat v(s_k)) + V^\pi(s_k) \Big]$

with importance weights $w_k = \pi(a_k|s_k)/\mu(a_k|s_k)$ .

Hyperparameters:
- Discount factor: $\gamma = 0.99$
- Trust-region radius: $\delta = 10^{-2}$
- Cost limits: Set to match target CV(load) or group-imbalance limits
- Batch size: $\sim 1000$ assignment epochs per iteration

Convergence is determined by stabilization of the objective $J_R$ and satisfaction of all cost constraints within specified slack.

4. Runtime Assignment Mechanism and Integer Programming

At test/deployment, the learned MARL policy $\pi_\theta$ produces either a probability mask $P_t = [p_{ij}] \in [0,1]^{N \times M}$ , with $p_{ij} = \pi_\theta(x_{ij}=1\mid s_t)$ , or direct deterministic assignments.

Given the combinatorial constraints, final reviewer assignment is produced by solving a small-scale Integer Linear Program (ILP):

Objective: maximize $\sum_{i,j} w_{ij} x_{ij}$ over $x_{ij} \in \{0,1\}$
Subject to:
- $\sum_i x_{ij} = K$ for each paper $j$
- $\sum_j x_{ij} \leq L_i$ for each reviewer $i$
- $x_{ij}=0$ if conflict-of-interest

Typically, $w_{ij}=p_{ij}$ or $w_{ij}=\log\left( \frac{p_{ij}}{1-p_{ij}} \right)$ . This ILP is computationally efficient, solvable in milliseconds for $N \sim 1000$ , $M \sim 500$ , using solvers such as Gurobi or COIN-OR.

5. Empirical Evaluation, Metrics, and Baselines

Performance assessment is scheduled in "shadow mode" using an Agent-Based Model (ABM) environment instantiated with historical OpenReview data distributions.

Baselines:
- ILP topic-matching/min-cost-flow (Charlin & Zemel 2013)
- Greedy smallest-load assignment
- Random matching under constraints
Metrics:
- Timeliness: Percentage of reviews completed by deadline
- Load-balance: Coefficient of Variation $CV(\ell_i) = \sigma(\ell_i)/\mathrm{mean}(\ell_i)$
- Group Fairness: Jain's index: $\, \frac{(\sum_g Load_g)^2}{|G| \, \sum_g Load_g^2} \,$
- Quality Uplift: $\Delta Q = Q_\mathrm{MARL} - Q_\text{baseline}$

Example metric comparison:

Method	Timeliness	CV(load)	Jain-Fairness	ΔQ
ILP baseline	75%	0.28	0.82	0.00
MARL assign	87%	0.15	0.93	+0.07

Significance is tested via paired $t$ -test ( $p<0.01$ ).

6. Equity, Threat Models, and Constraint Enforcement

The architecture incorporates explicit fairness and robustness strategies:

Bias amplification is addressed through fairness constraints in both the reward function ( $F$ penalty) and cost terms $c^j$ . Offline simulations enable bias testing.
Strategic declines by reviewers are mitigated by feeding decline patterns $δ_i^t$ into the state representation, thereby dynamically down-weighting over-decliners’ assignment likelihood.
PAC-style risk bounds can be derived for group imbalance: with high probability ( $\geq 1-\delta$ ), the learned policy ensures group-imbalance $\leq \epsilon$ if cost constraints $J_C \leq d_j$ and estimation errors are sufficiently bounded.

7. Pilot Implementation and Practical Deployment

Shadow Mode Pilot (2027–2028): Planned trial at ICSE workshop scale, utilizing historical data for simulation and policy rollouts. Implementation in Python + PyTorch, with assignment-phase inference latency of $\sim$ 0.5 s for $N=500$ , $M=200$ . ILP postprocessing via Gurobi or COIN-OR.
Compute Requirements: Training requires approximately two weeks on an 8-GPU server for 2000 ABM episodes.
Operational Targets:
- Timeliness uplift $>$ 10%
- $CV(\mathrm{load}) < 0.2$
- Jain-fairness $>0.9$
Live Mode (2028–2029): Expansion to opt-in live assignment, Program-Chair oversight, manual audit of 10% assignments, and continuous monitoring of group-Gini coefficients.

These advances collectively realize a MARL-driven reviewer matching system that (1) adapts to reviewer fatigue and historical participation patterns, (2) enforces equity and operational constraints by design, (3) learns efficiently from historical data via counterfactual evaluation, and (4) remains computationally tractable for deployment in real-world conference settings (Farooq et al., 27 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Reimagining Peer Review Process Through Multi-Agent Mechanism Design (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MARL-Optimized Reviewer Assignment.