Papers
Topics
Authors
Recent
Search
2000 character limit reached

MARL-Optimized Reviewer Assignment

Updated 3 February 2026
  • The paper introduces a Constrained Multi-Agent Reinforcement Learning framework that optimizes reviewer assignments by balancing timeliness, review quality, and group fairness.
  • MARL-Optimized Reviewer Assignment models the process as a stochastic multi-agent game, incorporating dynamic states like reviewer load, historical lateness, decline rates, and topic distances.
  • The approach combines rigorous reward design, trust-region policy updates, and an ILP-based runtime mechanism to ensure efficient, fair, and computationally tractable assignments.

MARL-Optimized Reviewer Assignment is a computational framework for peer review assignment based on Constrained Multi-Agent Reinforcement Learning (MARL), designed to address problems of reviewer fatigue, fairness, and review quality in the peer review ecosystem. It models the assignment process as a stochastic multi-agent game, where agents (reviewers) interact with a dynamically evolving system representing real-world constraints and objectives. This method seeks not only to optimize the matching of papers to reviewers, but also to enforce explicit constraints on timeliness, load-balance, and group equity, while adapting to historical reviewer behaviors and institutional demands (Farooq et al., 27 Jan 2026).

1. Formal Multi-Agent Model for Reviewer Assignment

In this formulation, the reviewer assignment subproblem is codified as a constrained stochastic game G=A,S,Ac,P,R,CG = \langle \mathcal{A}, \mathcal{S}, \mathcal{A}_c, \mathcal{P}, \mathcal{R}, \mathcal{C} \rangle:

  • Agents A\mathcal{A}: Each reviewer is an agent i{1,,N}i \in \{1,\ldots,N\}.
  • State Space S\mathcal{S}: At each assignment epoch tt, the state sts_t aggregates reviewer-level features:
    • itℓ_i^t (current uncompleted load for reviewer ii)
    • τitτ_i^t (historical mean lateness of reviewer ii)
    • δitδ_i^t (decline rate of reviewer ii)
    • dijtd_{ij}^t (topic distance between reviewer ii and paper jj)
    • st=(1t,,Nt;τ1t,,τNt;δ1t,,δNt;Dt)s_t = (\ell_1^t,\ldots,\ell_N^t; τ_1^t,\ldots,τ_N^t; δ_1^t,\ldots,δ_N^t; D^t) with DtRN×MD^t \in \mathbb{R}^{N \times M} for MM papers.
  • Joint Action Ac\mathcal{A}_c: Assignment matrix Xt=[xij]{0,1}N×MX_t = [x_{ij}] \in \{0,1\}^{N \times M} with xij=1x_{ij}=1 iff reviewer ii is assigned paper jj at epoch tt.
  • Constraints:
    • Exactly KK reviewers per paper (ixij=K\sum_i x_{ij} = K)
    • Load-bound per reviewer (jxijLi\sum_j x_{ij} \leq L_i)
    • Conflict-of-interest exclusion (xij=0x_{ij}=0 if COI)
  • Transition Kernel P\mathcal{P}: st+1s_{t+1} is updated by tracking assignment completion, lateness, and decline.
  • Reward R\mathcal{R}: Vector per epoch incorporates timeliness, review quality, and fairness penalties. The global instantaneous reward is:

rt=i=1N[αTi(st,at)+βQi(st,at)]γF(st,at)r_t = \sum_{i=1}^N \Big[ \alpha T_i(s_t, a_t) + \beta Q_i(s_t, a_t) \Big] - \gamma F(s_t, a_t)

where: - TiT_i: Timeliness indicator for reviewer ii - QiQ_i: Review specificity/quality score - FF: Group fairness penalty

  • Cost Functions C\mathcal{C}: Quantities (e.g., load-balance, group-imbalance) that serve as optimization constraints.

2. Reward Design and Constraint Specification

The multi-objective reward is constructed to align incentives and enforce operational constraints:

  • Timeliness: Ti(st,at)=1T_i(s_t, a_t) = 1 if reviewer ii completes on time, else $0$.
  • Review Quality: Qi(st,at)[0,1]Q_i(s_t, a_t) \in [0,1], typically normalized specificity or meta-review consistency.
  • Fairness Penalty: For demographic groups set GG, loading per group gg is Loadg(st,at)Load_g(s_t, a_t); the group fairness penalty is:

F(st,at)=gGLoadg(st,at)μloadF(s_t, a_t) = \sum_{g \in G} | Load_g(s_t, a_t) - \mu_\text{load} |

with μload=total assignmentsG\mu_\text{load} = \frac{\text{total assignments}}{|G|}.

The constrained RL objective is:

maxθJR(θ)subject toJCj(θ)djj\max_\theta\, J_R(\theta)\quad \text{subject to}\quad J_C^j(\theta) \leq d_j \quad \forall j

where JR(θ)=Eτπθ[t=0γtrt]J_R(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] and JCj(θ)J_C^j(\theta) similarly aggregates costs for constraint jj. Hard thresholds djd_j are selected (e.g., coefficient of variation of load 0.2\leq 0.2, group-imbalance 1\leq 1).

3. MARL Policy Learning and Offline (Counterfactual) Evaluation

The training approach leverages Constrained Multi-Agent Policy Optimization (CMAPO), extending single-agent constrained policy optimization to multi-agent systems:

  • Actor-Critic Architecture: Individual actor networks θ={θ1,,θN}\theta = \{\theta_1, \ldots, \theta_N\} (per reviewer); centralized critic ϕ\phi for rewards and costs.
  • Trust-Region Policy Update: At each iteration, gradients θiJR\nabla_{\theta_i} J_R, θiJCj\nabla_{\theta_i} J_C^j are computed for each agent; a quadratic program solves for joint trust-region updates under all cost constraints:
    • Quadratic program:

    maxgδθ s.t.bjδθ+(JCjdj)0,j δθHδθδ2\begin{align*} &\max \quad g^\top \delta\theta \ &\text{s.t.} \quad b^j{}^\top \delta\theta + (J_C^j - d_j) \leq 0,\,\, \forall j \ &\quad\quad\,\,\,\, \delta\theta^\top H\,\delta\theta \leq \delta^2 \end{align*}

    where δ\delta is the trust-region radius.

  • Offline RL with Doubly Robust Estimation: Policy improvement utilizes logged historical data and Doubly Robust off-policy estimators as in Thomas & Brunskill (2016):

J^(π)=1nk=1n[wk(rkv^(sk))+Vπ(sk)]\hat J(\pi) = \frac{1}{n}\sum_{k=1}^n \Big[ w_k (r_k - \hat v(s_k)) + V^\pi(s_k) \Big]

with importance weights wk=π(aksk)/μ(aksk)w_k = \pi(a_k|s_k)/\mu(a_k|s_k).

  • Hyperparameters:
    • Discount factor: γ=0.99\gamma = 0.99
    • Trust-region radius: δ=102\delta = 10^{-2}
    • Cost limits: Set to match target CV(load) or group-imbalance limits
    • Batch size: 1000\sim 1000 assignment epochs per iteration

Convergence is determined by stabilization of the objective JRJ_R and satisfaction of all cost constraints within specified slack.

4. Runtime Assignment Mechanism and Integer Programming

At test/deployment, the learned MARL policy πθ\pi_\theta produces either a probability mask Pt=[pij][0,1]N×MP_t = [p_{ij}] \in [0,1]^{N \times M}, with pij=πθ(xij=1st)p_{ij} = \pi_\theta(x_{ij}=1\mid s_t), or direct deterministic assignments.

Given the combinatorial constraints, final reviewer assignment is produced by solving a small-scale Integer Linear Program (ILP):

  • Objective: maximize i,jwijxij\sum_{i,j} w_{ij} x_{ij} over xij{0,1}x_{ij} \in \{0,1\}
  • Subject to:
    • ixij=K\sum_i x_{ij} = K for each paper jj
    • jxijLi\sum_j x_{ij} \leq L_i for each reviewer ii
    • xij=0x_{ij}=0 if conflict-of-interest

Typically, wij=pijw_{ij}=p_{ij} or wij=log(pij1pij)w_{ij}=\log\left( \frac{p_{ij}}{1-p_{ij}} \right). This ILP is computationally efficient, solvable in milliseconds for N1000N \sim 1000, M500M \sim 500, using solvers such as Gurobi or COIN-OR.

5. Empirical Evaluation, Metrics, and Baselines

Performance assessment is scheduled in "shadow mode" using an Agent-Based Model (ABM) environment instantiated with historical OpenReview data distributions.

  • Baselines:
    • ILP topic-matching/min-cost-flow (Charlin & Zemel 2013)
    • Greedy smallest-load assignment
    • Random matching under constraints
  • Metrics:
    • Timeliness: Percentage of reviews completed by deadline
    • Load-balance: Coefficient of Variation CV(i)=σ(i)/mean(i)CV(\ell_i) = \sigma(\ell_i)/\mathrm{mean}(\ell_i)
    • Group Fairness: Jain's index: (gLoadg)2GgLoadg2\, \frac{(\sum_g Load_g)^2}{|G| \, \sum_g Load_g^2} \,
    • Quality Uplift: ΔQ=QMARLQbaseline\Delta Q = Q_\mathrm{MARL} - Q_\text{baseline}

Example metric comparison:

Method Timeliness CV(load) Jain-Fairness ΔQ
ILP baseline 75% 0.28 0.82 0.00
MARL assign 87% 0.15 0.93 +0.07

Significance is tested via paired tt-test (p<0.01p<0.01).

6. Equity, Threat Models, and Constraint Enforcement

The architecture incorporates explicit fairness and robustness strategies:

  • Bias amplification is addressed through fairness constraints in both the reward function (FF penalty) and cost terms cjc^j. Offline simulations enable bias testing.
  • Strategic declines by reviewers are mitigated by feeding decline patterns δitδ_i^t into the state representation, thereby dynamically down-weighting over-decliners’ assignment likelihood.
  • PAC-style risk bounds can be derived for group imbalance: with high probability (1δ\geq 1-\delta), the learned policy ensures group-imbalance ϵ\leq \epsilon if cost constraints JCdjJ_C \leq d_j and estimation errors are sufficiently bounded.

7. Pilot Implementation and Practical Deployment

  • Shadow Mode Pilot (2027–2028): Planned trial at ICSE workshop scale, utilizing historical data for simulation and policy rollouts. Implementation in Python + PyTorch, with assignment-phase inference latency of \sim0.5 s for N=500N=500, M=200M=200. ILP postprocessing via Gurobi or COIN-OR.
  • Compute Requirements: Training requires approximately two weeks on an 8-GPU server for 2000 ABM episodes.
  • Operational Targets:
    • Timeliness uplift >>10%
    • CV(load)<0.2CV(\mathrm{load}) < 0.2
    • Jain-fairness >0.9>0.9
  • Live Mode (2028–2029): Expansion to opt-in live assignment, Program-Chair oversight, manual audit of 10% assignments, and continuous monitoring of group-Gini coefficients.

These advances collectively realize a MARL-driven reviewer matching system that (1) adapts to reviewer fatigue and historical participation patterns, (2) enforces equity and operational constraints by design, (3) learns efficiently from historical data via counterfactual evaluation, and (4) remains computationally tractable for deployment in real-world conference settings (Farooq et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MARL-Optimized Reviewer Assignment.