MARL-Optimized Reviewer Assignment
- The paper introduces a Constrained Multi-Agent Reinforcement Learning framework that optimizes reviewer assignments by balancing timeliness, review quality, and group fairness.
- MARL-Optimized Reviewer Assignment models the process as a stochastic multi-agent game, incorporating dynamic states like reviewer load, historical lateness, decline rates, and topic distances.
- The approach combines rigorous reward design, trust-region policy updates, and an ILP-based runtime mechanism to ensure efficient, fair, and computationally tractable assignments.
MARL-Optimized Reviewer Assignment is a computational framework for peer review assignment based on Constrained Multi-Agent Reinforcement Learning (MARL), designed to address problems of reviewer fatigue, fairness, and review quality in the peer review ecosystem. It models the assignment process as a stochastic multi-agent game, where agents (reviewers) interact with a dynamically evolving system representing real-world constraints and objectives. This method seeks not only to optimize the matching of papers to reviewers, but also to enforce explicit constraints on timeliness, load-balance, and group equity, while adapting to historical reviewer behaviors and institutional demands (Farooq et al., 27 Jan 2026).
1. Formal Multi-Agent Model for Reviewer Assignment
In this formulation, the reviewer assignment subproblem is codified as a constrained stochastic game :
- Agents : Each reviewer is an agent .
- State Space : At each assignment epoch , the state aggregates reviewer-level features:
- (current uncompleted load for reviewer )
- (historical mean lateness of reviewer )
- (decline rate of reviewer )
- (topic distance between reviewer and paper )
- with for papers.
- Joint Action : Assignment matrix with iff reviewer is assigned paper at epoch .
- Constraints:
- Exactly reviewers per paper ()
- Load-bound per reviewer ()
- Conflict-of-interest exclusion ( if COI)
- Transition Kernel : is updated by tracking assignment completion, lateness, and decline.
- Reward : Vector per epoch incorporates timeliness, review quality, and fairness penalties. The global instantaneous reward is:
where: - : Timeliness indicator for reviewer - : Review specificity/quality score - : Group fairness penalty
- Cost Functions : Quantities (e.g., load-balance, group-imbalance) that serve as optimization constraints.
2. Reward Design and Constraint Specification
The multi-objective reward is constructed to align incentives and enforce operational constraints:
- Timeliness: if reviewer completes on time, else $0$.
- Review Quality: , typically normalized specificity or meta-review consistency.
- Fairness Penalty: For demographic groups set , loading per group is ; the group fairness penalty is:
with .
The constrained RL objective is:
where and similarly aggregates costs for constraint . Hard thresholds are selected (e.g., coefficient of variation of load , group-imbalance ).
3. MARL Policy Learning and Offline (Counterfactual) Evaluation
The training approach leverages Constrained Multi-Agent Policy Optimization (CMAPO), extending single-agent constrained policy optimization to multi-agent systems:
- Actor-Critic Architecture: Individual actor networks (per reviewer); centralized critic for rewards and costs.
- Trust-Region Policy Update: At each iteration, gradients , are computed for each agent; a quadratic program solves for joint trust-region updates under all cost constraints:
- Quadratic program:
where is the trust-region radius.
Offline RL with Doubly Robust Estimation: Policy improvement utilizes logged historical data and Doubly Robust off-policy estimators as in Thomas & Brunskill (2016):
with importance weights .
- Hyperparameters:
- Discount factor:
- Trust-region radius:
- Cost limits: Set to match target CV(load) or group-imbalance limits
- Batch size: assignment epochs per iteration
Convergence is determined by stabilization of the objective and satisfaction of all cost constraints within specified slack.
4. Runtime Assignment Mechanism and Integer Programming
At test/deployment, the learned MARL policy produces either a probability mask , with , or direct deterministic assignments.
Given the combinatorial constraints, final reviewer assignment is produced by solving a small-scale Integer Linear Program (ILP):
- Objective: maximize over
- Subject to:
- for each paper
- for each reviewer
- if conflict-of-interest
Typically, or . This ILP is computationally efficient, solvable in milliseconds for , , using solvers such as Gurobi or COIN-OR.
5. Empirical Evaluation, Metrics, and Baselines
Performance assessment is scheduled in "shadow mode" using an Agent-Based Model (ABM) environment instantiated with historical OpenReview data distributions.
- Baselines:
- ILP topic-matching/min-cost-flow (Charlin & Zemel 2013)
- Greedy smallest-load assignment
- Random matching under constraints
- Metrics:
- Timeliness: Percentage of reviews completed by deadline
- Load-balance: Coefficient of Variation
- Group Fairness: Jain's index:
- Quality Uplift:
Example metric comparison:
| Method | Timeliness | CV(load) | Jain-Fairness | ΔQ |
|---|---|---|---|---|
| ILP baseline | 75% | 0.28 | 0.82 | 0.00 |
| MARL assign | 87% | 0.15 | 0.93 | +0.07 |
Significance is tested via paired -test ().
6. Equity, Threat Models, and Constraint Enforcement
The architecture incorporates explicit fairness and robustness strategies:
- Bias amplification is addressed through fairness constraints in both the reward function ( penalty) and cost terms . Offline simulations enable bias testing.
- Strategic declines by reviewers are mitigated by feeding decline patterns into the state representation, thereby dynamically down-weighting over-decliners’ assignment likelihood.
- PAC-style risk bounds can be derived for group imbalance: with high probability (), the learned policy ensures group-imbalance if cost constraints and estimation errors are sufficiently bounded.
7. Pilot Implementation and Practical Deployment
- Shadow Mode Pilot (2027–2028): Planned trial at ICSE workshop scale, utilizing historical data for simulation and policy rollouts. Implementation in Python + PyTorch, with assignment-phase inference latency of 0.5 s for , . ILP postprocessing via Gurobi or COIN-OR.
- Compute Requirements: Training requires approximately two weeks on an 8-GPU server for 2000 ABM episodes.
- Operational Targets:
- Timeliness uplift 10%
- Jain-fairness
- Live Mode (2028–2029): Expansion to opt-in live assignment, Program-Chair oversight, manual audit of 10% assignments, and continuous monitoring of group-Gini coefficients.
These advances collectively realize a MARL-driven reviewer matching system that (1) adapts to reviewer fatigue and historical participation patterns, (2) enforces equity and operational constraints by design, (3) learns efficiently from historical data via counterfactual evaluation, and (4) remains computationally tractable for deployment in real-world conference settings (Farooq et al., 27 Jan 2026).