Data-Driven Refugee Matching

Updated 10 February 2026

Data-driven refugee matching is a framework that integrates predictive modeling with combinatorial optimization to assign refugees to host locations while boosting employment and social integration.
Key methodologies include maximum-weight bipartite matching, submodular and LP-based models, and dynamic algorithms that adapt to changing refugee profiles and operational constraints.
Practical implementations involve post-processing corrections, fairness constraints, and robust policy evaluations to mitigate algorithmic harm and ensure equitable outcomes.

A data-driven refugee matching problem arises when policymakers or agencies seek to assign refugees (or asylum seekers, migrants, or relocation cases) to jurisdictions such that specific outcome metrics—most commonly employment rates, social integration, or subjective well-being—are optimized, subject to operational constraints. State-of-the-art methods utilize prediction models trained on historical data, combined with algorithmic assignment procedures, to recommend or implement matchings. This field spans several methodological traditions, including combinatorial optimization, machine learning, mechanism and market design, and fairness analysis, and has been a centerpiece for real-world deployments as well as foundational research in computational social science.

1. Formal Problem Statement and Optimization Models

The canonical setup involves a pool of $n$ refugees $R = \{1, \ldots, n\}$ and $k$ host locations $L = \{1, \ldots, k\}$ , with each location $\ell$ having capacity $c_{\ell}$ . High-dimensional feature vectors $x_i$ describe each refugee, and a classifier $g_{\ell}(x_i) \approx \Pr[Y_i=1\,|\,X_i=x_i,L_i=\ell]$ estimates the probability that $i$ achieves a specified outcome (e.g., employment) at location $\ell$ .

The matching decision is encoded by the binary assignment $R = \{1, \ldots, n\}$ 0, subject to one-to-one ( $R = \{1, \ldots, n\}$ 1) and capacity ( $R = \{1, \ldots, n\}$ 2) constraints. The principal optimization, as per the maximum-weight bipartite matching formalism, is:

$R = \{1, \ldots, n\}$ 3

where $R = \{1, \ldots, n\}$ 4 and the solution determines the placement $R = \{1, \ldots, n\}$ 5. This approach is widespread in current implementations (Lee et al., 2024).

Alternative formulations explicitly model non-additive effects, notably competition for scarce resources or jobs. Submodular objective functions $R = \{1, \ldots, n\}$ 6, where $R = \{1, \ldots, n\}$ 7 is a matching, are designed to capture diminishing returns or interference:

Retroactive-correction: Employing concave corrections based on occupation-job saturation.
Interview model: Stochastic sequential applications to a finite job pool.
Coordination model: Max-matching in random bipartite graphs induced by compatibility probabilities (Gölz et al., 2018).

Capacity and feasibility constraints can be viewed as partition matroids, and the assignment is often solved using greedy or LP-based algorithms—guaranteeing at least a $R = \{1, \ldots, n\}$ 8-approximation for $R = \{1, \ldots, n\}$ 9 matroid intersections.

2. Algorithmic Techniques and Mechanism Design

Data-driven refugee matching covers both one-off and dynamic assignment settings:

Maximum-weight bipartite matching: Used for pools with fixed data and known predictions; solved by combinatorial methods or LP relaxations (Lee et al., 2024, Gölz et al., 2018).
Learning-based allocation: When predictive models are uncertain, Thompson sampling-based combinatorial semi-bandit algorithms iteratively assign matches while learning reward probabilities and balancing exploration/exploitation (Kasy et al., 2020).
Dynamic/multi-step arrival: Online algorithms such as minimum-discord assignment forecast future arrivals, sample possible scenarios, and solve rolling-horizon assignment problems for robust sequential placement (Bansak et al., 2020).

Several mechanism design approaches incorporate agent preferences:

Constrained Priority Mechanism: Combines outcome maximization with refugee preferences, subject to a planner-defined minimum average employment threshold, and is strategy-proof and constrained-efficient (Acharya et al., 2019).
Constrained Random Serial Dictatorship (CRSD) and Constrained Rank Value (CRV): Allow trade-offs between optimal employment and family welfare through parameterization; CRSD remains strategy-proof, CRV achieves higher welfare at a cost to strategy-proofness (Olberg et al., 2022).

Preference-based, strategy-proof, envy-free matching is employed in situations with sponsored resettlement (e.g., the Multiple-Waitlist Procedure as in RUTH, which manages main and location-specific queues) (Farajzadeh et al., 2023). Burden-sharing approaches, particularly in asylum assignment to EU member states, generalize to matching with contracts (resource, time, and priority), with existence and uniqueness results contingent on homogeneous versus heterogeneous burden-sizes (Caspari et al., 26 Nov 2025).

3. Counterfactual Evaluation and Algorithmic Harm

A central issue is that data-driven matchings—when evaluated using historical data—may counterfactually worsen outcomes for specific pools relative to legacy or default policies. Define

$k$ 0

where $k$ 1 is the observed outcome under the default, and $k$ 2 is the (unobserved) counterfactual under the algorithmic assignment. The counterfactual harm is $k$ 3; if $k$ 4, the data-driven approach does harm in hindsight (Lee et al., 2024).

To guarantee non-harm:

Inverse-matching post-processing identifies minimally adjusted prediction matrices $k$ 5 such that the optimal new matching is "harmless": it preserves, for all $k$ 6 with $k$ 7 in the default matching, the same assignment in $k$ 8 (Lee et al., 2024).
Transformer-based correction: A Transformer $k$ 9 learns to approximate these corrections on unseen pools, trained to minimize the adjustment norm and penalize harmful matchings (i.e., matchings that underperform the default policy).

Empirical findings show that plain data-driven matching can harm up to $L = \{1, \ldots, k\}$ 055% of pools under moderate prediction noise; post-processing and learned corrections can reduce this to $L = \{1, \ldots, k\}$ 145%, with minimal loss in average employment gains.

4. Fairness and Group-level Considerations

Recent work emphasizes that maximizing global employment can exacerbate disparities across demographic subgroups measured by country of origin, age, or education. Group-fairness extensions operationalize constraints at the group level:

Fairness rules include random-assignment baselines, proportionally optimized thresholds (group-only LPs), and max-min rules (maximize the minimum group outcome).
Amplified and Conservative Bid-Price Controls: Online assignment strategies that guarantee, with high probability, small regret (loss relative to optimal) for each group, even under arrival and outcome uncertainty (Freund et al., 2023).

Empirical evidence indicates that enforcing fairness constraints via these bid-price controls can lift the group-fairness ratio for all groups above 90–99%, with total global employment losses limited to 1–5% versus the unconstrained optimum.

5. Policy Evaluation and the Winner’s Curse

A major methodological concern is the "winner's curse" in model-based policy evaluation: maximizing over predictions rather than true potential outcomes induces an optimistic bias. This bias is not prevented by model accuracy, sample splitting, or even correct model specification. Simulation studies show model-based evaluations can report illusory improvements (13–59%), even when the true effect is zero; model-free approaches like inverse probability weighting remain unbiased but suffer extreme variance in high-dimensional settings (Bastani et al., 9 Feb 2026).

Structural causes of this bias include:

Overfitting to prediction errors: The optimization step exploits small residuals in the prediction model.
Lack of overlap and off-policy data: When proposed matchings assign refugees to locations rarely seen in the historical data, the model extrapolates poorly.

A plausible implication is that offline policy evaluation—if relying on model-based imputation—is fundamentally unreliable unless supplemented with robust, model-free statistical inference or variance-regularized, conservative learning rules.

6. Extensions: Dynamic Regimes, Housing, and Multi-criteria Frameworks

Advanced topics in data-driven matching address system constraints and societal goals beyond pure employment maximization:

Dynamic matching with congestion: Each matching creates downstream service requirements (e.g., translators), inducing congestion. Regret-minimizing online algorithms, based on learned dual variables, control both queueing and quota violation penalties (Bansak et al., 2024).
Housing assignment: Data-driven platforms such as the Refugee Resettlement Housing Scout integrate multidimensional criteria (affordability, proximity to transit/schools/etc.) into an interactively weighted, single-objective utility model. Efficient computational architectures allow real-time re-optimization in field deployments (Ahsan et al., 2016).
Analytic Hierarchy Process (AHP) allocation: Multi-criteria, country-level allocation (e.g., at the EU level) can be handled with an AHP-based model that aggregates normalized indicators (land area per capita, GDP, unemployment, welfare index) into a principled score and shares. The process includes rigorous consistency checking for pairwise weight judgments and can be extended via hybrid AHP–linear programming, feedback loops, and additional capacities (Liu, 2024).

7. Practical Recommendations and Limitations

Standard deployment pipelines draw on these methodological advances with the following steps:

Data preparation: Assemble refugee profiles, including demographic, human capital, and case-level features, together with location-specific labor market and integration indicators.
Prediction model calibration: Use historical outcomes to train (and calibrate) predictive models of refugee success by location, ideally with cross-validation or hierarchical pooling for low-sample locations.
Optimization or mechanism selection: Adopt pure outcome-maximization, submodular competition models, or output-constrained priority mechanisms as appropriate for the context and policy objectives.
Fairness and harm-checking: Implement fairness constraints at group or individual levels, and deploy inverse-matching or learned correction steps to prevent algorithmic harm.
Online/dynamic adaptation: For streaming arrivals or uncertain settings, combine rolling-horizon assignment with robust, regret-minimizing online learners.
Policy evaluation: Rely on model-free or conservative, inference-aware frameworks for evaluating expected gains, controlling for winner's curse.

Key limitations include sensitivity to model misspecification, lack of overlap in observed outcomes, variance in subgroup outcomes, and practical challenges in eliciting reliable preference rankings from refugees.

References

"Matchings, Predictions and Counterfactual Harm in Refugee Resettlement Processes" (Lee et al., 2024)
"Migration as Submodular Optimization" (Gölz et al., 2018)
"Combining Outcome-Based and Preference-Based Matching: A Constrained Priority Mechanism" (Acharya et al., 2019)
"Enabling Trade-offs in Machine Learning-based Matching for Refugee Resettlement" (Olberg et al., 2022)
"Adaptive Combinatorial Allocation" (Kasy et al., 2020)
"Outcome-Driven Dynamic Refugee Assignment with Allocation Balancing" (Bansak et al., 2020)
"Optimizing Sponsored Humanitarian Parole" (Farajzadeh et al., 2023)
"Asylum Assignment and Burden-Sharing" (Caspari et al., 26 Nov 2025)
"Group fairness in dynamic refugee assignment" (Freund et al., 2023)
"Dynamic Matching with Post-allocation Service and its Application to Refugee Resettlement" (Bansak et al., 2024)
"Refugee Resettlement Housing Scout" (Ahsan et al., 2016)
"Solve the Refugee Crisis with Data" (Liu, 2024)
"Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching" (Bastani et al., 9 Feb 2026)