Addressing the Long-term Impact of ML Decisions via Policy Regret

Published 2 Jun 2021 in cs.LG, cs.AI, cs.CY, and stat.ML | (2106.01325v1)

Abstract: Machine Learning (ML) increasingly informs the allocation of opportunities to individuals and communities in areas such as lending, education, employment, and beyond. Such decisions often impact their subjects' future characteristics and capabilities in an a priori unknown fashion. The decision-maker, therefore, faces exploration-exploitation dilemmas akin to those in multi-armed bandits. Following prior work, we model communities as arms. To capture the long-term effects of ML-based allocation decisions, we study a setting in which the reward from each arm evolves every time the decision-maker pulls that arm. We focus on reward functions that are initially increasing in the number of pulls but may become (and remain) decreasing after a certain point. We argue that an acceptable sequential allocation of opportunities must take an arm's potential for growth into account. We capture these considerations through the notion of policy regret, a much stronger notion than the often-studied external regret, and present an algorithm with provably sub-linear policy regret for sufficiently long time horizons. We empirically compare our algorithm with several baselines and find that it consistently outperforms them, in particular for long time horizons.

Abstract PDF Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces policy regret and the SPO algorithm to model long-term ML outcomes using a multi-armed bandit framework.
It demonstrates that SPO achieves sub-linear policy regret, outperforming conventional algorithms in both synthetic and real-world scenarios.
The work shifts ML decision-making towards equitable practices by accounting for evolving rewards and long-term community impacts.

Addressing the Long-term Impact of ML Decisions via Policy Regret

The paper "Addressing the Long-term Impact of ML Decisions via Policy Regret" explores an overlooked aspect of ML application in decision-making—its long-term impact on communities and individuals. With ML systems increasingly being used in domains such as lending, education, and employment, understanding how these decisions affect future opportunities is crucial. The authors focus on modeling these decisions through a multi-armed bandit framework, introducing an approach that accounts for the evolving nature of rewards associated with ML-informed allocations.

In detail, the study models communities as arms in a multi-armed bandit setting, where each arm's reward changes with the number of times it is selected. The key contribution is introducing the notion of policy regret, which provides a more stringent criterion compared to the conventional external regret. Policy regret is defined over a sequence of allocations, capturing the differential impact of current ML decisions on potential future outcomes. The authors highlight that an acceptable allocation policy should incorporate the growth potential of the options or communities affected.

The paper's primary technical advancement is the Single-Peaked Optimism (SPO) algorithm. SPO is specifically designed for "single-peaked bandits," where reward functions initially increase with the number of choices but turn decreasing after a peak. This setting reflects real-world scenarios where resources exhibit increasing returns up to a certain point, followed by saturation. SPO achieves sub-linear policy regret across long horizons, signifying its effectiveness over extended periods. Unlike traditional algorithms, SPO considers the potential long-term rewards, thus avoiding the pitfall of focusing solely on immediate gains.

Theoretical analysis establishes that SPO outperforms conventional bandit algorithms, particularly in scenarios involving non-stationary rewards that many current models fail to address adequately. Empirical results corroborate these claims, showing that SPO reduces regret significantly in synthetic and real-world simulations, such as credit lending decisions using the FICO dataset. The experiments underscore SPO's robustness, particularly in high noise settings or where rewards fluctuate over time.

The implications of this work are profound. It prompts a shift from short-term, myopic decision-making to strategies that factor in the broader social and future potential impacts. This approach is not merely an algorithmic development but a conceptual one that suggests integrating social insights and domain expertise into the design of allocation algorithms. Future developments could extend this framework to incorporate more complex reward curves and handle diverse real-world uncertainties more effectively.

Overall, the study opens pathways for developing algorithms that align ML decisions with equitable long-term societal welfare, a crucial consideration amidst the pervasive adoption of automated systems in decision-making processes.