From Bandits to Experts: On the Value of Side-Observations

Published 13 Jun 2011 in cs.LG and stat.ML | (1106.2436v3)

Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known "experts" setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds.

Abstract PDF Upgrade to Chat

Citations (216)

View on Semantic Scholar

Summary

The paper introduces an online learning setting between bandits and experts, where side observations about unchosen actions' rewards are available via a graph structure.
Two algorithms, ExpBan and ELP, are proposed that leverage graph properties (clique-partition number and independence number) to derive improved theoretical regret bounds for fixed and time-varying graphs, respectively.
This research provides practical models for contexts with partial feedback and side observations (like web advertising) and expands the theoretical landscape of online learning by integrating graph-theoretic insights.

Examining "From Bandits to Experts: On the Value of Side-Observations"

The paper presents an exploration into a nuanced online learning setting situated between the classic "experts" setting and the multi-armed bandits framework. This intermediate setting is characterized by availability of "side observations" concerning the rewards of unchosen actions, represented as nodes in a graph structure. This paper contributes practical algorithms endowed with theoretical regret bounds that incorporate graph-theoretic properties, thereby expanding the applicability and understanding of online learning under partial feedback conditions.

Problem Formulation

The paper sets its investigation in the adversarial online learning scenario, where a decision maker repeatedly selects actions over $T$ rounds. Traditionally, the "experts" setting assumes visibility of all actions' rewards, while the bandits setting limits observation to the chosen action's reward. This research introduces an interpolated scenario where side observations provide insights into the potential rewards of some other actions within a graph-based structure. Here, the nodes represent actions, and edges encode the availability of side observations—a conceptually richer setup fueling the study's core contributions.

Method and Results

Two pivotal algorithms are introduced—ExpBan and ELP, each designed to handle differing structural properties of the feedback graph.

ExpBan Algorithm:
- Utilizes a fixed graph throughout $T$ rounds.
- Constructs action cliques within the feedback graph, each treated as "meta-actions."
- Leverages standard bandits algorithm (EXP3) over these meta-actions to achieve a regret bound of $O(\sqrt{\bar{\chi}(G)\log(k)T})$ . Here, $\bar{\chi}(G)$ is the clique-partition number of the graph $G$ .
- Constrains include computational intractability in finding optimal partitions, though this is a typical challenge shared with many combinatorial optimization problems.
ELP Algorithm:
- Adapts to time-varying, directed graphs.
- Incorporates a nuanced exploration strategy derived via linear programming.
- Achieves regret bounds of $O(\sqrt{\log(k) \sum_{t=1}^{T}\alpha(G_t)})$ , where $\alpha(G_t)$ represents the independence number of the undirected graph $G_t$ .
- This algorithm overcomes the computational shortcomings of ExpBan and offers robust performance under dynamic conditions.

Implications and Future Directions

The implications of this research are manifold. Practically, the models and algorithms proposed can adapt to contexts like web advertising and sensory networks, where side observations are a natural element. Theoretically, the interpolation between bandits and experts expands the landscape of online learning problems by incorporating graph-theoretic insights into regret minimization strategies.

The findings suggest a new frontier in partial feedback learning, hinting at further studies to:

Tighten gap between upper and lower regret bounds for directed graphs.
Extend the framework to account for partially known or evolving graph topologies.
Explore potential high-probability bounds for regret, expanding current expected-value-focused results.

Conclusion

The paper's development of algorithms utilizing side observations advances the understanding of online learning systems operating under complex feedback. By embedding structural insights into regret analysis, this work not only provides new tools for immediate application but also lays the groundwork for future theoretical advancements in learning frameworks bridging the "experts" and bandits paradigms.

Markdown Report Issue