Regret-Optimal Bilateral Pricing

Updated 6 February 2026

The paper introduces mechanisms that minimize regret relative to fixed-price benchmarks by posting prices under various feedback models.
It details algorithmic strategies like Follow-the-Best-Price and Scouting Bandits that achieve regret rates ranging from O(√T) to O(T) based on information regimes.
The study reveals sharp phase transitions and trade-offs among feedback granularity, incentive compatibility, and budget balance in bilateral trade settings.

Regret-optimal bilateral pricing studies sequential mechanisms in bilateral trade environments that minimize regret relative to a natural fixed-price or mechanism benchmark, under various information and distributional assumptions, budget constraints, and feedback models. The focus is on algorithmic strategies for posting prices to buyers and sellers with unknown, private valuations over $T$ rounds, with the goal of maximizing cumulative gain from trade (GFT) subject to incentive compatibility, individual rationality, and budget balance. The literature has fully characterized tight regret rates for most settings and revealed sharp phase transitions between sublinear and linear regret, depending on model assumptions and feedback granularity.

1. Formal Framework and Benchmarking

In regret-optimal bilateral pricing, at each round $t=1,\dots,T$ , a seller (value $s_t\in[0,1]$ ) and a buyer (value $b_t\in[0,1]$ ) arrive. The mechanism posts a price (or prices), and a trade occurs if accept/reject conditions are met. The canonical objective is to maximize total gain from trade:

$GFT_t(p) = (b_t-s_t)\mathbf{1}\{s_t \leq p \leq b_t\}$

for single-price mechanisms; or more generally with two prices $(p_t,q_t)$ ,

$GFT_t(p_t,q_t) = (b_t - s_t) \cdot \mathbf{1}\{s_t \leq p_t \leq q_t \leq b_t\}.$

Regret is computed against the best fixed (possibly randomized) mechanism chosen in hindsight. For the fixed-price benchmark,

$R_T = \max_{p\in[0,1]} \mathbb{E} \left[ \sum_{t=1}^T GFT_t(p) - \sum_{t=1}^T GFT_t(P_t) \right].$

Alternative benchmarks include the best mechanism satisfying Dominant Strategy Incentive Compatibility and Individual Rationality (DSIC+IR), randomized price-pair distributions under Global Budget Balance (GBB), or the best fixed pair $(p,q)$ with $p \leq q$ ensuring per-round or global non-negative profit.

2. Feedback Models and Their Impact

Regret bounds are fundamentally sensitive to the available feedback:

Full Feedback (Direct Revelation): After posting a price, the mechanism observes the full pair $(s_t, b_t)$ . This enables nonparametric estimation of the gain-from-trade curve and supports rates matching finite-action online learning ( $\Theta(\sqrt{T})$ for i.i.d. or even correlated valuations) (Cesa-Bianchi et al., 2021, Cesa-Bianchi et al., 2021).
Realistic (Partial) Feedback: Only accept/reject bits (e.g., $(\mathbf{1}\{s_t\leq p_t\}, \mathbf{1}\{p_t\leq b_t\})$ ), or even a single trade/no-trade bit, are observed after each price post. Under partial feedback, learning is severely information-limited; for stochastic independent and bounded-density valuations, the best possible regret is $\Theta(T^{2/3})$ via bandit-style exploration and convolutional estimation (Cesa-Bianchi et al., 2021, Cesa-Bianchi et al., 2021). If either independence or smoothness is dropped, or in adversarial settings, regret returns to linear.
Semi-feedback (GBB): Mechanisms may observe only the seller value together with the trade/no-trade outcome, and must enforce global profit nonnegativity only in aggregate. Here, careful phase-based exploration and bandit learning achieve $\widetilde{O}(T^{2/3})$ minimax regret, matching lower bounds (Jin, 23 Jan 2026).
One-bit Feedback with Two Prices: Posting two prices and observing only trade occurrence allows unbiased estimation of GFT at any nominal price position, enabling $O(T^{3/4})$ regret in adversarial or smoothed-adversary settings (Azar et al., 2022, Cesa-Bianchi et al., 2023).

The following table summarizes regret regimes for fixed-price mechanisms under key feedback and distribution settings (with $R_T$ up to polylogarithmic factors):

Setting	Regret Bound	Reference
Full Feedback, stochastic iid	$\Theta(\sqrt{T})$	(Cesa-Bianchi et al., 2021)
Realistic Feedback, iid+iv+bd	$\Theta(T^{2/3})$	(Cesa-Bianchi et al., 2021)
Realistic Feedback, iid + (iv or bd)	$\Theta(T)$	(Cesa-Bianchi et al., 2021)
Adversarial (any feedback)	$\Theta(T)$	(Cesa-Bianchi et al., 2021)
Semi-feedback, adversarial, GBB	$\widetilde{O}(T^{2/3})$	(Jin, 23 Jan 2026)
One-bit, two prices, adversarial	$O(T^{3/4})$	(Azar et al., 2022, Cesa-Bianchi et al., 2023)

3. Algorithms and Tight Rates

Full Feedback (Direct Revelation)

The "Follow-the-Best-Price" (FBP) algorithm sequentially selects the empirically best price so far:

$P_t \in \arg\max_{p\in[0,1]} \frac{1}{t-1} \sum_{i=1}^{t-1} GFT_i(p)$

For any i.i.d. or correlated valuation sequence, this achieves $R_T = O(\sqrt{T})$ regret (Cesa-Bianchi et al., 2021, Cesa-Bianchi et al., 2021). This matches lower bounds that reduce to experts or Lipschitz bandit problems over the price interval.

Realistic Feedback (Partial/Posted Price)

Under independent and bounded-density seller/buyer values, the "Scouting Bandits" (SB) method achieves $O(T^{2/3})$ regret via a two-phase approach: a randomized exploration phase estimates GFT functionals across a price grid using only accept/reject bits, followed by a multi-armed bandit phase to exploit the best grid price (Cesa-Bianchi et al., 2021, Cesa-Bianchi et al., 2021). Without independence or smoothness, learning becomes information-theoretically hard, and regret reverts to $\Theta(T)$ .

Adversarial and Distributional Robustness

For adversarial valuations (even under full feedback), no algorithm can achieve sublinear regret for the classical fixed-price benchmark (Cesa-Bianchi et al., 2021, Azar et al., 2022). In this strong sense, information does not accumulate usefully over time unless strong structural assumptions are made.
$\alpha$ -regret analysis shows no sublinear $\alpha$ -regret for any $\alpha < 2$ , but sublinear $2$-regret is possible (base gain being half of the offline optimal). Full-feedback allows $O(\sqrt{T \log T})$ $2$-regret using multiplicative-weights over a fine price grid; one-bit feedback with two prices gives $O(T^{3/4}\sqrt{\log T})$ $2$-regret (Azar et al., 2022).

Global Budget Balance and Extensions

Global Budget Balance (GBB) constraints relax per-round non-negative profit to an aggregate requirement. Under semi-feedback, the minimax regret against the best static price is $O(T^{2/3}\,\mathrm{polylog}\,T)$ ; matching lower bounds confirm this rate is tight (Jin, 23 Jan 2026). The GBB framework also enables sublinear regret against stronger benchmarks, such as distributions over price-pairs, and interpolates between regret and permitted total subsidy (Bernasconi et al., 2023, Lunghi et al., 15 Jul 2025).

4. Phase Transitions and Impossibility Results

Regret-optimal bilateral pricing exhibits sharp phase transitions:

Full feedback enables nonparametric learning: Sublinear $\sqrt{T}$ regret is possible, even with correlated or adversarial stochasticity.
Partial feedback imposes a partial monitoring barrier: Sublinear regret is only achievable under independence and bounded density; otherwise, it is information-theoretically impossible to distinguish critical price regions.
Adversarial environments revert to linear regret: Without structural or stochastic assumptions, exploration is non-informative and no learning takes place.
Budget-balance constraints interact strongly with feedback: Enforcing per-round budget balance under realistic feedback forces linear regret, while global budget balance plus richer semi-feedback allows $O(T^{2/3})$ regret, and interpolating between these constraints yields a full spectrum of regret/violation trade-offs (Lunghi et al., 15 Jul 2025).
Distributional benchmarks create new impossibility frontiers: Algorithms achieving sublinear regret against the best fixed price in hindsight may suffer linear regret against the best feasible distributional benchmark, unless stronger assumptions are made or feedback is enhanced (Lunghi et al., 5 Feb 2026, Bernasconi et al., 2023).

5. Methodological Foundations and Key Technical Devices

A range of statistical and algorithmic tools underpin regret-optimal bilateral pricing:

Uniform convergence and empirical process theory are used to show convergence of the empirical GFT curve to its expectation over $[0,1]$ , even for a continuum of actions (Cesa-Bianchi et al., 2021).
Partial monitoring and decompositional analysis: Regret lower bounds reduce to embedding hard bandit or partial monitoring instances; decomposition lemmas relate observed feedback to underlying GFT functions under independence and smoothness (Cesa-Bianchi et al., 2021).
Bandit algorithms and unbiased estimators: Bandit subroutines (e.g., MOSS, EXP3) are deployed in the bandit phase, while unbiased estimation of GFT via randomized price pairs is critical for learning with one-bit feedback (Cesa-Bianchi et al., 2021, Azar et al., 2022).
Phase-based and block decomposition: Many GBB algorithms separate profit collection, global feasibility enforcement, and exploitation via bandit optimization over adaptive grids (Jin, 23 Jan 2026, Lunghi et al., 15 Jul 2025).
Concentration inequalities and Rademacher symmetrization: Accurate estimation across all candidate mechanisms is achieved via chaining arguments and uniform concentration over data-dependent nets, especially in stochastic and indirect feedback settings (Gregorio et al., 26 Sep 2025).

6. Broader Implications and Open Problems

The regret-optimal bilateral pricing literature offers a clear "regret frontier"—sublinear $\sqrt{T}$ is possible under full information, $T^{2/3}$ for partial feedback with independence and smoothness, and $T$ otherwise (Cesa-Bianchi et al., 2021, Cesa-Bianchi et al., 2021). Extensions include:

Global budget balance enabling feasible learning beyond per-round constraints, and interpolations quantifying the fundamental trade-off between efficiency and allowable market subsidy (Bernasconi et al., 2023, Lunghi et al., 15 Jul 2025).
Contextual and feature-based extensions generalize the problem, with tight $\widetilde{O}(T^{2/3})$ rates under pooling and successive elimination (Gaucher et al., 2024).
Fairness objectives (e.g., fair GFT) significantly alter attainability and difficulty of the learning problem (Bachoc et al., 2024).
Open directions: Beyond static bilateral trade, extension to multi-unit settings, richer market structures, reward or feedback delay, strategic agent behavior, and contextual or non-i.i.d. valuation models comprise active areas of investigation (Cesa-Bianchi et al., 2021).

The field has thus mapped the complexity landscape and algorithmic possibilities for regret-optimal bilateral pricing under a wide range of operational requirements and information regimes, clarifying the interaction between statistical learnability, incentive constraints, and market design.