Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conditional Variational Information Bottleneck

Updated 27 January 2026
  • cVIB is a framework that extends the classic information bottleneck by selectively gating privileged inputs based on standard observations.
  • It employs a stochastic gating mechanism and a variational surrogate to optimize the trade-off between retaining predictive features and minimizing unnecessary information.
  • Empirical evaluations show that cVIB improves generalization and reduces computational costs across tasks like planning, navigation, multi-agent communication, and visual attention.

The Conditional Variational Information Bottleneck (cVIB) framework extends classical information bottleneck (IB) approaches to scenarios involving both standard and privileged inputs, imposing an information-theoretic constraint on the conditional contribution of the privileged input to the latent representation. This enables models to balance predictive accuracy against minimization of unnecessary, potentially costly, or overfitting-prone information transmission from specialized sources such as goals, planning rollouts, or communication, all while making access decisions stochastically and conditionally on standard observed data (Goyal et al., 2020).

1. Foundation and Motivation

The traditional information bottleneck method is formulated as an optimization over latent representations ZZ that achieves an optimal tradeoff between preserving predictive information about a target variable %%%%1%%%% and compressing input data XX. In many settings, particularly in reinforcement learning and multi-agent systems, inputs can be naturally split into a standard input SS (such as raw sensorimotor observations) and a privileged input GG (such as task goals, planned trajectories, or communication signals). The Conditional Variational Information Bottleneck constrains information flow from GG beyond what is already provided in SS, addressing the need to mitigate overfitting, improve generalization, and control the cost or risk associated with accessing GG (Goyal et al., 2020).

2. Formal Conditional Information Bottleneck Objective

Given a data-generating distribution pdist(S,G,Y)p_{\text{dist}}(S, G, Y) and a conditional encoder qθ(ZS,G)q_\theta(Z|S,G), the objective is to maximize the conditional mutual information I(Z;YS)I(Z;Y|S)—promoting retention of relevant predictive features—while minimizing the conditional mutual information I(Z;GS)I(Z;G|S)—discouraging gratuitous dependence on GG. Using a Lagrangian multiplier β>0\beta > 0, the optimization target is:

maxqθ,pϕ  I(Z;YS)βI(Z;GS)\max_{q_\theta, p_\phi}\; I(Z;Y|S) - \beta I(Z;G|S)

Empirical optimization leverages a variational surrogate:

L(θ,ϕ)=E(s,g,y)pdist[Ezqθ(zs,g)[logpϕ(ys,z)]+βDKL(qθ(zs,g)rψ(zs))]L(\theta, \phi) = \mathbb{E}_{(s, g, y) \sim p_{\text{dist}}} \Big[\, \mathbb{E}_{z \sim q_\theta(z|s,g)} [ –\log p_\phi(y|s, z) ] + \beta\, D_{\text{KL}}(q_\theta(z|s,g) \| r_\psi(z|s))\, \Big]

where rψ(zs)r_\psi(z|s) is an amortized prior over ZZ conditioned only on SS. The DKLD_{\text{KL}} term upper-bounds I(Z;GS)I(Z;G|S), implementing the bottleneck penalty (Goyal et al., 2020).

3. Stochastic Gating and Differentiable Mixture Encoder

To enforce the property that the model selects its information budget for GG based solely on SS, the encoder employs a stochastic gating mechanism:

  • A "bandwidth network" BθB_\theta, taking only SS, predicts a gating probability dcap(s)(0,1)d_{\text{cap}}(s) \in (0,1).
  • With probability dcap(s)d_{\text{cap}}(s), GG is accessed via a deterministic encoder fenc(s,g)f_{\text{enc}}(s,g), setting z=fenc(s,g)z = f_{\text{enc}}(s,g).
  • Otherwise, ZZ is sampled from the prior rψ(zs)r_\psi(z|s), entirely omitting GG.

The full encoder is represented as:

qθ(zs,g)=dcap(s)δ(zfenc(s,g))+[1dcap(s)]rψ(zs)q_\theta(z | s, g) = d_{\text{cap}}(s)\, \delta(z - f_{\text{enc}}(s,g)) + [1 - d_{\text{cap}}(s)]\, r_\psi(z|s)

The KL-divergence between this mixture and rψ(zs)r_\psi(z|s) admits a closed-form, fully differentiable expression, obviating the need for REINFORCE or other high-variance estimators during training. This mixture formulation is crucial for stochastic, learnable, input-dependent gating of privileged information (Goyal et al., 2020).

4. Parameterization and Learning Dynamics

Core architectural components include:

  • Bandwidth network BθB_\theta: A small MLP, mapping SS to dcap(s)d_{\text{cap}}(s) via a sigmoid. It can also be realized as a continuous Gaussian bottleneck regularized by a KL term, then normalized.
  • Encoder fencf_{\text{enc}}: MLP or convnet+MLP, processing concatenated (S,G)(S,G) to output ZRdZ \in \mathbb{R}^d; provides the non-stochastic path when GG is accessed.
  • Prior rψ(zs)r_\psi(z|s): Amortized conditional Gaussian, parameterized by an MLP on SS. This enables default representations when dcap(s)d_{\text{cap}}(s) is low.
  • Decoder pϕ(ys,z)p_\phi(y|s,z): MLP or policy head, operating on (S,Z)(S,Z) to predict YY (or in RL, an action or value distribution).

All parameters θ,ϕ,ψ\theta, \phi, \psi are jointly optimized by stochastic gradient descent on the variational objective. At inference, bBernoulli(dcap(s))b\sim\mathrm{Bernoulli}(d_{\text{cap}}(s)) determines whether GG is accessed (Goyal et al., 2020).

5. Empirical Evaluation and Generalization

The conditional variational bottleneck framework has been empirically validated across multiple tasks:

  • Model-based planning: The framework enables adaptive invocation of expensive planners, e.g., accessing imagination rollouts at junctions (∼72%) versus in straight maze segments (∼28%).
  • Goal-driven navigation: In out-of-distribution evaluation, such as transferring to larger environments, cVIB achieves higher success rates (≈80%) with reduced goal queries (∼76%) compared to fully conditional VIB baselines.
  • Multiagent communication: cVIB reduces communication (≈23–34% access rates) while maintaining comparable performance in cooperative landmark-reaching tasks.
  • Visual attention and memory access: In recurrent visual attention on MNIST and Neural Turing Machine memory-copy tasks, accessing the privileged input is substantially reduced with maintained or improved predictive accuracy over unconditional VIB and standard models.

The effective average KL is reduced (≈3–7 bits versus unconditional VIB), directly linking reduced privileged input use to both computational cost savings and improved out-of-distribution generalization (Goyal et al., 2020). A plausible implication is that intelligent, stochastic gating of privileged information induces more robust, adaptive representations with respect to both generalization and resource constraints.

6. Context, Limitations, and Extensions

The cVIB approach critically assumes conditional independence, i.e., p(gs)=p(g)p(g|s)=p(g), in bounding I(Z;GS)I(Z;G|S). Empirical results indicate that this simplification does not degrade performance in tested domains. The framework introduces no ad hoc\textit{ad hoc} regularizers beyond the primary β\beta and the mixture KL, streamlining implementation and avoiding brittle hyperparameter dependencies. Notably, the method is directly applicable to a range of architectures (MLP, convnet, LSTM) and problem modalities, including partially observable RL and multi-modal sequential decision-making.

A considered extension is the application to continuous or non-deterministic gating, as well as integration with alternate bottleneck parameterizations. The approach may also inspire other settings where costly or risk-sensitive privileged channels should be consulted selectively, always conditioned only on cheap, non-privileged inputs (Goyal et al., 2020). This suggests new research directions in information-efficient learning and dynamic resource allocation.

7. Relationship to Prior and Parallel Bottleneck Approaches

The conditional variational bottleneck operates as a generalization of the classical VIB, with the key distinction being the selective, input-dependent information gating of privileged sources. Although closely related to the Conditional Entropy Bottleneck and the Minimum Necessary Information (MNI) framework—both of which also focus on achieving robust generalization by information minimization—the cVIB formulation is unique in its operationalization using stochastic gating and variational training schemes (Goyal et al., 2020). This methodology forms a bridge between information-theoretic generalization theory and practical, scalable deep learning systems for settings with heterogeneous and potentially burdensome auxiliary data modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Variational Information Bottleneck (cVIB).