Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Learning as Goal-Conditioned Reinforcement Learning

Published 15 Jun 2022 in cs.LG and cs.AI | (2206.07568v2)

Abstract: In reinforcement learning (RL), it is easier to solve a task if given a good representation. While deep RL should automatically acquire such good representations, prior work often finds that learning representations in an end-to-end fashion is unstable and instead equip RL algorithms with additional representation learning parts (e.g., auxiliary losses, data augmentation). How can we design RL algorithms that directly acquire good representations? In this paper, instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right. To do this, we build upon prior work and apply contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function. We use this idea to reinterpret a prior RL method as performing contrastive learning, and then use the idea to propose a much simpler method that achieves similar performance. Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods, including in the offline RL setting. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives.

Citations (118)

Summary

  • The paper introduces a framework where contrastive learning directly represents goal-conditioned value functions by linking learned inner products to reward maximization.
  • The method simplifies conventional RL by eliminating the need for multiple Q-values, elaborate data augmentation, or auxiliary objectives, achieving superior results on diverse tasks.
  • Strong theoretical and empirical evidence, including convergence guarantees and successful transfer across tasks, supports the practicality and robust performance of the proposed approach.

Contrastive Learning as Goal-Conditioned Reinforcement Learning

This paper introduces a novel perspective on goal-conditioned reinforcement learning (RL) by framing contrastive representation learning as an RL algorithm itself. Instead of augmenting existing RL algorithms with representation learning techniques like auxiliary losses or data augmentation, the authors demonstrate that contrastive learning can directly acquire effective representations corresponding to goal-conditioned value functions. The paper's key contribution lies in formally connecting contrastive learning with reward maximization, highlighting that the inner product between learned representations corresponds to a value function. This framework generalizes prior methods like C-learning and suggests new, simpler, and higher-performing goal-conditioned RL algorithms.

Theoretical Foundation

The paper grounds its approach in a rigorous theoretical framework, beginning with the definition of the goal-conditioned RL problem, characterized by states sts_t \in, actions ata_t, an initial state distribution p0(s)p_0(s), dynamics p(st+1st,at)p(s_{t+1} \mid s_t, a_t), a distribution over goals pg(sg)p_g(s_g), and a reward function rg(s,a)r_g(s, a) for each goal. The reward is defined as the probability density of reaching the goal at the next time step: rg(st,at)(1γ)p(st+1=sgst,at)r_g(s_t, a_t) \triangleq (1 - \gamma) p(s_{t+1} = s_g \mid s_t, a_t).

Proposition 1 establishes the equivalence between the Q-function and the probability of reaching a goal state sgs_g under the discounted state occupancy measure: Qsgπ(s,a)=pπ(,sg)(st+=sgs,a)Q_{s_g}^\pi(s, a) = p^{\pi(\cdot \mid \cdot, s_g)}(s_{t+} = s_g \mid s, a).

The core contribution is Lemma 1, which demonstrates that the critic function f(s,a,sf)f^*(s, a, s_f) that optimizes the contrastive learning objective (Eq. 4) is a Q-function for the goal-conditioned reward function (Eq. 1), up to a constant factor: exp(f(s,a,sf))=1p(sf)Qsfπ()(s,a)\exp(f^*(s, a, s_f)) = \frac{1}{p(s_f)} \cdot Q_{s_f}^{\pi(\cdot \mid \cdot)}(s, a).

Method Implementation

The contrastive RL algorithm involves alternating between fitting the critic function using contrastive learning and updating the policy using the actor loss (Eq. 5): maxπ(as,sg)Eπ(as,sg)p(s)p(sg)[f(s,a,sf=sg)]Eπ(as,sg)p(s)p(sg)[logQsgπ()(s,a)logp(sg)]\max_{\pi(a \mid s, s_g)} E_{\pi(a \mid s, s_g)p(s)p(s_g)}\left[f(s, a, s_f = s_g) \right] \approx E_{\pi(a \mid s, s_g)p(s)p(s_g)}\left[\log Q_{s_g}^{\pi(\cdot \mid \cdot)}(s, a) - \log p(s_g) \right]. Figure 1

Figure 1: Reinforcement learning via contrastive learning. Our method uses contrastive learning to acquire representations of state-action pairs (ϕ(s,a)\phi(s, a)) and future states (ψ(sf)\psi(s_f)), so that the representations of future states are closer than the representations of random states. We prove that learned representation corresponds to a value function for a certain reward function. To select actions for reaching goal sgs_g, the policy chooses the action where ϕ(s,a)\phi(s, a) is closest to ψ(sg)\psi(s_g).

The critic is parameterized as an inner product between representations of state-action pairs and goal states: f(s,a,sg)=ϕ(s,a)Tψ(sg)f(s, a, s_g) = \phi(s, a)^T\psi(s_g). The paper provides a JAX implementation of the actor and critic losses (Algorithm 1). Contrastive RL (NCE) is presented as a simple algorithm that does not require multiple Q-values, target Q networks, data augmentation, or auxiliary objectives. A variant, contrastive RL (CPC), is derived using the infoNCE bound on mutual information.

Convergence Analysis

Lemma 2 provides a convergence guarantee for contrastive RL under the assumption of tabular states and actions, a Bayes-optimal critic, and an additional filtering step that excludes training examples where the probability of the trajectory under the commanded goal differs significantly from that under the actually reached goal. The result shows that performing contrastive RL on a static dataset results in one step of approximate policy improvement. Re-collecting data and iteratively applying contrastive RL leads to approximate policy improvement.

Experimental Evaluation

The paper evaluates contrastive RL algorithms on a suite of goal-conditioned tasks, including fetch reach, fetch push, sawyer push, sawyer bin, ant umaze, and point Spiral11x11. The results demonstrate that contrastive RL (NCE) outperforms prior methods on most tasks, including those with image observations. The method also demonstrates competitive performance in offline RL, outperforming baselines on five out of six D4RL AntMaze tasks. Ablation studies probe the design decisions of contrastive RL, demonstrating the benefits of the NCE objective and the importance of sampling random goals for the actor loss. Further experiments demonstrate the transferability of learned representations across tasks and the robustness of contrastive RL to environment perturbations. Figure 2

Figure 2: Representation learning for image-based tasks. While adding data augmentation and auxiliary representation objectives can boost the performance of the TD3+HER baseline, replacing the underlying goal-conditioned RL algorithm with one that resembles contrastive representation learning (i.e., ours) yields a larger increase in success rates. Baselines: DrQ augments images and averages the Q-values across 4 augmentations; auto encoder (AE) adds an auxiliary reconstruction loss; CURL applies RL on top of representations learned via augmentation-based contrastive learning.

Implications and Future Directions

The paper demonstrates that contrastive representation learning can serve as a foundation for goal-conditioned RL, offering a fresh perspective on representation learning in RL. The proposed framework generalizes prior methods and suggests new algorithms that are simpler and more effective. The finding that RL algorithms can be constructed to resemble representation learning has broad implications for the design of future RL algorithms. The authors note that applying these methods to arbitrary RL problems remains an open question, while acknowledging that recent algorithms for this setting already bear a resemblance to contrastive RL. Future research could explore the rich set of ideas from contrastive learning to construct even better RL algorithms.

Conclusion

This paper makes a significant contribution to the field of reinforcement learning by bridging the gap between contrastive representation learning and goal-conditioned RL. The theoretical analysis, algorithmic innovations, and comprehensive experimental results provide a strong foundation for future research in this area. By demonstrating that RL algorithms can be designed to resemble representation learning, the authors open new avenues for developing more effective and efficient RL agents.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Contrastive Learning as Goal-Conditioned Reinforcement Learning — A Simple Explanation

What is this paper about?

The paper explores a new way to teach computers (or robots) how to reach goals by reusing a powerful idea from computer vision called contrastive learning. Instead of adding extra “representation learning” tricks to an existing reinforcement learning (RL) algorithm, the authors flip the script: they show that contrastive learning itself can act like an RL algorithm for goal-reaching tasks. Even better, they prove that what contrastive learning learns matches a core RL concept (a value function), and they show in experiments that this approach works really well.

What questions did the authors ask?

  • Can we solve goal-reaching tasks using just contrastive learning, without extra reward shaping or fancy add-ons?
  • Is there a precise link between contrastive learning (which learns to tell “things that go together” from “things that don’t”) and RL value functions (which tell you how good an action is for a goal)?
  • Does this idea work better (and more simply) than common RL methods, especially on image-based tasks and in offline settings where you can’t collect new data?

How did they do it? (With simple analogies)

Think of a robot trying to reach a goal, like pushing a puck to a spot on a table.

  • Key idea: Learn two “embeddings” (compact summaries):
    • One for “what I’m doing now”: a state-action pair, written as φ(s, a).
    • One for “what I want”: a future or goal state, written as ψ(s_goal).
  • Contrastive learning setup:
    • Positive pairs: match a current state-action (what you do now) with a future state the robot actually reaches soon after. These are true matches.
    • Negative pairs: match the same state-action with a random, unrelated state. These are false matches.
  • Training goal: Make the embedding of the current choice φ(s, a) be “close” (high similarity, via a dot product) to the embedding of the correct future/goal ψ(s_goal), and far from random ones.
  • Why this is like RL: The similarity score f(s, a, s_goal) = φ(s, a) * ψ(s_goal) ends up estimating “how likely this action will lead to that goal soon.” In RL terms, that’s a goal-conditioned Q-value (a measure of how good an action is for reaching a specific goal).
  • Acting (choosing actions): To reach a goal, pick the action whose embedding is closest to the goal’s embedding. In other words, choose actions that make the desired future most likely.

In short, they turn “match the right pairs” into “take actions that will reach the goal,” and they prove the math that connects these two ideas.

Helpful definitions:

  • Reinforcement Learning (RL): Learning by trying actions and seeing what works to get rewards.
  • Goal-conditioned RL: The agent is given a target state (the “goal”) and learns to reach it.
  • Contrastive learning: Learning to pull matching pairs together and push non-matching pairs apart.
  • Representation/embedding: A compact vector that captures the important features of inputs.
  • Q-function/value function: A score telling how good an action is for reaching a goal.

What methods did they compare?

They built a very simple version called Contrastive RL (NCE). They also looked at:

  • A prior method (C-learning), which they show is actually a form of contrastive learning.
  • Variants using different contrastive objectives (like CPC/infoNCE).
  • A combined version (NCE + C-learning).

They compared against:

  • HER (a popular goal-conditioned RL baseline with actor-critic learning).
  • GCBC (goal-conditioned behavioral cloning: imitate actions that led to goals).
  • Model-based methods that predict future states.
  • Image-focused add-ons like DrQ (data augmentation), autoencoders (AE), and CURL (contrastive features from augmented images).

They tested on:

  • Robot tasks (reach/push with a robotic arm), both with low-dimensional states and with images.
  • Navigation tasks (mazes).
  • Offline RL benchmarks (AntMaze), where the agent cannot collect new data.
  • A harder camera setup with a moving, first-person view (partial observability).

What did they find, and why is it important?

  • Contrastive RL often outperformed the traditional methods on both state and image tasks, sometimes by a lot, especially on harder tasks (like moving objects across bins).
  • It worked well on images without any special image tricks (no data augmentation or extra losses), while actor-critic baselines needed those tricks and still didn’t match its performance.
  • The simplest version (NCE) is already strong and easy to implement. A combined version (NCE + C-learning) was often the best overall.
  • It handled moving-camera, first-person views reasonably well, despite partial observability.
  • In offline RL (AntMaze), Contrastive RL + a bit of behavior cloning beat many baselines on 5 out of 6 tasks, including some that are hard for standard TD-learning methods.

Why this matters:

  • Simpler training: You don’t need to bolt on extra representation learning parts or image augmentations; the learning objective doubles as both representation learning and RL.
  • Strong performance with images: Good news for real robots that see the world through cameras.
  • Theory + practice: They don’t just claim it works; they prove the similarity score is a value function (up to a constant) and give conditions for policy improvement.

What could this change in the future?

  • A cleaner way to build goal-reaching agents: one objective that learns both understanding (representations) and doing (actions) at the same time.
  • Better robotics: agents can learn from their own experience, camera views, and even offline datasets without carefully crafted rewards or image tricks.
  • A unifying view: Representation learning and RL don’t have to be separate steps; contrastive learning can directly drive decision-making.

Overall, the paper shows that contrastive learning isn’t just for making good features—it can directly power goal-conditioned reinforcement learning, often more simply and effectively than the usual approaches.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what the paper leaves missing, uncertain, or unexplored, phrased to be directly actionable for future work:

  • Theory beyond tabular and Bayes-optimal critics: Convergence and policy-improvement guarantees rely on tabular state-action spaces, a Bayes-optimal critic, and an auxiliary filtering step that is not used in practice. It remains open to establish guarantees under function approximation, stochastic optimization, and without filtering.
  • Policy mismatch in the learned critic: The core result links the critic to Q for the averaged goal-conditioned policy π(·|·) rather than per-goal π(·|·, s_g). Quantify and mitigate the impact of this mismatch on control, and develop estimators that recover per-goal Q without sacrificing sample efficiency.
  • Dependency on discounted state occupancy sampling: Positive pairs must be drawn from the discounted occupancy p{π}(s_{t+}|s,a), but in practice data are off-policy and mixed across policies. Characterize the bias introduced by this approximation and propose corrected sampling or importance weighting schemes.
  • Role and estimation of p(s_g): The optimal critic is log Q − log p(s_g). The method ignores or absorbs the −log p(s_g) term at action selection. Analyze the effect of non-uniform goal marginals on policy learning and explore ways to estimate or compensate for p(s_g) in high-dimensional spaces.
  • Generality beyond the next-step reachability reward: The analysis hinges on the goal-conditioned reward r_g(s,a) = (1−γ) p(s_{t+1}=s_g|s,a). Determine whether and how the framework extends to other goal-reaching rewards (e.g., sparse indicators, terminal-only success, staying vs just hitting), shaped rewards, or non-reachability objectives.
  • Variance and horizon scaling for Monte Carlo learning: The method is Monte Carlo–style with geometric time sampling. Provide a variance analysis, investigate γ sensitivity, and develop variance-reduction strategies for long-horizon, sparse-reachability settings.
  • Stability without target networks: The approach eschews target networks and TD-style bootstrapping. Identify failure modes (e.g., overfitting, drift under replay) and study stabilizers (target encoders, momentum, Polyak averaging).
  • Negative sampling pitfalls and false negatives: In-batch negatives can include states that are reachable or semantically similar, causing representation conflicts. Explore hard-negative mining, temporal windows, distance-based masks, or curriculum negatives to reduce false-negative harm.
  • Temperature and calibration in NCE/BCE: The critic uses a logistic link without temperature tuning. Assess whether temperature scaling, logits normalization, or margin-based losses improve performance and calibration of Q estimates.
  • Expressivity limits of inner-product critics: The bilinear form f(s,a,s_g)=φ(s,a)T ψ(s_g) may be too restrictive for complex dynamics. Evaluate more expressive critics (e.g., MLP over concatenated embeddings, attention, hypernetworks) and study their impact on control and representation quality.
  • Representation ablations: There is no systematic study of embedding dimensionality, batch size, encoder architecture, or shared vs separate encoders for φ and ψ. Provide ablations to quantify their effects on learning stability and sample efficiency.
  • Combining with augmentations and auxiliary objectives: While the method outperforms baselines that use augmentation/auxiliary losses, it remains unknown whether judicious augmentations (e.g., color jitter, random crops) or self-supervised regularizers synergize with contrastive RL.
  • Partial observability and memory: The method succeeds moderately with a moving camera but does not incorporate recurrent policies or history encoders. Evaluate recurrent/transformer architectures, belief-state learning, and goal-conditioned memory for POMDPs.
  • Exploration and goal sampling: The work treats exploration and automatic goal selection as orthogonal. Integrate goal curricula, novelty-driven sampling, or coverage objectives and quantify gains on hard-exploration tasks (e.g., Sawyer bin remains <50% success).
  • Off-policy theory and replay usage: The algorithm is practically off-policy but theoretically on-policy. Develop off-policy corrections (e.g., importance sampling, conservative critics) and characterize how replay buffer composition and policy lag affect learning.
  • Scaling to high-dimensional visual goals: Pairwise logits incur O(B2) compute/memory. Investigate scalable negative pools, approximate nearest neighbor negatives, memory banks, or sub-quadratic contrastive estimators to handle large images and bigger batches.
  • Generalization to unseen goals: Goals are sampled from replay. Measure and improve generalization to goals outside the training occupancy (e.g., extrapolation to far or unseen regions), including evaluation on held-out goal sets.
  • Robustness to stochastic, irreversible, or time-varying dynamics: Analyze how stochastic transitions, irreversible actions, and non-stationary environments affect occupancy-based critics and propose robust training modifications.
  • Discrete or hybrid action spaces: The actor update assumes reparameterizable continuous actions. Develop and test counterparts for discrete/hybrid actions (e.g., Gumbel-softmax, categorical policy gradients) and study their efficacy.
  • Real-world deployment and sensing realism: No real-robot results or domain randomization tests are reported. Evaluate on hardware with delays, sensor noise, and dynamic backgrounds; assess sim-to-real transfer and robustness requirements.
  • Comparison breadth and fairness: Some baselines (e.g., stronger image-based GCRL methods, modern contrastive RL with momentum encoders, recent UVFA variants) are not included. Expand comparisons and ensure matched compute/augmentation budgets.
  • Offline RL sensitivity and scope: Offline results use AntMaze; the sensitivity to λ (BC weight), number of critics, dataset quality/coverage, and goal relabeling strategies is not analyzed. Extend to diverse datasets (e.g., image-based, robotic manipulation) and characterize failure regimes.
  • Hindsight relabeling strategy design: The method relabels with future states but does not explore alternative relabeling distributions (e.g., time-distance weighting, prioritized successful outcomes). Study how relabeling choices influence learning.
  • Success metric vs “stay at goal” behavior: The occupancy objective encourages reaching and staying, while many tasks count single-time success. Examine alignment/misalignment between this objective and task success criteria and adapt the objective when needed.
  • Transfer and compositionality of learned representations: It is unclear whether φ and ψ transfer across tasks, goals, or reward families (e.g., successor-feature-style transfer). Investigate zero/few-shot goal composition and transfer learning.
  • Safety and constraints: The framework optimizes goal reachability without explicit constraints. Explore integration with constrained RL or cost-aware contrastive objectives for safe goal-reaching.
  • Hyperparameter selection guidelines: Practical guidance for γ, entropy regularization, batch size, negative ratio, and learning rates is limited. Provide principled selection heuristics or adaptive schemes grounded in theory or diagnostics.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.