Papers
Topics
Authors
Recent
Search
2000 character limit reached

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

Published 19 Mar 2025 in cs.LG and cs.AI | (2503.14858v3)

Abstract: Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 - 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by $2\times$ - $50\times$, outperforming other goal-conditioned baselines. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned. The project webpage and code can be found here: https://wang-kevin3290.github.io/scaling-crl/.

Summary

  • The paper demonstrates that increasing network depth up to 1024 layers results in dramatic performance improvements, with up to 50x gains in humanoid tasks.
  • The paper leverages residual connections, layer normalization, and Swish activation to ensure stable training and effective gradient propagation in ultra-deep networks.
  • The study reveals that deeper networks enable emergent qualitative behaviors, such as acrobatic maneuvers, which improve policy learning in complex, unsupervised environments.

Scaling Depth in Self-Supervised RL: A Technical Insight

Introduction

The paper "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities" (2503.14858) explores the frontier of reinforcement learning (RL) by challenging the common reliance on shallow neural networks and proposing the substantial scaling of network depth. While breakthroughs in language and vision have been achieved through the expansion of self-supervised learning models, scaling up in RL remains relatively unexplored. This study investigates the potential of self-supervised RL with deep networks extending up to 1024 layers to enhance goal-reaching performance in unsupervised settings across various tasks.

Methodology and Experimental Design

Network Architecture

The authors employ a combination of architectural innovations to stabilize training procedures of deeply scaled networks. Specifically, the networks incorporate residual connections, layer normalization, and Swish activation functions. These improvements are borrowed from prior deep learning advancements to ensure effective gradient propagation and training stability in networks.

Experimental Setup

Experiments were conducted in an unsupervised goal-conditioned setting, where agents were tasked with reaching specified goals based on their own exploration, absent of any external demonstrations or rewards. These experiments were situated across simulated environments designed to test locomotion and manipulation skills in challenging configurations which include the Ant and Humanoid mazes. Figure 1

Figure 1: Scaling network depth yields performance gains across a suite of locomotion and manipulation tasks, with marked improvements at critical depths specific to each environment.

Results and Analysis

Performance Enhancement via Depth Scaling

The results reveal a stark enhancement in performance upon significant scaling of network depth. For example, in humanoid-based tasks, depth scaling introduces performance improvements from twofold to fiftyfold. The scaling does not progress linearly; rather, performance leaps are often realized at critical network depths, indicating an emergent capability when certain architectural complexities are reached. Figure 2

Figure 2

Figure 2: Incrementally increasing depth results in marginal performance gains (left). However, once a critical threshold is reached, performance improves dramatically (right).

Neural Expressivity and Batch Size

Deep networks demonstrated superior expressivity, allowing for qualitatively distinct policies to emerge. The exploration capabilities and state-space coverage are augmented, suggesting that increased network depth enables more nuanced policy learning. Furthermore, deeper networks were able to benefit from larger batch sizes, a scaling behavior traditionally advantageous in large model regimes of other domains. Figure 3

Figure 3: Deeper networks unlock batch size scaling, providing further performance improvements as depth increases from 4 to 64.

Emergent Properties

Distinct qualitative behaviors were visible as a function of network depth. In the Humanoid maze, for instance, deep networks began to exhibit unique strategies like acrobatic maneuvers to overcome obstacles, an indicative mark of emergent learning properties uncharacteristic in shallower networks.

Theoretical and Practical Implications

The capability for deep networks to effectively map state space with increased detail means RL could see a paradigm shift similar to vision and NLP, where model scaling is a norm. Practically, the findings could propel advancements in robotics requiring fine-tuned manipulation and responsive learning. Theoretically, these results inspire a potential reconsideration of RL architecture guidelines, advocating for testing of network depth as a performance lever.

Future Directions

Future exploration could focus on integrating these findings within distributed RL to leverage computational infrastructures fully, and examining the synergistic effect of scaling both depth and width in tandem. An additional avenue could involve investigating supplementary self-supervised methods where depth scaling might reveal different or complementary competencies.

Conclusion

The paper effectively posits that substantial scaling of network depth revitalizes the capacity for self-supervised RL to achieve unprecedented levels of performance. This operates not only as an engineering exploit to amplify existing architectures but also as a profound commentary on the unexplored depth of capacity and representation in RL networks. As computational resources continue to rise, so too may the plausibility and prevalence of ultra-deep networks in reinforcement learning applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper introduces deep residual MLPs (up to 1024 layers) for self-supervised goal-conditioned RL via Contrastive RL (CRL) and reports large empirical gains. The following points identify what remains missing, uncertain, or unexplored, framed to be concrete and actionable for future work:

  • Real-world validation is absent: results are limited to simulated Brax/MJX tasks; sim-to-real transfer, deployment latency, and hardware constraints on onboard inference are untested.
  • Pixel-based observation settings are not evaluated: all tasks use state inputs; it’s unclear if depth scaling yields similar gains with high-dimensional visual inputs and data augmentation.
  • Scope is narrow to continuous control and single-agent tasks: discrete action spaces, multi-agent settings, partial observability, and recurrent architectures remain unexplored.
  • No iso-compute/iso-parameter comparisons: improvements are reported across varying parameter counts and compute; controlled experiments under fixed FLOPs, fixed memory, or matched parameter budgets are needed to attribute gains to depth per se.
  • Compute accounting is inconsistent: the paper asserts parameter count “scales linearly with width but quadratically with depth,” which is not generally true for MLPs; a rigorous, reproducible compute/parameter accounting framework is missing.
  • Lack of formal scaling laws or theory: “critical depth” phenomena are observed but not explained; there is no predictive model linking environment properties (e.g., horizon, topology, observation dimension) to required depth.
  • Stability at extreme depth is unresolved: actor loss explosions at 1024 layers are noted; systematic stabilization strategies (e.g., pre-/post-LN, DeepNorm, skip-init, spectral norm constraints, gradient clipping, reversible layers) are not evaluated.
  • Offline CRL scaling underperforms: the method yields little benefit on OGBench; it is unclear whether data coverage, behavior mismatch, cold-start tuning, or loss design (e.g., temperature, margins) blocks scaling offline.
  • Batch-size scaling is only partially characterized: larger batches help with deeper nets, but principled LR scaling rules, optimizer choices (momentum, weight decay), gradient-noise-scale measurements, and replay ratio interactions are not studied.
  • Actor vs. critic scaling guidance is missing: environment-dependent effects are observed; there are no diagnostics or heuristics to decide how to allocate depth between actor and critic given task properties.
  • Exploration coverage is not quantified: “synergy” between exploration and expressivity is argued via the collector experiment, but formal coverage metrics (e.g., state visitation entropy, occupancy, novelty) are absent.
  • Contrastive training hyperparameters are under-specified: the number of negatives K, temperature/margin, hard-negative mining, and sampling policies are not systematically varied or analyzed for deep CRL.
  • Goal sampling and curricula are not explored: how different goal distributions, adaptive curricula, or subgoal selection affect scaling and emergence is unknown.
  • Architectural generality is untested: only residual MLPs are used; depth scaling with transformers, Mixture-of-Experts, reversible layers, gated MLPs, or quasimetric encoders (beyond appendix mentions) remains open.
  • Normalization and block design choices are not ablated: residual block size, pre- vs. post-LN, activation functions (e.g., GELU vs. Swish), skip-initialization, and position of residual adds could materially impact very deep training.
  • Replay buffer design is not analyzed: prioritized replay, replay ratios, temporal sampling windows for positives, and dataset deduplication with deep CRL are unexamined.
  • Evaluation metric may bias behavior: “number of timesteps near the goal” over 1000 steps could be gamed by staying at the goal; episode-level success rates, success-on-first-arrival, path optimality, and sample efficiency metrics should be added.
  • Statistical robustness is unclear: number of seeds, variance, confidence intervals, and sensitivity to hyperparameters are not reported; claims of “critical depth” need statistical backing across seeds/runs.
  • Representational analysis is qualitative: Q-visualizations and PCA plots suggest richer topology, but quantitative measures (e.g., topology-preserving metrics, mutual information, cluster separability, CKA/CKN similarity) are needed to link depth to representation quality.
  • Generalization via stitching is tested narrowly: only Ant U-Maze with constrained train pairs; broader, systematic evaluation across tasks, environment variations, and long-range compositionality is missing.
  • Interaction with extrinsic rewards remains unexplored: CRL is self-supervised; combining depth-scaled CRL with shaped rewards or hybrid RL (e.g., contrastive pretraining + TD finetuning) could bridge early sample-efficiency gaps observed vs. SAC.
  • Early sample efficiency is inferior to SAC in Humanoid mazes: mechanisms to improve CRL’s cold-start (e.g., behavioral priors, exploration bonuses, subgoal discovery) are not investigated.
  • Width scaling limits are not thoroughly mapped: width is explored up to 2048; deeper analyses of width–depth trade-offs under fixed compute, and Pareto fronts across tasks, are needed.
  • Safety and robustness are unaddressed: sensitivity to perturbations, domain shift, actuator noise, and adversarial or out-of-distribution goals remain open.
  • Hyperparameter sensitivity for deep nets is not specified: learning rate schedules, weight decay, initialization schemes, and regularization (dropout, stochastic depth) could be critical at 1000+ layers.
  • Planning vs. function approximation: depth may emulate planning in mazes; comparisons to model-based RL, hierarchical RL (options/subgoals), or search-based approaches would clarify what depth is substituting for.
  • Goal-space design choices are not analyzed: the mapping f: S → G may induce aliasing; alternative goal parameterizations (relative positions, learned embeddings) could affect scaling and emergence.
  • Embedding-norm control is absent: the actor maximizes L2 distance in embedding space; without norm constraints or temperatures, degenerate behavior and loss explosions can occur; margin-based or normalized objectives should be explored.
  • Distillation/pruning/compression are proposed but not tested: practical recipes to compress deep CRL policies/critics for deployment (and their impact on performance and emergence) are open.
  • Cross-task transfer is untested: whether deep CRL learns reusable goal-conditioned representations that transfer across environments or tasks is unknown.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 487 likes about this paper.