Compositional Use of Depth

Updated 6 February 2026

Compositional use of depth is a concept that employs hierarchical network structures to assemble and manipulate complex patterns using layerwise geometric or structural properties.
It integrates theoretical foundations, empirical analyses, and practical implementations across neural, vision, and graphics systems to enhance expressivity and generalization.
Research challenges include overcoming ensemble averaging and refining architectural designs to promote true compositional hierarchies in deep models.

Compositional use of depth refers to the explicit or implicit leveraging of layerwise, geometric, or structural depth in artificial neural networks (ANNs) and computer vision/graphics systems to enable the hierarchical assembly, manipulation, or interpretation of complex patterns, objects, or operations. This encompasses architectural, algorithmic, and representational paradigms in which depth is not merely a means to increase expressivity or capacity, but is instrumented to support specific forms of compositional or generalizable computation—such as the recursive combination of primitives, disambiguation of occlusions, or stratified scene representations.

1. Theoretical Foundations: Depth, Expressivity, and Compositional Hierarchies

The expressive power of depth in neural architectures underpins the modern understanding of compositional generalization. Theoretically, for fixed width, the space of functions representable by a deep feedforward network often grows exponentially with layer count due to the ability to assemble feature detectors hierarchically (Raghu et al. 2017; Merrill et al. 2021). In transformers, stacking decoder layers allows higher-order feature hierarchies—syntactic, semantic, and structural—capable, in principle, of supporting compositional reasoning as required in language and vision (Petty et al., 2023).

Beyond generic approximation, compositional depth is instantiated in:

Logical/computational recursion: Recurrent application of a relation (e.g., in parsing or procedural image generation) (see (Klinger et al., 2020)).
Hierarchical kernels: Successive composition of reproducing kernels via the nonlinearity’s action on function space, yielding compositional kernels whose bandwidth and memorization properties are depth-controlled (Liang et al., 2020).
Layered graphics/vision decompositions: Discrete stacking/orderings of image regions as in illustrator’s depth for digital editing, or occlusion layers in pop-out segmentation (Maruani et al., 21 Nov 2025, Wu et al., 2022).

2. Empirical Studies of Depth–Compositional Generalization in LLMs

Empirical analysis of transformer LMs indicates that increasing depth, independent of total parameter footprint, yields pronounced but rapidly saturating improvements in compositional generalization (Petty et al., 2023). For fixed parameter budgets (41M, 134M, 374M), models constructed by trading width for depth and keeping total parameter count constant demonstrate:

Sharp reductions in pretraining perplexity and OOD compositional task accuracy with increasing layers, but most performance increase occurs in the initial layers (e.g., 1→2 layers delivers +20–30% in OOD compositional accuracy, while further gains plateau after ~4–8 layers for sub-500M models).
Lexical splits (generalization over new words in familiar frames) exhibit more benefit from depth; structural splits (novel syntactic frame composition) remain difficult regardless of depth.
Diminishing marginal return: beyond a threshold depth $L^*$ (which grows with budget), both perplexity and compositional accuracy gains become negligible—or degrade once feedforward width falls below embedding size.
Depth boosts compositional generalization even when controlling for pretraining performance or in-distribution fit: at matched language-modeling perplexity, deeper models generalize better OOD (Petty et al., 2023).

However, recent detailed residual-stream analyses in large LLMs (e.g., Llama 3, Qwen 3 series) reveal that depth is not exploited for sequentially composing higher-order subcomputations ("compositional use" in the strong sense) (Csordás et al., 20 May 2025). Instead:

The first half of layers dominate nontrivial update dynamics; the second half mainly amplify or refine fixed features, with layer-skipping ablations and integrated gradients showing little role for deep layers in progressive "multi-hop" reasoning or composition.
Cross-depth linear probes between shallow and deep models indicate that deeper models stretch the same computational graph across more layers with smaller per-layer increments—there is little evidence for new forms of computation or composition appearing at greater depth.

A more nuanced view is provided by layerwise inference diagnostics: early layers in decoder-only transformers perform high-frequency lexical "guessing," with increasingly context– and fact-aware refinement in deeper layers (Gupta et al., 21 Oct 2025). Depth is exploited compositionally across a spectrum of linguistic complexity, aligning function words and easily-predicted tokens to shallow layers, with context integration and final disambiguation left to deeper layers.

3. Depth in Compositional Kernels, Scaling Laws, and Architectural Constraints

The compositional properties of depth echo in formal settings beyond transformer LMs. In the theory of compositional kernels (Liang et al., 2020), depth– $L$ architectures induce kernel functions as $L$ -fold compositions of the activation's dual generating function. This branching process formulation quantifies how spectral complexity, memorization capacity, and the effective bandwidth scale with depth: greater depth compresses off-diagonal kernel mass, enabling memorization for sufficiently deep, nonlinear networks, with the required depth scaling as $O(\log n/d)$ for $n$ samples in $d$ dimensions.

In neural scaling law analyses, depth–limited error often empirically scales as $L_\ell \sim 1/\ell$ —inverse with depth, but not exponentially improved—across both transformer LMs and toy residual networks (Liu et al., 5 Feb 2026). The dominant regime appears to be one of “ensemble averaging,” where most layers implement nearly identical transforms with incremental error reduction, rather than compositional layering of distinct, hierarchical features. Architectural constraints in residual networks (e.g., persistent skip connections enforcing near-identity behavior) and target function properties (e.g., the highly peaked, non-smooth next-token prediction targets in LMs) may both inhibit true procedural or hierarchical composition with depth.

Distinct depth-use regimes:

Regime	Layer behavior	Depth scaling	Efficiency/robustness
Compositional	Qualitatively distinct	Exponential (ideal)	Efficient, less robust
Ensemble averaging	Nearly identical	Inverse (1/ $\ell$ )	Robust, inefficient
Smooth ODE approx.	Discretized vector field	1/ $\ell^2$ –1/ $\ell^3$	High efficiency, requires smooth targets

A plausible implication is that breaking out of ensemble averaging may necessitate interventions such as per-layer diversity regularization, higher-order integration blocks (e.g., Runge–Kutta–style residuals), tied-weight plus scaling (recurrent depth), or intermediate supervision that targets compositional subgoals (Liu et al., 5 Feb 2026).

4. Depth as a Primitive for Discrete and Layered Compositionality

Compositional use of depth extends beyond metric/continuous or residual settings and includes representations where depth is quantized or stratified to encode layer orderings or occlusion structure:

Illustrator's Depth defines depth as an interpretable discrete layer index, corresponding to a compositional, globally consistent stacking order in 2D images optimized for downstream editing (Maruani et al., 21 Nov 2025). Each pixel's index determines occlusion relationships: $D(i) > D(j)$ if pixel $i$ occludes $j$ . This supports high-fidelity image vectorization, depth-aware editing, and 3D bas-relief generation directly from 2D inputs.
Pop-out segmentation in vision leverages simple compositional priors that foreground objects lie atop a continuous background surface. By learning or inferring the contact surface depth, segmentation can operate via $D_{po}(x) > D_c(x)$ , translating 3D discontinuities to compositional semantic boundaries (Wu et al., 2022).

These approaches demonstrate the value of reframing depth as a compositional or ordinal abstraction, augmenting or replacing physically metric notions where editability, occlusion logic, and structured manipulation are paramount.

5. Compositional Depth in Vision, Graphics, and Generative Models

Numerous systems instrument depth to organize, regularize, and compose components at multiple levels:

3D Scene Processing: RoomTex fuses panoramic depth maps and per-object perspective depths to drive coarse-to-fine, style-consistent texturing of compositional scene meshes, enforcing alignment and editability by treating depth as a first-class control signal at global and local levels (Wang et al., 2024). RICO utilizes depth-based regularization (smoothness, reversed rendering losses) in neural SDF frameworks, constraining object geometry within smoothed background backstops to ensure watertight compositional 3D reconstructions (Li et al., 2023).
Compositional Image Synthesis: Compose-and-Conquer's depth-disentanglement training leverages independently inferred foreground/background depths plus cross-attention masks to inject disjoint depth and global style signals into a diffusion backbone, supporting precise multi-object composition and region-specific semantic control (Lee et al., 2024). Depth-SIMS integrates sparse depth and semantic composition to support structurally aligned in-painting, aiding both synthesis fidelity and downstream segmentation/depth completion tasks (Musat et al., 2022).
Occlusion and Stereo: DepGAN introduces an explicit depth-aware loss to enforce correct occlusion boundaries and transparency in composited images, using depth maps and alpha channels to refine object placement and blending (Ghoneim et al., 2024). 360° stereo image composition with depth adaptation uses per-view depth-guided projection and densification to eliminate ghost artifacts and preserve depth-consistent parallax in VR contexts (Huang et al., 2022).

These systems demonstrate the centrality of compositional depth for enabling controllable synthesis, robust editing, and occlusion-resolving representations.

6. Limitations, Challenges, and Architectural Opportunities

Despite its theoretical and practical importance, compositional use of depth remains only partially realized in state-of-the-art deep models:

LLMs: While depth facilitates compositional generalization over a modest range, scaling to arbitrary compositional depths or recursive reasoning absent explicit architectural bias is out of reach. Feedforward and transformer architectures lack mechanisms for unfolding relations over unbounded depth, and performance decays with increased compositional chain length or substitution (Petty et al., 2023, Klinger et al., 2020).
Residual networks: The prevailing regime is ensemble averaging, not procedural or ODE-like depth composition (Liu et al., 5 Feb 2026).
Vision/graphics models: Discrete and layered representations capture a subset of compositionality (e.g., occlusion, editability) but do not generalize to arbitrary recursion or program-like scene graph assembly.

Current research directions include enforcing layerwise diversity or proceduralization via intermediate loss, constructing architectures with dynamic depth or shared parameter "recurrent depth," and combining relational with compositional (recursive, modular) inductive biases (Klinger et al., 2020, Maruani et al., 21 Nov 2025). In vision, hybrid approaches leverage both metric and discrete depth, and in generative models, compositionality increasingly leverages depth as a explicit conditioning axis for scene assembly, local/global style disentanglement, and occlusion-aware synthesis.

7. Connections to Benchmarks and Evaluation of Compositional Depth

Precise measurement of compositional depth and its exploitation requires controllable benchmarks:

ConceptWorld proposes a DSL for compositional visual concepts, varying depth via recursive scene graph assembly, and systematically tests generalization to unseen depths (Klinger et al., 2020). All neural models tested (MLP, CNN, ResNet, Relational nets) experience sharp dropoff in performance beyond training depths, with no evidence of unbounded generalization.
COGS and GeoQuery measure OOD compositional generalization to novel combinations in language, revealing both the benefits and limitations of deeper Transformer LMs (Petty et al., 2023).

Such benchmarks are vital for diagnosing whether network capacity gains with depth translate to meaningful compositional use, or merely to shallow, incremental smoothing of already-encoded features.

In conclusion, the compositional use of depth is a multi-faceted notion: while depth empowers expressivity and supports practical composition in both network architectures and vision/graphics systems, it is primarily effective within architectural or algorithmic regimes that enforce or exploit hierarchical, recursive, or ordinal structure at each layer or compositional unit. Evidence across domains suggests that most current deep models use depth as an error-smoothing or incremental refinement mechanism rather than as a principled scaffold for assembling genuinely novel compositions at higher levels of abstraction. Research into architectures, regularization schemes, and benchmarks that strongly privilege and exploit compositionality at every layer remains an open and promising direction.