Single-Stream CNN with Learnable Architecture

Updated 2 February 2026

The paper presents a model where all architectural hyperparameters, such as kernel sizes, width, depth, and grouping, are optimized dynamically using continuous, learnable masks.
It introduces differentiable mask and operator selection functions that enable gradient-based joint optimization of network weights and architecture under resource constraints.
The method culminates in an efficient, discretized CNN that outperforms hand-engineered networks by adaptively tailoring its structure through dynamic grouping and pruning.

A single-stream convolutional neural network (CNN) with learnable architecture is a model in which the complete set of architectural hyperparameters—kernel sizes, channel widths, downsampling positions, depth, connectivity, group structure, or even operator types—are optimized in a data-driven, differentiable, and end-to-end manner within a single computation stream. Rather than relying on manual architecture engineering or fixed design patterns, these approaches jointly train network weights and architecture parameters (e.g., via masks, gates, or gradients through operator mixtures) so that the final discrete network emerges from an optimization process directly reflecting the data, loss objectives, and any resource constraints. This methodology encompasses a variety of techniques, including but not limited to differentiable architecture search, dynamic grouping, width expansion, and learnable parameter repetition.

1. Continuous Parameterization of CNN Architecture

Modern approaches to learnable single-stream CNN architecture replace discrete choices of kernel size ( $K$ ), width ( $C$ ), downsampling positions, and depth ( $D$ ) with continuous, differentiable proxies. The DNArch framework (Romero et al., 2023) parameterizes each architectural axis as a continuous dimension, introducing learnable masks $m(x;\theta)$ along:

Kernel size: 2D mask per layer over $x \in [-K_{max}/2, K_{max}/2]$ and $y$ .
Width: 1D mask over channel indices $c \in [1,C_{max}]$ .
Downsampling: 1D mask over frequency $\omega \in [0, \omega_{Nyquist}]$ in the Fourier domain.
Depth: 1D mask over block index $\ell \in [1, D_{max}]$ .

Masks are parameterized by $\theta$ (e.g., mean $C$ 0, variance $C$ 1, temperature $C$ 2). These masks determine which slices of the architecture are retained or suppressed by evaluating $C$ 3 (“kept”) or $C$ 4 (“erased”). The architectural parameters are updated by backpropagating the task loss and any additional constraints directly through the mask functions.

2. Differentiable Mask and Operator Selection Functions

Learnable architectures use differentiable selection functions to make architectural choices amenable to gradient-based optimization:

Gaussian Mask: For kernel size or similar attributes, masks are defined as

$C$ 5

subject to $C$ 6 thresholding for computational efficiency.

Sigmoid Mask: For discrete choices (channels, depth, etc.), sigmoid-based transitions offer differentiability:

$C$ 7

where $C$ 8 denotes the logistic sigmoid function. Thresholding at $C$ 9 keeps the mask sparse.

Operator selection and layer connectivity in cell-based architectures are handled by softmax-parameterized mixtures (Chen et al., 2020). For example, on an edge $D$ 0, continuous parameters $D$ 1 define weights $D$ 2 for the candidate operator set via:

$D$ 3

Similar softmax weights $D$ 4 handle learnable adjacency and aggregation between intermediate nodes.

Dynamic grouping for single-stream multi-source inputs is achieved via learnable binary or continuous matrices $D$ 5 encoding channel relationships, with gates $D$ 6 binarized in the forward pass and optimized using the straight-through estimator (Yang et al., 2021).

3. Joint Training Objectives and Resource Constraints

The optimization of single-stream CNNs with learnable architectures employs joint loss functions. For differentiable axis-tuning (Romero et al., 2023), the composite objective is:

$D$ 7

where $D$ 8 is the task-specific loss (e.g., cross-entropy, MSE), and $D$ 9 measures deviation of the model’s complexity $m(x;\theta)$ 0 (such as FLOPs or parameter count) from a user-defined target $m(x;\theta)$ 1:

$m(x;\theta)$ 2

The regularization parameter $m(x;\theta)$ 3 explicitly governs the trade-off between accuracy and budget compliance.

Alternating gradient-based updates are typical (Chen et al., 2020), with architectural parameters ( $m(x;\theta)$ 4 or $m(x;\theta)$ 5) and standard weights ( $m(x;\theta)$ 6 or $m(x;\theta)$ 7) updated in tandem or in a bilevel optimization loop.

4. Discretization and Inference-Time Network Extraction

At inference, the continuous architectures must be instantiated as discrete computational graphs:

Thresholding is applied to mask values; coordinates with $m(x;\theta)$ 8 are retained, others eliminated (Romero et al., 2023).
For cell-based or graph-based methods, operator mixtures are collapsed to their maximal contributors, and only the strongest $m(x;\theta)$ 9 connections are retained per node (Chen et al., 2020).
In dynamic grouping, the learned connection matrix $x \in [-K_{max}/2, K_{max}/2]$ 0 is binarized, with group structure fixed for deployment (Yang et al., 2021).
In repetition-based scaling (Chavan et al., 2021), the final repetition factors $x \in [-K_{max}/2, K_{max}/2]$ 1 (depth or width) are fixed for the forward pass.

This process enables the construction of efficient, compact, and specialized CNNs with architectures tailored to the training data and resource constraints.

5. Single-Stream Architecture Variants: Dynamic Grouping and Growth

Several families of techniques further realize the single-stream paradigm:

Dynamic Grouping Convolution (DGConv): Encodes group structure as learnable variables, allowing the network to identify regularization structure or sensor-specific splits without hard-coding group numbers. DGConv is used in conjunction with backbone architectures such as ResNet and U-Net, with depth-wise and grouped convolutions applied per learned policy (Yang et al., 2021).
Greedy Layer-Wise Feature Expansion: Layer width is expanded on demand by measuring each filter’s cross-correlation “distance” from initialization, introducing new features only when survivor filters have not “fully evolved.” This bottom-up expansion finds non-monotonic “bell-shaped” width profiles adapted to task complexity (Mundt et al., 2017).
RepeatNet and Learnable Repetition: The scaling of effective depth or width is achieved by repeating transforms of parent convolution kernels, with small sets of learnable transformation parameters $x \in [-K_{max}/2, K_{max}/2]$ 2. Both “swish-like” multiplicative nonlinearities and sign-flip maskings are used. Width or depth scaling factors can themselves be learned via differentiable relaxations, and the method enables wide or deep effective networks with drastically reduced parameter counts (Chavan et al., 2021).

6. Empirical Results and Comparative Performance

A broad range of empirical studies validate the efficacy of single-stream CNNs with learnable architectures:

Method/Setting	Task/Dataset	Baseline	Learnable Architecture	Relative Cost
DNArch CCNN $x \in [-K_{max}/2, K_{max}/2]$ 3	LRA (seq. tasks)	84.15%	85.78% (kern+down+wid+dep)	1.01×
DNArch CCNN $x \in [-K_{max}/2, K_{max}/2]$ 4	CIFAR-10/100	90.52%/64.72%	93.47%/72.98%	1.01×
DNArch CCNN $x \in [-K_{max}/2, K_{max}/2]$ 5	PDE/Cosmic-ray	0.00452/0.059	0.001763/0.048	1.00×
SepG-ResNet18 (DGConv)	Berlin HS-SAR	65.43% ±3.56	68.21% ±2.43	-
Greedy Width Expansion (VGG-E, CIFAR-10)	CIFAR-10	7.46% err, 21.5M	5.42% ±0.11, 45.0M ±7.3	-
$x \in [-K_{max}/2, K_{max}/2]$ 6-RepeatNet $x \in [-K_{max}/2, K_{max}/2]$ 7 (ResNet-16, CIFAR-10)	CIFAR-10	91.19%	93.18% (no extra conv param)	-

Empirical findings demonstrate that learnable architectures, when properly regularized and pruned, match or surpass the accuracy of hand-engineered or dense networks, often with reduced size, lower test variance, and better adaptivity. Notably, group convolution structure is most beneficial in deep, wide layers but detrimental in early, narrow layers, suggesting its primary role as a learnable regularizer rather than as a hard-coded sensor-specific feature extractor (Yang et al., 2021).

7. Practical Implementation Insights and Considerations

Effective construction of single-stream CNNs with learnable architecture requires:

Careful selection of mask or gate hyperparameters ( $x \in [-K_{max}/2, K_{max}/2]$ 8, $x \in [-K_{max}/2, K_{max}/2]$ 9, growth threshold $y$ 0, etc.).
Use of task-appropriate operator sets for node/edge architectures (Chen et al., 2020).
Regularization terms (e.g., $y$ 1 on connectivity, channel sparsity) to induce parsimony.
Downsampling and feature grouping mechanisms adapted to data heterogeneity (i.e., multi-source remote sensing).
Gradual or stage-wise retraining when adaptively adding width or structure during optimization (Mundt et al., 2017).
Final pruning or discretization step to guarantee deployment efficiency.

A plausible implication is that these methods enable systematic, resource-constrained, and data-driven architecture search without the need for intrusive multi-stream designs or expensive non-differentiable search algorithms. The result is a flexible and theoretically grounded toolkit for constructing highly tailored single-stream CNNs across modalities and tasks (Romero et al., 2023, Chen et al., 2020, Yang et al., 2021, Mundt et al., 2017, Chavan et al., 2021).