Papers
Topics
Authors
Recent
Search
2000 character limit reached

DenseNet-OPT: Entropy-Based Architecture Search

Updated 26 January 2026
  • Dense Optimizer (DenseNet-OPT) is an automatic architecture search framework that maximizes information entropy with power-law constraints to optimize DenseNet variants.
  • It employs a branch-and-bound search algorithm that scores candidate architectures using stagewise entropy and power-law fitting for rapid, efficient model discovery.
  • Experimental results show DenseNet-OPT variants outperform traditional DenseNets by reducing top-1 error by up to 4% on benchmark datasets while optimizing computational resources.

Dense Optimizer, referred to as DenseNet-OPT, is an automatic architecture search framework for Dense-like convolutional neural networks that formulates structural optimization as an information entropy maximization under power-law constraints. This approach replaces conventional human-driven tuning with a principled, mathematically grounded methodology, leading to DenseNet variants that systematically allocate representational capacity across network stages for improved efficiency and accuracy. The distinctive feature of Dense Optimizer is its use of information-theoretic metrics and power-law equilibrium to drive both the architecture search and final network design, with a branch-and-bound algorithm providing tractable and efficient model discovery (Tianyuan et al., 2024). Additionally, DenseNet-OPT is tightly connected to optimization-algorithm-inspired feedforward propagation, drawing theoretical parallels to Nesterov’s accelerated gradient and heavy-ball momentum, leading to enhanced gradient flow and convergence properties (Li et al., 2018).

1. Mathematical Formulation: Information Entropy and Power-Law Constraints

Dense Optimizer treats a Dense-like backbone as a multi-stage hierarchical information-processing system. For each stage ii, the design variables are:

  • number of layers LiL_i,
  • per-layer channel growth rate (width) wiw_i,
  • convolutional kernel size kik_i.

The objective is to maximize the total structural entropy of the network, with stagewise entropy HiH_i computed as an upper bound on normalized Gaussian entropy for each DenseBlock. For a block f()f(\cdot) with LL layers, channel counts cic_i, kernel sizes kik_i, and resolution rir_i, the entropy is Hf=log(rL2cL+1)log((ciki2)Li!)H_f = \log(r_L^2 c_{L+1})\cdot \log\left((c_i k_i^2)^{L \cdot i!}\right). Summing across all MM stages, H=[H1,H2,,HM]\mathbf{H} = [H_1, H_2,\ldots, H_M].

Constraints include:

  • an effectiveness ratio ρi=Li/w0,iρmax\rho_i = L_i / w_{0,i} \leq \rho_{\max}, with w0,iw_{0,i} approximated by the initial width and ρmax[10,20]\rho_{\max} \in [10, 20],
  • budget limits on FLOPs and parameters,
  • monotonic channel-width growth w1w2wMw_1 \leq w_2 \leq \cdots \leq w_M,
  • kernel size selection ki{3,5,7}k_i \in \{3,5,7\}.

Importantly, empirical observation reveals that H\mathbf{H} as a function of stage index mm closely follows a power-law, that is, HmambH_m \approx a m^b. The optimization objective is then

maxci,ki,LiF=i=1MαiHi+β(ab),\max_{c_i, k_i, L_i} F = \sum_{i=1}^M \alpha_i H_i + \beta(a-b),

where αi\alpha_i and β\beta are balancing hyperparameters, and (a,b)(a,b) are power-law fit parameters. The goal is to maximize both overall entropy (aa) and the heaviness of the distribution’s tail (b-b).

2. Search Algorithm: Branch-and-Bound with Power-Law Pruning

To address the mixed-integer, highly nonconvex optimization, Dense Optimizer employs a custom branch-and-bound algorithm tailored for efficiency on CPUs. The iterative procedure maintains a candidate population of network configurations, repeatedly:

  1. Scoring each candidate via computation of entropies {Hi}\{H_i\} and power-law fit (a,b)(a,b).
  2. Identifying stages contributing most to deviation from the ideal amba m^b profile.
  3. Splitting those stages into finer candidate sub-regions.
  4. Pruning any subregion that cannot outperform the global best in upper-bounded entropy or (ab)(a-b).
  5. Pruning lowest-scoring candidates if the population exceeds a set cap.

This process is repeated for a fixed number of iterations or until convergence. Empirically, this converges rapidly (approximately $0.2$ CPU-days or $4$ hours), significantly more efficient than common NAS methods such as DARTS or SNAS which require multi-GPU, multi-day computations (Tianyuan et al., 2024).

3. Search Space, Structural Variables, and Entropy Criterion

Dense Optimizer’s search space is defined by:

  • Number of stages MM (fixed at $4$ in reported experiments).
  • For each stage ii:
    • LiL_i: number of dense-connection layers (Li130\sum L_i \leq 130),
    • c0,ic_{0,i}: input channel width (monotonically nondecreasing),
    • KiK_i: growth rate (Ki{12,24,40}K_i \in \{12,24,40\}),
    • kik_i: kernel size ($3,5,7$).

The entropy criterion simultaneously encourages increased representational capacity (favoring larger HiH_i) and enforces balanced entropy allocation via the power-law fit, penalizing architectures that accumulate capacity disproportionately in a single stage. This fosters an optimal trade-off between network depth, width, and kernel complexity across spatial scales (Tianyuan et al., 2024).

4. Discovered Architectures: DenseNet-OPT(123) and Variants

Through an extensive search (500,000 iterations; population =256=256; initiated from DenseNet-121 architecture), Dense Optimizer yielded the DenseNet-OPT(123) network:

Stage ii LiL_i (layers) KiK_i (growth) c0,ic_{0,i} (input channels) kik_i (kernel)
1 31 24 64 3
2 30 24 808 3
3 30 24 1528 3
4 32 24 2248 3
  • Total layers: 123,
  • Parameters: 24.12M.
  • Transition layers halve resolution between stages; classification head mirrors DenseNet-BC.

Additional configurations with higher growth rate (K=40K=40 and K=128K=128) were also produced, extending model capacity and top-1 accuracy accordingly.

5. Experimental Protocol and Comparative Performance

Training applied identical recipes across CIFAR-10, CIFAR-100, and SVHN datasets:

  • SGD with momentum $0.9$,
  • Weight decay 5×1045 \times 10^{-4},
  • Initial learning rate $0.1$, batch size $32$,
  • Cosine learning-rate scheduling with 5-epoch linear warm-up,
  • 100 epochs,
  • Data augmentations: mix-up, label-smoothing, random erasing, random crop/resize/flip/lighting, Auto-Augment.

Key results:

Dataset Model Params Top-1 Error Top-1 Accuracy
CIFAR-100 DenseNet-BC(121) 9.0M 19.90% 80.10%
CIFAR-100 DenseNet-OPT(123) 24.12M 17.74% 82.26%
CIFAR-100 DenseNet-OPT(129, K=40) 32.60M 16.96% 83.04%
CIFAR-100 DenseNet-OPT(86, K=128) 171.7M 15.70% 84.30%
CIFAR-10 DenseNet-BC(250) 15.3M 5.19% 94.81%
CIFAR-10 DenseNet-OPT(123) 24.1M 3.53% 96.47%
SVHN DenseNet-BC 15.3M 1.74% 98.26%
SVHN DenseNet-OPT(123) 24.1M 1.49% 98.51%

DenseNet-OPT outperformed original DenseNet-BC and NAS baselines by $2$–6%6\% in top-1 accuracy, with search costs reduced to $0.2$ CPU-days (4\sim4 hours) versus at least $0.4$ GPU-days for competitor NAS frameworks.

6. Connections to Optimization-Inspired Architecture

Independent of the entropy-maximizing search, DenseNet-OPT is also connected theoretically to optimization-algorithm-inspired network design. Feed-forward architectures can be interpreted as unrolled steps of gradient descent on linear-nonlinear objectives. By analogy, inserting heavy-ball and Nesterov’s accelerated gradient updates leads to new network blocks, with DenseNet-OPT specifically corresponding to a Nesterov momentum-inspired, feature-concatenating variant.

DenseNet-OPT layers aggregate features across all preceding layers with weighted concatenations corresponding to “history” coefficients from accelerated optimization. Empirical results on CIFAR and ImageNet confirm that such optimizer-inspired variants yield 0.3%0.3\%1.6%1.6\% reduced error versus plain DenseNets, with augmented gradient flow and feature reuse (Li et al., 2018). This suggests that momentum-equivalent feature propagation is integral to DenseNet-OPT’s empirical success.

7. Ablations, Insights, and Applicability

  • Imposing a power-law constraint over stagewise entropy enforces balanced capacity allocation and prevents overfitting or underutilization of layers at particular scales; removal of this constraint degrades generalization.
  • Ablation studies show strong positive correlation between power-law parameter aa and accuracy (Pearson r+0.86r\approx+0.86), and strong negative correlation for bb (r0.94r\approx-0.94), confirming the benefit of heavy-tailed entropy allocation.
  • The Lagrange multiplier β\beta governing the weight of (ab)(a-b) was best set to $0.1$ for performance balance.
  • The general principle is applicable beyond Dense-like models to any multi-stage, cross-scale architecture relying on feature concatenation and information balance.

In summary, Dense Optimizer establishes a single-level optimization framework from information-theoretic analysis, executed with a branch-and-bound search algorithm, and theoretically linked to accelerated-gradient propagation. The resulting DenseNet-OPT family consistently surpasses hand-crafted architectures under identical training regimes, highlighting both the efficiency and efficacy of entropy-guided architectural search (Tianyuan et al., 2024, Li et al., 2018).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dense Optimizer (DenseNet-OPT).