DenseNet-OPT: Entropy-Based Architecture Search
- Dense Optimizer (DenseNet-OPT) is an automatic architecture search framework that maximizes information entropy with power-law constraints to optimize DenseNet variants.
- It employs a branch-and-bound search algorithm that scores candidate architectures using stagewise entropy and power-law fitting for rapid, efficient model discovery.
- Experimental results show DenseNet-OPT variants outperform traditional DenseNets by reducing top-1 error by up to 4% on benchmark datasets while optimizing computational resources.
Dense Optimizer, referred to as DenseNet-OPT, is an automatic architecture search framework for Dense-like convolutional neural networks that formulates structural optimization as an information entropy maximization under power-law constraints. This approach replaces conventional human-driven tuning with a principled, mathematically grounded methodology, leading to DenseNet variants that systematically allocate representational capacity across network stages for improved efficiency and accuracy. The distinctive feature of Dense Optimizer is its use of information-theoretic metrics and power-law equilibrium to drive both the architecture search and final network design, with a branch-and-bound algorithm providing tractable and efficient model discovery (Tianyuan et al., 2024). Additionally, DenseNet-OPT is tightly connected to optimization-algorithm-inspired feedforward propagation, drawing theoretical parallels to Nesterov’s accelerated gradient and heavy-ball momentum, leading to enhanced gradient flow and convergence properties (Li et al., 2018).
1. Mathematical Formulation: Information Entropy and Power-Law Constraints
Dense Optimizer treats a Dense-like backbone as a multi-stage hierarchical information-processing system. For each stage , the design variables are:
- number of layers ,
- per-layer channel growth rate (width) ,
- convolutional kernel size .
The objective is to maximize the total structural entropy of the network, with stagewise entropy computed as an upper bound on normalized Gaussian entropy for each DenseBlock. For a block with layers, channel counts , kernel sizes , and resolution , the entropy is . Summing across all stages, .
Constraints include:
- an effectiveness ratio , with approximated by the initial width and ,
- budget limits on FLOPs and parameters,
- monotonic channel-width growth ,
- kernel size selection .
Importantly, empirical observation reveals that as a function of stage index closely follows a power-law, that is, . The optimization objective is then
where and are balancing hyperparameters, and are power-law fit parameters. The goal is to maximize both overall entropy () and the heaviness of the distribution’s tail ().
2. Search Algorithm: Branch-and-Bound with Power-Law Pruning
To address the mixed-integer, highly nonconvex optimization, Dense Optimizer employs a custom branch-and-bound algorithm tailored for efficiency on CPUs. The iterative procedure maintains a candidate population of network configurations, repeatedly:
- Scoring each candidate via computation of entropies and power-law fit .
- Identifying stages contributing most to deviation from the ideal profile.
- Splitting those stages into finer candidate sub-regions.
- Pruning any subregion that cannot outperform the global best in upper-bounded entropy or .
- Pruning lowest-scoring candidates if the population exceeds a set cap.
This process is repeated for a fixed number of iterations or until convergence. Empirically, this converges rapidly (approximately $0.2$ CPU-days or $4$ hours), significantly more efficient than common NAS methods such as DARTS or SNAS which require multi-GPU, multi-day computations (Tianyuan et al., 2024).
3. Search Space, Structural Variables, and Entropy Criterion
Dense Optimizer’s search space is defined by:
- Number of stages (fixed at $4$ in reported experiments).
- For each stage :
- : number of dense-connection layers (),
- : input channel width (monotonically nondecreasing),
- : growth rate (),
- : kernel size ($3,5,7$).
The entropy criterion simultaneously encourages increased representational capacity (favoring larger ) and enforces balanced entropy allocation via the power-law fit, penalizing architectures that accumulate capacity disproportionately in a single stage. This fosters an optimal trade-off between network depth, width, and kernel complexity across spatial scales (Tianyuan et al., 2024).
4. Discovered Architectures: DenseNet-OPT(123) and Variants
Through an extensive search (500,000 iterations; population ; initiated from DenseNet-121 architecture), Dense Optimizer yielded the DenseNet-OPT(123) network:
| Stage | (layers) | (growth) | (input channels) | (kernel) |
|---|---|---|---|---|
| 1 | 31 | 24 | 64 | 3 |
| 2 | 30 | 24 | 808 | 3 |
| 3 | 30 | 24 | 1528 | 3 |
| 4 | 32 | 24 | 2248 | 3 |
- Total layers: 123,
- Parameters: 24.12M.
- Transition layers halve resolution between stages; classification head mirrors DenseNet-BC.
Additional configurations with higher growth rate ( and ) were also produced, extending model capacity and top-1 accuracy accordingly.
5. Experimental Protocol and Comparative Performance
Training applied identical recipes across CIFAR-10, CIFAR-100, and SVHN datasets:
- SGD with momentum $0.9$,
- Weight decay ,
- Initial learning rate $0.1$, batch size $32$,
- Cosine learning-rate scheduling with 5-epoch linear warm-up,
- 100 epochs,
- Data augmentations: mix-up, label-smoothing, random erasing, random crop/resize/flip/lighting, Auto-Augment.
Key results:
| Dataset | Model | Params | Top-1 Error | Top-1 Accuracy |
|---|---|---|---|---|
| CIFAR-100 | DenseNet-BC(121) | 9.0M | 19.90% | 80.10% |
| CIFAR-100 | DenseNet-OPT(123) | 24.12M | 17.74% | 82.26% |
| CIFAR-100 | DenseNet-OPT(129, K=40) | 32.60M | 16.96% | 83.04% |
| CIFAR-100 | DenseNet-OPT(86, K=128) | 171.7M | 15.70% | 84.30% |
| CIFAR-10 | DenseNet-BC(250) | 15.3M | 5.19% | 94.81% |
| CIFAR-10 | DenseNet-OPT(123) | 24.1M | 3.53% | 96.47% |
| SVHN | DenseNet-BC | 15.3M | 1.74% | 98.26% |
| SVHN | DenseNet-OPT(123) | 24.1M | 1.49% | 98.51% |
DenseNet-OPT outperformed original DenseNet-BC and NAS baselines by $2$– in top-1 accuracy, with search costs reduced to $0.2$ CPU-days ( hours) versus at least $0.4$ GPU-days for competitor NAS frameworks.
6. Connections to Optimization-Inspired Architecture
Independent of the entropy-maximizing search, DenseNet-OPT is also connected theoretically to optimization-algorithm-inspired network design. Feed-forward architectures can be interpreted as unrolled steps of gradient descent on linear-nonlinear objectives. By analogy, inserting heavy-ball and Nesterov’s accelerated gradient updates leads to new network blocks, with DenseNet-OPT specifically corresponding to a Nesterov momentum-inspired, feature-concatenating variant.
DenseNet-OPT layers aggregate features across all preceding layers with weighted concatenations corresponding to “history” coefficients from accelerated optimization. Empirical results on CIFAR and ImageNet confirm that such optimizer-inspired variants yield – reduced error versus plain DenseNets, with augmented gradient flow and feature reuse (Li et al., 2018). This suggests that momentum-equivalent feature propagation is integral to DenseNet-OPT’s empirical success.
7. Ablations, Insights, and Applicability
- Imposing a power-law constraint over stagewise entropy enforces balanced capacity allocation and prevents overfitting or underutilization of layers at particular scales; removal of this constraint degrades generalization.
- Ablation studies show strong positive correlation between power-law parameter and accuracy (Pearson ), and strong negative correlation for (), confirming the benefit of heavy-tailed entropy allocation.
- The Lagrange multiplier governing the weight of was best set to $0.1$ for performance balance.
- The general principle is applicable beyond Dense-like models to any multi-stage, cross-scale architecture relying on feature concatenation and information balance.
In summary, Dense Optimizer establishes a single-level optimization framework from information-theoretic analysis, executed with a branch-and-bound search algorithm, and theoretically linked to accelerated-gradient propagation. The resulting DenseNet-OPT family consistently surpasses hand-crafted architectures under identical training regimes, highlighting both the efficiency and efficacy of entropy-guided architectural search (Tianyuan et al., 2024, Li et al., 2018).