Papers
Topics
Authors
Recent
Search
2000 character limit reached

Global Channel Gating (GCG)

Updated 21 January 2026
  • Global Channel Gating (GCG) is a neural network pruning framework that uses per-channel hard gating and hypergraph dependency modeling for hardware-agnostic compression.
  • It employs an adaptive auxiliary loss and iterative pruning with fine-tuning to maintain accuracy while reducing FLOPs and latency, especially in complex architectures like ResNet.
  • Empirical evaluations on ImageNet demonstrate that GCG achieves significant computational and latency reductions with minimal drops in top-1 accuracy.

Global Channel Gating (GCG) is a structured neural network pruning framework introduced to address the need for hardware-agnostic network compression with state-of-the-art accuracy-resource trade-offs. GCG leverages per-channel hard gating, hypergraph-based dependency modeling, and an adaptive auxiliary loss targeting computational, memory, or empirical latency cost, yielding fine control over pruning behavior even in architectures with complex connectivity patterns such as ResNets. Notably, it can remove entire non-sequential blocks and consistently enforces identical pruning across skip-linked layers. On standard benchmarks such as ResNet-50/ImageNet, GCG demonstrates substantial FLOPs and latency reductions while preserving accuracy competitively to or exceeding prior methods (Passov et al., 2022).

1. Channel-wise Hard Gating Mechanism

GCG introduces a learnable gating system parameterized per channel. Let X∈RB×C×H×WX \in \mathbb{R}^{B \times C \times H \times W} denote the activation (input or output) to a convolution. For each channel i=1…Ci = 1 \ldots C, a learnable scalar θi\theta_i is associated, forming the basis for a Bernoulli hard gate gi∈{0,1}g_i \in \{0,1\}. Gate sampling employs a binary Gumbel-logistic distribution:

σ(u)=11+e−u,x∼Logistic(0,1),\sigma(u) = \frac{1}{1+e^{-u}},\qquad x \sim \mathrm{Logistic}(0,1),

gi={1,σ((θi+x)/τ)≥0.5, 0,otherwise,τ>0.g_i = \begin{cases} 1, & \sigma((\theta_i + x)/\tau) \geq 0.5, \ 0, & \text{otherwise} \end{cases}, \qquad \tau>0.

With Ï„=1\tau = 1 in practice, training uses a straight-through estimator: the forward pass applies hard thresholding, while gradients are backpropagated through the smooth sigmoid. The gated activation is X~b,i,h,w=giXb,i,h,w\widetilde X_{b,i,h,w} = g_i X_{b,i,h,w}. To prune output channels, gig_i is post-convolution; for input channels, it is pre-convolution.

2. Auxiliary Cost and Loss Formulation

The pruning process is regularized with an auxiliary cost function Lcomp(t)L_{\mathrm{comp}}(t) reflecting resource consumption, combined with the original training loss (e.g., cross-entropy) as

Ltotal(t)=Lorig+α Lcomp(t),L_{\mathrm{total}}(t) = L_{\mathrm{orig}} + \alpha\,L_{\mathrm{comp}}(t),

where α>0\alpha > 0 manages the trade-off between accuracy and pruning aggressiveness. Edge-wise channel dependencies are grouped as E={ej}j=1mE = \{e_j\}_{j=1}^m, with per-edge gate vectors gj,ig_{j,i}. The number of active channels on edge jj at iteration tt is cj(t)=∑i=1Cjgj,i(t)c_j(t) = \sum_{i=1}^{C_j} g_{j,i}(t). Each edge is assigned a per-channel cost λj(t)\lambda_j(t), parameterizable for memory, theoretical FLOPs, or empirical latency. For instance, edge-wise FLOPs are:

λj(t)=∑indqkqwkqhcqout(t)+∑outdqkqwkqhcqin(t),\lambda_j(t) = \sum_{\text{in}} d_q k_q^w k_q^h c_q^{\mathrm{out}}(t) + \sum_{\text{out}} d_q k_q^w k_q^h c_q^{\mathrm{in}}(t),

where dq=4−sqd_q = 4^{-s_q} encodes down-sampling and kqw×kqhk_q^w \times k_q^h is kernel size. The normalized cost per edge is:

λ^j(t)=λj(t)∑j=1mCjλj(0),\widehat\lambda_j(t) = \frac{\lambda_j(t)}{\sum_{j=1}^m C_j \lambda_j(0)},

Lcomp(t)=∑j=1mcj(t)λ^j(t).L_{\mathrm{comp}}(t) = \sum_{j=1}^m c_j(t) \widehat\lambda_j(t).

To ensure balanced pruning across varied edges, the learning rate for the gating parameters is rescaled:

ηj(t)=γ η(t)λ^j(t),\eta_j(t) = \frac{\gamma\,\eta(t)}{\widehat\lambda_j(t)},

mitigating premature pruning of shallower layers with large contribution to loss.

3. Hypergraph-based Modeling of Channel Dependencies

GCG's pruning must respect architectural dependencies, especially in non-sequential designs such as ResNets, in which a channel may simultaneously propagate through the principal path and skip connections. For this, an undirected hypergraph H=(V,E)H = (V, E) encodes channel dependencies:

  • VV: union of all conv input/output channel indices.
  • Each hyperedge ej∈Ee_j \in E: set of vertices representing a "dependency group", i.e., input/output channels across convolutions sharing the same tensor dimension.

For ResNet-50, this results in m=37m = 37 edges: $32$ sequential (within bottlenecks) and $5$ large skip-path edges spanning multiple layers. Pruning is enacted by setting gates for all channels in a dependency group to zero, ensuring consistent structural reduction across all affected pathways.

4. Iterative Pruning and Fine-Tuning Algorithm

GCG applies pruning in staged iterations, each comprising gating, hard-channel removal, and fine-tuning. The process is described as follows:

  • Initialization: Learning rates reset; gating θ\theta-values retained from previous iteration.
  • Gating Phase (TgateT_{\mathrm{gate}} epochs): For each batch, sample Gumbel-logistic noise for every gate, compute gj,ig_{j,i} as above, apply gating in forward pass, evaluate LtotalL_{\mathrm{total}}, and backpropagate using per-edge learning rates.
  • Prune: After the gating phase, for each gate gj,ig_{j,i}, compute its soft activation probability: pj,i=Ex[σ((θj,i+x)/Ï„)]p_{j,i} = \mathbb{E}_x[\sigma((\theta_{j,i} + x)/\tau)]. If pj,i<0.5p_{j,i} < 0.5, set gj,i≡0g_{j,i} \equiv 0 (channel permanently pruned).
  • Fine-Tuning Phase (TftT_{\mathrm{ft}} epochs): Fix pruned channels, train only remaining weights under LorigL_{\mathrm{orig}}.

This is repeated for several iterations, gradually increasing α\alpha to drive increased sparsity while maintaining accuracy through intermediate compensation.

5. Empirical Evaluation on ResNet-50 / ImageNet

GCG achieves notable compression and speedup on standard benchmarks. Table 1 summarizes main empirical results as reported for ResNet-50 on ImageNet, with two tuning objectives: FLOPs reduction and latency speedup.

Configuration FLOPs Reduction Top-1 Accuracy (%) Top-5 Accuracy (%) Relative Speedup (×)
Baseline (ResNet-50) – 76.15 92.87 1.00
Gator flops 0.5 49.9% 75.19 (–0.96) 92.61 (–0.26) 1.44
Gator latency 0.5 48.3% 75.28 (–0.87) 92.50 (–0.37) 1.65
Gator flops 1.0 62.6% 74.14 (–2.01) 91.99 (–0.88) 1.61
Gator latency 1.0 61.2% 74.24 (–1.91) 91.95 (–0.92) 1.86
Gator flops 2.0 76.6% 72.36 (–3.79) 90.97 (–1.90) 2.11

In the regime of approximately 50% FLOPs reduction, GCG incurs only a 1% absolute drop in top-1 accuracy, with a real-world GPU speedup of 1.4×, surpassing prior structured-pruning approaches both in computational and empirical metrics.

6. Context and Significance within Neural Network Compression

GCG generalizes channel pruning by introducing flexible, differentiable hard gating, composable with architectures manifesting complex connectivity. Its auxiliary cost can be tuned explicitly for mission-specific resource constraints, whether FLOPs, memory footprint, or device-specific latency, an advance over prior pruning schemes tied rigidly to FLOPs or layerwise strategies. The hypergraph dependency formalism ensures pruning is feasible for architectures with multi-path connections, notably without the need to disentangle skip-linked tensors manually. A plausible implication is the broadening of structured pruning's applicability to emerging model classes with intricate topologies.

7. Availability and Extensions

The implementation codebase for GCG, designated as "Gator," is publicly available at https://github.com/EliPassov/gator. This facilitates reproducibility and adaptation to alternative architectures and deployment restrictions (Passov et al., 2022). While the core mechanism targets convolutional networks, the underlying dependency modeling is not tied to a specific layer type, potentially enabling adaptation to other structured tensor factorizations or reparameterizations. Further empirical studies may clarify efficacy across non-vision modalities and in conjunction with quantization and knowledge distillation.


For detailed algorithms, model dependencies, and supplementary results, refer to "Gator: Customizable Channel Pruning of Neural Networks with Gating" (Passov et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global Channel Gating (GCG).