Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Squeeze Pooling (TS)

Updated 2 February 2026
  • Temporal Squeeze Pooling (TS) is a video representation technique that compresses long video clips into compact embeddings using adaptive, learnable projections.
  • It reduces computational and memory costs by fusing temporal information into squeezed images or channel-fused tensors while preserving key motion dynamics.
  • TS pooling integrates with architectures like TeSNet and SqueezeTime, demonstrating improved accuracy and efficiency on benchmarks such as UCF101 and Kinetics-400.

Temporal Squeeze Pooling (TS) refers to a class of operations for video representation learning that compress the temporal dimension of video clips into compact, information-preserving forms suitable for efficient and effective recognition. Two prominent instantiations—the original TS pooling for spatio-temporal aggregation in standard CNN backbones (Huang et al., 2020) and its reinterpretation as “squeezing time into channel” for mobile video understanding (Zhai et al., 2024)—demonstrate both the generality and technical nuance of the approach. TS pooling employs learnable, input-adaptive projections to map long sequences either into a small set of “squeezed images” or into fused channel representations, while providing mechanisms to recover and emphasize temporal dynamics.

1. Motivation and Underlying Principles

The core motivation underlying TS pooling is to address the redundancy and inefficiency inherent in conventional video encodings. Long video clips, typically containing tens or hundreds of frames, include substantial temporal redundancy due to static backgrounds and repeated appearance, while salient, discriminative motion cues are temporally sparse. Existing paradigms—such as 3D convolutions that treat time as an explicit axis, or dynamic image summarization collapsing an entire clip into a single pseudo-frame—can either lead to prohibitive computational footprints or substantial loss of fine-grained temporal information.

TS pooling fundamentally seeks to compress a sequence of KK frames (XRK×C×H×WX \in \mathbb{R}^{K \times C \times H \times W}) into a low-dimensional embedding of DKD \ll K discriminative projections. These projections can manifest as either a set of “squeezed images” (if maintaining spatial structure) or as a channel-fused tensor in which temporal variation is encoded across feature dimensions. Importantly, modern formulations employ adaptive, input-conditioned transformations, allowing the network to emphasize task-relevant temporal structures rather than static background content.

2. Mathematical Formulation of TS Pooling Layers

In the canonical TS pooling layer for spatio-temporal feature reduction (Huang et al., 2020), the operation proceeds as follows:

  • Squeeze Step: Compute a per-frame global descriptor by averaging all pixels and channels for each frame xkx_k, yielding zk=1HWCi,j,lxk(i,j,l)z_k = \frac{1}{HWC}\sum_{i,j,l} x_k(i,j,l) and zRK\mathbf{z} \in \mathbb{R}^K.
  • Excitation Step: Generate a projection subspace via a two-layer, fully connected “squeeze-and-excitation” block:

u=δ2(W2δ1(W1z))RKDu = \delta_2(W_2\,\delta_1(W_1\,\mathbf{z})) \in \mathbb{R}^{K D}

where W1W_1, W2W_2 are learnable matrices, δ1\delta_1 is Sigmoid, and δ2\delta_2 is LeakyReLU; uu is reshaped and column-independence is enforced, yielding ARK×DA \in \mathbb{R}^{K \times D}.

  • Projection Step: Each pixel trajectory through time xˉiRK\bar{x}_i \in \mathbb{R}^{K} (for all i1,...,HWCi \in 1,...,HWC) is least-squares-projected onto the DD-dimensional subspace:

yi=(ATA)1ATxˉiRDy_i = (A^\mathsf{T}A)^{-1} A^\mathsf{T}\bar{x}_i \in \mathbb{R}^{D}

The concatenated YR(HWC)×DY \in \mathbb{R}^{(HWC) \times D} is reshaped into squeezed images SRD×C×H×WS \in \mathbb{R}^{D \times C \times H \times W}.

For “time-to-channel" operations (Zhai et al., 2024), the temporal axis is fused into channels by reshaping XRT×C×H×WX \in \mathbb{R}^{T \times C \times H \times W} to FbR(TC)×H×WF_b \in \mathbb{R}^{(T C) \times H \times W}, optionally followed by a 2D convolution mapping to CC' output channels. This operation removes explicit temporal structure, requiring subsequent channel-time learning blocks to restore the useful parts of temporal modeling.

3. Architectures and Integration into Video Networks

TeSNet

In TeSNet, the temporal squeeze pooling layers are embedded into an Inception-ResNet-V2 backbone (pre-trained on ImageNet) for two-stream video classification (RGB and TV-L1 optical flow). TS layers can be positioned at the input (directly after raw frames) or after convolutional blocks, with the output “squeezed images” fed through the backbone as standard images.

  • Configurations:
    • Single TS after input: KK frames \rightarrow TS (D=3D=3) \rightarrow full CNN (shared weights for DD compressed frames).
    • Pyramidal TS: TS at input (D1D_1), deeper TS at a later block (D2D_2).
  • Output handling: Each squeezed image is treated as a conventional image input; backbone feature maps are pooled and fused for classification.

SqueezeTime with Channel–Time Learning (CTL) Blocks

In mobile-centric architectures, the temporal axis is squeezed into channel dimensions for maximal efficiency. The CTL block then restores and refines temporal dynamics. Each residual CTL block comprises:

  • Bottleneck form:

Fo=Conv1×1rCC(CTL(Conv1×1CrC(Fi)))+FiF_o = \text{Conv}_{1\times1}^{rC \to C}(\text{CTL}(\text{Conv}_{1\times1}^{C \to rC}(F_i))) + F_i

with r=1/4r=1/4 reduction ratio.

  • CTL core:

CTL()=TFC()+IOI()\text{CTL}(\cdot) = \text{TFC}(\cdot) + \text{IOI}(\cdot)

  • Temporal-Focus Convolution (TFC) branch: Channel-wise importance weighting computed via a global max pool and two-layer MLP followed by sigmoid, reweighting a standard 2D k×kk\times k conv per channel.
  • Inter-Temporal Object Interaction (IOI) branch: Recovers temporal position and enables interaction across pseudo-temporal (fused) channels, using learnable temporal positional encodings and large-kernel convolutions across collapsed temporal axes.

4. Computational Complexity and Memory Efficiency

TS pooling enables substantial reductions in both flops and memory:

  • Original TS pooling: Projects KK-frame video to DD squeezed frames (DKD \ll K), with little to no increase in parameter or runtime cost. When integrated as an input layer, downstream operations are fully shared among squeezed images.
  • Time-to-channel pooling: Replaces all TT-axis convolutions with 2D convolutions on (TC)(T C) input channels:

OTS=2CoutCink2HWO_{TS} = 2 \cdot C_{\textrm{out}} \cdot C_{\textrm{in}} \cdot k^2 \cdot H \cdot W

eliminating the TT factor present in $3$D convolutional cost. Peak activation memory similarly drops from TCHWT \cdot C \cdot H \cdot W to (TC)HW(T C) \cdot H \cdot W at the first stage, and only CHWC \cdot H \cdot W thereafter. The empirical result is a near TT-fold memory and computation saving relative to conventional 3D CNNs.

5. Empirical Evaluation and Ablation Analyses

Temporal Squeeze pooling demonstrates considerable efficacy across multiple video recognition benchmarks:

Benchmark Model/Config RGB (%) OF (%) Combined (%)
UCF101 Baseline (single frame, IncResNet-V2) 83.5 85.4 92.5
TSN (3 frames) 85.0 85.1 92.9
TeSNet (64, D1D_1=16, D2D_2=4) 87.8 88.2 95.2
HMDB51 TeSNet (avg. 3 splits) -- -- 71.5

For TeSNet, the best single-layer setting is an input-level TS with D=3D=3 (85.4% top-1 on UCF101, split 1), with two-layer designs offering further tradeoffs between accuracy and computational expense (Huang et al., 2020). Ablations indicate that increasing the number of frames KK improves performance (e.g., K=10K=10 yields 85.3%, K=64K=64 yields 87.8%).

In SqueezeTime (Zhai et al., 2024), within mobile constraints (≈5.5 GFLOPs, 16 frames, 2242224^2 input), the approach delivers higher Top-1 accuracy (71.6% on Kinetics-400) than competing mobile methods and achieves GPU throughput of 903 clips/s, over 80% faster than the nearest baseline. Memory and latency savings are equally significant—e.g., throughput, CPU latency, and downstream fine-tuning for temporal action detection are all improved.

6. Characteristics and Interpretation of Squeezed Outputs

The output of TS pooling comprises a low-dimensional set of images (“squeezed images” or “fused channel tensors”) that visually and semantically partition scene content:

  • For K=10K=10, D=2D=2: One squeezed image will typically correspond to the static background, while the other highlights moving objects as salient, motion-blur-like features.
  • When clips contain little movement, both outputs approximate the average scene, reflecting subtle temporal variation.
  • Multiple squeezed images can recover multiple modes of temporal variation, in contrast to the inherent information loss of single dynamic image summarization.

In channel-fused representations, the subsequent CTL blocks facilitate the restoration of explicit temporal reasoning necessary for recognition tasks, making this approach highly suitable for computationally constrained environments.

7. Training Procedures and Optimization

Training TS pooling models involves specific objectives and hyperparameters:

  • Loss functions:

    • Squeeze reconstruction loss:

    proj=1HWCi=1HWCxˉix^i2\ell_{proj} = \frac{1}{HWC} \sum_{i=1}^{HWC} \| \bar{x}_i - \widehat{x}_i \|_2 - Final loss:

    final=classif+βm=1Mproj(m)+λL2\ell_{final} = \ell_{classif} + \beta \sum_{m=1}^M \ell_{proj}^{(m)} + \lambda \ell_{L2}

with typical weights β=10\beta = 10, λ=4×105\lambda = 4 \times 10^{-5}.

  • Optimization: SGD (batch size 32, momentum 0.9), dropout 0.5.
  • Learning rates: 0.001 for RGB, 0.005 for optical flow, with 0.1×0.1\times decay on validation plateau.
  • Input: 299×299299 \times 299 (TeSNet), 224×224224 \times 224 (SqueezeTime); KK typically 16–64 frames.

In summary, Temporal Squeeze pooling formalizes an efficient, learnable reduction of temporal dimensionality in video representation learning, balancing compactness with preservation of discriminative temporal cues. The methodology demonstrates strong empirical results and broad applicability, particularly for resource-constrained video understanding (Huang et al., 2020, Zhai et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Squeeze Pooling (TS).