Temporal & Depthwise Convolution

Updated 1 February 2026

Temporal and depthwise convolutions are fundamental operations in neural networks that perform efficient one-dimensional filtering for sequential data.
Architectures like ST-AttNet and TENet illustrate the use of multi-scale, residual methods to achieve high accuracy with reduced parameters and multiply-adds.
These approaches drastically lower computational cost and memory usage, making them ideal for applications in edge devices and speech recognition.

Temporal convolution and depthwise convolution are fundamental operations in modern neural network architectures designed for sequence and time series modeling, particularly when computational efficiency is paramount. These operators underpin compact, high-performance networks in tasks such as keyword spotting, enabling accurate inference under tight memory and energy constraints.

1. Formal Definitions and Mathematical Formulations

Temporal convolution refers to 1-dimensional convolution applied along the time axis of sequential input data. Given an input feature map $x\in\mathbb{R}^{T\times C}$ , where $T$ is the sequence length and $C$ the channel dimension, a standard 1D temporal convolution with kernel $w\in\mathbb{R}^{k\times C\times C'}$ produces the output

$y_{t,c'} \;=\;\sum_{i=1}^{k}\sum_{c=1}^{C} w_{i,c,c'} x_{t + i - \lfloor k/2\rfloor, c}, \quad 1 \leq t \leq T, \; 1 \leq c' \leq C'$

where $k$ is the temporal kernel width and $C'$ is the output channels.

Depthwise convolution factorizes the standard convolution by restricting each filter to one input channel, producing an intermediate output $y\in\mathbb{R}^{T\times C}$ :

$y_{t,c} = \sum_{i=1}^{k} w^\mathrm{dw}_{i,c} x_{t + i - \lfloor k/2\rfloor, c}$

where $w^\mathrm{dw}\in\mathbb{R}^{k\times C}$ . The resulting feature map maintains the input channel count.

Depthwise separable convolution further appends a $1\times1$ pointwise convolution for cross-channel mixing:

$z_{t, c'} = \sum_{c=1}^{C} w^\mathrm{pw}_{c, c'} y_{t,c}$

with $w^\mathrm{pw}\in\mathbb{R}^{1\times C\times C'}$ .

This staged decomposition is central to parameter- and computation-efficient neural models, and can be further extended via "multi-branch" (multi-scale) strategies where several depthwise convolutions operate in parallel and are fused at inference (Hu et al., 2021, Li et al., 2020).

2. Parameterization, Computational Cost, and Efficiency

Parameter count and FLOPs metrics for each operation are as follows:

Operation	Parameter Count	Multiply-Adds (FLOPs)
Standard 1D Conv	$k C C'$	$T k C C'$
Depthwise Conv	$k C$	$T k C$
Pointwise (1×1) Conv	$C C'$	$T C C'$
Depthwise-Separable Conv	$k C + C C'$	$T (k C + C C')$

Depthwise separable convolution thus achieves a reduction ratio:

$\frac{\#\mathrm{Params}_\mathrm{ds}}{\#\mathrm{Params}_\mathrm{std}} = \frac{1}{C'}+\frac{1}{k}$

which, for large $C',k$ , provides a substantial parameter and computation reduction (Hu et al., 2021).

MTConv modules generalize this by performing $B$ parallel depthwise convolutions with different kernel sizes, then elementwise summing the normalized branch outputs. At inference, these are algebraically fused to yield a single equivalent depthwise convolution, so no runtime or memory penalty is incurred (Li et al., 2020).

3. Architectural Instances: ST-AttNet and TENet

ST-AttNet (Hu et al., 2021) exemplifies the use of separable temporal convolution in small-footprint models. Its core is a stack of residual "SeparableConv" blocks, each performing a depthwise 1D convolution (kernel size $k=3$ , dilation schedule $d_i=2^{i-1}$ ), followed by batch normalization, ReLU, and a pointwise $1\times1$ convolution. Downstream, a temporally pooled multi-head attention module summarizes temporal context before dense classification. Empirically, ST-AttNet4-wide (48K parameters) matches the 96.6% accuracy of the larger 305K-param TC-ResNet14-1.5 model, requiring only 60% of multiplies.

TENet (Li et al., 2020) employs a stack of "Inverted Bottleneck Blocks" (IBBs), each featuring pointwise expansion, depthwise temporal convolution, and pointwise projection. TENet12, for instance, uses 12 IBB stages with 32 channels and 9-tap depthwise kernels, achieving an approximate parameter budget of 100K and sub-3M FLOPs for state-of-the-art accuracy.

MTConv: During TENet training, MTConv replaces plain depthwise layers with multi-scale, multi-branch convolutions (branches with receptive fields, e.g., sizes 3, 5, 7, 9). The branches are fused, retaining efficiency at test time.

Representative layerwise breakdown for ST-AttNet4 (Hu et al., 2021):

Layer	Kernel	In → Out Ch	Dilation	#Params	#Multiplies
Initial sep-conv	3 × 1	40 → 45	1	1,920	$T\times1,920$
ResBlock ×4 (2 sep-convs)	3 × 1	45 → 45	1,2,4,8	17,280	$4\times T \times2,160\times2$
Avg-Attn (5-head)	–	45 → 45	–	~4,300	~207,000
Dense + Softmax	1 × 1	45 → 12	–	540	540
Total	–	–	–	~24,000	~2,000,000

4. Multi-Scale and Residual Extensions

Multi-scale temporal convolution (MTConv) enriches the feature space by exposing the network to varying temporal granularities. Each branch processes the input using a different kernel size, and the summed outputs ensure the subsequent layer receives information aggregated over both short-term and long-term contexts (Li et al., 2020). Crucially, due to the linearity of convolution and batch normalization, these can be collapsed into a single filter at inference, providing enhanced model capacity during training without runtime penalty.

Residual connections are systematically employed in both ST-AttNet and TENet to stabilize optimization and preserve representational fidelity. Dilation is introduced (exponentially increasing in subsequent blocks) to exponentially enlarge the temporal receptive field without increasing parameter count or computation (Hu et al., 2021).

5. Comparative Results and Empirical Benchmarks

Empirical evaluation on the Google Speech Commands V1 dataset yields:

Model	Accuracy	Params	Multiplies
TC-ResNet14-1.5	96.6%	305K	6.7M
ST-AttNet4	96.3%	24K	2.0M
ST-AttNet4-wide	96.6%	48K	4.1M
ST-AttNet7	96.5%	37K	3.3M
TENet (MTConv, B=4)	96.8%	100K	<3M

Temporally pooled attention further increases accuracy by 0.9% absolute over average-pooling alone in ST-AttNet (Hu et al., 2021). TENet with MTConv matches the state-of-the-art with minimal parameter and computation overhead, as the multi-branch convolution is fused at inference (Li et al., 2020).

6. Principles, Benefits, and Trade-Offs

The principal advantage of depthwise separable decomposition is the decoupling of temporal/spatial filtering (depthwise) from channel mixing (pointwise), enabling drastic reductions in parameter count and FLOPs—a critical property for edge and embedded systems. Temporal convolutions excel at modeling sequential correlations, while multi-scale mechanisms enable architectures to capture dependencies across diverse timescales.

A potential limitation is that pure depthwise convolution may inadequately model cross-channel interactions unless followed by sufficiently expressive pointwise projections or multi-head attention. Increasing depth/width improves accuracy but with a linear increase in multiplications and parameter budget. However, methods such as MTConv enable richer representations at training time but revert to standard efficient inference models.

A plausible implication is that such architectures establish a general template for resource-efficient deep sequence modeling beyond speech—wherever compactness and rapid inference are required.

The described techniques—temporal convolution, depthwise separable operations, residual connections, dilations, and multi-scale architectures—constitute an extensible methodology for lightweight yet accurate neural models. Temporally pooled multi-head attention can be further generalized for aggregation in other domains. The algebraic fusion of training-time multi-branch modules into single inference kernels leverages the linear structure of convolutional and normalization layers, and may find utility in the broader context of train-time/inference-time model transformations.

Continued research explores trade-offs between kernel sizes, number of depthwise and pointwise channels, attention head counts, and integration strategies for additional temporal modeling mechanisms (Hu et al., 2021, Li et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Separable Temporal Convolution plus Temporally Pooled Attention for Lightweight High-performance Keyword Spotting (2021)

Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Convolution and Depthwise Convolution.

Temporal & Depthwise Convolution

1. Formal Definitions and Mathematical Formulations

2. Parameterization, Computational Cost, and Efficiency

3. Architectural Instances: ST-AttNet and TENet

4. Multi-Scale and Residual Extensions

5. Comparative Results and Empirical Benchmarks

6. Principles, Benefits, and Trade-Offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Temporal & Depthwise Convolution

1. Formal Definitions and Mathematical Formulations

2. Parameterization, Computational Cost, and Efficiency

3. Architectural Instances: ST-AttNet and TENet

4. Multi-Scale and Residual Extensions

5. Comparative Results and Empirical Benchmarks

6. Principles, Benefits, and Trade-Offs

7. Extensions and Related Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research