SPD-Conv: Riemannian and CNN Downsampling

Updated 12 February 2026

SPD-Conv offers two methodologies: a Riemannian convolution for SPD matrices that preserves local geometric structure and a CNN space-to-depth block for lossless downsampling.
The Riemannian approach leverages manifold-aware operations, including BiMap, ReEig, and submanifold extraction, to improve feature extraction and accelerate convergence.
The CNN approach replaces strided convolutions with a space-to-depth rearrangement, yielding enhanced performance in small object detection and low-resolution image tasks.

SPD-Conv refers to two distinct methodologies in contemporary deep learning: (1) a manifold-aware convolutional mechanism for symmetric positive definite (SPD) matrices, designed to preserve local Riemannian geometry in SPD neural networks; and (2) a general-purpose building block for convolutional neural networks (CNNs) that replaces strided convolutions or pooling with a space-to-depth transformation followed by non-strided convolution, enabling exact spatial downsampling with no information loss. Both frameworks are motivated by the need to enhance the extraction and utilization of local structure in neural architectures, but are applied in fundamentally different mathematical and practical contexts.

1. SPD-Conv on Riemannian SPD Manifolds

The first interpretation of SPD-Conv originates from "Riemannian Local Mechanism for SPD Neural Networks" (Chen et al., 2022). In this context, the goal is to process data that naturally resides on the manifold of symmetric positive definite matrices Sym⁺(d), equipped with the affine-invariant Riemannian metric (AIM). The conventional Euclidean notion of convolution for extracting local features is generalized via category theory and principal submanifold selection, enabling locality-aware processing on SPD-valued representations.

1.1 Geometry of Symmetric Positive Definite Matrices

Let $\text{Sym}^+(d) = \{S\in\mathbb{R}^{d\times d}\mid S=S^T,\,v^T S v>0\,\forall v\neq 0\}$ denote the cone of $d\times d$ SPD matrices. The canonical Riemannian metric, the affine-invariant metric, is defined for $V,W\in T_S\text{Sym}^+(d)\simeq\text{Sym}(d)$ as

$g_S(V,W) = \mathrm{Tr}(S^{-1} V S^{-1} W)$

which is invariant under congruence transformations. The geodesic distance between $S_1,S_2\in\text{Sym}^+(d)$ is

$d_A(S_1,S_2) = \|\log(S_1^{-1/2} S_2 S_1^{-1/2})\|_F$

with Riemannian exponential and logarithm maps at $S$ constructed accordingly.

1.2 Generalization of Convolution via Category Theory

In Euclidean deep learning, convolution aggregates information from local $k\times k$ neighborhoods using linear maps. Category-theoretic abstraction replaces subspaces by sub-objects (here, submanifolds), linear maps by smooth morphisms, and aggregation by category-dependent sums. On $\text{Sym}^+(d)$ , valid submanifold extraction is realized only by principal submatrices, ensuring the SPD property is preserved. For $k<d$ , $\text{Sym}^+(k)$ embeds as a regular submanifold in $\text{Sym}^+(d)$ via principal submatrix selection.

1.3 Multi-Scale Submanifold Block (SPD-Conv Implementation)

The Multi-Scale Submanifold Block, denoted "MSNet," leverages the following sequence:

BiMap Transformation: Holistic SPD feature $S_{k-1}\in\text{Sym}^+(d_{k-1})$ is projected via a learnable semi-orthogonal matrix $W_{k-1}^i$ :

$S_{k-1}^i = W_{k-1}^i S_{k-2} (W_{k-1}^i)^T$

ReEig Layer: Eigenvalue rectification:

$S_k^i = U \max(\epsilon I, \Sigma) U^T,\quad S_{k-1}^i = U \Sigma U^T,\quad \epsilon > 0$

Principal Submanifold Extraction: For patch-size $k_i$ and index set $\mathcal{I}_{i,j}$ ,

$P_k^{i,j} = S_k^i[\mathcal{I}_{i,j},\mathcal{I}_{i,j}] \in \text{Sym}^+(k_i)$

Mapping to Euclidean: Each patch is logarithm-mapped to $M_k^{i,j} = \log(P_k^{i,j})$ and vectorized:

$v_k^{i,j} = \text{vec}(\mathrm{tril}(M_k^{i,j})) \in \mathbb{R}^{k_i(k_i+1)/2}$

Concatenation and Classification: Vectors are concatenated within and across branches, followed by fully connected layers and softmax.

Backward propagation exploits the chain rule through all operations, including eigenvalue gradient formulas and retraction to the Stiefel manifold for BiMap layers.

2. Space-to-Depth Convolution (SPD-Conv) for CNN Downsampling

A distinct usage of the SPD-Conv term refers to "No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects" (Sunkara et al., 2022). Here, SPD-Conv is designed as a generic CNN component for information-preserving downsampling in Euclidean architectures.

2.1 Motivation and Design Principle

Strided convolution and pooling reduce feature map resolution but discard spatial detail, critically impairing performance on low-resolution inputs or tasks requiring small-object localization. SPD-Conv rectifies this by rearranging the feature tensor spatially to depth, preserving all original values while reducing spatial dimensions. Subsequent convolution learns to fuse the resultant channels into meaningful representations, enabling lossless downsampling.

2.2 Formal Definition

Given input $X\in\mathbb{R}^{H\times W\times C}$ and block size $r$ , the space-to-depth (SPD) operation produces $X'\in\mathbb{R}^{H/r\times W/r\times Cr^2}$ as:

$f_{i,j}(u,v,c) = X(r\,u + i,\, r\,v + j,\,c)$

$\forall\, i,j \in \{0,\dots,r-1\}$ , $0\leq u < H/r$ , $0\leq v < W/r$ , $0\leq c < C$ , concatenated along the channel axis. The SPD-Conv block is thus:

$\text{SPD-Conv}(X) = \mathrm{Conv}_{k\times k, s=1, p=\lfloor k/2 \rfloor} (\mathrm{SPD}(X;r))$

2.3 Integration into Architectures

In object detection (YOLOv5) and classification (ResNet), all strided conv and pooling layers are replaced by SPD-Conv blocks with $r$ matched to the intended downsampling factor. Residual paths are adjusted to align with altered channel counts using $1\times 1$ convolutions. All activation, normalization, and backbone scaling logic remain unchanged. No custom CUDA is necessary, as the SPD manipulation is achieved via tensor reshaping and permutation.

3. Experimental Results and Ablation Studies

3.1 SPD-Conv on Riemannian Manifolds (MSNet)

MSNet was evaluated against both shallow and deep baselines on Cambridge-Gesture, UCF-sub, and FPHA. On FPHA, MSNet-MS achieved 87.13% top-1 accuracy ( $\uparrow$ ~1.5 points over SPDNet). On Cambridge-Gesture, MSNet-MS reached 91.25% (vs SPDNet's 89.03%) and on UCF-sub, 60.87% (vs SPDNet's 59.93%). Ablation demonstrates that combining global (holistic) and local (submanifold) features yields highest accuracy, with redundancy observed if all submanifolds are used indiscriminately. Model convergence is faster and underfitting is mitigated relative to classical SPDNet, at approximately 1.5× training runtime per epoch (Chen et al., 2022).

3.2 SPD-Conv for CNN Downsampling

Empirical validation on MS-COCO showed YOLOv5-SPD variants outperforming their strided-conv baselines, especially for small object detection (AP_S). For YOLOv5-n, AP_S increased from 14.1 to 16.0 (13.2% relative); for YOLOv5-s, from 21.1 to 23.5 (11.4% relative). Accuracy improvements on Tiny ImageNet (+2.84% for ResNet18-SPD) and CIFAR-10 (+1.09% for ResNet50-SPD) were also reported. Ablation indicates the SPD rearrangement alone accounts for partial gains, but the full SPD-Conv yields the largest improvements (Sunkara et al., 2022).

Model	Dataset	Top-1/Mean AP	AP_S / Δ (%)	Latency (ms)
YOLOv5-n	COCO	28.0	14.1	6.3
YOLOv5-SPD-n	COCO	31.0	16.0 (+13.2)	7.3
YOLOv5-s	COCO	37.4	21.1	6.4
YOLOv5-SPD-s	COCO	40.0	23.5 (+11.4)	7.3
ResNet18	Tiny ImageNet	61.68	—	—
ResNet18-SPD	Tiny ImageNet	64.52 (+2.84)	—	—

SPD-Conv blocks entail a parameter and computation increase of 10–20% per replacement, with an overall runtime overhead typically less than 10% (Sunkara et al., 2022).

4. Algorithmic and Implementation Details

4.1 SPD-Conv on SPD Manifolds

Forward Pass: Alternates between BiMap (semi-orthogonal projection), ReEig (eigenvalue thresholding), principal submanifold extraction, Riemannian logarithm, vectorization, and concatenation.
Backward Pass: Employs chain rule gradients through submanifold and spectral operations, with special treatment for bi-linear maps on Stiefel and Riemannian log/exp.
Patching Strategy: Patch size, stride, and the choice between single or multi-scale branches are user-configurable for sensitivity to different local geometries.

4.2 SPD-Conv for CNNs

Hyperparameters: Block size $r$ , convolution kernel size $k$ (typically $3\times 3$ ), output channel count $C'$ , activation function (SiLU, ReLU), optimization schedule (SGD with cosine lr), and data augmentation choices.
Integration: Replace every strided downsampling operation by $\mathrm{SPD}(r) \rightarrow \mathrm{Conv}_{k\times k}$ , adjusting residual paths as needed.
Implementation: Achieved via tensor reshape→permute→reshape operations in standard deep learning frameworks (PyTorch/TensorFlow), requiring no nonstandard CUDA kernels.

5. Relation to Broader Research and Practical Use

Both SPD-Conv frameworks are centered on locality: the Riemannian version extracts geometric substructure on SPD-valued data, while the Euclidean/space-to-depth approach prevents information loss in downsampling. While the former is tailored to covariance and manifold-valued representations found in action recognition, hand gesture datasets, or neuroscience, the latter generalizes across mainstream CNN architectures in vision, requiring only minimal changes and yielding empirical advantages particularly in low-data, low-resolution, or small-object scenarios.

The codebases for both approaches are open-source, enabling study and deployment in new domains. For the Riemannian SPD-Conv, see https://github.com/GitZH-Chen/MSNet.git (Chen et al., 2022); for the space-to-depth SPD-Conv, see https://github.com/LabSAINT/SPD-Conv (Sunkara et al., 2022).

6. Limitations and Ablation

For Riemannian SPD-Conv, including all submanifolds may introduce redundancies; optimal results are achieved by combining global (holistic) and select local (submanifold) information. In CNN SPD-Conv, added parameters and MACs per SPD-Conv block incur a modest computational penalty, but accuracy and detection gains outweigh this in target regimes. On large, high-resolution images, conventional downsampling is less harmful and SPD-Conv’s advantage is diminished—a plausible implication is to selectively enable SPD-Conv only for task-critical stages.

Potential confusion arises from homonymy: "SPD-Conv" can refer to either a Riemannian convolution mechanism for SPD matrices or an information-preserving Euclidean block based on space-to-depth rearrangement. The mathematical underpinnings and deployment targets of each must be distinguished in research and applications.

Markdown Report Issue Upgrade to Chat

References (2)

Riemannian Local Mechanism for SPD Neural Networks (2022)

No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SPD-Conv.