Siamese Capsule Networks

Updated 20 January 2026

Siamese Capsule Networks are twin-capsule architectures that generate pose-aware embeddings via dynamic routing for accurate pairwise similarity estimation.
They employ both margin-based and NT-Xent contrastive losses to optimize metric learning, ensuring stable performance in few-shot and unsupervised setups.
Empirical results show SCNs achieve competitive accuracy with fewer parameters and lower FLOPs compared to traditional CNN approaches, particularly in face verification and unsupervised tasks.

Siamese Capsule Networks are neural architectures that integrate capsule networks—characterized by vector- or matrix-based representations with dynamic routing mechanisms—into a Siamese (twin-branch, weight-shared) configuration specifically designed for pairwise learning tasks. These models extend the part-whole equivariance, pose-awareness, and dynamic routing of capsules to scenarios requiring similarity estimation between pairs of inputs such as verification, contrastive representation learning, and few-shot classification. The term encompasses both supervised and unsupervised methodologies and has been explored in both vision and speech domains (Neill, 2018, Panwar et al., 2022, Hajavi et al., 2020).

1. Architectural Foundations

Siamese Capsule Networks (SCNs) combine two identical capsule subnetworks ("arms"), each accepting one member of an input pair. The SCN introduced by Van Hoorick et al. (Neill, 2018) extends the standard dynamic routing Capsule Network (Sabour et al., 2017) into a pairwise setup by utilizing two parameter-tied pathways that each transform an input (e.g., a 100×100 face image) into a compact, pose-aware embedding. The architecture is layerwise as follows:

Convolutional Front-End: Each arm begins with a convolutional layer (e.g., 256 filters, 9×9, stride 3, batch-norm, and ReLU), producing high-dimensional activations (e.g., 256 × 31 × 31 for 100×100 input).
Primary Capsules: These are constructed by applying banks of parallel convolutions to the Conv1 output, resulting in, for example, 32 capsule types each generating 8-dimensional pose vectors.
“Face” or “Class” Capsules: Outputs from all primary capsules are routed to a set of higher-level capsules representing subjects, classes, or parts, using dynamic routing by agreement. The face capsule output typically employs 8- or 16-dimensional pose vectors.
Embedding Head: Pose vectors from the final capsules are concatenated and passed through a fully connected layer (e.g., 20 units with tanh), yielding the final normalized embedding.

The two arms share all weights to ensure symmetric representation. Other SCN variants, such as the contrastive capsule (CoCa) model (Panwar et al., 2022), use a similar Siamese layout with a deep ConvBlock, convolutional PrimaryCaps (e.g., 32 channels of 16-D capsules), and a final ClassCaps layer.

2. Dynamic Routing Mechanisms

A central property of capsule networks is "routing by agreement," in which outputs of lower-level capsules are dynamically assigned to higher-level capsules based on agreement of their predictions. In SCNs, this dynamic routing is carried out independently in each arm. The routing algorithm is typically iterative, involving the update of coupling coefficients $c_{mn}$ via softmax, computation of class capsule inputs via transformation matrices $W_{mn}$ , squashing non-linearities, and agreement-based updates for the log-priors $b_{mn}$ .

Typical routing iterations (e.g., $R=4$ or $6$ in (Neill, 2018), $\epsilon=3$ in (Panwar et al., 2022)) ensure stable part-to-whole assignments. The squash nonlinearity,

$\mathrm{squash}(s_n) = \frac{\|s_n\|^2}{1 + \|s_n\|^2}\frac{s_n}{\|s_n\|},$

is used to guarantee that capsule outputs are short vectors for non-existent entities and longer for present ones.

No cross-branch routing takes place; each branch uses local dynamic routing and gradients flow through these routing operations.

3. Contrastive Learning and Loss Formulations

Siamese Capsule Networks employ metric learning objectives to train the representations.

Supervised Contrastive Loss: In (Neill, 2018), the SCN is trained with a margin-based contrastive loss on $\ell_2$ -normalized embeddings: $L(\omega) = \sum_{i=1}^m \left[ y^{(i)}\,D_i^2 + (1 - y^{(i)})\,\max\{0,\,m - D_i\}^2 \right]$ where $D_i$ is the Euclidean distance between normalized embeddings and $y^{(i)}$ indicates match (1) or non-match (0). For datasets with high intra-class variation, a double-margin variant is used.
Unsupervised Contrastive Loss: The CoCa model (Panwar et al., 2022) introduces an NT-Xent (Normalized Temperature-Scaled Cross-Entropy) loss suitable for unsupervised contrastive learning: $\mathcal{L}_{i,j} = -\log \frac{\exp(\mathrm{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k\neq i]} \exp(\mathrm{sim}(z_i, z_k)/\tau)}$ with positives comprising augmented views of the same instance, and negatives all other examples in the batch.

In both regimes, $\ell_2$ normalization of the embedding manifold stabilizes training and ensures meaningful metric distances (Neill, 2018).

4. Training Protocols and Regularization

Training protocols in SCN literature prioritize regularization and robustness:

Optimizers: SCN uses AMSGrad (Adam variant, [Reddi et al. 2018]) (Neill, 2018), while CoCa adopts Adam with specified weight decay and learning rate (Panwar et al., 2022).
Regularization: Dropout (e.g., $p=0.2$ ) is applied throughout convolutional and capsule layers. The final capsule layer in SCN leverages Concrete Dropout, a learned, continuous approximation to Bernoulli dropout that identifies informative pose dimensions.
Data Handling: SCN experiments eschew data augmentation beyond normalization and resizing; CoCa employs extensive data augmentations to support contrastive learning.
Batch Size and Epochs: SCN operates with minibatches of 32 image pairs for 100 epochs; CoCa utilizes large batches (size 512, each yielding two augmentations per image) for up to 500 epochs.

No reconstruction-decoder or auxiliary losses are used; learning is driven solely by the contrastive objectives.

5. Empirical Results and Comparative Analysis

Siamese Capsule Networks have demonstrated competitive or superior performance relative to convolutional Siamese baselines, particularly in few-shot and low-resource regimes.

Supervised Face Verification (Neill, 2018):

Dataset	Test Loss (SCN)	Acc. (SCN)	Best Baseline	Baseline Acc.
AT&T	0.019	91.8%	ResNet-34	82.3%
LFW	0.013	95.2%	AlexNet	96.0%

SCN achieves lower test loss and higher accuracy than ResNet-34 and AlexNet on the low-sample AT&T set, indicating improved generalization in few-shot contexts. On LFW, SCN matches or slightly trails in accuracy but with considerably fewer parameters.

Unsupervised CIFAR-10 (CoCa, (Panwar et al., 2022)):

Model	Top-1 (%)	Params	FLOPs
Baseline CapsNet	68.98	7.9 M	—
SimCLR	93	24.62 M	1.31 G
CoCa (SCN)	70.50	0.78 M	18.34 M

CoCa outperforms supervised baseline CapsNet by 1.5 points in top-1 accuracy with roughly 10× fewer parameters. It also approaches SimCLR top-5 performance (98.1% vs. 99%) despite using 31× fewer parameters and 71× fewer FLOPs.

SCNs converge faster in few-shot regimes and reach lower contrastive loss values, attributed to the pose-aware, part-whole representations inherent to capsules.

6. Strengths, Limitations, and Applications

Key strengths of SCNs include:

Robust Few-Shot Performance: Strong results in scenarios with few examples per class or zero-shot verification (Neill, 2018).
Parameter Efficiency: High accuracy with significantly smaller model sizes and lower FLOPs relative to deep CNNs (Panwar et al., 2022).
Pose/Equivariance Preservation: Capsule embeddings, through dynamic routing, retain part-whole and geometric relationships in an unsupervised setting.

Noted limitations:

Computational Overhead: Routing algorithms are quadratic in capsule number and slower per batch than conventional CNNs, limiting scalability.
Training Stability: Margin selection, number of routing iterations, and temperature hyperparameters critically affect empirical performance.
Unrealized Scaling: Extensions to larger-scale classification, detection, or segmentation tasks remain unproven.

Applications demonstrated include face verification (controlled/uncontrolled), unsupervised visual representation learning, and speaker verification “in the wild.” The architecture’s efficacy is established for metric learning, especially under limited supervision or few-shot constraints (Neill, 2018, Panwar et al., 2022, Hajavi et al., 2020).

7. Future Directions

Potential avenues for advancing Siamese Capsule Networks include:

Efficient Routing Mechanisms: Development of scalable or approximate routing strategies could ameliorate computational bottlenecks.
Data Augmentation and Pretraining: Incorporation of sophisticated augmentations and reconstruction-based unsupervised pretraining may further improve robustness and generalization.
Hierarchical and Deep Capsule Architectures: Exploration into deeper, multi-stage capsule structures to capture richer part-whole hierarchies is encouraged.
Transfer and Domain Adaptation: Applying SCNs to broader domains, including large-scale object recognition and cross-modal pairing, remains an open research direction.

Siamese Capsule Networks thus provide a principled approach to pairwise learning, unifying the merits of capsule pose-awareness with the metric-learning strengths of Siamese architectures (Neill, 2018, Panwar et al., 2022, Hajavi et al., 2020).

Markdown Report Issue Upgrade to Chat

References (3)

Siamese Capsule Networks (2018)

Capsule Network based Contrastive Learning of Unsupervised Visual Representations (2022)

Siamese Capsule Network for End-to-End Speaker Recognition In The Wild (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Siamese Capsule Networks.