Structure-Aware Language-Image Pretraining (SLIP)

Updated 10 November 2025

The paper introduces SLIP, which enhances the CLIP framework by integrating a graph-aware module with GNNs and a structural contrastive loss to capture instance-level relational information.
It employs a dual-encoder architecture with a Graph Attention Network that fuses image and text embeddings, resulting in improved performance in both zero-shot and few-shot learning settings.
Empirical evaluations demonstrate significant improvements in cross-modal retrieval metrics such as MRR and Recall@1, validating the effectiveness of incorporating structural context.

Structure-aware Language-Image Pretraining (SLIP) is a vision-language pretraining (VLP) paradigm that explicitly incorporates relational structure inherent to many multimodal datasets, targeting improved cross-modal alignment by reasoning over both content and instance-level relationships. SLIP extends the classical contrastive VLP framework (e.g., CLIP) by introducing structural contrastive loss and lightweight graph neural network (GNN) modules, leveraging graph-based contextual information such as product co-purchases. Empirical results demonstrate consistent improvement over baseline approaches across cross-modal retrieval and classification in both zero-shot and few-shot settings.

1. Model Architecture

SLIP builds upon the CLIP dual-encoder backbone and introduces a graph-aware module to fuse relational information. The principal architectural components are as follows:

Base Encoders:
- The image encoder $f_V$ is a Vision Transformer (ViT), e.g., ViT-B/16 or ViT-B/32, initialized from CLIP.
- The text encoder $f_T$ is the matching CLIP transformer.
- Image-text pairs yield $\ell_2$ -normalized embedding vectors $v_i, t_i \in \mathbb{R}^d$ .
Graph-aware Fusion:
- Within each training batch, each image–text pair is modeled as a node in a sparse instance graph $G_b$ , with adjacency matrix $A \in \{0,1\}^{b \times b}$ capturing one-hop co-purchase relationships.
- Visual and textual embeddings are processed separately by two layers of Graph Attention Networks (GATs):
- $H_V^{(l+1)} = \mathrm{GAT}(H_V^{(l)}, A),\qquad H_T^{(l+1)} = \mathrm{GAT}(H_T^{(l)}, A)$
- where $H_V^{(0)} = E_v$ , $H_T^{(0)} = E_t$ , hidden dimension = 512, attention heads = 4, dropout = 0.1.
- The final GAT outputs are concatenated and projected:
$Z = \phi([H_V^{(2)} \| H_T^{(2)}]) \in \mathbb{R}^{b \times d}$

where $f_T$ 0 comprises a single MLP and $f_T$ 1-normalization, yielding fused embeddings encoding both content and relational context.
Retained CLIP Pathway:
- The original encodings $f_T$ 2 concurrently propagate through the standard CLIP contrastive head, allowing SLIP to retain the representational fidelity of content-only models.

2. Structural Contrastive Objective

SLIP augments the standard CLIP InfoNCE loss with a structural contrastive loss, designed to capture the relational semantics encoded by the graph:

CLIP Contrastive Head:

Aligns each image-text pair via symmetric cross-entropy:

$f_T$ 3

where $f_T$ 4 indexes positive pairs in the batch, $f_T$ 5 is the temperature scalar.

Graph-based Generalization:
- Positive mask $f_T$ 6 identifies batch pairs within hop-distance $f_T$ 7 ( $f_T$ 8 by default, i.e., direct neighbors):
$f_T$ 9 - Negative mask $\ell_2$ 0. - Structural similarity scores: $\ell_2$ 1; the row-wise log-softmax $\ell_2$ 2 is computed. - Structural contrastive loss:

$\ell_2$ 3
Auxiliary Classification:
- An optional linear classifier head $\ell_2$ 4 is trained with standard cross-entropy:
$\ell_2$ 5
Composite Loss Function:
- The final objective combines all components:
$\ell_2$ 6

with $\ell_2$ 7 and $\ell_2$ 8.

3. Multimodal Graph Dataset Construction

To realize the structure-aware paradigm, SLIP introduces a large-scale Amazon Product Co-purchase Multimodal Graph Dataset:

Nodes:

Products identified from Amazon data (including titles, descriptions, high-resolution images), labeled by Amazon taxonomy categories.

Edges:

Derived via bipartite user-product graph construction. Co-purchase edges connect two products if at least three distinct users purchased both items. A 5-core decomposition is applied to remove nodes with degree less than five, yielding denser and more stable subgraphs.

Dataset Statistics (per category):

| Subset | Nodes (K) | Edges (K) | CLIP-T Title Alignment | CLIP-T Desc. Alignment | |---------------|-----------|-----------|-----------------------|-----------------------| | Electronics | 98 | 2,015 | $\ell_2$ 9 | $v_i, t_i \in \mathbb{R}^d$ 0 | | Books, Beauty, Health | 14–194 | 90–3,988 | n/a | n/a |

These statistics highlight the broad coverage and multimodal detail available for structured cross-modal pretraining.

4. Training Procedure

The SLIP training process combines established practices from CLIP with specialized graph-based supervision, governed by the following protocol:

Architectural specifics:

Backbone: openai/clip-vit-base-patch16 or patch32. GATs: hidden size 512, 4 attention heads, dropout 0.1.

Optimization:
- Pretrained CLIP layers: discriminative fine-tuning with a base learning rate $v_i, t_i \in \mathbb{R}^d$ 1 for deepest layers, decayed by 0.8 per shallower layer.
- Non-pretrained modules (GAT, projection, classifier): learning rate $v_i, t_i \in \mathbb{R}^d$ 2.
Learning schedule and regularization:
- Linear warmup (500 steps), followed by cosine or linear decay to zero.
- Early stopping on validation performance (patience 10 epochs, min improvement $v_i, t_i \in \mathbb{R}^d$ 3).
- Gradient checkpointing for large batch efficiency.
Batch sizes and compute:

Experiments span subgraph sizes $v_i, t_i \in \mathbb{R}^d$ 4, with best results at batch size 1024. Training duration is up to 50 epochs (~1–2 days on $v_i, t_i \in \mathbb{R}^d$ 5A100 GPUs, batch-size dependent).

5. Experimental Evaluation and Results

SLIP’s effectiveness is measured principally on cross-modal retrieval and classification within the Electronics subset. The evaluation framework includes:

Retrieval Metrics:

Mean Reciprocal Rank (MRR), Mean Rank, Median Rank Recall@1, Recall@5, Recall@10 Comparison baseline: CLIP, fine-tuned under the same schedule.

Principal Results (batch=1024):

| Model | MRR | R@1 | R@5 | Mean Rank | |-------|--------|-------|-------|-----------| | SLIP | 0.584 | 0.478 | 0.712 | 93.4 | | CLIP | 0.520 | 0.403 | 0.644 | 133.4 |

SLIP yields a relative improvement of +12.3% in MRR and +18.6% in Recall@1 over CLIP. Improvements are symmetric for both image→text and text→image retrieval.

Qualitative Example:

For the query “Garfrfin Delorme Atlas – Alaska”, CLIP’s top result is rank #12, with generic images, while SLIP retrieves the true map at rank #1, grouping related cartographic items. Visualization confirms SLIP’s ability to cluster semantically related graph-neighbors using co-purchase relations.

6. Ablation and Analytical Studies

Batch Size and Graph Connectivity:

Small batch sizes ( $v_i, t_i \in \mathbb{R}^d$ 6256) result in predominantly disconnected graphs, undermining structural contrastive supervision and harming performance. Substantial gains begin at batch size 256, with best results observed at 512 and 1024, where graph context per batch is richly connected.

Component Analysis:
- “Graph only” configuration ( $v_i, t_i \in \mathbb{R}^d$ 7): achieves MRR = 0.597, Mean Rank = 114.1.
- Addition of auxiliary classifier: MRR = 0.584, Mean Rank = 93.4 (improving worst-case retrieval).
- Removal of discriminative learning-rate protocol reduces MRR to 0.566.
- The primary driver of SLIP’s gain is the $v_i, t_i \in \mathbb{R}^d$ 8 term, with auxiliary classification and layer-wise LR schedules yielding further refinements.
Hop-distance for Positive Selection:

One-hop graph neighbors are optimal as additional positives; they represent complementary rather than identical products, and their cosine similarity under CLIP exhibits a heavy tail adjacent to true positives. Broadening to 2- or 3-hop neighbors introduces noise and dilutes the contrastive signal.

7. Limitations and Prospective Extensions

Static Graph Construction:

The use of static co-purchase graphs conflates temporally sequenced or complementary purchases over time. Partitioning edges by recency or weighting by timestamp is proposed to better delineate “replacement” vs “accessory” product relationships.

Graph Types and Modalities:

Current implementation is limited to instance-level, undirected, untyped graphs. Extensions to incorporate edge types (e.g., “also_bought” vs “also_viewed”), heterogeneous graphs, or integration with multimodal knowledge bases are promising avenues for future research.

Domain and Modality Adaptation:

Generalization to social recommendation, citation networks, and further modalities (e.g., video–text, audio–text) remains unexplored.

Scalability:

Experimental scale currently utilizes $v_i, t_i \in \mathbb{R}^d$ 9100 K nodes per batch. Future exploration of streaming graph construction and advanced graph-sampling strategies will be necessary for extending to corpora with billions of nodes.

In summary, SLIP introduces structure-aware pretraining by integrating GNN-based graph context and structural contrastive loss atop the standard vision-language dual-encoder paradigm. These additions provide systematic improvement in cross-modal retrieval and classification, conditioned on sufficient graph connectivity per batch. The accompanying Amazon multimodal co-purchase graph forms a new benchmark for structure-aware vision–language learning.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Structure-aware Language-Image Pretraining (SLIP).