YearCLIP: Multi-Modal Year Prediction

Updated 27 December 2025

The paper introduces an innovative method that fuses visual, textual, and geographical data with prompt-based similarity extraction and a coarse-to-fine ordinal regression head.
The methodology leverages frozen CLIP backbones with learnable MLP adapters, location encoders, and parallel prompt branches to capture nuanced architectural cues.
The design notably mitigates popularity bias by calibrating ordinal distances between construction years, enhancing temporal reasoning beyond landmark memorization.

YearCLIP is a multi-modal ordinal regression architecture designed to predict the construction year of buildings from photographs, with the explicit goal of mitigating popularity bias in vision-LLMs (VLMs). Developed for the YearGuessr benchmark, YearCLIP fuses visual, textual, and optional geographic cues using a combination of pre-trained and learnable modules. The architecture is characterized by its distinctive prompt-based similarity feature extraction, geographically-aware fusion layer, and a coarse-to-fine ordinal regression head that collectively enable nuanced temporal reasoning beyond simple memorization of popular landmarks (Szu-Tu et al., 24 Dec 2025).

YearCLIP ingests multi-modal inputs, including a 224×224 building façade image $I$ and optional GPS coordinates $g = (\phi, \lambda)$ . The architecture is organized into three main branches:

Visual branch: The frozen CLIP ViT-B/16 image encoder $f_v$ converts $I$ to a 512-dimensional visual feature $z_v^{raw}$ . This passes through a small, learnable MLP adapter to yield $z_v \in \mathbb{R}^{512}$ .
Location branch (optional): GPS coordinates are mapped to random Fourier Features, passed through an MLP to produce $z_l^{raw} \in \mathbb{R}^{512}$ and further transformed by a learnable zero-initialized 1×1 convolution ("ZeroConv") layer to obtain $z_l$ . Fusion with vision occurs by element-wise addition: $z_{input} = z_v$ (no location) or $z_{input} = z_v + z_l$ (with location).
Textual prompt branches: Both branches use the frozen CLIP text encoder:
- Style prompts: Seven coarse architectural style tokens $\{s_i\}$ produce embeddings $\{z_{c_i}\}$ .
- Reasoning prompts: A set of fine-grained reasoning cues (e.g., roof type, wall material) $\{r_{jk}\}$ yield embeddings $\{z_{r_{jk}}\}$ .

Cosine similarities between $z_{input}$ and style embeddings ( $\mathrm{cos}(z_{input}, z_{c_i})$ ) and reasoning embeddings ( $\mathrm{cos}(z_{input}, z_{r_{jk}})$ ) are concatenated to form a prompt-based similarity vector $s \in \mathbb{R}^{7 + \sum_j|subcats_j|}$ .

2. Adaptations and Extensions Beyond CLIP

YearCLIP departs from the standard CLIP pipeline through several targeted architectural innovations:

All CLIP backbones (vision, caption, reasoning encoders) remain frozen to preserve broad visual-textual knowledge while enabling task-specific adaptation via lightweight modules.
A vision MLP adapter is introduced immediately after the CLIP vision tower.
A specialized location encoder combines random Fourier feature mapping with an MLP, followed by a zero-initialized convolution ("ZeroConv") layer, enabling location-based disambiguation without overwhelming the visual signal.
Parallel prompt branches inject explicit architectural priors, encouraging the extraction of style and construction reasoning cues otherwise underrepresented in generic VLM pretraining.
Contrasting head replacement: The traditional CLIP contrastive head is supplanted by a coarse-to-fine ordinal regression head ( $g$ ), optimized for the temporally structured regression target.

3. Ordinal Regression Head: Design and Output

The regression head receives the prompt-based similarity vector ( $s$ ) as input, with dimensionality $D_{in} = 7 + R$ (e.g., $R \approx 20$ , so $D_{in} \approx 27$ ). Its structure is as follows:

Hidden layer: Fully connected, 512 units, ReLU activation, optional dropout (0.1).
Output layer: Fully connected, $k=7$ logits corresponding to coarse temporal bins (aligned with major architectural periods).
Prediction: A softmax layer yields distribution $\vec{p}$ over bins; continuous prediction is generated as a weighted average of bin midpoints $\{b_i\}$ :

$\hat{y} = \sum_{i=1}^k p_i \cdot b_i,$

with $p_i = \mathrm{Softmax}(g(s))_i$ .

Rationale output: Returns the argmax style token and most influential reasoning subcategory per group, supporting interpretability.

4. Training Objectives and Evaluation Metrics

YearCLIP is trained with multi-term objectives designed to enforce ordinal structure and calibrate uncertainty:

Fine-grained Cross-modal Ranking-based Contrastive Loss (FCRC):

$L_{FCRC} = -\frac{1}{M}\sum_{i=1}^M \log \frac{\exp( \mathrm{cos}(z_i, w_i)/\tau ) }{ \exp( \mathrm{cos}(z_i, w_i) / \tau ) + \sum_{j \ne i} \lambda_{i,j} \exp( \mathrm{cos}(z_i, w_j)/\tau ) }$

with label-dependent weights $\lambda_{i, j} \propto |y_i - y_j|$ that increase penalty for negatives with similar construction years.

Auxiliary objectives: Equally weighted cross-entropy for bin classification, KL-divergence from smoothed targets, and (optionally) $\ell_1$ regression on $\hat{y}$ .
Metrics:
- MAE: Mean Absolute Error between $\hat{y}$ and $y$
- Interval Accuracy (IA $_k$ ): Fraction with $|y - \hat{y}| \leq k$ for $k \in \{5,20,50,100\}$
- Popularity-aware interval accuracy and “popularity gain”: $IA_5(\mathrm{high}) - IA_5(\mathrm{low})$

5. Hyperparameter Choices and Optimization

Component configurations and training recipes are as follows:

Module	Architecture	Hyperparameters
CLIP encoders	ViT-B/16, 512-dim output	Frozen; no fine-tuning
Vision adapter	MLP (512 → 512)	RAdam, lr=1e–5
Ordinal regression	MLP (27 → 512 → 7)	Adam, lr=1e–4, $\beta= (0.9, 0.999)$
Reasoning bank	$\sim$ 20 subcategories, 7 styles	Frozen prompts
Location encoder	RFF (dim=60) → MLP (512) → ZeroConv
Schedule/Regular	Batch 64, epochs 50, dropout 0.1, wd 1e–4	Mixed-precision (16-bit)

All loss terms are weighted equally.

6. Novel Components and Equations

YearCLIP introduces several notable mathematical and architectural innovations:

ZeroConv fusion: $z_l = \mathrm{ZeroConv}(\mathrm{MLP}_{\mathrm{RFF}}(\phi, \lambda))$ , initialized so that $z_l \approx 0$ initially, allowing the network to learn selective location conditioning.
Prompt-based similarity composite: $s = [\mathrm{cos}(z_{input}, z_{c_1}), \ldots, \mathrm{cos}(z_{input}, z_{c_7}), \mathrm{cos}(z_{input}, z_{r_{11}}), \ldots ]^T$
Coarse-to-fine year regression: $\hat{y} = \sum_{i=1}^k p_i \cdot b_i$
Distance-weighted negatives in FCRC: $\lambda_{i, j} = \mathrm{Norm}(\beta |y_i - y_j|)$ to modulate the magnitude of the contrastive penalty

7. Mitigation of Popularity Bias

YearCLIP's design explicitly addresses the over-reliance of large VLMs on memorization of high-profile, widely-photographed structures. Critical mitigation mechanisms include:

Ordinal contrastive loss enforces respect for numeric distances between construction years, limiting superficial text-match memorization.
Low-level architectural cues from reasoning prompts ensure the model leverages constructive visual properties rather than only relying on global or superficial patterns.
Geographically-aware fusion enables local disambiguation without learning to "shortcut" via memorized geolocated images.
YearCLIP demonstrates a modest drop in IA $_5$ between rare and popular buildings (–7.8%), compared to much larger gains (16–34%) for closed-source VLMs, providing evidence for reduced reliance on landmark memorization and greater capacity for generalizable temporal reasoning (Szu-Tu et al., 24 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YearCLIP Model Architecture.

YearCLIP: Multi-Modal Year Prediction

2. Adaptations and Extensions Beyond CLIP

3. Ordinal Regression Head: Design and Output

4. Training Objectives and Evaluation Metrics

5. Hyperparameter Choices and Optimization

6. Novel Components and Equations

7. Mitigation of Popularity Bias

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

YearCLIP: Multi-Modal Year Prediction

1. Multi-Modal Pipeline Composition

2. Adaptations and Extensions Beyond CLIP

3. Ordinal Regression Head: Design and Output

4. Training Objectives and Evaluation Metrics

5. Hyperparameter Choices and Optimization

6. Novel Components and Equations

7. Mitigation of Popularity Bias

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research