Papers
Topics
Authors
Recent
Search
2000 character limit reached

YearCLIP: Multi-Modal Year Prediction

Updated 27 December 2025
  • The paper introduces an innovative method that fuses visual, textual, and geographical data with prompt-based similarity extraction and a coarse-to-fine ordinal regression head.
  • The methodology leverages frozen CLIP backbones with learnable MLP adapters, location encoders, and parallel prompt branches to capture nuanced architectural cues.
  • The design notably mitigates popularity bias by calibrating ordinal distances between construction years, enhancing temporal reasoning beyond landmark memorization.

YearCLIP is a multi-modal ordinal regression architecture designed to predict the construction year of buildings from photographs, with the explicit goal of mitigating popularity bias in vision-LLMs (VLMs). Developed for the YearGuessr benchmark, YearCLIP fuses visual, textual, and optional geographic cues using a combination of pre-trained and learnable modules. The architecture is characterized by its distinctive prompt-based similarity feature extraction, geographically-aware fusion layer, and a coarse-to-fine ordinal regression head that collectively enable nuanced temporal reasoning beyond simple memorization of popular landmarks (Szu-Tu et al., 24 Dec 2025).

1. Multi-Modal Pipeline Composition

YearCLIP ingests multi-modal inputs, including a 224×224 building façade image II and optional GPS coordinates g=(ϕ,λ)g = (\phi, \lambda). The architecture is organized into three main branches:

  • Visual branch: The frozen CLIP ViT-B/16 image encoder fvf_v converts II to a 512-dimensional visual feature zvrawz_v^{raw}. This passes through a small, learnable MLP adapter to yield zvR512z_v \in \mathbb{R}^{512}.
  • Location branch (optional): GPS coordinates are mapped to random Fourier Features, passed through an MLP to produce zlrawR512z_l^{raw} \in \mathbb{R}^{512} and further transformed by a learnable zero-initialized 1×1 convolution ("ZeroConv") layer to obtain zlz_l. Fusion with vision occurs by element-wise addition: zinput=zvz_{input} = z_v (no location) or zinput=zv+zlz_{input} = z_v + z_l (with location).
  • Textual prompt branches: Both branches use the frozen CLIP text encoder:
    • Style prompts: Seven coarse architectural style tokens {si}\{s_i\} produce embeddings {zci}\{z_{c_i}\}.
    • Reasoning prompts: A set of fine-grained reasoning cues (e.g., roof type, wall material) {rjk}\{r_{jk}\} yield embeddings {zrjk}\{z_{r_{jk}}\}.

Cosine similarities between zinputz_{input} and style embeddings (cos(zinput,zci)\mathrm{cos}(z_{input}, z_{c_i})) and reasoning embeddings (cos(zinput,zrjk)\mathrm{cos}(z_{input}, z_{r_{jk}})) are concatenated to form a prompt-based similarity vector sR7+jsubcatsjs \in \mathbb{R}^{7 + \sum_j|subcats_j|}.

2. Adaptations and Extensions Beyond CLIP

YearCLIP departs from the standard CLIP pipeline through several targeted architectural innovations:

  • All CLIP backbones (vision, caption, reasoning encoders) remain frozen to preserve broad visual-textual knowledge while enabling task-specific adaptation via lightweight modules.
  • A vision MLP adapter is introduced immediately after the CLIP vision tower.
  • A specialized location encoder combines random Fourier feature mapping with an MLP, followed by a zero-initialized convolution ("ZeroConv") layer, enabling location-based disambiguation without overwhelming the visual signal.
  • Parallel prompt branches inject explicit architectural priors, encouraging the extraction of style and construction reasoning cues otherwise underrepresented in generic VLM pretraining.
  • Contrasting head replacement: The traditional CLIP contrastive head is supplanted by a coarse-to-fine ordinal regression head (gg), optimized for the temporally structured regression target.

3. Ordinal Regression Head: Design and Output

The regression head receives the prompt-based similarity vector (ss) as input, with dimensionality Din=7+RD_{in} = 7 + R (e.g., R20R \approx 20, so Din27D_{in} \approx 27). Its structure is as follows:

  • Hidden layer: Fully connected, 512 units, ReLU activation, optional dropout (0.1).
  • Output layer: Fully connected, k=7k=7 logits corresponding to coarse temporal bins (aligned with major architectural periods).
  • Prediction: A softmax layer yields distribution p\vec{p} over bins; continuous prediction is generated as a weighted average of bin midpoints {bi}\{b_i\}:

y^=i=1kpibi,\hat{y} = \sum_{i=1}^k p_i \cdot b_i,

with pi=Softmax(g(s))ip_i = \mathrm{Softmax}(g(s))_i.

  • Rationale output: Returns the argmax style token and most influential reasoning subcategory per group, supporting interpretability.

4. Training Objectives and Evaluation Metrics

YearCLIP is trained with multi-term objectives designed to enforce ordinal structure and calibrate uncertainty:

  • Fine-grained Cross-modal Ranking-based Contrastive Loss (FCRC):

LFCRC=1Mi=1Mlogexp(cos(zi,wi)/τ)exp(cos(zi,wi)/τ)+jiλi,jexp(cos(zi,wj)/τ)L_{FCRC} = -\frac{1}{M}\sum_{i=1}^M \log \frac{\exp( \mathrm{cos}(z_i, w_i)/\tau ) }{ \exp( \mathrm{cos}(z_i, w_i) / \tau ) + \sum_{j \ne i} \lambda_{i,j} \exp( \mathrm{cos}(z_i, w_j)/\tau ) }

with label-dependent weights λi,jyiyj\lambda_{i, j} \propto |y_i - y_j| that increase penalty for negatives with similar construction years.

  • Auxiliary objectives: Equally weighted cross-entropy for bin classification, KL-divergence from smoothed targets, and (optionally) 1\ell_1 regression on y^\hat{y}.
  • Metrics:
    • MAE: Mean Absolute Error between y^\hat{y} and yy
    • Interval Accuracy (IAk_k): Fraction with yy^k|y - \hat{y}| \leq k for k{5,20,50,100}k \in \{5,20,50,100\}
    • Popularity-aware interval accuracy and “popularity gain”: IA5(high)IA5(low)IA_5(\mathrm{high}) - IA_5(\mathrm{low})

5. Hyperparameter Choices and Optimization

Component configurations and training recipes are as follows:

Module Architecture Hyperparameters
CLIP encoders ViT-B/16, 512-dim output Frozen; no fine-tuning
Vision adapter MLP (512 → 512) RAdam, lr=1e–5
Ordinal regression MLP (27 → 512 → 7) Adam, lr=1e–4, β=(0.9,0.999)\beta= (0.9, 0.999)
Reasoning bank \sim20 subcategories, 7 styles Frozen prompts
Location encoder RFF (dim=60) → MLP (512) → ZeroConv
Schedule/Regular Batch 64, epochs 50, dropout 0.1, wd 1e–4 Mixed-precision (16-bit)

All loss terms are weighted equally.

6. Novel Components and Equations

YearCLIP introduces several notable mathematical and architectural innovations:

  • ZeroConv fusion: zl=ZeroConv(MLPRFF(ϕ,λ))z_l = \mathrm{ZeroConv}(\mathrm{MLP}_{\mathrm{RFF}}(\phi, \lambda)), initialized so that zl0z_l \approx 0 initially, allowing the network to learn selective location conditioning.
  • Prompt-based similarity composite: s=[cos(zinput,zc1),,cos(zinput,zc7),cos(zinput,zr11),]Ts = [\mathrm{cos}(z_{input}, z_{c_1}), \ldots, \mathrm{cos}(z_{input}, z_{c_7}), \mathrm{cos}(z_{input}, z_{r_{11}}), \ldots ]^T
  • Coarse-to-fine year regression: y^=i=1kpibi\hat{y} = \sum_{i=1}^k p_i \cdot b_i
  • Distance-weighted negatives in FCRC: λi,j=Norm(βyiyj)\lambda_{i, j} = \mathrm{Norm}(\beta |y_i - y_j|) to modulate the magnitude of the contrastive penalty

7. Mitigation of Popularity Bias

YearCLIP's design explicitly addresses the over-reliance of large VLMs on memorization of high-profile, widely-photographed structures. Critical mitigation mechanisms include:

  • Ordinal contrastive loss enforces respect for numeric distances between construction years, limiting superficial text-match memorization.
  • Low-level architectural cues from reasoning prompts ensure the model leverages constructive visual properties rather than only relying on global or superficial patterns.
  • Geographically-aware fusion enables local disambiguation without learning to "shortcut" via memorized geolocated images.
  • YearCLIP demonstrates a modest drop in IA5_5 between rare and popular buildings (–7.8%), compared to much larger gains (16–34%) for closed-source VLMs, providing evidence for reduced reliance on landmark memorization and greater capacity for generalizable temporal reasoning (Szu-Tu et al., 24 Dec 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YearCLIP Model Architecture.