Papers
Topics
Authors
Recent
Search
2000 character limit reached

DisCo-FLoc: Visual Floorplan Localization

Updated 12 January 2026
  • DisCo-FLoc is a monocular visual floorplan localization system that predicts laser-style depth rays and leverages dual-level contrastive learning to disambiguate symmetric layouts.
  • The framework combines a ray regression predictor with a ResNet-18 based encoder to perform robust geometry and orientation-level discrimination without semantic supervision.
  • Empirical evaluations on benchmarks such as Gibson and Structured3D demonstrate significant improvements in recall and pose accuracy over state-of-the-art methods.

DisCo-FLoc is a monocular visual floorplan localization framework that introduces dual-level visual–geometric contrastive learning for robust, depth-aware localization, disambiguating repetitive or symmetric layouts without requiring semantic supervision. It addresses key challenges in floorplan localization (FLoc)—notably structural ambiguity and reliance on expensive annotation—by combining a specialized ray regression predictor with a fine-grained, multi-level contrastive disambiguation mechanism (Meng et al., 5 Jan 2026).

1. Pipeline Overview

DisCo-FLoc operates as a two-stage pipeline for single RGB image-based localization within a known floorplan geometry:

  • Stage 1: Ray Regression Predictor (RRP)
    • Given an input image I\mathcal{I}, the RRP predicts NN discrete “laser-style” distances d=(d1,,dN)\vec{d} = (d_1, \dots, d_N) along fixed orientations to the nearest structural boundary.
    • The backbone is a frozen DINO V2 depth encoder. Features are reduced to dimension DD via 1×11 \times 1 convolution, and depth-ray queries qiRDq_i \in \mathbb{R}^D are aggregated vertically per image column.
    • A multi-head attention layer predicts, for each ray ii, a probability distribution Pi,1DP_{i,1\dots D} over DD depth bins.
    • Depth regression: For each ray,

    di=k=1DPi,kdk,dk=(dminγ+kD(dmaxγdminγ))1/γ(1)d_i = \sum_{k=1}^D P_{i,k} d_k, \qquad d_k = \bigl(d_{\min}^\gamma + \tfrac{k}{D}(d_{\max}^\gamma - d_{\min}^\gamma)\bigr)^{1/\gamma} \tag{1} - Training loss: L1L1 plus a cosine “shape” term:

    Lray=dd1+ddmax{d2d2,ϵ}(2)\mathcal{L}_{ray} = \| \vec{d} - \vec{d}^* \|_1 + \frac{\vec{d}^\top \vec{d}^*}{\max\{\|\vec{d}\|_2 \|\vec{d}^*\|_2,\,\epsilon \}} \tag{2} - Depth-Aware Floorplan Probabilistic Map (DAFPM): At inference, candidate (x,y,θ)(x, y, \theta) poses are tiled, ground-truth rays are rendered per pose, and compared to the predictions to yield Pd(x,y,θ)P_d(x, y, \theta). The XX most likely candidates progress to the next stage.

  • Stage 2: Visual–Geometric Contrastive Disambiguation

    • Each candidate Si=(xi,yi,θi)S_i = (x_i, y_i, \theta_i) is associated with a 5 m ×\times 5 m floorplan patch, aligned to θi\theta_i.
    • A contrastive learning mechanism is used to score fine-grained consistency between visual depth features and geometric layout in the patch.

This sequence enables DisCo-FLoc to first generate plausible localization candidates using a geometric prior and subsequently resolve ambiguities arising from floorplan regularities via contrastive matching.

2. Dual-Level Visual–Geometric Contrastive Learning

To discriminate between visually and geometrically similar locations, DisCo-FLoc pre-trains a ResNet-18 floorplan encoder with dual contrastive constraints while freezing the visual depth backbone.

  • Position-Level Contrast:
    • Positive pairs: image ff and its corresponding floorplan patch g+g^+ at (x,y,θ)(x^*, y^*, \theta^*), plus small spatial (±0.5\pm 0.5 m) and orientation (±0.26\pm 0.26 rad) perturbations.
    • Negative pairs are drawn from:
    • Inner-floorplan: patches 1.5–3 m away within the same layout.
    • Cross-floorplan: patches from other floorplans.
  • Orientation-Level Contrast:
    • Negative pairs: same (x,y)(x^*, y^*) but heading θ+180\theta^* + 180^\circ.
  • PointInfoNCE Loss:
    • With negative set M=MpMoM = M_p \cup M_o (for position and orientation levels):

    Zp=mMpexp(fgm/τ),Zo=mMoexp(fgm/τ)Z_p = \sum_{m\in M_p}\exp(f \cdot g_m/\tau), \quad Z_o = \sum_{m \in M_o}\exp(f \cdot g_m/\tau)

    LCL=(f,g+)logexp(fg+/τ)Zp+Zo(3)\mathcal{L}_{CL} = -\sum_{(f,g^+)} \log \frac{\exp(f \cdot g^+ / \tau)}{Z_p + Z_o} \tag{3}

    where τ\tau is a temperature and “\cdot” denotes cosine similarity.

  • Training Objective:

L=λrayLray+λposLCL(p)+λoriLCL(o)(4)\mathcal{L} = \lambda_{ray}\,\mathcal{L}_{ray} + \lambda_{pos}\,\mathcal{L}_{CL}^{(p)} + \lambda_{ori}\,\mathcal{L}_{CL}^{(o)} \tag{4}

In practice, a single LCL\mathcal{L}_{CL} is used as both negative types are included.

The dual-level scheme addresses both translational and rotational ambiguities, crucial for distinguishing between symmetric or repetitive structures.

3. Feature Extraction and Fusion

The feature encoders and inference logic operate as follows:

  • Visual Feature Extraction:

    • The DINO V2 depth encoder is frozen for visual input.
    • A CLS (classification) token representing global depth is concatenated to the vertical column-token aggregation for the image.
  • Floorplan Feature Extraction:
    • ResNet-18 is trained using LCL\mathcal{L}_{CL} for geometric layout encoding.
  • Candidate Matching and Scoring:

    • For each candidate SiS_i, vision feature ff and floorplan feature gig_i are compared:

    si=exp(fgi/τ)s_i = \exp(f \cdot g_i/\tau)

    The collection {si}\{s_i\} is normalized across the XX candidates using SoftMax to yield a disambiguation probability Pc(Si)P_c(S_i).

  • Final Pose Selection:

P(Si)=(1w)Pd(Si)+wPc(Si)P(S_i) = (1-w) \cdot P_d(S_i) + w \cdot P_c(S_i)

where w[0,1]w \in [0, 1] (set to 0.5). The candidate with the highest P(Si)P(S_i) is selected as the localization output.

4. Empirical Evaluation

DisCo-FLoc is evaluated on Gibson(f), Gibson(g), and Structured3D(full), with metrics including [email protected] m, 0.5 m, 1 m, and 1 m within ±30\pm 30^\circ.

Method [email protected] m [email protected] m R@1 m R@1 m 30° (Gibson(f))
PF-net 0.0 2.0 6.9 1.2
MCL 1.6 4.9 12.1 8.2
LASER 0.4 6.7 13.0 10.4
F³Loc 4.7 28.6 36.6 35.1
3DP 5.3 33.2 39.8 38.4
RSK 8.3 38.5 45.3 43.6
Ours w/o Dis. 12.0 45.8 50.6 49.2
DisCo-FLoc 13.1 50.9 56.7 55.4
  • On Structured3D(full), DisCo-FLoc achieves 10.0/59.0/67.0/66.0 at [email protected]/0.5m/1m/1m 30°, surpassing prior methods, including those using semantic supervision.

Ablation studies reveal that:

  • Using both position- and orientation-level negatives yields maximal recall.
  • Removing the CLS token notably reduces recall at fine localization ([email protected] m).
  • Positional (2.3%-2.3\% at [email protected] m) and angular (2.3%-2.3\% at R@1 m) perturbations are important for robust training.
  • Best performance is observed with X100X \approx 100 candidates, w0.5w \approx 0.5, and 5 m ×\times 5 m patches.

5. Analysis and Implications

The two-level contrastive scheme directly addresses the ambiguous pose hypotheses arising from architectural repetition:

  • Position-level negatives drive the system to distinguish spatially nearby but geometrically similar regions, minimizing self-similarity confounds.
  • Orientation-level negatives enforce heading sensitivity, critical in symmetric environments.
  • Orientation-level contrast and angular perturbation offer slightly larger accuracy gains than positional mechanisms.

Importantly, the DINO V2 depth encoder transfers monocular depth estimation skill to the FLoc context without fine-tuning, and no semantic (e.g., room-type, door) annotations are required, achieving superior localization compared to state-of-the-art semantic-based approaches.

6. Limitations and Future Directions

DisCo-FLoc’s sequential ray regression and contrastive reranking architecture introduces some redundancy in feature extraction between the two stages. A plausible implication is that a unified, end-to-end framework could eliminate this inefficiency and further improve localization accuracy.

Potential avenues for extension include jointly integrating depth regression and contrastive disambiguation within a single module, streamlining computation and potentially exploiting cross-modality cues more effectively.


The DisCo-FLoc framework establishes a robust, annotation-free approach for monocular visual floorplan localization, coupling depth-aware geometry prediction with dual-level contrastive disambiguation. Its mathematical formulations, dual-stage design, and empirical performance on challenging benchmarks demonstrate significant improvements in both localization precision and orientation disambiguation (Meng et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DisCo-FLoc.