DisCo-FLoc: Visual Floorplan Localization

Updated 12 January 2026

DisCo-FLoc is a monocular visual floorplan localization system that predicts laser-style depth rays and leverages dual-level contrastive learning to disambiguate symmetric layouts.
The framework combines a ray regression predictor with a ResNet-18 based encoder to perform robust geometry and orientation-level discrimination without semantic supervision.
Empirical evaluations on benchmarks such as Gibson and Structured3D demonstrate significant improvements in recall and pose accuracy over state-of-the-art methods.

DisCo-FLoc is a monocular visual floorplan localization framework that introduces dual-level visual–geometric contrastive learning for robust, depth-aware localization, disambiguating repetitive or symmetric layouts without requiring semantic supervision. It addresses key challenges in floorplan localization (FLoc)—notably structural ambiguity and reliance on expensive annotation—by combining a specialized ray regression predictor with a fine-grained, multi-level contrastive disambiguation mechanism (Meng et al., 5 Jan 2026).

1. Pipeline Overview

DisCo-FLoc operates as a two-stage pipeline for single RGB image-based localization within a known floorplan geometry:

Stage 1: Ray Regression Predictor (RRP)
- Given an input image $\mathcal{I}$ , the RRP predicts $N$ discrete “laser-style” distances $\vec{d} = (d_1, \dots, d_N)$ along fixed orientations to the nearest structural boundary.
- The backbone is a frozen DINO V2 depth encoder. Features are reduced to dimension $D$ via $1 \times 1$ convolution, and depth-ray queries $q_i \in \mathbb{R}^D$ are aggregated vertically per image column.
- A multi-head attention layer predicts, for each ray $i$ , a probability distribution $P_{i,1\dots D}$ over $D$ depth bins.
- Depth regression: For each ray,
$d_i = \sum_{k=1}^D P_{i,k} d_k, \qquad d_k = \bigl(d_{\min}^\gamma + \tfrac{k}{D}(d_{\max}^\gamma - d_{\min}^\gamma)\bigr)^{1/\gamma} \tag{1}$ - Training loss: $L1$ plus a cosine “shape” term:

$\mathcal{L}_{ray} = \| \vec{d} - \vec{d}^* \|_1 + \frac{\vec{d}^\top \vec{d}^*}{\max\{\|\vec{d}\|_2 \|\vec{d}^*\|_2,\,\epsilon \}} \tag{2}$ - Depth-Aware Floorplan Probabilistic Map (DAFPM): At inference, candidate $(x, y, \theta)$ poses are tiled, ground-truth rays are rendered per pose, and compared to the predictions to yield $P_d(x, y, \theta)$ . The $X$ most likely candidates progress to the next stage.
Stage 2: Visual–Geometric Contrastive Disambiguation
- Each candidate $S_i = (x_i, y_i, \theta_i)$ is associated with a 5 m $\times$ 5 m floorplan patch, aligned to $\theta_i$ .
- A contrastive learning mechanism is used to score fine-grained consistency between visual depth features and geometric layout in the patch.

This sequence enables DisCo-FLoc to first generate plausible localization candidates using a geometric prior and subsequently resolve ambiguities arising from floorplan regularities via contrastive matching.

2. Dual-Level Visual–Geometric Contrastive Learning

To discriminate between visually and geometrically similar locations, DisCo-FLoc pre-trains a ResNet-18 floorplan encoder with dual contrastive constraints while freezing the visual depth backbone.

Position-Level Contrast:
- Positive pairs: image $f$ and its corresponding floorplan patch $g^+$ at $(x^*, y^*, \theta^*)$ , plus small spatial ( $\pm 0.5$ m) and orientation ( $\pm 0.26$ rad) perturbations.
- Negative pairs are drawn from:
- Inner-floorplan: patches 1.5–3 m away within the same layout.
- Cross-floorplan: patches from other floorplans.
Orientation-Level Contrast:
- Negative pairs: same $(x^*, y^*)$ but heading $\theta^* + 180^\circ$ .
PointInfoNCE Loss:
- With negative set $M = M_p \cup M_o$ (for position and orientation levels):
$Z_p = \sum_{m\in M_p}\exp(f \cdot g_m/\tau), \quad Z_o = \sum_{m \in M_o}\exp(f \cdot g_m/\tau)$

$\mathcal{L}_{CL} = -\sum_{(f,g^+)} \log \frac{\exp(f \cdot g^+ / \tau)}{Z_p + Z_o} \tag{3}$

where $\tau$ is a temperature and “ $\cdot$ ” denotes cosine similarity.
Training Objective:

$\mathcal{L} = \lambda_{ray}\,\mathcal{L}_{ray} + \lambda_{pos}\,\mathcal{L}_{CL}^{(p)} + \lambda_{ori}\,\mathcal{L}_{CL}^{(o)} \tag{4}$

In practice, a single $\mathcal{L}_{CL}$ is used as both negative types are included.

The dual-level scheme addresses both translational and rotational ambiguities, crucial for distinguishing between symmetric or repetitive structures.

3. Feature Extraction and Fusion

The feature encoders and inference logic operate as follows:

Visual Feature Extraction:
- The DINO V2 depth encoder is frozen for visual input.
- A CLS (classification) token representing global depth is concatenated to the vertical column-token aggregation for the image.
Floorplan Feature Extraction:
- ResNet-18 is trained using $\mathcal{L}_{CL}$ for geometric layout encoding.
Candidate Matching and Scoring:
- For each candidate $S_i$ , vision feature $f$ and floorplan feature $g_i$ are compared:
$s_i = \exp(f \cdot g_i/\tau)$

The collection $\{s_i\}$ is normalized across the $X$ candidates using SoftMax to yield a disambiguation probability $P_c(S_i)$ .
Final Pose Selection:

$P(S_i) = (1-w) \cdot P_d(S_i) + w \cdot P_c(S_i)$

where $w \in [0, 1]$ (set to 0.5). The candidate with the highest $P(S_i)$ is selected as the localization output.

4. Empirical Evaluation

DisCo-FLoc is evaluated on Gibson(f), Gibson(g), and Structured3D(full), with metrics including [email protected] m, 0.5 m, 1 m, and 1 m within $\pm 30^\circ$ .

Method	[email protected] m	[email protected] m	R@1 m	R@1 m 30° (Gibson(f))
PF-net	0.0	2.0	6.9	1.2
MCL	1.6	4.9	12.1	8.2
LASER	0.4	6.7	13.0	10.4
F³Loc	4.7	28.6	36.6	35.1
3DP	5.3	33.2	39.8	38.4
RSK	8.3	38.5	45.3	43.6
Ours w/o Dis.	12.0	45.8	50.6	49.2
DisCo-FLoc	13.1	50.9	56.7	55.4

On Structured3D(full), DisCo-FLoc achieves 10.0/59.0/67.0/66.0 at [email protected]/0.5m/1m/1m 30°, surpassing prior methods, including those using semantic supervision.

Ablation studies reveal that:

Using both position- and orientation-level negatives yields maximal recall.
Removing the CLS token notably reduces recall at fine localization ([email protected] m).
Positional ( $-2.3\%$ at [email protected] m) and angular ( $-2.3\%$ at R@1 m) perturbations are important for robust training.
Best performance is observed with $X \approx 100$ candidates, $w \approx 0.5$ , and 5 m $\times$ 5 m patches.

5. Analysis and Implications

The two-level contrastive scheme directly addresses the ambiguous pose hypotheses arising from architectural repetition:

Position-level negatives drive the system to distinguish spatially nearby but geometrically similar regions, minimizing self-similarity confounds.
Orientation-level negatives enforce heading sensitivity, critical in symmetric environments.
Orientation-level contrast and angular perturbation offer slightly larger accuracy gains than positional mechanisms.

Importantly, the DINO V2 depth encoder transfers monocular depth estimation skill to the FLoc context without fine-tuning, and no semantic (e.g., room-type, door) annotations are required, achieving superior localization compared to state-of-the-art semantic-based approaches.

6. Limitations and Future Directions

DisCo-FLoc’s sequential ray regression and contrastive reranking architecture introduces some redundancy in feature extraction between the two stages. A plausible implication is that a unified, end-to-end framework could eliminate this inefficiency and further improve localization accuracy.

Potential avenues for extension include jointly integrating depth regression and contrastive disambiguation within a single module, streamlining computation and potentially exploiting cross-modality cues more effectively.

The DisCo-FLoc framework establishes a robust, annotation-free approach for monocular visual floorplan localization, coupling depth-aware geometry prediction with dual-level contrastive disambiguation. Its mathematical formulations, dual-stage design, and empirical performance on challenging benchmarks demonstrate significant improvements in both localization precision and orientation disambiguation (Meng et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DisCo-FLoc: Using Dual-Level Visual-Geometric Contrasts to Disambiguate Depth-Aware Visual Floorplan Localization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DisCo-FLoc.