DisCo-FLoc: Visual Floorplan Localization
- DisCo-FLoc is a monocular visual floorplan localization system that predicts laser-style depth rays and leverages dual-level contrastive learning to disambiguate symmetric layouts.
- The framework combines a ray regression predictor with a ResNet-18 based encoder to perform robust geometry and orientation-level discrimination without semantic supervision.
- Empirical evaluations on benchmarks such as Gibson and Structured3D demonstrate significant improvements in recall and pose accuracy over state-of-the-art methods.
DisCo-FLoc is a monocular visual floorplan localization framework that introduces dual-level visual–geometric contrastive learning for robust, depth-aware localization, disambiguating repetitive or symmetric layouts without requiring semantic supervision. It addresses key challenges in floorplan localization (FLoc)—notably structural ambiguity and reliance on expensive annotation—by combining a specialized ray regression predictor with a fine-grained, multi-level contrastive disambiguation mechanism (Meng et al., 5 Jan 2026).
1. Pipeline Overview
DisCo-FLoc operates as a two-stage pipeline for single RGB image-based localization within a known floorplan geometry:
- Stage 1: Ray Regression Predictor (RRP)
- Given an input image , the RRP predicts discrete “laser-style” distances along fixed orientations to the nearest structural boundary.
- The backbone is a frozen DINO V2 depth encoder. Features are reduced to dimension via convolution, and depth-ray queries are aggregated vertically per image column.
- A multi-head attention layer predicts, for each ray , a probability distribution over depth bins.
- Depth regression: For each ray,
- Training loss: plus a cosine “shape” term:
- Depth-Aware Floorplan Probabilistic Map (DAFPM): At inference, candidate poses are tiled, ground-truth rays are rendered per pose, and compared to the predictions to yield . The most likely candidates progress to the next stage.
Stage 2: Visual–Geometric Contrastive Disambiguation
- Each candidate is associated with a 5 m 5 m floorplan patch, aligned to .
- A contrastive learning mechanism is used to score fine-grained consistency between visual depth features and geometric layout in the patch.
This sequence enables DisCo-FLoc to first generate plausible localization candidates using a geometric prior and subsequently resolve ambiguities arising from floorplan regularities via contrastive matching.
2. Dual-Level Visual–Geometric Contrastive Learning
To discriminate between visually and geometrically similar locations, DisCo-FLoc pre-trains a ResNet-18 floorplan encoder with dual contrastive constraints while freezing the visual depth backbone.
- Position-Level Contrast:
- Positive pairs: image and its corresponding floorplan patch at , plus small spatial ( m) and orientation ( rad) perturbations.
- Negative pairs are drawn from:
- Inner-floorplan: patches 1.5–3 m away within the same layout.
- Cross-floorplan: patches from other floorplans.
- Orientation-Level Contrast:
- Negative pairs: same but heading .
- PointInfoNCE Loss:
- With negative set (for position and orientation levels):
where is a temperature and “” denotes cosine similarity.
Training Objective:
In practice, a single is used as both negative types are included.
The dual-level scheme addresses both translational and rotational ambiguities, crucial for distinguishing between symmetric or repetitive structures.
3. Feature Extraction and Fusion
The feature encoders and inference logic operate as follows:
Visual Feature Extraction:
- The DINO V2 depth encoder is frozen for visual input.
- A CLS (classification) token representing global depth is concatenated to the vertical column-token aggregation for the image.
- Floorplan Feature Extraction:
- ResNet-18 is trained using for geometric layout encoding.
- Candidate Matching and Scoring:
- For each candidate , vision feature and floorplan feature are compared:
The collection is normalized across the candidates using SoftMax to yield a disambiguation probability .
- Final Pose Selection:
where (set to 0.5). The candidate with the highest is selected as the localization output.
4. Empirical Evaluation
DisCo-FLoc is evaluated on Gibson(f), Gibson(g), and Structured3D(full), with metrics including [email protected] m, 0.5 m, 1 m, and 1 m within .
| Method | [email protected] m | [email protected] m | R@1 m | R@1 m 30° (Gibson(f)) |
|---|---|---|---|---|
| PF-net | 0.0 | 2.0 | 6.9 | 1.2 |
| MCL | 1.6 | 4.9 | 12.1 | 8.2 |
| LASER | 0.4 | 6.7 | 13.0 | 10.4 |
| F³Loc | 4.7 | 28.6 | 36.6 | 35.1 |
| 3DP | 5.3 | 33.2 | 39.8 | 38.4 |
| RSK | 8.3 | 38.5 | 45.3 | 43.6 |
| Ours w/o Dis. | 12.0 | 45.8 | 50.6 | 49.2 |
| DisCo-FLoc | 13.1 | 50.9 | 56.7 | 55.4 |
- On Structured3D(full), DisCo-FLoc achieves 10.0/59.0/67.0/66.0 at [email protected]/0.5m/1m/1m 30°, surpassing prior methods, including those using semantic supervision.
Ablation studies reveal that:
- Using both position- and orientation-level negatives yields maximal recall.
- Removing the CLS token notably reduces recall at fine localization ([email protected] m).
- Positional ( at [email protected] m) and angular ( at R@1 m) perturbations are important for robust training.
- Best performance is observed with candidates, , and 5 m 5 m patches.
5. Analysis and Implications
The two-level contrastive scheme directly addresses the ambiguous pose hypotheses arising from architectural repetition:
- Position-level negatives drive the system to distinguish spatially nearby but geometrically similar regions, minimizing self-similarity confounds.
- Orientation-level negatives enforce heading sensitivity, critical in symmetric environments.
- Orientation-level contrast and angular perturbation offer slightly larger accuracy gains than positional mechanisms.
Importantly, the DINO V2 depth encoder transfers monocular depth estimation skill to the FLoc context without fine-tuning, and no semantic (e.g., room-type, door) annotations are required, achieving superior localization compared to state-of-the-art semantic-based approaches.
6. Limitations and Future Directions
DisCo-FLoc’s sequential ray regression and contrastive reranking architecture introduces some redundancy in feature extraction between the two stages. A plausible implication is that a unified, end-to-end framework could eliminate this inefficiency and further improve localization accuracy.
Potential avenues for extension include jointly integrating depth regression and contrastive disambiguation within a single module, streamlining computation and potentially exploiting cross-modality cues more effectively.
The DisCo-FLoc framework establishes a robust, annotation-free approach for monocular visual floorplan localization, coupling depth-aware geometry prediction with dual-level contrastive disambiguation. Its mathematical formulations, dual-stage design, and empirical performance on challenging benchmarks demonstrate significant improvements in both localization precision and orientation disambiguation (Meng et al., 5 Jan 2026).