CGL-Decoder: Prompt-Guided Wireframe Parsing
- The paper introduces a robust neural module that refines wireframe extraction by integrating prompt-conditioned sparse attention with point-line consistency, cutting endpoint mismatches from 12.4% to 7.8%.
- Local feature fusion and windowed multi-head cross-attention exchange spatial cues between line and junction prompts, ensuring coherent and context-aware geometry refinement.
- Empirical evaluations on benchmarks like Wireframe and YorkUrban demonstrate the decoder’s competitive performance, achieving up to 76.8 FPS with improved prediction accuracy.
The Cross-Guidance Line Decoder (CGL-Decoder) is a neural module introduced within the Co-PLNet framework for prompt-guided wireframe parsing. Its design enables collaborative refinement of structured geometry by exchanging spatial cues between lines and junctions using prompt-conditioned, windowed sparse attention. The CGL-Decoder enforces point-line consistency and computational efficiency, resulting in improved accuracy and real-time performance for tasks such as wireframe extraction in images (Wang et al., 26 Jan 2026).
1. Architectural Overview
The CGL-Decoder operates on feature representations and prompt maps derived from preceding feature extraction and Point-Line Prompt Encoder (PLP-Encoder) stages. Its critical architectural elements are:
- Inputs:
- : Refined feature map from the U-Net backbone.
- : Line prompt map ().
- : Junction prompt map ().
- Outputs:
- : Dense line parameter proposals at subpixel accuracy.
- : Refined junction heatmap and position offset maps.
- : Final set of sparse line segments (endpoints) following non-maximum suppression (NMS) and line-of-interest (LOI) verification.
- Internal modules:
- Local Feature Fusion: Concatenates , , and along the channel axis; processed through two small convolutional branches to obtain .
- 1×1 Projection: reduces channel count from to () as pre-attention embedding.
- Window Partitioning: Feature tensors partitioned into non-overlapping spatial windows ().
- Sparse Multi-Head Cross-Attention: Each window attends from the line (or junction) branch to backbone features.
- Gated Residual Fusion: Attended corrections are fused into using learnable gating masks .
- Prediction Heads: HAWP/PLNet line and junction heads refine geometry predictions using .
- Endpoint Grouping & LOI Scoring: Endpoints associated to nearest junctions, deduplicated, and scored for final selection.
2. Prompt-Conditioned Sparse Attention Mechanism
Within the CGL-Decoder, attention is conditioned on point-line prompts and applied sparsely within local windows. For the line branch, the formalization within spatial window is:
with as learned projections (). For each head and window : The multi-head output is: Windows are reassembled to yield the attended feature map (likewise for junctions). Gated residual fusion restores the original channel dimension with
where are learned gating masks, and is element-wise multiplication.
3. Integration with the PLP-Encoder and Local-Global Context
The CGL-Decoder leverages coarse spatial prompts from the PLP-Encoder, which generates and through lightweight convolutional heads. These prompt maps, spatially and channel aligned with , are concatenated along with and passed through two convolutional branches to yield and . This direct injection of geometry prompts enables the decoder to modulate feature fusion prior to global context aggregation by sparse attention, enhancing context-aware refinement.
4. Stepwise Decoding and Refinement Workflow
The complete refinement algorithm is outlined below:
- Backbone & PLP-Encoder: Extract multi-scale features (SuperPoint + U-Net), generate coarse proposals , and produce spatial prompts .
- Local Fusion: Input concatenation and convolution to obtain .
- Sparse Attention: Compute for each window; perform windowed multi-head attention to derive ; fuse via gated residuals.
- Geometry Prediction: Line head on predicts line parameters ; junction head on predicts heatmaps and offsets.
- Post-Processing: Endpoints are snapped to their nearest junction within a threshold and deduplicated via NMS.
- Line-of-Interest Verification: Features are sampled along each candidate line and scored by a small MLP; top- lines are retained.
- Output: The final set of line segments and junctions is produced.
5. Loss Formulations and Optimization Criteria
CGL-Decoder training is end-to-end and employs a composite loss: where:
- supervises junction heatmaps and offsets.
- penalizes line regression error.
- encourages consistency between dense line proposals and ground-truth geometry (e.g., via point-to-line distance losses as in HAWP).
- supervises the LOI MLP's output line confidence scores.
6. Implementation Considerations and Computational Analysis
- All CGL operations are implemented in PyTorch and benchmarked on an RTX 4080 GPU.
- Window size default is , providing non-overlapping partitions.
- Attention cost per image is , linear in image area due to sparse windowing.
- The 1×1 convolution reduces feature channels (256 to 32) before attention; prompt maps require only 16 channels.
- Multi-head attention and window partitioning are parallelized for speed, yielding 76.8 FPS on images sized .
- Dense (full) attention only marginally increased sAP by , but halved the runtime to FPS, motivating the sparse approach.
7. Empirical Performance and Ablation Insights
On standard wireframe benchmarks:
| Dataset | sAP⁵ | sAP¹⁰ | sAP¹⁵ | FPS | Endpoint Mismatch Rate |
|---|---|---|---|---|---|
| Wireframe | 68.4 | 72.3 | 73.8 | 76.8 | 7.8% |
| YorkUrban | 32.7 | 35.6 | 36.6 | — | — |
CGL-Decoder reduces the endpoint mismatch rate (the proportion of line endpoints not snapping to any detected junction within 15 px) from 12.4% (baseline PLNet) to 7.8% with full prompts and sparse attention. Ablation studies demonstrate that point-to-line prompts alone drop mismatches to 11.2% and raise sAP¹⁵ to 72.3; inclusion of line-to-point prompts improves sAP¹⁵ to 72.6 and mismatch to 9.6%. Sparse attention brings further gains (sAP¹⁵ to 73.3; mismatch to 7.8%). A window size of achieves the optimal accuracy-speed trade-off (Wang et al., 26 Jan 2026).