Papers
Topics
Authors
Recent
Search
2000 character limit reached

CGL-Decoder: Prompt-Guided Wireframe Parsing

Updated 29 January 2026
  • The paper introduces a robust neural module that refines wireframe extraction by integrating prompt-conditioned sparse attention with point-line consistency, cutting endpoint mismatches from 12.4% to 7.8%.
  • Local feature fusion and windowed multi-head cross-attention exchange spatial cues between line and junction prompts, ensuring coherent and context-aware geometry refinement.
  • Empirical evaluations on benchmarks like Wireframe and YorkUrban demonstrate the decoder’s competitive performance, achieving up to 76.8 FPS with improved prediction accuracy.

The Cross-Guidance Line Decoder (CGL-Decoder) is a neural module introduced within the Co-PLNet framework for prompt-guided wireframe parsing. Its design enables collaborative refinement of structured geometry by exchanging spatial cues between lines and junctions using prompt-conditioned, windowed sparse attention. The CGL-Decoder enforces point-line consistency and computational efficiency, resulting in improved accuracy and real-time performance for tasks such as wireframe extraction in images (Wang et al., 26 Jan 2026).

1. Architectural Overview

The CGL-Decoder operates on feature representations and prompt maps derived from preceding feature extraction and Point-Line Prompt Encoder (PLP-Encoder) stages. Its critical architectural elements are:

  • Inputs:
    • ZRH×W×CZ \in \mathbb{R}^{H \times W \times C}: Refined feature map from the U-Net backbone.
    • yL(0)RH×W×Cpy_L^{(0)} \in \mathbb{R}^{H \times W \times C_p}: Line prompt map (Cp=16C_p=16).
    • yJ(0)RH×W×Cpy_J^{(0)} \in \mathbb{R}^{H \times W \times C_p}: Junction prompt map (Cp=16C_p=16).
  • Outputs:
    • yLy_L: Dense line parameter proposals (d,θ,θ1,θ2,r)(d,\theta, \theta_1, \theta_2, r) at subpixel accuracy.
    • yJy_J: Refined junction heatmap and position offset maps.
    • yy: Final set of sparse line segments (endpoints) following non-maximum suppression (NMS) and line-of-interest (LOI) verification.
  • Internal modules:
    • Local Feature Fusion: Concatenates ZZ, yL(0)y_L^{(0)}, and yJ(0)y_J^{(0)} along the channel axis; processed through two small convolutional branches to obtain Z~L,Z~JRH×W×Cf\tilde Z_L, \tilde Z_J \in \mathbb{R}^{H \times W \times C_f}.
    • 1×1 Projection: ψ()ψ(\cdot) reduces channel count from CfC_f to CqC_q (Cq=32C_q=32) as pre-attention embedding.
    • Window Partitioning: Feature tensors partitioned into non-overlapping spatial windows (w=8w=8).
    • Sparse Multi-Head Cross-Attention: Each window attends from the line (or junction) branch to backbone features.
    • Gated Residual Fusion: Attended corrections are fused into ZZ using learnable gating masks GL,GJG_L, G_J.
    • Prediction Heads: HAWP/PLNet line and junction heads refine geometry predictions using ZL,ZJZ'_L, Z'_J.
    • Endpoint Grouping & LOI Scoring: Endpoints associated to nearest junctions, deduplicated, and scored for final selection.

2. Prompt-Conditioned Sparse Attention Mechanism

Within the CGL-Decoder, attention is conditioned on point-line prompts and applied sparsely within local windows. For the line branch, the formalization within spatial window tt is:

Qt=ψ(Z~L)[t]WQ Kt=ψ(Z)[t]WK Vt=ψ(Z)[t]WV\begin{align*} Q_t &= ψ(\tilde Z_L)[t] \, W_Q \ K_t &= ψ(Z)[t] \, W_K \ V_t &= ψ(Z)[t] \, W_V \end{align*}

with WQ,WK,WVRCq×dkW_Q, W_K, W_V \in \mathbb{R}^{C_q \times d_k} as learned projections (dk=Cq/Hheadsd_k = C_q / H_{\mathrm{heads}}). For each head hh and window tt: headh(t)=softmax(Qt(h)(Kt(h))dk)Vt(h)\mathrm{head}_h(t) = \mathrm{softmax}\left( \frac{Q_t^{(h)} (K_t^{(h)})^\top}{\sqrt{d_k}} \right) V_t^{(h)} The multi-head output is: MHAt=Concat(head1(t),...,headHheads(t))WO\mathrm{MHA}_t = \mathrm{Concat}(\mathrm{head}_1(t), ..., \mathrm{head}_{H_{\mathrm{heads}}}(t)) W_O Windows are reassembled to yield the attended feature map ZˉL\bar Z_L (likewise for junctions). Gated residual fusion restores the original channel dimension with

ZL=Z+GLZˉL,ZJ=Z+GJZˉJZ_L' = Z + G_L \odot \bar Z_L, \quad Z_J' = Z + G_J \odot \bar Z_J

where GL,GJRH×W×CG_L, G_J \in \mathbb{R}^{H \times W \times C} are learned gating masks, and \odot is element-wise multiplication.

3. Integration with the PLP-Encoder and Local-Global Context

The CGL-Decoder leverages coarse spatial prompts from the PLP-Encoder, which generates yL(0)y_L^{(0)} and yJ(0)y_J^{(0)} through lightweight convolutional heads. These prompt maps, spatially and channel aligned with ZZ, are concatenated along with ZZ and passed through two convolutional branches to yield Z~L\tilde Z_L and Z~J\tilde Z_J. This direct injection of geometry prompts enables the decoder to modulate feature fusion prior to global context aggregation by sparse attention, enhancing context-aware refinement.

4. Stepwise Decoding and Refinement Workflow

The complete refinement algorithm is outlined below:

  1. Backbone & PLP-Encoder: Extract multi-scale features (SuperPoint + U-Net), generate coarse proposals (CL,CJ)(C_L, C_J), and produce spatial prompts yL(0),yJ(0)y_L^{(0)}, y_J^{(0)}.
  2. Local Fusion: Input concatenation and convolution to obtain Z~L,Z~J\tilde Z_L, \tilde Z_J.
  3. Sparse Attention: Compute Q,K,VQ,K,V for each window; perform windowed multi-head attention to derive ZˉL,ZˉJ\bar Z_L, \bar Z_J; fuse via gated residuals.
  4. Geometry Prediction: Line head on ZLZ_L' predicts line parameters (d,θ,θ1,θ2,r)(d, \theta, \theta_1, \theta_2, r); junction head on ZJZ_J' predicts heatmaps and offsets.
  5. Post-Processing: Endpoints are snapped to their nearest junction within a threshold and deduplicated via NMS.
  6. Line-of-Interest Verification: Features are sampled along each candidate line and scored by a small MLP; top-kk lines are retained.
  7. Output: The final set of line segments and junctions is produced.

5. Loss Formulations and Optimization Criteria

CGL-Decoder training is end-to-end and employs a composite loss: L=m{PLP,CGL}(Lline(m)+Ljunc(m)+Laux(m))+LLOIL = \sum_{m \in \{\mathrm{PLP}, \mathrm{CGL}\}} \Bigl(L_{\mathrm{line}}^{(m)} + L_{\mathrm{junc}}^{(m)} + L_{\mathrm{aux}}^{(m)} \Bigr) + L_{\mathrm{LOI}} where:

  • Ljunc=BCE(HJ,HJ)+λoffΔCJΔCJ1L_{\mathrm{junc}} = \mathrm{BCE}(H_J, H_J^*) + \lambda_{\mathrm{off}} \| \Delta C_J - \Delta C_J^* \|_1 supervises junction heatmaps and offsets.
  • Lline=dd22+θθ22+θ1θ122+θ2θ222+rr22L_{\mathrm{line}} = \|d - d^*\|_2^2 + \|\theta - \theta^*\|_2^2 + \|\theta_1 - \theta_1^*\|_2^2 + \|\theta_2 - \theta_2^*\|_2^2 + \|r - r^*\|_2^2 penalizes line regression error.
  • LauxL_{\mathrm{aux}} encourages consistency between dense line proposals and ground-truth geometry (e.g., via point-to-line distance losses as in HAWP).
  • LLOI=iyilogsi+(1yi)log(1si)L_{\mathrm{LOI}} = -\sum_i y_i^* \log s_i + (1-y_i^*) \log (1-s_i) supervises the LOI MLP's output line confidence scores.

6. Implementation Considerations and Computational Analysis

  • All CGL operations are implemented in PyTorch and benchmarked on an RTX 4080 GPU.
  • Window size default is w=8w=8, providing non-overlapping partitions.
  • Attention cost per image is O(HWdkHheads)O\left( HW \cdot d_k \cdot H_{\mathrm{heads}} \right), linear in image area due to sparse windowing.
  • The 1×1 convolution ψψ reduces feature channels (256 to 32) before attention; prompt maps require only 16 channels.
  • Multi-head attention and window partitioning are parallelized for speed, yielding 76.8 FPS on images sized 512×512512 \times 512.
  • Dense (full) attention only marginally increased sAP by +0.3+0.3, but halved the runtime to  42~42 FPS, motivating the sparse approach.

7. Empirical Performance and Ablation Insights

On standard wireframe benchmarks:

Dataset sAP⁵ sAP¹⁰ sAP¹⁵ FPS Endpoint Mismatch Rate
Wireframe 68.4 72.3 73.8 76.8 7.8%
YorkUrban 32.7 35.6 36.6

CGL-Decoder reduces the endpoint mismatch rate (the proportion of line endpoints not snapping to any detected junction within 15 px) from 12.4% (baseline PLNet) to 7.8% with full prompts and sparse attention. Ablation studies demonstrate that point-to-line prompts alone drop mismatches to 11.2% and raise sAP¹⁵ to 72.3; inclusion of line-to-point prompts improves sAP¹⁵ to 72.6 and mismatch to 9.6%. Sparse attention brings further gains (sAP¹⁵ to 73.3; mismatch to 7.8%). A window size of w=8w=8 achieves the optimal accuracy-speed trade-off (Wang et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Guidance Line Decoder (CGL-Decoder).