CGL-Decoder: Prompt-Guided Wireframe Parsing

Updated 29 January 2026

The paper introduces a robust neural module that refines wireframe extraction by integrating prompt-conditioned sparse attention with point-line consistency, cutting endpoint mismatches from 12.4% to 7.8%.
Local feature fusion and windowed multi-head cross-attention exchange spatial cues between line and junction prompts, ensuring coherent and context-aware geometry refinement.
Empirical evaluations on benchmarks like Wireframe and YorkUrban demonstrate the decoder’s competitive performance, achieving up to 76.8 FPS with improved prediction accuracy.

The Cross-Guidance Line Decoder (CGL-Decoder) is a neural module introduced within the Co-PLNet framework for prompt-guided wireframe parsing. Its design enables collaborative refinement of structured geometry by exchanging spatial cues between lines and junctions using prompt-conditioned, windowed sparse attention. The CGL-Decoder enforces point-line consistency and computational efficiency, resulting in improved accuracy and real-time performance for tasks such as wireframe extraction in images (Wang et al., 26 Jan 2026).

1. Architectural Overview

The CGL-Decoder operates on feature representations and prompt maps derived from preceding feature extraction and Point-Line Prompt Encoder (PLP-Encoder) stages. Its critical architectural elements are:

Inputs:
- $Z \in \mathbb{R}^{H \times W \times C}$ : Refined feature map from the U-Net backbone.
- $y_L^{(0)} \in \mathbb{R}^{H \times W \times C_p}$ : Line prompt map ( $C_p=16$ ).
- $y_J^{(0)} \in \mathbb{R}^{H \times W \times C_p}$ : Junction prompt map ( $C_p=16$ ).
Outputs:
- $y_L$ : Dense line parameter proposals $(d,\theta, \theta_1, \theta_2, r)$ at subpixel accuracy.
- $y_J$ : Refined junction heatmap and position offset maps.
- $y$ : Final set of sparse line segments (endpoints) following non-maximum suppression (NMS) and line-of-interest (LOI) verification.
Internal modules:
- Local Feature Fusion: Concatenates $Z$ , $y_L^{(0)}$ , and $y_J^{(0)}$ along the channel axis; processed through two small convolutional branches to obtain $\tilde Z_L, \tilde Z_J \in \mathbb{R}^{H \times W \times C_f}$ .
- 1×1 Projection: $ψ(\cdot)$ reduces channel count from $C_f$ to $C_q$ ( $C_q=32$ ) as pre-attention embedding.
- Window Partitioning: Feature tensors partitioned into non-overlapping spatial windows ( $w=8$ ).
- Sparse Multi-Head Cross-Attention: Each window attends from the line (or junction) branch to backbone features.
- Gated Residual Fusion: Attended corrections are fused into $Z$ using learnable gating masks $G_L, G_J$ .
- Prediction Heads: HAWP/PLNet line and junction heads refine geometry predictions using $Z'_L, Z'_J$ .
- Endpoint Grouping & LOI Scoring: Endpoints associated to nearest junctions, deduplicated, and scored for final selection.

2. Prompt-Conditioned Sparse Attention Mechanism

Within the CGL-Decoder, attention is conditioned on point-line prompts and applied sparsely within local windows. For the line branch, the formalization within spatial window $t$ is:

$\begin{align*} Q_t &= ψ(\tilde Z_L)[t] \, W_Q \ K_t &= ψ(Z)[t] \, W_K \ V_t &= ψ(Z)[t] \, W_V \end{align*}$

with $W_Q, W_K, W_V \in \mathbb{R}^{C_q \times d_k}$ as learned projections ( $d_k = C_q / H_{\mathrm{heads}}$ ). For each head $h$ and window $t$ : $\mathrm{head}_h(t) = \mathrm{softmax}\left( \frac{Q_t^{(h)} (K_t^{(h)})^\top}{\sqrt{d_k}} \right) V_t^{(h)}$ The multi-head output is: $\mathrm{MHA}_t = \mathrm{Concat}(\mathrm{head}_1(t), ..., \mathrm{head}_{H_{\mathrm{heads}}}(t)) W_O$ Windows are reassembled to yield the attended feature map $\bar Z_L$ (likewise for junctions). Gated residual fusion restores the original channel dimension with

$Z_L' = Z + G_L \odot \bar Z_L, \quad Z_J' = Z + G_J \odot \bar Z_J$

where $G_L, G_J \in \mathbb{R}^{H \times W \times C}$ are learned gating masks, and $\odot$ is element-wise multiplication.

3. Integration with the PLP-Encoder and Local-Global Context

The CGL-Decoder leverages coarse spatial prompts from the PLP-Encoder, which generates $y_L^{(0)}$ and $y_J^{(0)}$ through lightweight convolutional heads. These prompt maps, spatially and channel aligned with $Z$ , are concatenated along with $Z$ and passed through two convolutional branches to yield $\tilde Z_L$ and $\tilde Z_J$ . This direct injection of geometry prompts enables the decoder to modulate feature fusion prior to global context aggregation by sparse attention, enhancing context-aware refinement.

The complete refinement algorithm is outlined below:

Backbone & PLP-Encoder: Extract multi-scale features (SuperPoint + U-Net), generate coarse proposals $(C_L, C_J)$ , and produce spatial prompts $y_L^{(0)}, y_J^{(0)}$ .
Local Fusion: Input concatenation and convolution to obtain $\tilde Z_L, \tilde Z_J$ .
Sparse Attention: Compute $Q,K,V$ for each window; perform windowed multi-head attention to derive $\bar Z_L, \bar Z_J$ ; fuse via gated residuals.
Geometry Prediction: Line head on $Z_L'$ predicts line parameters $(d, \theta, \theta_1, \theta_2, r)$ ; junction head on $Z_J'$ predicts heatmaps and offsets.
Post-Processing: Endpoints are snapped to their nearest junction within a threshold and deduplicated via NMS.
Line-of-Interest Verification: Features are sampled along each candidate line and scored by a small MLP; top- $k$ lines are retained.
Output: The final set of line segments and junctions is produced.

5. Loss Formulations and Optimization Criteria

CGL-Decoder training is end-to-end and employs a composite loss: $L = \sum_{m \in \{\mathrm{PLP}, \mathrm{CGL}\}} \Bigl(L_{\mathrm{line}}^{(m)} + L_{\mathrm{junc}}^{(m)} + L_{\mathrm{aux}}^{(m)} \Bigr) + L_{\mathrm{LOI}}$ where:

$L_{\mathrm{junc}} = \mathrm{BCE}(H_J, H_J^*) + \lambda_{\mathrm{off}} \| \Delta C_J - \Delta C_J^* \|_1$ supervises junction heatmaps and offsets.
$L_{\mathrm{line}} = \|d - d^*\|_2^2 + \|\theta - \theta^*\|_2^2 + \|\theta_1 - \theta_1^*\|_2^2 + \|\theta_2 - \theta_2^*\|_2^2 + \|r - r^*\|_2^2$ penalizes line regression error.
$L_{\mathrm{aux}}$ encourages consistency between dense line proposals and ground-truth geometry (e.g., via point-to-line distance losses as in HAWP).
$L_{\mathrm{LOI}} = -\sum_i y_i^* \log s_i + (1-y_i^*) \log (1-s_i)$ supervises the LOI MLP's output line confidence scores.

6. Implementation Considerations and Computational Analysis

All CGL operations are implemented in PyTorch and benchmarked on an RTX 4080 GPU.
Window size default is $w=8$ , providing non-overlapping partitions.
Attention cost per image is $O\left( HW \cdot d_k \cdot H_{\mathrm{heads}} \right)$ , linear in image area due to sparse windowing.
The 1×1 convolution $ψ$ reduces feature channels (256 to 32) before attention; prompt maps require only 16 channels.
Multi-head attention and window partitioning are parallelized for speed, yielding 76.8 FPS on images sized $512 \times 512$ .
Dense (full) attention only marginally increased sAP by $+0.3$ , but halved the runtime to $~42$ FPS, motivating the sparse approach.

7. Empirical Performance and Ablation Insights

On standard wireframe benchmarks:

Dataset	sAP⁵	sAP¹⁰	sAP¹⁵	FPS	Endpoint Mismatch Rate
Wireframe	68.4	72.3	73.8	76.8	7.8%
YorkUrban	32.7	35.6	36.6	—	—

CGL-Decoder reduces the endpoint mismatch rate (the proportion of line endpoints not snapping to any detected junction within 15 px) from 12.4% (baseline PLNet) to 7.8% with full prompts and sparse attention. Ablation studies demonstrate that point-to-line prompts alone drop mismatches to 11.2% and raise sAP¹⁵ to 72.3; inclusion of line-to-point prompts improves sAP¹⁵ to 72.6 and mismatch to 9.6%. Sparse attention brings further gains (sAP¹⁵ to 73.3; mismatch to 7.8%). A window size of $w=8$ achieves the optimal accuracy-speed trade-off (Wang et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Guidance Line Decoder (CGL-Decoder).

CGL-Decoder: Prompt-Guided Wireframe Parsing

1. Architectural Overview

2. Prompt-Conditioned Sparse Attention Mechanism

3. Integration with the PLP-Encoder and Local-Global Context

4. Stepwise Decoding and Refinement Workflow

5. Loss Formulations and Optimization Criteria

6. Implementation Considerations and Computational Analysis

7. Empirical Performance and Ablation Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

CGL-Decoder: Prompt-Guided Wireframe Parsing

1. Architectural Overview

2. Prompt-Conditioned Sparse Attention Mechanism

3. Integration with the PLP-Encoder and Local-Global Context

4. Stepwise Decoding and Refinement Workflow

5. Loss Formulations and Optimization Criteria

6. Implementation Considerations and Computational Analysis

7. Empirical Performance and Ablation Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics