SparseLaneSTP: 3D Lane Detection Framework

Updated 15 January 2026

The paper demonstrates a novel 3D lane detection framework that integrates spatial structure and temporal evolution using a DETR-style sparse transformer.
It introduces a continuous spline-based lane representation along with spatial and temporal regularization to enhance detection accuracy under challenging conditions.
Experiments on multiple datasets show improved F1-scores and robustness compared to traditional dense bird’s-eye-view methods.

SparseLaneSTP is a 3D lane detection framework designed for autonomous driving scenarios, integrating spatio-temporal geometric priors into a sparse transformer architecture. The method advances over dense bird’s-eye-view (BEV) pipelines by directly operating on front-view features using DETR-style sparse queries, while embedding both spatial (lane structure) and temporal (lane evolution across time) knowledge. It further introduces a continuous spline-based lane representation and a dedicated regularization strategy, supported by an auto-labeled, large-scale 3D lane dataset for performance benchmarking (Pittner et al., 8 Jan 2026).

1. Motivation and Problem Formulation

3D lane detection seeks to recover both the precise layout of lane markings and the underlying 3D road surface, expressed in a vehicle-centric coordinate frame. Conventional dense BEV approaches transform front-view image features via Inverse Perspective Mapping (IPM) or learned lifts to generate a BEV feature map. Such methods are limited by real-world road non-planarity and misalignment between feature maps and the 3D road geometry, resulting in suboptimal detection performance. Sparse DETR-style detectors address these geometric distortions by having queries attend directly to image features, sidestepping BEV altogether. However, prior approaches neglect two critical sources of domain knowledge: spatial priors—such as lane continuity, smoothness, and parallelism—and temporal priors derived from historical lane observations, which are critical for robust detection under ambiguous visual conditions caused by occlusion or low visibility.

SparseLaneSTP addresses this by integrating spatial structure and temporal evolution through a transformer-based architecture with tailored attention mechanisms and continuous curve parameterization.

2. Model Architecture and Workflow

The input is an RGB image $\mathbf{I}\in\mathbb{R}^{H\times W\times3}$ processed by a convolutional backbone (e.g., ResNet-50) to produce a feature map $\mathbf{F}\in\mathbb{R}^{H_F\times W_F\times C}$ . A lane instance segmentation branch transforms $\mathbf{F}$ into initial query embeddings $^{0}\mathbf{Q}\in\mathbb{R}^{N\times M\times C}$ , with $N$ lanes and $M$ control points per lane. Each query embedding is decoded by an MLP to produce control-point parameters $^{0}\mathbf{P}_{i,j}\in\mathbb{R}^4$ (encoding 3D position and visibility).

The decoder consists of $L$ transformer layers, refining the queries and control points via the following sequence:

Spatio-Temporal Attention (STA):

$^{l}\mathbf Q_{\mathrm{STA}} = \mathrm{STA}(^{l-1}\mathbf Q,~^{l-1}\mathbf P,~\mathbf Q_{\mathrm{Mem}},~\mathbf P_{\mathrm{Mem}})$

Deformable Cross Attention (DCA):

$^{l}\mathbf Q_{\mathrm{DCA}} = \mathrm{DCA}(^l\mathbf Q_{\mathrm{STA}},~\mathbf F,~^{l-1}\mathbf P_{2D})$

Feedforward network and normalization:

$^l\mathbf Q = \mathrm{FFN}(^{l}\mathbf Q_{\mathrm{DCA}})$

After each layer, control-point parameters and classification logits are regressed via a shared MLP.

3. Lane-Specific Spatio-Temporal Attention

A FIFO memory buffer stores the top $N_{\mathrm{Mem}}$ queries and control points from each of the past $T$ frames:

$\mathbf Q_{\mathrm{Mem}} = [\mathbf Q^{(t-1)};\dots;\mathbf Q^{(t-T)}] \in \mathbb{R}^{T N_{\mathrm{Mem}}\times M\times C}$

To account for ego-motion, past 3D control points are re-expressed in the current vehicle frame:

$\mathbf P_{ij}^{(t-k)\to t} = \left[\left(\mathbf E_{\mathrm{inv}}^{(t)}\,\mathbf E^{(t-k)}\,\mathbf P_{3D,ij}^{(t-k)}\right)~|~\mathbf P_{v,ij}^{(t-k)}\right]$

Queries are augmented with position encodings: $\widetilde{\mathbf Q} = \mathbf Q + \mathrm{PE}(\mathbf{P})$ .

STA is decomposed into three structured submodules:

Same-Line Attention (SLA): restricts attention within the $M$ control points of each lane.
Parallel-Neighbor Attention (PNA): enables attention to nearest control points on adjacent (parallel) lanes.
Temporal Cross-Attention (TCA): allows current queries to attend to nearest historical points across $T$ frames.

The attention for a single head is given by:

$A_{ij}^{t,t'} = \frac{(Q_i^t W_Q)(K_j^{t'} W_K)^\top}{\sqrt{d_k}} + P_{\mathrm{spatial}}(i,j) + P_{\mathrm{temporal}}(t,t')$

where

$Q_i^t, K_j^{t'}$ are position-encoded queries and keys,
$W_Q, W_K$ are linear projections,
$d_k$ is the head dimension,
$P_{\mathrm{spatial}}(i,j)$ and $P_{\mathrm{temporal}}(t,t')$ are learnable or fixed biases penalizing invalid relations.

4. Continuous Lane Representation

Each lane $i$ comprises $M$ control points $P_i\in\mathbb{R}^{M\times 4}$ , with $(x_{ij}, y_{ij}, z_{ij}, v_{ij})$ per row. $y_{ij}$ is uniformly fixed in $[y_s, y_e]$ , so the network regresses only $x$ , $z$ , and visibility ( $v$ ). The lane is represented as a Catmull-Rom spline:

$f_i(s) = [s^3~s^2~s~1]\,M_{\mathrm{CR}}(s)\,P_i$

with precomputed basis $M_{\mathrm{CR}}(s)\in\mathbb{R}^{4\times M}$ . Discrete sampling over finely quantized $s$ reconstructs the continuous curve.

5. Spatial and Temporal Regularization

Spatial priors are enforced using losses from LaneCPP, including:

Parallelism loss: maintains near-constant lateral lane separation,
Smoothness loss: penalizes large second derivatives,
Curvature penalty: discourages excessive curvature.

Temporal smoothness is encouraged by maintaining an exponentially weighted moving average (EMA) of predicted curves:

$\bar f_i^{(t)}(s) = \alpha\,\bar f_i^{(t-1)}(s) + (1-\alpha)\,f_i^{(t)}(s)$

with loss

$\mathcal L_{\mathrm{temp}} = \frac{1}{N} \sum_{i=1}^N \int_0^1 \bar f_{v,i}^{(t)}(s)\,\|f_{3D,i}(s) - \bar f_{3D,i}^{(t)}(s)\|_1\,ds$

encouraging temporal consistency in both geometry and per-point visibility.

6. Dataset Creation and Auto-Labeling Pipeline

SparseLaneSTP introduces an auto-labeling pipeline to address deficiencies in existing 3D lane datasets. The process is as follows:

Apply a state-of-the-art 2D lane detector (e.g., LaneATT) to extract near-range lane points with high confidence,
Estimate vehicle trajectory $\{\mathbf E(t)\}$ via visual odometry or SLAM,
Model the road surface as piecewise planar using ego-pose sequences,
Project 2D lane detections onto the local road surface plane to obtain 3D points,
Accumulate points over $T$ frames and use a Kalman-filter-based tracker to form consistent lane tracks and control point candidates,
Optionally perform semantic segmentation for occlusion annotation.

The dataset comprises $\sim$ 500K images spanning diverse scenarios, annotated with lane category, pointwise visibility (including occlusion), persistent track IDs, camera intrinsics/extrinsics, and pose logs, with up to 250m lane labeling range.

7. Experimental Evaluation

SparseLaneSTP is evaluated on OpenLane (Waymo), ONCE-3DLanes, and the newly introduced dataset. Key metrics include F1-score (based on IoU of sampled curve points), $x$ / $z$ errors in near (0–40m) and far (40–100m) ranges, Chamfer Distance (CD), and Visibility-IoU (Vis-IoU) for occlusion handling.

Dataset	F1 (%)	F1 (Second-Best, %)	CD (m)	$x$ -err (far, m)	$z$ -err (far, m)	Vis-IoU (%)
OpenLane	66.1	∼64	—	0.240	0.092	—
ONCE-3DLanes	82.75	—	0.048	—	—	—
New 3D Lane DS	68.2	—	—	—	—	81.4

Ablation studies show the continuous spline representation gives a +1.1% F1 gain over a discrete baseline, STA (SLA+PNA+TCA) adds +3.2%, and full spatial+temporal regularization contributes a further +0.3%. Within STA, SLA+PNA bring +0.9% and TCA a further +1.2%. Optimal memory length is $T=3$ frames.

Qualitative results indicate:

Continuous curves yield precise visibility boundaries,
Temporal queries enhance lane persistence under occlusion or faded markings,
Spatial priors ensure robust parallelism and curvature in complex geometry (e.g., merges, splits) (Pittner et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseLaneSTP.