Papers
Topics
Authors
Recent
Search
2000 character limit reached

SparseLaneSTP: 3D Lane Detection Framework

Updated 15 January 2026
  • The paper demonstrates a novel 3D lane detection framework that integrates spatial structure and temporal evolution using a DETR-style sparse transformer.
  • It introduces a continuous spline-based lane representation along with spatial and temporal regularization to enhance detection accuracy under challenging conditions.
  • Experiments on multiple datasets show improved F1-scores and robustness compared to traditional dense bird’s-eye-view methods.

SparseLaneSTP is a 3D lane detection framework designed for autonomous driving scenarios, integrating spatio-temporal geometric priors into a sparse transformer architecture. The method advances over dense bird’s-eye-view (BEV) pipelines by directly operating on front-view features using DETR-style sparse queries, while embedding both spatial (lane structure) and temporal (lane evolution across time) knowledge. It further introduces a continuous spline-based lane representation and a dedicated regularization strategy, supported by an auto-labeled, large-scale 3D lane dataset for performance benchmarking (Pittner et al., 8 Jan 2026).

1. Motivation and Problem Formulation

3D lane detection seeks to recover both the precise layout of lane markings and the underlying 3D road surface, expressed in a vehicle-centric coordinate frame. Conventional dense BEV approaches transform front-view image features via Inverse Perspective Mapping (IPM) or learned lifts to generate a BEV feature map. Such methods are limited by real-world road non-planarity and misalignment between feature maps and the 3D road geometry, resulting in suboptimal detection performance. Sparse DETR-style detectors address these geometric distortions by having queries attend directly to image features, sidestepping BEV altogether. However, prior approaches neglect two critical sources of domain knowledge: spatial priors—such as lane continuity, smoothness, and parallelism—and temporal priors derived from historical lane observations, which are critical for robust detection under ambiguous visual conditions caused by occlusion or low visibility.

SparseLaneSTP addresses this by integrating spatial structure and temporal evolution through a transformer-based architecture with tailored attention mechanisms and continuous curve parameterization.

2. Model Architecture and Workflow

The input is an RGB image IRH×W×3\mathbf{I}\in\mathbb{R}^{H\times W\times3} processed by a convolutional backbone (e.g., ResNet-50) to produce a feature map FRHF×WF×C\mathbf{F}\in\mathbb{R}^{H_F\times W_F\times C}. A lane instance segmentation branch transforms F\mathbf{F} into initial query embeddings 0QRN×M×C^{0}\mathbf{Q}\in\mathbb{R}^{N\times M\times C}, with NN lanes and MM control points per lane. Each query embedding is decoded by an MLP to produce control-point parameters 0Pi,jR4^{0}\mathbf{P}_{i,j}\in\mathbb{R}^4 (encoding 3D position and visibility).

The decoder consists of LL transformer layers, refining the queries and control points via the following sequence:

  1. Spatio-Temporal Attention (STA):

lQSTA=STA(l1Q, l1P, QMem, PMem)^{l}\mathbf Q_{\mathrm{STA}} = \mathrm{STA}(^{l-1}\mathbf Q,~^{l-1}\mathbf P,~\mathbf Q_{\mathrm{Mem}},~\mathbf P_{\mathrm{Mem}})

  1. Deformable Cross Attention (DCA):

lQDCA=DCA(lQSTA, F, l1P2D)^{l}\mathbf Q_{\mathrm{DCA}} = \mathrm{DCA}(^l\mathbf Q_{\mathrm{STA}},~\mathbf F,~^{l-1}\mathbf P_{2D})

  1. Feedforward network and normalization:

lQ=FFN(lQDCA)^l\mathbf Q = \mathrm{FFN}(^{l}\mathbf Q_{\mathrm{DCA}})

After each layer, control-point parameters and classification logits are regressed via a shared MLP.

3. Lane-Specific Spatio-Temporal Attention

A FIFO memory buffer stores the top NMemN_{\mathrm{Mem}} queries and control points from each of the past TT frames:

QMem=[Q(t1);;Q(tT)]RTNMem×M×C\mathbf Q_{\mathrm{Mem}} = [\mathbf Q^{(t-1)};\dots;\mathbf Q^{(t-T)}] \in \mathbb{R}^{T N_{\mathrm{Mem}}\times M\times C}

To account for ego-motion, past 3D control points are re-expressed in the current vehicle frame:

Pij(tk)t=[(Einv(t)E(tk)P3D,ij(tk))  Pv,ij(tk)]\mathbf P_{ij}^{(t-k)\to t} = \left[\left(\mathbf E_{\mathrm{inv}}^{(t)}\,\mathbf E^{(t-k)}\,\mathbf P_{3D,ij}^{(t-k)}\right)~|~\mathbf P_{v,ij}^{(t-k)}\right]

Queries are augmented with position encodings: Q~=Q+PE(P)\widetilde{\mathbf Q} = \mathbf Q + \mathrm{PE}(\mathbf{P}).

STA is decomposed into three structured submodules:

  • Same-Line Attention (SLA): restricts attention within the MM control points of each lane.
  • Parallel-Neighbor Attention (PNA): enables attention to nearest control points on adjacent (parallel) lanes.
  • Temporal Cross-Attention (TCA): allows current queries to attend to nearest historical points across TT frames.

The attention for a single head is given by:

Aijt,t=(QitWQ)(KjtWK)dk+Pspatial(i,j)+Ptemporal(t,t)A_{ij}^{t,t'} = \frac{(Q_i^t W_Q)(K_j^{t'} W_K)^\top}{\sqrt{d_k}} + P_{\mathrm{spatial}}(i,j) + P_{\mathrm{temporal}}(t,t')

where

  • Qit,KjtQ_i^t, K_j^{t'} are position-encoded queries and keys,
  • WQ,WKW_Q, W_K are linear projections,
  • dkd_k is the head dimension,
  • Pspatial(i,j)P_{\mathrm{spatial}}(i,j) and Ptemporal(t,t)P_{\mathrm{temporal}}(t,t') are learnable or fixed biases penalizing invalid relations.

4. Continuous Lane Representation

Each lane ii comprises MM control points PiRM×4P_i\in\mathbb{R}^{M\times 4}, with (xij,yij,zij,vij)(x_{ij}, y_{ij}, z_{ij}, v_{ij}) per row. yijy_{ij} is uniformly fixed in [ys,ye][y_s, y_e], so the network regresses only xx, zz, and visibility (vv). The lane is represented as a Catmull-Rom spline:

fi(s)=[s3 s2 s 1]MCR(s)Pif_i(s) = [s^3~s^2~s~1]\,M_{\mathrm{CR}}(s)\,P_i

with precomputed basis MCR(s)R4×MM_{\mathrm{CR}}(s)\in\mathbb{R}^{4\times M}. Discrete sampling over finely quantized ss reconstructs the continuous curve.

5. Spatial and Temporal Regularization

Spatial priors are enforced using losses from LaneCPP, including:

  • Parallelism loss: maintains near-constant lateral lane separation,
  • Smoothness loss: penalizes large second derivatives,
  • Curvature penalty: discourages excessive curvature.

Temporal smoothness is encouraged by maintaining an exponentially weighted moving average (EMA) of predicted curves:

fˉi(t)(s)=αfˉi(t1)(s)+(1α)fi(t)(s)\bar f_i^{(t)}(s) = \alpha\,\bar f_i^{(t-1)}(s) + (1-\alpha)\,f_i^{(t)}(s)

with loss

Ltemp=1Ni=1N01fˉv,i(t)(s)f3D,i(s)fˉ3D,i(t)(s)1ds\mathcal L_{\mathrm{temp}} = \frac{1}{N} \sum_{i=1}^N \int_0^1 \bar f_{v,i}^{(t)}(s)\,\|f_{3D,i}(s) - \bar f_{3D,i}^{(t)}(s)\|_1\,ds

encouraging temporal consistency in both geometry and per-point visibility.

6. Dataset Creation and Auto-Labeling Pipeline

SparseLaneSTP introduces an auto-labeling pipeline to address deficiencies in existing 3D lane datasets. The process is as follows:

  1. Apply a state-of-the-art 2D lane detector (e.g., LaneATT) to extract near-range lane points with high confidence,
  2. Estimate vehicle trajectory {E(t)}\{\mathbf E(t)\} via visual odometry or SLAM,
  3. Model the road surface as piecewise planar using ego-pose sequences,
  4. Project 2D lane detections onto the local road surface plane to obtain 3D points,
  5. Accumulate points over TT frames and use a Kalman-filter-based tracker to form consistent lane tracks and control point candidates,
  6. Optionally perform semantic segmentation for occlusion annotation.

The dataset comprises \sim500K images spanning diverse scenarios, annotated with lane category, pointwise visibility (including occlusion), persistent track IDs, camera intrinsics/extrinsics, and pose logs, with up to 250m lane labeling range.

7. Experimental Evaluation

SparseLaneSTP is evaluated on OpenLane (Waymo), ONCE-3DLanes, and the newly introduced dataset. Key metrics include F1-score (based on IoU of sampled curve points), xx/zz errors in near (0–40m) and far (40–100m) ranges, Chamfer Distance (CD), and Visibility-IoU (Vis-IoU) for occlusion handling.

Dataset F1 (%) F1 (Second-Best, %) CD (m) xx-err (far, m) zz-err (far, m) Vis-IoU (%)
OpenLane 66.1 ∼64 0.240 0.092
ONCE-3DLanes 82.75 0.048
New 3D Lane DS 68.2 81.4

Ablation studies show the continuous spline representation gives a +1.1% F1 gain over a discrete baseline, STA (SLA+PNA+TCA) adds +3.2%, and full spatial+temporal regularization contributes a further +0.3%. Within STA, SLA+PNA bring +0.9% and TCA a further +1.2%. Optimal memory length is T=3T=3 frames.

Qualitative results indicate:

  • Continuous curves yield precise visibility boundaries,
  • Temporal queries enhance lane persistence under occlusion or faded markings,
  • Spatial priors ensure robust parallelism and curvature in complex geometry (e.g., merges, splits) (Pittner et al., 8 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseLaneSTP.