SparseLaneSTP: 3D Lane Detection Framework
- The paper demonstrates a novel 3D lane detection framework that integrates spatial structure and temporal evolution using a DETR-style sparse transformer.
- It introduces a continuous spline-based lane representation along with spatial and temporal regularization to enhance detection accuracy under challenging conditions.
- Experiments on multiple datasets show improved F1-scores and robustness compared to traditional dense bird’s-eye-view methods.
SparseLaneSTP is a 3D lane detection framework designed for autonomous driving scenarios, integrating spatio-temporal geometric priors into a sparse transformer architecture. The method advances over dense bird’s-eye-view (BEV) pipelines by directly operating on front-view features using DETR-style sparse queries, while embedding both spatial (lane structure) and temporal (lane evolution across time) knowledge. It further introduces a continuous spline-based lane representation and a dedicated regularization strategy, supported by an auto-labeled, large-scale 3D lane dataset for performance benchmarking (Pittner et al., 8 Jan 2026).
1. Motivation and Problem Formulation
3D lane detection seeks to recover both the precise layout of lane markings and the underlying 3D road surface, expressed in a vehicle-centric coordinate frame. Conventional dense BEV approaches transform front-view image features via Inverse Perspective Mapping (IPM) or learned lifts to generate a BEV feature map. Such methods are limited by real-world road non-planarity and misalignment between feature maps and the 3D road geometry, resulting in suboptimal detection performance. Sparse DETR-style detectors address these geometric distortions by having queries attend directly to image features, sidestepping BEV altogether. However, prior approaches neglect two critical sources of domain knowledge: spatial priors—such as lane continuity, smoothness, and parallelism—and temporal priors derived from historical lane observations, which are critical for robust detection under ambiguous visual conditions caused by occlusion or low visibility.
SparseLaneSTP addresses this by integrating spatial structure and temporal evolution through a transformer-based architecture with tailored attention mechanisms and continuous curve parameterization.
2. Model Architecture and Workflow
The input is an RGB image processed by a convolutional backbone (e.g., ResNet-50) to produce a feature map . A lane instance segmentation branch transforms into initial query embeddings , with lanes and control points per lane. Each query embedding is decoded by an MLP to produce control-point parameters (encoding 3D position and visibility).
The decoder consists of transformer layers, refining the queries and control points via the following sequence:
- Spatio-Temporal Attention (STA):
- Deformable Cross Attention (DCA):
- Feedforward network and normalization:
After each layer, control-point parameters and classification logits are regressed via a shared MLP.
3. Lane-Specific Spatio-Temporal Attention
A FIFO memory buffer stores the top queries and control points from each of the past frames:
To account for ego-motion, past 3D control points are re-expressed in the current vehicle frame:
Queries are augmented with position encodings: .
STA is decomposed into three structured submodules:
- Same-Line Attention (SLA): restricts attention within the control points of each lane.
- Parallel-Neighbor Attention (PNA): enables attention to nearest control points on adjacent (parallel) lanes.
- Temporal Cross-Attention (TCA): allows current queries to attend to nearest historical points across frames.
The attention for a single head is given by:
where
- are position-encoded queries and keys,
- are linear projections,
- is the head dimension,
- and are learnable or fixed biases penalizing invalid relations.
4. Continuous Lane Representation
Each lane comprises control points , with per row. is uniformly fixed in , so the network regresses only , , and visibility (). The lane is represented as a Catmull-Rom spline:
with precomputed basis . Discrete sampling over finely quantized reconstructs the continuous curve.
5. Spatial and Temporal Regularization
Spatial priors are enforced using losses from LaneCPP, including:
- Parallelism loss: maintains near-constant lateral lane separation,
- Smoothness loss: penalizes large second derivatives,
- Curvature penalty: discourages excessive curvature.
Temporal smoothness is encouraged by maintaining an exponentially weighted moving average (EMA) of predicted curves:
with loss
encouraging temporal consistency in both geometry and per-point visibility.
6. Dataset Creation and Auto-Labeling Pipeline
SparseLaneSTP introduces an auto-labeling pipeline to address deficiencies in existing 3D lane datasets. The process is as follows:
- Apply a state-of-the-art 2D lane detector (e.g., LaneATT) to extract near-range lane points with high confidence,
- Estimate vehicle trajectory via visual odometry or SLAM,
- Model the road surface as piecewise planar using ego-pose sequences,
- Project 2D lane detections onto the local road surface plane to obtain 3D points,
- Accumulate points over frames and use a Kalman-filter-based tracker to form consistent lane tracks and control point candidates,
- Optionally perform semantic segmentation for occlusion annotation.
The dataset comprises 500K images spanning diverse scenarios, annotated with lane category, pointwise visibility (including occlusion), persistent track IDs, camera intrinsics/extrinsics, and pose logs, with up to 250m lane labeling range.
7. Experimental Evaluation
SparseLaneSTP is evaluated on OpenLane (Waymo), ONCE-3DLanes, and the newly introduced dataset. Key metrics include F1-score (based on IoU of sampled curve points), / errors in near (0–40m) and far (40–100m) ranges, Chamfer Distance (CD), and Visibility-IoU (Vis-IoU) for occlusion handling.
| Dataset | F1 (%) | F1 (Second-Best, %) | CD (m) | -err (far, m) | -err (far, m) | Vis-IoU (%) |
|---|---|---|---|---|---|---|
| OpenLane | 66.1 | ∼64 | — | 0.240 | 0.092 | — |
| ONCE-3DLanes | 82.75 | — | 0.048 | — | — | — |
| New 3D Lane DS | 68.2 | — | — | — | — | 81.4 |
Ablation studies show the continuous spline representation gives a +1.1% F1 gain over a discrete baseline, STA (SLA+PNA+TCA) adds +3.2%, and full spatial+temporal regularization contributes a further +0.3%. Within STA, SLA+PNA bring +0.9% and TCA a further +1.2%. Optimal memory length is frames.
Qualitative results indicate:
- Continuous curves yield precise visibility boundaries,
- Temporal queries enhance lane persistence under occlusion or faded markings,
- Spatial priors ensure robust parallelism and curvature in complex geometry (e.g., merges, splits) (Pittner et al., 8 Jan 2026).