RoboSense2025 Challenge
- RoboSense2025 Challenge is a comprehensive benchmark for testing embodied robot perception and navigation under diverse, real-world conditions using multi-modal sensor data.
- It unifies five research tracks including language-guided driving, social navigation, sensor generalization, cross-modal correspondence, and cross-platform 3D detection with rigorous evaluation protocols.
- The challenge advances data-centric robustness, geometry-aware modeling, and self-training methods to tackle sensor noise, domain shifts, and platform discrepancies.
The RoboSense2025 Challenge is a landmark benchmark for robustness and adaptability in embodied robot perception and navigation, spanning multi-modal sensing, domain adaptation, and egocentric reasoning under real-world variation. This challenge integrates standardized datasets, evaluation protocols, and five rigorous research tracks to aggregate methodological insights and promote reliable autonomous system design in dynamic, unconstrained environments (Kong et al., 8 Jan 2026, Su et al., 2024). The competition attracted broad participation (143 teams, 85 institutions) and catalyzed advancements in geometry-aware modeling, data-centric robustness, and unsupervised adaptation.
1. Scope, Objectives, and Motivation
The primary goal of RoboSense2025 is to advance embodied perception across sensor noise, viewpoint shifts, corrupted modalities, and platform changes—including vehicles, drones, and indoor robots. Most state-of-the-art models degrade under unstructured conditions due to domain mismatch, sensor configuration changes, or non-canonical environmental context. To combat these challenges, the competition unified five complementary tasks:
- Language-grounded driving perception and reasoning
- Socially compliant navigation in dynamic human-populated scenes
- Sensor placement generalization for LiDAR object detection
- Cross-modal scene correspondence (text-to-aerial image retrieval)
- Cross-platform 3D object detection via domain adaptation
The use of shared datasets and evaluation protocols enables reproducible, large-scale comparison of robust perception models (Kong et al., 8 Jan 2026).
2. Datasets, Sensor Modalities, and Annotation
The RoboSense dataset forms the foundation for the challenge, consolidating 133K synchronized multi-sensor frames (RGB, LiDAR, fisheye) collected from a social mobile robot (“robosweeper”) with a full horizontal view (Su et al., 2024). Core modalities include:
- Pinhole cameras (4 each: 1920×1080 RGB, 25 Hz, [111.78°, 63.16°] FOV)
- Fisheye cameras (4: 1280×720 RGB, 25 Hz, 180° FOV)
- Top-mounted Hesai Pandar40M LiDAR (64 beams, 10 Hz, 384 kpps), three Zvision ML30 side LiDARs (40 beams, 10 Hz, 720 kpps, 286.48° horizontal FOV), Livox Horizon for densification
All sensors are globally time-synchronized (100 ms timestep) and calibrated for accurate extrinsics. The dataset comprises 7,619 sequences (20 s each) across six scene types (parks, squares, campuses, sidewalks, streets) with a train/val/test split of 50%/10%/40% (one scene type reserved for domain generalization testing) (Su et al., 2024).
Annotation formats:
- 1.4M 3D bounding boxes for Vehicle, Cyclist, Pedestrian (encoded as )
- 216K unique trajectories (track IDs)
- Voxel-labeling for space occupancy (“occupied,” “free,” “unknown”) with semantic classes, privacy desensitization for camera frames
A three-stage labeling process involves PointPillar-based pre-detections, expert refinement, and validity checks on sensor visibility (Su et al., 2024).
3. Defined Tasks, Evaluation Metrics, and Protocols
Six standardized tasks are formulated for egocentric perception and prediction (Su et al., 2024). Each utilizes precise input/output specifications and metrics:
| Task | Input Modality | Output / Metric |
|---|---|---|
| Multi-view 3D Detection | 8 RGB camera images | 3D boxes; mAP, AOS, ASE via CCDP |
| LiDAR-only 3D Detection | 360° fused pointclouds | 3D boxes; mAP, AOS, ASE via CCDP |
| Multi-modal 3D Detection | RGB images + LiDAR pointcloud | Fused 3D boxes; mAP, AOS, ASE via CCDP |
| Multiple 3D Object Tracking | Per-frame detections | Track IDs; sAMOTA, AMOTP, MT/ML |
| Motion Forecasting | 1s history of tracks | 3s future trajectories; minADE, minFDE, MR, EPA |
| Occupancy Prediction | 1s RGB frames, calibration | Voxel states; class-wise IoU, mIoU-3D, mIoU-BEV |
The closest collision-point distance proportion (CCDP) criterion determines whether a detection is a true positive based on a proportion of the true box’s closest collision point, emphasizing strict accuracy for near-field obstacles (Su et al., 2024). Each metric (mAP, AOS, ASE, sAMOTA, EPA, etc.) is formally defined in the reference guide.
Unified protocols introduce:
- Domain shifts by systematic corruption or sensor configuration changes
- Sensor failures via dropout or occlusion
- Platform discrepancies through geometry normalization removal
Baseline models for each task (e.g., BEVDepth, Transfusion-L, PointPillar, Falcon, PnPNet) are provided, and all test submissions are standardized and server-evaluated for fair comparison (Su et al., 2024, Kong et al., 8 Jan 2026).
4. Track Structure, Methods, and Benchmarking
Tracks and representative winning solution methods (Kong et al., 8 Jan 2026, Xiao et al., 9 Oct 2025, Su et al., 2024):
Track 1: Driving with Language
- Multi-view VLM (Qwen2.5-VL-7B) answers MCQs and open-text reasoning under sensor corruption.
- Weighted metric: MCQ/VQA accuracy, plus robustness to synthetic (motion blur, fog, occlusion) and real-world (seasonal) domain shifts.
Track 2: Social Navigation
- Egocentric RGB-D navigation. Falcon policy baseline (DDPPO, ResNet-50 encoder, GRU temporal model).
- Proactive Risk Perception Module (PRPM) augments policy with continuous collision risk scores per human, yielding a +0.0746 total-score lift and smoother margin-keeping compared to the baseline (Xiao et al., 9 Oct 2025).
- Key metrics: Success Rate (SR), Success weighted by Path Length (SPL), Personal Space Compliance (PSC), Human Collision Rate (H-Coll), aggregate Total score.
Track 3: Sensor Placement Generalization
- LiDAR-only detection under novel extrinsics at test time. Winning methods include temporal sweep aggregation, placement-mixed training, and TTA for substantial mAP improvement, and Gaussian-blobs coordinate encoding to remove “global shortcut.”
Track 4: Cross-Modal Scene Correspondence
- Text-to-aerial retrieval via dual encoders, cross-attention, hierarchical contrastive and matching objectives, yielding Recall@1 improvement from 25.4% → 38.3%.
Track 5: Cross-Platform 3D Detection
- Unsupervised domain adaptation from vehicle-trained detectors to drone and quadruped LiDAR using PV-RCNN++ with fused point–voxel feature abstraction (Feng et al., 13 Jan 2026).
- Cross-Platform Jitter Alignment simulates platform motion by pitch/roll randomization, and two-stage self-training with pseudo-labeling (ST3D) to continuously refine adaptation.
- Car [email protected]: 62.67% (drone), 58.76% (quadruped); Pedestrian [email protected]: 49.81%.
- Ablations reveal additive AP gains from CJA (+14.5% Car), ST3D (+10% Car), and AnchorHead (+4% Car) (Feng et al., 13 Jan 2026).
5. Methodological Trends and Insights
Emergent methodological themes across tracks (Kong et al., 8 Jan 2026, Feng et al., 13 Jan 2026):
- Data-centric Robustness: Augmentations (temporal location-mixing, depth-cutout) yield substantial robustness without major architectural change.
- Geometry-awareness: Local coordinate canonicalization and ground-plane alignment enable reliable cross-platform adaptation.
- Language-grounded Reasoning: Chain-of-thought prompting, hierarchical contrast, and metadata grounding are effective for robust vision–language alignment under corruption.
- Self-training and UDA: Carefully tuned pseudo-label thresholds, teacher-student regularization, and meta-learning are vital for unsupervised adaptation stability.
- Modularity: Hybrid point–voxel fusion architectures outperform pure point-based models under severe domain shift.
Key insights:
- Geometry normalization is a potent bias-removal tool for sensor and platform discrepancies.
- Most robustness gains are attributed to training recipes (augmentation, pretraining, self-training) over architectural innovation.
- Data-driven adaptation far outpaces hand-designed invariance functions in the presence of compound environmental shifts.
6. Open Challenges and Future Directions
Persistent challenges remain (Kong et al., 8 Jan 2026):
- Stable self-training under noisy pseudo-labels, especially for long-tail or rare objects.
- Robustness to compound domain shifts: simultaneous sensor corruption, viewpoint changes, and extrinsic shifts.
- Tail-risk evaluation and behavioral safety: need for uncertainty estimation and calibrated rejection (“I don’t know”) in VLMs and navigation policies.
- Explicit sensor-agnostic representation learning is immature, limiting adaptation without geometric supervision.
- Multi-step risk forecasting and scene-adaptive thresholding for social navigation remain open research areas (Xiao et al., 9 Oct 2025).
Recommended future avenues include compound robustness benchmarking, uncertainty-driven policy design, canonicalization layers for arbitrary sensor configuration, self-supervised geometry disentanglement, and end-to-end behavioral safety evaluation integrating multimodal uncertainties.
7. Impact, Participation, and Broader Context
RoboSense2025 has unified diverse research communities around standardized real-world benchmarking, driving methodological advances in robust, adaptive perception. Its large-scale dataset, public leaderboards, and rigorous evaluation have accelerated progress in geometry-aware fusion, social navigation, and cross-platform adaptation. The competition’s broad engagement and documentation of leading solutions suggest ongoing transfer of data-centric and modular design principles to subsequent embodied AI benchmarks. A plausible implication is continued adoption of these protocols in future multi-platform robotics research (Kong et al., 8 Jan 2026, Su et al., 2024).