AdaBench: Driving & Anomaly Benchmarks

Updated 24 January 2026

AdaBench is a comprehensive benchmarking suite that evaluates multi-modal reasoning in autonomous driving and anomaly detection with state-of-the-art annotation and evaluation protocols.
It leverages large-scale, diversified datasets covering adverse weather, challenging scenes, and varied data types including images, tabular data, and text, ensuring rigorous evaluation.
Key results show that Chain-of-Thought prompting enhances reasoning accuracy in driving scenarios while guided supervision improves anomaly detection under real-world noise.

AdaBench is an umbrella term encompassing major benchmarks pertinent to structuring and evaluating reasoning and detection capabilities in both autonomous driving (AD²-Bench) and anomaly detection (ADBench). Each benchmark establishes rigorous, large-scale platforms for advancing research via systematic, multi-dimensional evaluation protocols, and open-source accessibility.

1. Overview and Definition

AD²-Bench is the first Chain-of-Thought (CoT) benchmark explicitly designed for end-to-end multi-modal LLMs (MLLMs) in autonomous driving under adverse weather and complex scenes. It contains approximately 10,000 real-world driving images, 70,000 question–answer pairs, and 5,406 fully annotated CoT instances, with each step treated as an atomic unit possessing explicit ground truth. The benchmark fills gaps in existing AD datasets regarding (a) comprehensive adverse weather coverage (rain, snow, fog, night, etc.), (b) annotation quality supporting multi-step reasoning, and (c) evaluation frameworks for intermediate CoT reasoning steps under safety-critical conditions (Wei et al., 11 Jun 2025).

ADBench is a comprehensive anomaly detection benchmark that evaluates the robustness and performance of 30 algorithms over 57 datasets, covering diverse settings in terms of supervision levels, anomaly types, and noise/corruption. The goal is to allow statistically robust, fair, and reproducible comparisons, especially as prior work neglected supervision, anomaly typology, and real-world data corruption. ADBench includes tabular, computer vision, and natural language processing datasets, and open-sources all resources (Han et al., 2022).

2. Dataset Construction and Coverage

AD²-Bench aggregates ≈10,000 high-resolution driving images from sources such as CODA, ACDC, DAWN, BDD100K, nuScenes, and web-scraped real-weather imagery. Categories encompass adverse weather (light/moderate/heavy rain, snow, fog, sandstorm), challenging illumination (night, dawn, glare), and complex traffic (occlusions, multi-agent interactions, accidents, construction zones). The 33 sub-tasks, distributed across four hierarchical dimensions (Basic Perception, Advanced Perception, Relation Understanding, Reasoning & Decision), yield 70,000 QA pairs (Wei et al., 11 Jun 2025).

ADBench incorporates 47 well-established tabular anomaly detection datasets (healthcare, finance, intrusion, physical sciences, speech, image features, documents, etc.) and 10 contributed CV/NLP datasets, preprocessed via deep embeddings (ResNet18/ViT for CV; BERT/RoBERTa for NLP). Each dataset definition includes sample counts (≤10k), feature dimensions, anomaly rates, and domain metadata (Han et al., 2022).

Benchmark	Coverage/Type	Example Modalities
AD²-Bench	Adverse weather, complex scenes, multi-prompt	Rain, fog, night, occlusion
ADBench	Tabular data, CV/NLP embeddings, anomaly types	Healthcare, finance, CIFAR, 20newsgroups

This underscores the breadth and modality-specific design of each benchmark.

3. Task Structure, Annotation, and Methodology

AD²-Bench employs atomic annotation, decomposing each QA into $N$ intermediate reasoning steps (“atoms”), each annotated and cross-validated by multi-specialist teams. Each atom receives explicit ground truth, and LLM evaluators (e.g., GPT-4o) score output fluency and correctness. Multi-round “Atomic Flow” peer review enforces logical consistency, while visual prompt levels (image, region, point, text) target perception vs. reasoning isolation. Point-level prompts offer optimal cost-accuracy performance under severe occlusion, and chains average $N\approx5$ steps (Wei et al., 11 Jun 2025).

ADBench algorithm evaluation spans unsupervised (14), semi-supervised (7), and supervised (9) methods. Supervision levels range from no labels up to 100% labeled anomalies. Four controlled anomaly types are introduced (local, global, dependency, clustered), and noise/corruption scenarios (duplicated anomalies, irrelevant features, annotation errors) are systematically explored. Experimental design involves 98,436 runs (70/30 stratified splits; 3 repeats), with all randomness controlled and results averaged for reproducibility (Han et al., 2022).

4. Evaluation Protocols and Metrics

AD²-Bench formalizes several CoT-specific metrics:

Step-wise Accuracy & Completeness Score (SACS): $SACS = \frac{1}{N}\sum_{i=1}^N Eval\_acc(A_i, GT_i)$ (1–10 scale, LLM-judged factual/correctness coverage).
Step-wise Logical Progression Score (SLPS): $SLPS = \frac{1}{N-1}\sum_{i=1}^{N-1} Eval\_prog(A_i, A_{i+1})$ (coherence between steps).
Overall Reasoning Coherence Score (ORCS): $ORCS = Eval\_coh(A_1...A_N)$ (holistic, non-contradiction).
Decision Justification Strength Score (DJSS): $DJSS = Eval\_just((A_1...A_{N-1}), A_N)$ (final decision justification).
Perception tasks use standard binary, localization, detection (IoU), OCR (F1, CER) metrics; localization: $acc_i = \frac{1}{1 + \alpha\| x_i - x_i^{gt} \|_2}, \alpha=0.005$ (Wei et al., 11 Jun 2025).

ADBench protocols deploy area-under-curve metrics: AUCROC $= \int_0^1 TPR(FPR^{-1}(t))dt$ , AUCPR, and optionally F1 scores, with precision and recall as standard. Critical-difference diagrams (Wilcoxon-Holm, $p\leq0.05$ ) statistically differentiate algorithms over multiple runs and datasets (Han et al., 2022).

Benchmark	Core Metrics	Statistical Testing
AD²-Bench	SACS, SLPS, ORCS, DJSS, F1, CER, IoU	Multi-expert cross-check
ADBench	AUCROC, AUCPR, F1, Precision, Recall	Critical-difference diagrams

The metrics are tailored to support the hierarchical and typology-specific evaluation needs of each use case.

5. Key Results and Comparative Findings

AD²-Bench reveals low (<60%) reasoning accuracy even for state-of-the-art MLLMs in adverse driving scenarios, indicating substantial room for improvement. Hierarchical CoT prompting elevates models like InternVL3-Qwen2.5 (57.3%), Qwen2.5-VL (56.8%), InternVL2.5 (55.1%), compared to 35–45% without CoT. Advanced Perception (OCR) tasks show high F1 (≈90%) for InternVL series but lower performance for LLaVA. Basic Perception suffers in poor weather, with persistent hallucinations and default values; occlusion exacerbates misses. Relation understanding and complex decision steps expose the limits of logical chain coherence in current MLLMs (ORCS scores < SACS). Failure analysis points to weather ambiguity, occlusion, format errors, and chain incoherence (Wei et al., 11 Jun 2025).

ADBench finds no statistically superior unsupervised method—CD diagrams are almost fully horizontally connected. Semi-supervised detectors leverage minimal labeled anomalies (as low as 1%) to outperform the best unsupervised baselines (e.g., CBLOF); fully supervised methods require ≈10% labels to surpass unsupervised but converge with semi-supervised at ≥50%. Performance is strongly dependent on anomaly type; unsupervised detectors excel when inductive assumptions align (LOF for local, kNN for global/dependency, OCSVM for clustered). Label-informed methods only outperform for clustered anomalies. Robustness findings: unsupervised algorithms suffer (−16.4% median AUCROC) under duplicated anomalies; semi-/fully supervised are unaffected. Supervised methods resist feature noise (≤5% drop at 50% irrelevant features) via feature selection; semi-/unsupervised lose up to 10%. Minor annotation errors mildly degrade supervised/semi-supervised (≤2% drop for ≤5% error); unsupervised are unaffected (Han et al., 2022).

6. Recommendations and Future Directions

AD²-Bench highlights opportunities in pre-training with targeted adverse weather data, integrating finer vision prompts, employing dual-resolution strategies (patch/dynamic resizing), and explicit CoT fine-tuning with step-level supervision. Enhancing instruction-following protocols and developing hallucination-detection modules for safety-critical entities are also recommended (Wei et al., 11 Jun 2025).

ADBench suggests improving unsupervised evaluation through large-scale statistical testing and meta-learning for detector selection. Exploiting limited label regimes with advanced semi-supervised architectures, especially transformer-based backbones, is encouraged. Anomaly-type-aware algorithmic design—either adapting detectors or dynamically routing data—is a promising avenue. Robust detection is supported by unsupervised feature selection and margin-based loss functions. Modalities should expand to time series, graphs, open-set/OOD, fairness-aware detection, and richer pixel/NLP sequence datasets (Han et al., 2022).

7. Accessibility, Reproducibility, and Community Impact

AD²-Bench advances interpretable, robust end-to-end autonomous driving reasoning research by exposing MLLM limitations in adverse scenarios and supporting detailed, fine-grained analytics. The benchmark’s hierarchical CoT scaffolding is a distinct contribution, enabling logic and narrative assurance for each inference chain (Wei et al., 11 Jun 2025).

ADBench provides code, datasets, and results repositories (https://github.com/Minqi824/ADBench, BSD-2 license) with modular organization, Jupyter experiment scripts, and controlled, reproducible randomness. Adding new algorithms involves extending the methods directory and experiment config, with automatic evaluation across the full benchmark suite. This facilitates comprehensive, standardized, and reproducible assessment for anomaly detection communities, bridging domains and fostering robust comparative research (Han et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions (2025)

ADBench: Anomaly Detection Benchmark (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdaBench.