Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Published 3 Oct 2023 in cs.RO, cs.AI, cs.CL, and cs.CV | (2310.01957v2)

Abstract: LLMs have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.

Abstract PDF HTML Upgrade to Chat

References (51)

Citations (120)

View on Semantic Scholar

Summary

The paper introduces a novel multimodal architecture integrating object-level vector modalities with LLMs for clearer, explainable driving decisions.
It utilizes a two-stage pretraining and fine-tuning strategy along with a unique Driving QA dataset of 160,000 pairs to benchmark performance.
Empirical results show improved action prediction and decision rationale over baseline methods, emphasizing enhanced interpretability in autonomous systems.

An Overview of "Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving"

The paper "Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving" proposes a cutting-edge framework for integrating LLMs with traditional autonomous driving systems to enhance interpretability and generalization capabilities. The methodology centers on marrying object-level vector modalities with pre-trained LLMs using a novel multimodal architecture, effectively enabling these models to better comprehend and react to driving scenarios.

The authors take an innovative approach by introducing a unique object-level vector modality that augments the LLMs' decision-making processes. This is achieved by embedding vectorized semantic representations of the driving context—such as details about nearby vehicles, pedestrians, and traffic signals—into the narrative capabilities of LLMs. Consequently, this allows the model to conduct spatial reasoning and infer actions while maintaining a coherent natural language explanation of those decisions.

Methodology and Contributions

The framework is structured around several key contributions:

Novel Multimodal Architecture: The authors develop a proficient architecture that synergizes object-level vector modalities into any LLMs. This involves a two-stage pretraining and fine-tuning process that ensures the numeric vector data seamlessly integrates with textual representations.
Extensive Dataset and Driving QA Task: The team assembled a sizable dataset containing 160,000 question-answer pairs derived from a broad spectrum of driving situations. This dataset acts as a benchmark for the driving scenarios explored in the paper and supports the evaluations of Driving QA tasks.
Evaluation with Driving QA: A novel evaluation method for Driving QA is introduced, presenting robust benchmarks and an initial pretrained baseline to guide further research in the domain.

In terms of methodology, the paper employs reinforcement learning (RL) to collect high-quality training data within a driving simulation environment. The RL agent, acting as a pseudo-expert driver, aids in generating realistic control commands across numerous procedural scenarios. This approach circumvents the need for human experts and accelerates data acquisition.

The paper further elucidates a pretraining strategy where the object-level vector and language modalities are aligned using pseudo-captioning data. This process, combined with fine-tuning on the unique Driving QA dataset, positions the model to perform complex decision-making escalations and respond to nuanced driving queries.

Results and Implications

The empirical results highlight the model's proficiency across several dimensions. Key metrics regarding the accuracy of action prediction and driving question-answering tasks indicate substantial improvement over baseline behavior cloning methods, although challenges remain in spatial perception tasks. The model's superior performance in action-based reasoning accentuates the benefits of integrating the semantic depth of LLMs with numerically rich autonomous driving data.

The work fundamentally enhances the interpretability of autonomous systems, addressing traditional limitations in behavior transparency and out-of-distribution reasoning. The introduction of a structured language generator, capable of translating complex vector data into narrative form, represents a significant methodological advancement with potential applications beyond simulated environments.

Conclusion and Future Directions

The paper lays the groundwork for future explorations in embedding pre-trained language understanding into vehicular operations, aspiring to tackle both theoretical challenges and practical hurdles in the field. Enhanced by this framework, autonomous systems could gain higher levels of context awareness and decision-making clarity, leading to improved safety and public trust. Future research could explore refining the grounding process for numeric vectors, scaling the approach to real-world scenarios, and reducing the computational complexity of LLMs during closed-loop evaluations.

Overall, the paper's revelations underscore a pivotal shift toward explainable AI in autonomous systems, potentially steering the development trajectory in a direction that champions accountability and human-friendly interfaces.