FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Published 3 Jun 2025 in cs.CL | (2506.03278v1)

Abstract: We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of LLMs to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.

Abstract PDF Upgrade to Chat

Summary

The paper presents FailureSensorIQ, a novel multi-choice QA dataset for evaluating sensor-failure reasoning in industrial assets.
It employs systematic perturbations and metrics like Accuracy and Consistency-Based Accuracy to robustly measure model performance.
Findings reveal significant model sensitivity to data perturbations, emphasizing the need for enhanced domain-specific reasoning.

FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

Introduction

"FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes" (2506.03278) introduces a novel benchmark designed to evaluate the reasoning capabilities of LLMs in domain-specific scenarios, focusing specifically on Industry 4.0 applications. The emphasis is on understanding failure modes, sensor data, and their interrelationships in various industrial assets. This dataset diverges from traditional QA benchmarks by integrating domain-driven insights with statistical analyses.

Benchmark Design and Components

The dataset is structured to challenge LLMs in understanding complex industrial scenarios through multiple-choice questions (MCQs). It covers ten types of industrial assets and includes expert-curated questions, offering two main types of queries:

Failure Modes to Sensor Relevance (FM2Sensor): Identifying the most relevant sensors for detecting failure signs.
Sensor to Failure Mode Relevance (Sensor2FM): Understanding which failure modes are indicated by specific sensor readings.

The dataset comprises 8,296 questions, divided into single-correct-answer and multi-correct-answer formats. These questions are derived from data in ISO documents, offering a robust platform for assessing industrial knowledge in LLMs.

Figure 1: Example of AI Tasks for Industry 4.0 Applications.

Methodology and Analysis

Dataset Structure and Preparation

Each question in the dataset is carefully crafted using a systematic process:

Questions are generated according to asset type and relevance criteria, ensuring a diverse coverage of potential industrial scenarios.
A variety of perturbations are applied to questions to assess robustness, including format changes and content expansions.

Evaluation Metrics

The study employs metrics such as Accuracy (Acc), Perturbed Accuracy (Acc@Perturb), Consistency-Based Accuracy (Acc@Consist), and others to measure the performance of models under different conditions. The emphasis is on understanding how well models can retain and apply domain-specific knowledge when faced with varied levels of question complexity and perturbations.

Figure 2: Multi-Choice Question Generation Pipeline.

Key Findings

The analysis reveals significant insights:

Knowledge Retention: LLMs show a marked drop in performance when faced with perturbed data, emphasizing the need for robust reasoning capabilities beyond static knowledge retrieval.
Performance Disparities: The highest-performing models achieved around 60% accuracy on original questions, but this dropped significantly with perturbations, highlighting the challenge of consistent reasoning.
Model Sensitivity: The study identifies a correlation between data availability from external sources (like Wikipedia) and model accuracy, suggesting that broader data exposure might improve performance in specialized domains.
Figure 3: Real-world knowledge capacities assessed by ACC@Consist.

Implications and Future Research

The introduction of FailureSensorIQ represents a significant step towards evaluating and enhancing the industrial relevance of LLMs. This dataset not only serves as a tool for benchmarking but also highlights the potential for LLMs to automate and enhance decision-making in predictive maintenance and asset management.

Future research should focus on integrating temporal dynamics and real-time data streams into benchmarking processes. This includes extending datasets to account for the temporal nature of sensor data, thereby providing a more comprehensive evaluation framework for industrial applications.

Figure 4: Response Pattern Analysis between Original Data and Complex Perturbed Data on SC-MCQA for o1.

Conclusion

FailureSensorIQ offers a rigorous platform for testing LLMs in industrial contexts, shedding light on their current capabilities and limitations. It challenges the field to develop models that not only parse and retrieve information but also synthesize complex domain knowledge into actionable insights. Through better benchmark design and dataset extension, this research aims to bridge the gap between general linguistic capabilities and domain-specific expertise necessary for Industry 4.0.

Markdown Report Issue