Multimodal-LRP for RUL Regression
- Multimodal-LRP is an explainability method designed for deep networks that integrate image and time-frequency modalities for Remaining Useful Life (RUL) estimation.
- It systematically traces and redistributes relevance using adapted LRP rules, handling convolutions, residuals, and fusion operations in multimodal architectures.
- The approach enhances model interpretability and benchmark performance, reducing training data needs and boosting reliability in industrial prognostics.
Multimodal Layer-wise Relevance Propagation (multimodal-LRP) is an explainability method developed for neural architectures that integrate heterogeneous input modalities, specifically designed for the joint interpretation of features in multimodal Remaining Useful Life (RUL) regression networks. Within the context of prognostics and health management (PHM) for mechanical systems, multimodal-LRP allocates fine-grained relevance scores to the raw inputs—image representations (ImR) and time-frequency representations (TFR)—by systematically tracing and redistributing model output relevance across all computational layers and branches of a multimodal deep learning architecture. This process provides targeted, interpretable visualizations of the driving evidence underpinning automatic RUL scoring, substantiating model reliability for industrial deployment (Razzaq et al., 7 Dec 2025).
1. Network Architecture and Input Modalities
The multimodal-RUL framework consists of three transformative processing branches:
- Image Representation (ImR) Branch: Input bearing vibration signals are 2D rasterized using Bresenham’s algorithm, yielding 64 × 500 gray-scale images. The branch processes this input through two dilated Conv2D residual blocks:
- Block 1: Small convolutional kernels (e.g., 5 × 5) with high dilation rates [(4,4),(3,3),(2,2),(1,1)], coupled with ReLU and MaxPooling.
- Block 2: Larger kernels and lower dilation rates, with a residual shortcut via 1 × 1 convolution.
- Time-Frequency Representation (TFR) Branch: Feature vectors are derived via Continuous Wavelet Transform (CWT), capturing metrics such as energy (), dominant frequency (), entropy (), kurtosis (), skewness, mean, and standard deviation (). Four dilated Conv1D layers (dilations [2,2,1,1], filters [32,32,64,64]), ReLU, and identical pooling are employed, with a two-layer 1 × 1 Conv1D residual sub-architecture.
- Fusion and Regression Branch: Outputs from both branches are concatenated along the feature axis. They are processed with a stack of three LSTM layers ([100, 64, 64] units, tanh), residual addition, an eight-headed multi-head attention layer (key/value dimension 64), and three fully connected layers ([64→32→1]) for final RUL estimation.
2. Principles of Layer-wise Relevance Propagation
Layer-wise Relevance Propagation (LRP) assigns a quantitative relevance score to each input feature, reflecting its contribution to the final network output. At its core is relevance conservation:
for any layer , where denotes the relevance assigned at layer . Backward propagation of relevance relies on canonical LRP rules (LRP-0, LRP-ε, LRP-γ) applied to both dense and convolutional structures. These rules address the distribution of output relevance among input neurons, factoring in activations, weights, and architectural details such as non-negativity or residual connectivity.
Classical LRP methodologies are suited for single-stream architectures and thus may misallocate relevance in multimodal networks containing parallel branches, skip connections, and concatenative fusions. This limitation motivates the need for multimodal-LRP, which explicates the pathwise and modality-specific contribution to a joint regression output (Razzaq et al., 7 Dec 2025).
3. Mathematical Formulation and Propagation Rules
Multimodal-LRP extends conventional LRP by rigorously tracking network topology, branch operations, and fusion mechanisms in both forward and backward passes. The principal rules and their mathematical representations are:
- Dilated convolution + residual (LRP-γ):
- Weight decomposition:
- Pre-activations:
- Adjusted weight:
- Sensitivity:
- Relevance assignment:
- Linear layer (LRP-ε):
- Pooling and Elementwise Operations (LRP-0):
- MaxPooling: for among maxima pooled to , $0$ otherwise.
- Add (Residual): for inputs.
- Activation (ReLU): .
- Fusion Operations:
- Concat: Relevance from the concatenated feature is first averaged over LSTM and residual add outputs, then split back to the appropriate modalities.
- Reshape & Lambda: Relevance is backward propagated to preserve modality attribution.
- Normalization and Sequence Modules:
- LayerNorm: for feature .
- LSTM: .
- Multi-Head Attention: .
4. Implementation Workflow and Pseudocode
The multimodal-LRP procedure uses a two-stage workflow: forward pass to cache activations and architectural events (residual map, concat indices), and backward pass to apply LRP rules and split relevance across modalities.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
activations = [x] ; residual_map = {}
for L in model.layers:
if L is Add: residual_map[L] = secondary_input_activation
if L is Concat: store_concat_indices
A_new = L.forward(A_prev or A_all)
activations.append(A_new)
R = R_top
for L in reversed(model.layers):
if L is Add:
R_main, R_res = L.backward(R)
R = R_main + R_res + residual_map.get(L, 0)
elif L is Concat:
R = L.backward(R, stored_indices) # split modalities
else:
R = L.backward(R, layer_activation)
return R_img, R_tf |
Within each backward call, the specific LRP rule corresponding to the network operation is utilized (Razzaq et al., 7 Dec 2025).
5. Visualization and Attribution Maps
Relevance heatmaps produced via multimodal-LRP confirm which image and TFR features predominate in the network’s RUL prediction. In the ImR branch, red/yellow regions on the heatmap correspond to sharp peaks, fault impulses, or amplitude jumps in the rasterized vibration image, signifying major relevance. Flat baseline regions appear blue, signifying negligible attribution. In the TFR branch, feature-wise visualizations (e.g., bar or circle plots) often show high relevance attributed to standard deviation () and kurtosis (), indicating decisive reliance on variability and impulsiveness within the vibration signal for RUL estimation.
The following table summarizes typical relevance attribution patterns by modality:
| Modality | Key Features with High Relevance | Visualization |
|---|---|---|
| ImR | Sharp peaks, fault impulses | Red/yellow regions in 2D heatmap |
| TFR | Standard deviation (), kurtosis () | Warm-colored bars/circles for , |
These findings indicate the CNN-based ImR branch prioritizes localized geometric wear signatures, while the TFR branch accentuates non-stationary statistical markers and energy metrics (Razzaq et al., 7 Dec 2025). The net effect is transparent, modality-specific explanation supporting model interpretability and prediction trustworthiness.
6. Practical Significance and Benchmark Evaluation
Multimodal-LRP underpins the interpretability of the multimodal-RUL framework, which demonstrates robust generalization and noise resilience across the XJTU-SY and PRONOSTIA bearing datasets. The provenance of salient feature contribution, as revealed through relevance maps, substantiates the model’s capability to match or outperform state-of-the-art baselines under both seen and unseen conditions, while requiring significantly less training data (~28 % less for XJTU-SY, ~48 % less for PRONOSTIA). These outcomes affirm the utility of multimodal-LRP not only for rigorous model auditing but also for enhancing deployment confidence in real-world industrial systems (Razzaq et al., 7 Dec 2025).
This suggests that explicit multimodal relevance tracing is required for accurate interpretation in fused architectures containing parallel and non-standard network operations. A plausible implication is the general applicability of multimodal-LRP logic to other multimodal learning frameworks facing similar interpretability requirements in industrial AI.