Physics-Guided Tiny-Mamba Transformer
- The paper introduces PG-TMT, a compact tri-branch encoder that integrates physics-guided spectral mapping and EVT-calibrated thresholds to enhance early fault detection in rotating machinery.
- It fuses depthwise convolution, state-space modeling, and local transformer attention to capture micro-transients, long-range dynamics, and cross-channel resonances with high precision.
- Experimental evaluations show robust PR-AUC and ROC AUC, low latency, and reliable transfer across domains under severe nonstationary conditions and class imbalances.
The Physics-Guided Tiny-Mamba Transformer (PG-TMT) is a compact, tri-branch encoder architecture designed for reliability-aware early fault warning in rotating machinery under nonstationary conditions, domain shifts, and severe class imbalance. PG-TMT integrates physically guided priors—explicit temporal-to-spectral mappings aligned with mechanical defect frequencies—into a fusion of depthwise-separable convolution, state-space modeling, and attention-based resonance capture. Decision reliability is ensured through extreme-value theory (EVT) calibrated thresholds and hysteretic alarm logic. Evaluation across public and industrial datasets demonstrates competitive precision-recall metrics, timeliness, robust transfer, and deployment feasibility (Li et al., 29 Jan 2026).
1. Tri-Branch Encoder Architecture
PG-TMT processes online windows of multichannel vibration signals, , to produce a calibrated anomaly score at each time (hop , batch-size 1). The encoder is organized into three complementary branches:
- Depthwise-Separable Convolutional Stem (Micro-Transients):
A cascade of causal 1D depthwise convolutions (kernel size , optional dilation ) is followed by per-channel pointwise () convolutions. At each layer , for input ,
The receptive field, 0, is tuned for sub-millisecond impact-like transients. Output: 1.
- Tiny-Mamba State-Space Branch (Long-Range Dynamics):
A gated, linear state-space model captures near-linear degradation over hundreds or thousands of timesteps:
2
Here, 3 is a channel-reduced input, 4 is the latent state, and 5 are learned gates. Stability is enforced via 6 with discretization:
7
ensuring 8. Output: 9.
- Local Transformer (Cross-Channel Resonances):
Self-attention is restricted to a causal window 0 for each head 1:
2
producing 3 for cross-channel resonance encoding.
Branch outputs are concatenated, 4, and fused by a gated residual:
5
A local attention distribution 6, Jensen–Shannon discrepancy term, and a final score 7 (with 8 incorporating evidence and discrepancy) complete the inference pipeline.
2. Physically Guided Temporal–Spectral Alignment
PG-TMT imposes explicit temporal-to-spectral mapping by analytically connecting learned temporal attention to classical fault-order bands—frequencies determined by bearing geometry and shaft speed.
- Spectral Attention:
Let 9 be sampling rate, 0. Spectral attention is computed as
1
- Fault Orders and Band Mask:
Classical bearing defect frequencies:
2
For each primary order 3, side-bands, and windowing parameters, a Gaussian mixture 4 masks the frequencies of interest.
Smoothed spectral and mask distributions 5, 6 yield a physics-based alignment penalty:
7
and a band-alignment score
8
quantifying the physics-grounded plausibility of the model’s attention.
3. EVT-Calibrated Reliability-Aware Decision Logic
PG-TMT translates raw anomaly scores into calibrated, reliability-guaranteed alarms using an EVT-based extremal modeling of healthy-score exceedances.
- Peaks-Over-Threshold Extreme-Value Modeling:
On calibration segments, scores above a high quantile 9 are modeled via the generalized Pareto distribution (GPD), 0. Exceedance times approximate a Poisson process of rate 1. The on-threshold 2 meeting false alarm intensity 3 is
4
with the limiting case 5 yielding the logarithmic form.
- Dual-Threshold Hysteresis and Hold Time:
To suppress spurious frame-level alarms, 6, with minimal episode duration 7 and merging of episodes separated by less than 8. The resulting alarm logic produces episodes whose empirical rate 9 tracks the prescribed 0, including under speed drift when 1 is RPM-adapted.
4. Experimental Design and Evaluation Protocols
Evaluation follows strict leakage-free, right-censored streaming protocols emphasizing reliable, domain-robust deployment.
- Streaming Protocol:
Sliding windows of length 2, hop size 3, batch=1. A burn-in period 4 initializes state. Data splits are disjoint at machine, load, speed, and sensor level, with no window crossing of split boundaries. Per-channel normalization is trained only.
- Timeliness and Right-Censoring:
Detection time is censored if no alarm occurs before run end. Timeliness 5 is computed using Kaplan–Meier estimators, reporting mean/median MTTD with confidence intervals.
- Datasets:
- CWRU bearing data (speeds, loads, rigs)
- Paderborn University (seeded faults, speed6torque, cross-rig)
- XJTU-SY run-to-failure (chronological splits)
- Industrial pilot (in-service rotating machinery)
- Metrics:
- Precision–Recall AUC (PR-AUC) under severe class imbalance
- ROC AUC
- Mean time-to-detect (MTTD) at matched 7
- Alarm intensity (episodes/hour, hysteresis+merge logic)
- Cross-domain transfer: AUC and MTTD retention and gain under directed shifts and few-shot adaptation, using
8
5. Key Results and Ablation Findings
Detection Performance:
- PR-AUC approximately 0.96–0.94 and ROC AUC 0.99–0.97 across CWRU/Paderborn/XJTU-SY (graceful degradation to 0 dB SNR).
- Mean MTTD 9 28–33 s (clean), increasing to 49–61 s at SNR = 0 dB, at 0 events/hour.
- Empirical false-alarm intensity 1 remains within 2 events/hour of target, stable under RPM drift.
- Transfer Across Domains:
- AUC retention 3 0.95 for cross-load/speed; MTTD retention 4 0.9; transfer across sensor/rig/dataset is robust.
- Few-shot adaptation (1–5% labels) recovers nearly oracle performance.
- Ablation and Latency:
- Removing any encoder branch or physics prior degrades PR-AUC, increases FAR, or worsens MTTD.
- Excluding EVT/hysteresis disrupts intensity matching and increases chatter.
- Latency: median inference 5 10 ms (p50), 612 ms (p90/p99) on CPU/Jetson; model size 0.8M parameters, 0.28 GFLOPs.
6. Significance, Applications, and Interpretation
PG-TMT combines physically aligned representation learning with calibrated, interpretable, and operationally robust early fault warnings for reliability-centric prognostics and health management. Its fusion of transient detection, slow-trend modeling, cross-channel resonance capture, and analytic attention-band alignment is directly interpretable in terms of vibrational fault physics. The EVT-calibrated, hysteretic alarm logic provides explicit guarantees on false-alarm rates and episode integrity under nonstationary and imbalanced conditions. Demonstrated performance across public benchmarks and real-world pilots, together with robustness to domain shifts and low-SNR conditions, establishes PG-TMT as a deployment-ready solution for industrial rotating machinery monitoring (Li et al., 29 Jan 2026).