Safe-NEureka: High-Reliability Extreme Systems
- Safe-NEureka is a comprehensive engineering framework that integrates quantitative risk assessment, modular redundancy, and ALARA-compliant safety practices for both high-energy accelerators and satellite AI systems.
- The methodology employs hybrid modular redundancy, SEC-DED ECC memory protection, and TMR controllers to ensure fault tolerance and reliable performance under extreme thermal, mechanical, and radiation stresses.
- The framework also extends to chemical and environmental safety by using high-flash-point solvents and robust shielding, achieving significant reductions in faults, exposure risks, and operational overhead.
Safe-NEureka refers both to a class of engineering practices for high-safety, high-reliability systems operating under extreme environments (notably multi-MW neutrino beamlines and large-volume particle detectors), and to a radiation-tolerant DNN accelerator architecture for on-board satellite AI. The term embodies a design philosophy and implementation framework that mandates rigorous risk quantification, modular redundancy, robust hardware/software co-design, and environmental hazard mitigation, with formal compliance to defined safety metrics across operational domains (Baussan et al., 2011, Bonhomme et al., 2022, Tedeschi et al., 4 Feb 2026).
1. General Principles and Safety Objectives
Safe-NEureka architectures are governed by a principle of “no single point of failure” and the provision of quantitative engineering margins against all identified risk factors. Key safety and reliability goals, as established in multi-MW accelerator and satellite AI contexts, include:
- Partitioning high-risk elements (e.g., 4 MW proton beam into four 1 MW targets) to reduce exposure per module (Baussan et al., 2011).
- Maintaining thermal, mechanical, and electrical stresses below the endurance limits for multi-year, high-cycling operation (e.g., beam pulses, MPa for Al 6061-T6) (Baussan et al., 2011).
- Ensuring all failure modes, from radiation-induced upsets to hardware wear-out, are mitigated either by redundancy (modular, DMR/TMR), robust materials design, or remote/intervention-free recovery protocols (Tedeschi et al., 4 Feb 2026).
- ALARA-driven shielding, containment, and environmental protection for personnel and habitat, combining passive materials barriers and active air/ventilation control (Baussan et al., 2011, Bonhomme et al., 2022).
2. Redundancy, Fault Tolerance, and Recovery Architectures
Central to Safe-NEureka’s approach is modular redundancy at both system and sub-system levels:
- Hybrid Modular Redundancy (HMR): Safe-NEureka accelerator IP splits a 4×4 processing-element (PE) array into two 4×2 sub-arrays. At run-time, this hardware can be switched between:
- Dual Modular Redundancy (DMR) mode for safety-critical workloads, with online output comparison and hardware rollback on mismatch;
- Performance mode, operating both sub-arrays independently for throughput maximization (Tedeschi et al., 4 Feb 2026).
- Memory Protection (SEC-DED ECC): All tightly coupled data memory and meta-data paths are guarded by Hsiao SEC-DED codes (), effecting single-bit error correction and double-bit error detection on-the-fly (Tedeschi et al., 4 Feb 2026).
- TMR Controller: Critical controller FSM and loop microcode are triplicated with majority voting; this constitutes a area overhead for the controller but only of total accelerator area (Tedeschi et al., 4 Feb 2026).
- Recovery FSM: On-line detection/rollback for DMR triggers a microcode pointer revert and tile recomputation (latency bounded to cycles, e.g., $90$—$330$ cycles for typical CNN tiles), decoupled from global system reboots.
- Multi-level Redundancy in Neutrino Facilities: Parallel target/horn systems, cooling circuits, and power supplies. Critical activated components are handled remotely to avoid personnel exposure. ALARA principles further drive redundant barriers and fail-safe environmental controls (Baussan et al., 2011).
The quantified impact is a reduction in faulty executions for DMR with a manageable area overhead. In redundancy mode, Safe-NEureka exhibits a $70$– latency increase and up to reduced efficiency (TOPS/W), but in performance mode, throughput and efficiency reductions are constrained to $5$– (Tedeschi et al., 4 Feb 2026).
3. Thermo-Mechanical and Radiation Safety Analysis
Safe-NEureka frameworks employ multi-physics FEA and probabilistic fault models to establish robust operation:
- Horn and Target Modules: Finite element analysis integrates electromagnetic (J×B) pulsed stresses (magnetic pressure ) with steady-state and transient thermal profiles (). Fatigue S–N curves for relevant alloys (e.g., Al 6061-T6) establish allowable stress amplitudes ( MPa for ) (Baussan et al., 2011).
- Beam Window and Target: For a 0.25 mm Be window under 1 MW beam, water or He cooling holds at C/C, well below beryllium strength limits ( MPa). Ti6Al4V packed-bed sphere targets with He cooling provide thermal-shock mitigation and facilitate remote handling (Baussan et al., 2011).
- Radiation Shielding: Facility walls use $5.5$ m concrete; FLUKA models confirm negligible rock activation after 200 operational days. All highly activated equipment is accessed only by remote manipulators to minimize dose (Baussan et al., 2011).
- Environmental Controls for Scintillators: Selection of high-flash-point (e.g., polysiloxane, LAB) and non-toxic solvents, with vapor pressures and flash points documented in strict compliance with GHS/EU safety standards (e.g., TPTMTS: flash point C, vapor pressure mbar, no H-statements) (Bonhomme et al., 2022).
4. Safe-NEureka in Chemical Systems and Environmental Safety
Safe-NEureka principles are extended to liquid scintillator systems via the adoption of advanced solvents and materials:
- Scintillator Design: Classical solvents (toluene, xylene, pseudocumene) offer high light yield (LY), but with low flash points (C) and high toxicity. Safe-NEureka-compliant solutions emphasize:
- Linear alkylbenzene (LAB): flash point C, vapor pressure $0.013$ mbar, GHS non-hazardous, attenuation length m, LY Anthracene (Bonhomme et al., 2022).
- Polysiloxane (TPTMTS): flash point C, non-volatile, non-toxic, attenuation length 5–10 m (unpurified), LY Anthracene.
- Operational Recommendations: For kiloton-scale, transparency-dominated detectors, LAB with PPO and bis-MSB is favored. For safety-dominated contexts (reactor proximity, strict VOC caps), TPTMTS solutions are prioritized despite lower LY and increased viscosity (Bonhomme et al., 2022).
- Deployment Practices: Purification (e.g., AlO column) extends attenuation, temperature is controlled to $15$–C, and all operations are designed to minimize environmental and occupational hazard via remote and automated protocols.
5. Compliance, Lifetime, and Mixed-Criticality Use Cases
Safe-NEureka compliance spans both operational and lifecycle safety:
- Fault Coverage: Fault-injection analyses demonstrate safe accelerator modes reduce undetected error rates by nearly two orders of magnitude. Controller-originated errors are nearly eliminated via TMR, with remaining failures traceable to non-triplicated logic eligible for further hardening (Tedeschi et al., 4 Feb 2026).
- Mode Switching and Overhead: The architecture allows dynamic mode switches (DMR ↔ Performance) via a memory-mapped register, facilitating rapid adaptation to changing mission phases without job restarts or reconfiguration overheads 400 cycles (Tedeschi et al., 4 Feb 2026).
- Facility and Environmental Lifetime: Beamline modules and scintillator containment are engineered for operation cycles and multi-year lifetimes, with full remote handling and repair scenarios validated, ensuring ALARA, fail-safe, and environmental targets are met (Baussan et al., 2011, Bonhomme et al., 2022).
- Mixed-Criticality Operation: Satellites process GNC kernels in redundancy mode (latency trade-off for correctness), while payload filters utilize performance mode (minimal throughput loss) (Tedeschi et al., 4 Feb 2026).
6. Summary of Quantitative Risk Mitigation
The following table summarizes salient risk mitigation parameters and their engineered responses:
| Risk Domain | Mitigation Strategy | Quantitative Outcome |
|---|---|---|
| Radiation-induced faults (DNN) | DMR, TMR, SEC-DED ECC | 96% reduction in faults, area overhead 15% |
| Proton beam/horn overstress | FEA-verified limits, cooling, fatigue | MPa vs. MPa |
| Chemical fire/exposure, LS detectors | High FP, non-toxic solvents (LAB, TPTMTS) | LAB FP C, TPTMTS FP C |
| Environmental/occupational exposure | ALARA shielding, remote handling | Rock activation ≃ 0, minimal VOC/groundwater risk |
| Controller/hardware logic errors | TMR-voted FSM, hardware rollback | Nearly all controller errors recovered |
7. Outlook and Future Prospects
Safe-NEureka frameworks, both in hardware and facility-scale contexts, provide a model for engineering rigorous, multi-modal safety in systems exposed to extreme environments and mixed-criticality computational workloads. Potential extensions include dynamic adaptation of redundancy and recovery strategies, integration with on-line system health monitoring, and further formalization of ALARA-compliant environmental controls in increasingly large-scale or autonomous systems (Baussan et al., 2011, Bonhomme et al., 2022, Tedeschi et al., 4 Feb 2026).