Edge–Cloud–HPC Continuum Overview
- Edge–Cloud–HPC Continuum is an integrated infrastructure that combines edge devices, cloud data centers, and HPC clusters to support diverse scientific workflows and real-time analytics.
- It employs advanced orchestration algorithms and AI-schedulers like DECICE and Nextflow to optimize task mapping, reduce latency, and improve energy and cost efficiency.
- Robust security and provenance mechanisms across tiers ensure data integrity, reproducibility, and fault tolerance in complex multi-resource environments.
The Edge–Cloud–HPC Continuum denotes an integrated computational infrastructure spanning edge devices (sensors, lab instruments, IoT nodes), cloud data centers (public/private, burstable or on-demand compute/storage), and high-performance computing (HPC) platforms (large-scale clusters with low-latency interconnects) connected by high-speed networks and orchestrated storage repositories. This architecture enables distributed scientific workflows, real-time analytics, and machine learning to operate across heterogeneous resources with optimization for performance, cost, energy, security, and reproducibility (Tallent et al., 2024, Santillan et al., 2 Aug 2025, Rosendo et al., 2022).
1. System Architecture and Infrastructure Components
The continuum is realized as a multi-tiered architecture composed of:
- Edge Tier: Physical sensors, scientific instruments, embedded compute (GPU/TPU for near-instrument inference), and secure uplinks. This tier is responsible for raw-data acquisition, initial inference, and privacy-critical preprocessing.
- Cloud Tier: Elastic services (e.g., AWS Fargate, serverless Lambda), managed AI/ML pipelines, batch queues, and REST/gRPC endpoints supporting scaling, retraining, event-driven task launching, and cost-optimized compute.
- HPC Tier: Large-scale clusters (e.g., NERSC Perlmutter, EC2 C5/G4 fleets) offering advanced simulation, parallel analytics, GNN training, and I/O acceleration using distributed file systems (FSx for Lustre, GPFS) and batch schedulers (Slurm, PBS).
- Data Repositories: Geo-distributed object stores, HDF5/netCDF archives, versioned file systems supporting provenance and semantic metadata curation.
Connectivity across tiers is managed by high-speed networks (InfiniBand, 5G, WAN) and protocol stacks that enable seamless movement of data and tasks (Tallent et al., 2024, Santillan et al., 2 Aug 2025). In practice, architectural complexity is reflected in service-count and workflow-count metrics, with typical deployments showing mean(|S|)=8.1–8.5 services and mean(|W|)=3.27–4 workflows per architecture (Santillan et al., 2 Aug 2025).
2. Workflow Model, Scheduling, and Optimization
Scientific and analytic workflows in the continuum are modeled as directed acyclic graphs (DAGs) comprising heterogeneous tasks (numerical solvers, analytic routines, ML modules). Formal scheduling is defined by mapping functions (tasks to resources), start/finish times , and inter-tier data-transfer delays (Tallent et al., 2024, Sharma et al., 18 May 2025).
Multi-Objective Formulations
Scheduling and workload mapping target combined minimization of makespan, energy, and monetary cost:
with
subject to resource capacity, feature compatibility, data dependency, and transfer constraints (Tallent et al., 2024, Sharma et al., 18 May 2025).
Orchestration Algorithms
- FastFlow: Identifies critical flows, balances locality and parallelism via lightweight analytic projections for linear-time scheduling.
- HEFT/OLB and MILP: MILP yields exact results for small workflows. HEFT and OLB provide scalable approximate solutions (≤10% optimality deviation, 99× speedup for 10⁴+ tasks) (Sharma et al., 18 May 2025).
- SkyPilot Broker and Nextflow: Dynamically select best cloud/HPC regions/resources based on target objectives, supporting portable and cross-provider deployments.
- AI-Schedulers (DECICE): Hybrid of supervised forecasting and DRL-based closed-loop MDP policies, reinforced by digital-twin fidelity (Kunkel et al., 2023).
3. Performance, Energy, and Reliability Trade-offs
Quantitative Metrics
Empirical studies report substantial speedups and efficiency gains enabled by continuum orchestration:
- Response-time speedup: 1.28×–87× (DataLife, FastFlow + bottleneck detection) (Tallent et al., 2024)
- AI segmentation improvement: mean IoU up to +21.4%, false positives down by 18.4% (SAM-I-Am vs vanilla SAM) (Tallent et al., 2024)
- I/O acceleration: Monte Carlo (10×), E3SM storm tracking (3.7×) (Tallent et al., 2024)
- ML pipeline (SageMaker EC2 + Lambda@Edge): training (4–8h, $25/h for 100GB), inference (<$0.20/million requests, 10–50ms latency) (Santillan et al., 2 Aug 2025)
- MRI segmentation (DECICE): time-to-result reduced 120s → 25s; 60% data bandwidth savings (Kunkel et al., 2023)
Energy and Cost Models
Simple linear energy model per task:
Roofline compression model (modulating compute/memory bottlenecks):
where CR is compression ratio (Tallent et al., 2024).
Reliability Provisions
- Automatic checkpointing and task retries (Nextflow+SkyPilot) counteract cloud spot instance volatility.
- Workflow-level provenance (DataLife/DaYu, ProvLight) enables deterministic replay and partial re-execution.
- DECICE maintains system resilience, auto-migrating jobs under connectivity disruptions within 5s (Kunkel et al., 2023, Rosendo et al., 2023).
4. Security Models and Data Provenance
Security mechanisms are tier-specific:
- Edge: Mutual-TLS, device certificates, hardware-rooted trust.
- Cloud: RBAC, IAM, event bus encryption (SNS/SQS).
- HPC: Kerberos/SPNEGO, encrypted data-in-transit (MPI/TCP).
- Repositories: Server-side and client envelope encryption (SSE-S3, SSE-KMS) (Tallent et al., 2024).
Formal trust policies use first-order logic, e.g.,
and authority-signed certificates to establish trusted users.
Provenance capture is integrated throughout workflow steps using semantic metadata (HDF5, JSON-LD), supporting queryable, reproducible pipelines and fault recovery (Nextflow, DataLife/DaYu, ProvLight). ProvLight achieves up to 37× faster capture and 2.5× lower energy cost on edge devices than previous systems (Rosendo et al., 2023).
5. Representative Workloads and Empirical Lessons
Scientific and Industry Case Studies
- STEM imaging: Real-time edge inference (SAM-I-Am, GPU), synthetic modeling (PRISMScope), GNN retraining (MassiveGNN), compression (ViSemZ).
- BelleII pipeline: 10× I/O speedup, cloud-HPC dataflow.
- E3SM: I/O storm detection/remediation (3.7× improvement).
- MRI (DECICE): Edge-GPU slicing, segmented in hybrid cloud/HPC pipelines.
- Emergency drone response: 95% mission coverage, edge-auto-migration under network drops (Kunkel et al., 2023).
Key Bottlenecks
Identified system challenges include:
- Semantic I/O storms in verbose scientific formats (HDF5/netCDF).
- Memory-network imbalance at HPC scale when scaling workflow data.
- Model-transfer delays across tiers due to non-adaptive compression or poor scheduling (Tallent et al., 2024).
Design Recommendations
- Instrument-proximal inference for minimal latency.
- Early, domain-aware compression to curtail expensive data movement.
- Fine-grained introspection informing dynamic scheduling for optimal bottleneck remediation.
- Use portable templates (Nextflow+SkyPilot), goal-directed brokers, and rigorous provenance frameworks to ensure reproducibility and fault tolerance (Tallent et al., 2024).
6. Continuum-wide AI/Machine Learning and Orchestration Patterns
AI and ML services span all tiers:
- Edge: Real-time inference using tiny ML frameworks (TensorFlow Lite, Lambda@Edge, Greengrass).
- Cloud: Elastic retraining, business logic, distributed analytics (SageMaker, Lambda).
- HPC: Large-scale model training (multi-GPU, parameter servers), batch-oriented analytics.
- Federated Learning: Device-local updates, central aggregation for privacy and global convergence.
- Orchestration Ecosystems: Kubernetes (Cloud, KubeEdge), Volcano (HPC batch), Nextflow, E2Clab for reproducible deployments. Semantic brokers (HERMES) ensure ontology-based interoperability, blockchain-backed resource marketplaces, and DRL-enabled multi-objective scheduling (Dehury et al., 9 Dec 2025, Kunkel et al., 2023, Rosendo et al., 2022).
7. Experimental Platforms, Provenance, and Reproducibility
Realistic continuum experimentation and workflow optimization are supported by integrated platforms:
- E2Clab: End-to-end deployment, analysis, and optimization stack; supports parameter space definition, execution, monitoring, artifact archiving, and reproducibility (Rosendo et al., 2021, Rosendo et al., 2021).
- ProvLight: High-efficiency provenance capture for edge-to-cloud workflows; integrated with E2Clab for minimal resource footprint and robust workflow introspection (Rosendo et al., 2023).
- Grid’5000, FIT IoT-Lab, Chameleon, Fed4FIRE+: Testbeds offering heterogeneous infrastructure for reproducible evaluation (Rosendo et al., 2022).
Methodological best practices demand publication of artifacts, infrastructure descriptors, and raw results to ensure that experiments can be replicated and extended, advancing continuum-wide optimization and empirical understanding.
The Edge–Cloud–HPC Continuum is now foundational for scientific discovery, industry-scale distributed applications, and large-scale AI/ML analytics. Recent frameworks and empirical research confirm its feasibility and impact, while continued innovation is required in orchestration, security, provenance, and holistic optimization to fully realize its promise (Tallent et al., 2024, Santillan et al., 2 Aug 2025, Sharma et al., 18 May 2025, Kunkel et al., 2023, Rosendo et al., 2023, Rosendo et al., 2021, Rosendo et al., 2021, Dehury et al., 9 Dec 2025, Rosendo et al., 2022).