Archipelago Infrastructure Systems
- Archipelago Infrastructure is a distributed systems paradigm that organizes compute and storage as loosely-coupled, semi-autonomous islands with local scheduling and resource provisioning.
- It supports both large-scale scientific applications like LOFAR-IT for radio astronomy and modern serverless architectures through decentralized, latency-aware load balancing.
- The approach enhances scalability and performance by leveraging uniform container environments, localized scheduling, and robust WAN interconnects with significant throughput improvements.
An archipelago infrastructure is a distributed systems paradigm in which compute and storage resources are logically organized as loosely-coupled, semi-autonomous “islands,” each with local autonomy over scheduling and resource provisioning, interconnected by wide-area networks and harmonized through shared software environments or orchestrators. The approach underpins both large-scale scientific cyberinfrastructure for data-intensive radio astronomy and modern low-latency serverless computing architectures. The archipelago concept is exemplified by the LOFAR-IT infrastructure for the Low-Frequency Array (LOFAR) Italian community (Taffoni et al., 2022) and by the Archipelago serverless platform for scalable cloud function scheduling (singhvi et al., 2019).
1. Logical and Physical Topologies
Archipelago infrastructures consist of multiple geographically separated sites or “islands” connected via high-bandwidth wide-area networks. In the LOFAR-IT deployment, four “islands” are located in Trieste (INAF–OATs), Catania (INAF–OACt), Bologna (INAF–IRA), and Turin (OCCAM), each equipped with heterogeneous high-performance cluster hardware, local storage fabrics, and independent batch scheduling (Taffoni et al., 2022). The logical topology is generally a star or partial-mesh overlay on the academic network backbone (e.g., GARR in Italy) with guaranteed inter-site bandwidth of 10 Gb/s and one-way latencies in the 10–30 ms range.
At the platform level, the Archipelago serverless model divides a centralized cloud cluster into a number of smaller, disjoint worker pools, each managed by a “semi-global scheduler” (SGS) that has exclusive control over its worker subset. Requests are distributed across SGSs using consistent hashing, dynamic queuing analysis, and per-application latency targets (singhvi et al., 2019).
2. Resource Composition and Scheduling
Each site in an archipelago is provisioned with compute and storage resources tailored to the expected workload, but with rationalized container environments to ensure workload portability. The LOFAR-IT system aggregates ≈2700 cores (projected to 4000+), distributed across nodes with RAM configs of 5.2 GB/core up to 768 GB/node, and provides over 1.8 PB of scratch and archive storage across sites (Taffoni et al., 2022). Each site maintains a local job scheduler (SLURM or PBS-Pro for HPC, Docker/Singularity for container orchestration), with no global batch system. Schedulers enforce local policies and quotas, while global governance (e.g., resource sharing, data locality, and project anchoring) is maintained at the consortium/board level.
In Archipelago’s serverless context, each SGS implements a latency-aware scheduling algorithm—Shortest-Remaining-Slack-First (SRSF)—to service concurrent Directed Acyclic Graph (DAG) function workloads within user-specified deadlines. Function requests are queued and prioritized by the slack metric: where is the DAG deadline and is the estimated critical path to completion for the function (singhvi et al., 2019).
3. Software Distribution and Container Strategy
Uniformity of runtime environments across islands is critical for reproducibility and supportability. LOFAR-IT employs an OCI-compliant Docker registry (hosted on GitLab), with all LOFAR data processing pipelines (e.g. prefactor, facet-cal, killMS, DDFacet) encapsulated as versioned container images. On HPC resources, Singularity is used for runtime execution; Docker is used natively where permitted (Taffoni et al., 2022). Job submission involves local scheduler directives to pull and execute the appropriate image; no global orchestrator spans sites.
For serverless platforms like Archipelago, function sandboxes (containers) are proactively allocated. Each SGS estimates invocation rates using an exponential weighted moving average (EWMA), computes the Poisson quantile for the SLA percentile, and distributes the required count of sandboxes evenly across its worker pool (singhvi et al., 2019). “Soft” and “hard” eviction strategies are used to control resource consumption without introducing cold-start penalties for the majority of invocations.
4. Pipeline Execution and Data Management
LOFAR-IT organizes data processing as stage-wise pipelines (e.g., prefactor → facet-cal → direction-dependent imaging), with intermediate and final products managed on separate storage layers. Fast scratch filesystems (BeeGFS, Lustre, SSD) are used for ephemeral intermediates (5–15 TB per 8h dataset). Long-term archives (Ceph, NFS, BeeGFS) store final calibrated outputs. To minimize wide-area transfers, users are anchored to a preferred site at proposal time, and pipeline stages are scheduled to maintain data locality (Taffoni et al., 2022). Workflow automation leverages modular shell/Python scripts submitted as containers.
In Archipelago, bulk data transfers between SGS and worker pools are avoided; execution is as close to function invocation as possible. DAG completion is tracked end-to-end, and queuing plus execution latency is monitored and feedback to the load balancer for scaling decisions (singhvi et al., 2019).
5. Operations, Fault Tolerance, and Security
Real-time system health and I/O metrics are collected using Prometheus and visualized with Grafana (node state, link utilization, disk I/O)—enabling prompt operational interventions (Taffoni et al., 2022). Local schedulers automatically restart failed jobs, but lack fine-grained checkpointing; failures at the node level trigger full-stage re-queues. Failover is implemented via user migration to alternative sites by retargeting image tags, incurring WAN transfer penalties as needed.
Security is enforced through signed and vulnerability-scanned container images, X.509 certificate authentication for private images, and restrictive network ACLs limiting SSH/Jupyter access to institutional VPNs. In Archipelago’s serverless model, tenant isolation is maintained at container and scheduler boundaries; resource allocation and evictions are scheduler-governed (singhvi et al., 2019).
6. Performance Metrics and Benchmarking
LOFAR-IT achieves substantial parallel speed-up through cross-site pipeline stage distribution: a single 8-hour observation processed on one 32-core node requires ≈150h wall-clock, but distributing calibration stages over four islands reduces this to ≈30h by exploiting parallelism. Burst scratch I/O rates reach 800 MB/s/node for BeeGFS and 600 MB/s for Lustre. WAN synchronization via rsync over the 10 Gb/s backbone achieves bursting at 1.1 GB/s and sustains ≈600 MB/s (Taffoni et al., 2022).
On serverless benchmarks, the Archipelago platform achieves 99.9th-percentile latency improvements of 20.8× and 35.97× compared to FIFO baselines under Poisson and sinusoidal arrivals, respectively. Over 99% of per-DAG deadlines (SLOs) are met, with the system missing only 0.76–0.98% of deadlines versus substantially higher miss rates for centralized baselines. Cold starts are reduced by 24.4×, and tail queuing delays are 47.5× lower (singhvi et al., 2019).
7. Generalization, Scalability, and Lessons
Containerization is identified as essential for portability and maintenance across heterogeneous islands (Taffoni et al., 2022). Lightweight per-site batch scheduling, rather than a centralized global scheduler, limits operational complexity. Static user anchoring minimizes “ping-pong” of large datasets. Interactive science-platforms (e.g., JupyterHub backed by SAML/OAuth and LDAP, with containerized sessions allocated via SLURM) significantly lower entry barriers for users and facilitate pre/post-processing.
Scalability is achieved in both LOFAR-IT and Archipelago by enabling the addition of new islands or worker pools with standardized requirements: a baseline compute/storage footprint (e.g., ≥100 cores, 100 TB scratch), container runtime, a 10 Gb/s WAN link, and a local batch scheduler or HTCondor. For serverless settings, scalable performance hinges on balancing SGS pool sizes (8–16 nodes per pool) and maintaining fine-grained, latency-aware load balancing without excessive over-scaling (singhvi et al., 2019). Robust governance and adaptive load distribution are critical for multi-tenant isolation and predictable performance.
A plausible implication is that the archipelago paradigm can be adapted beyond its origins in radio astronomy and serverless computing to any domain demanding federated data analysis or globally distributed, deadline-sensitive computation, provided that uniform software delivery, WAN interconnect, and lightweight cross-site coordination are feasible.