In-Network Collective Operations (INC)
- INC is a distributed communication paradigm that shifts collective operations into network hardware to reduce latency and bandwidth usage.
- Edge-INC and Core-INC implement computations at the network’s edge and core, respectively, enabling hierarchical aggregation and efficient data routing.
- INC architectures leverage programmable NICs and switches to optimize collective performance in large-scale AI, HPC, and cloud environments.
In-Network Collective Operations (INC) refer to a class of distributed communication primitives in which collective computation—encompassing AllReduce, AllGather, Broadcast, ReduceScatter, AllToAll, and similar operations—is partially or fully offloaded from general-purpose CPUs/GPUs into the network fabric itself. The goal is to reduce overall system latency, lower network and DRAM bandwidth consumption, and improve accelerator utilization by distributing aggregation and replication tasks across network endpoints (such as SmartNICs) or into the switching ASIC dataplane. INC has emerged as a response to bandwidth and latency bottlenecks in large-scale AI, HPC, and cloud systems, and is implemented in two main forms: "Edge-INC" at endpoints (NIC or SmartNIC), and "Core-INC" within programmable or fixed-function switches. Recent work examines INC’s concrete benefits, exposes the architectural and algorithmic barriers to widespread adoption, and points the way to standardized, scalable solutions (Hoefler et al., 27 Jan 2026, Kim et al., 2020, Zhao et al., 2022).
1. Fundamental Concepts and Flavors of INC
INC extends traditional host-based collective operations by introducing compute primitives (e.g., sum, max, min, broadcast) into the network datapath. Two principal paradigms are distinguished:
- Edge-INC: Computation is offloaded to SmartNICs or programmable NICs residing at the network edge. Operations such as streaming reductions and tree-based multicasts are handled without intervention from the main system’s DRAM or CPU, leading to minimal host-side resource consumption and improved overlap between communication and accelerator computation. Representative mechanisms include Portals 4 triggered collectives and sPIN packet handlers (Hoefler et al., 27 Jan 2026).
- Core-INC: Compute capabilities are embedded within switch ASICs or network cores. Switches aggregate intermediate partials in a hierarchical fashion—e.g., in a fat-tree topology, leaf switches sum or reduce local inputs, spine switches further aggregate, and finally the root diverts the result downward for broadcast. Notable realizations include NVIDIA SHARP and next-generation Ethernet collective extensions (Hoefler et al., 27 Jan 2026).
INC is increasingly crucial where model/data/tensor parallelism drives synchronized exchange of large tensors or gradients, resulting in synchronization costs that can dominate iteration runtime in distributed AI, HPC, and cloud data processing.
2. Architectures, Programming Models, and Implementation
The architectural design space is defined by the locus of compute, control, and adaptability:
- Dataplane Realization: Programmable switches and NICs (e.g., Barefoot Tofino, Mellanox BlueField) expose P4 or NPL pipelines for parsing collective packet headers, tracking context (group_id, op_type), maintaining per-group accumulators in register memory, and performing packet replication or thresholding (Kim et al., 2020, Zhao et al., 2022). Switch kernels may implement tree-based aggregation for AllReduce and branching/multicast for Broadcast, as demonstrated in NSinC (Kim et al., 2020) and NetRPC (Zhao et al., 2022). Host-based collective offload is achieved via lightweight MPI plug-ins or RPC stubs.
- Control Plane and Telemetry: Advanced architectures such as NSinC introduce a multi-tier hierarchy, including local controllers per-switch to monitor metric sketches (histograms, entropy, key-frequency), and a centralized global controller orchestrating group membership, tree reconfiguration, congestion balancing, and progress statistics (Kim et al., 2020).
- Programming Abstractions: Approaches like NetRPC (Zhao et al., 2022) extend familiar RPC or Protobuf APIs with INC-enabled data types (IEDTs) and filter configurations, abstracting away dataplane specifics and exposing reconfigurable, multi-tenant INC as a shared cluster service. Five robust reliable INC primitives (RIPs) encapsulate the supported computational operations, dynamically mapped onto hardware at runtime.
Implementation challenges include register memory management, pipeline stage limitations, context allocation for concurrent collectives, and pragmatic software fallbacks for operations exceeding hardware capabilities.
3. Analytical Models and Performance Metrics
Analytical models in INC research quantify both micro-level communication costs and macro-level impact on application speedup:
- Latency and Bandwidth Models: For a message of size , classic latency-bandwidth relations apply:
where and represent per-message fixed overhead and reciprocal bandwidth, respectively. For host-based P-node ring AllReduce,
Core-INC reduces the passes through the network core:
yielding approximately a halving of communication time per collective (Hoefler et al., 27 Jan 2026).
- End-to-End Speedup (Amdahl-Style): Let be the fraction of iteration time spent in AllReduce. The application-level speedup is:
where and are collective times pre- and post-INC (Hoefler et al., 27 Jan 2026). For an 8 GiB AllReduce, reduction from 352 ms (host ring) to 151 ms (Core-INC) yields S ≈ 1.11× if , highlighting limited end-to-end impact unless communication dominates.
- Volume and Utilization: NSinC demonstrates nearly core traffic savings—per-rank link volume dropping from $2(N-1)m$ (host) to ~ (in-network), with concurrent reductions in host DRAM and CPU usage (Kim et al., 2020).
Benchmark studies confirm:
- AllReduce (1 MB): 2.4 ms (host MPI) → 1.1 ms (in-network, ×2.2)
- MPI_Barrier (128 ranks): 0.85 ms → 0.32 ms (×2.7)
- Host CPU load during collectives drops by 70% (Kim et al., 2020)
- NetRPC: 42% throughput gain in distributed ML over BytePS; up to 28% over SwitchML (Zhao et al., 2022)
4. Optimization Techniques and Routing Adaptation
- Closed-Loop Adaptation: NSinC employs real-time telemetry and sketch analytics to dynamically adjust reduction tree shape (branching factor ), ingress mapping, and speculative aggregation thresholds to respond to shifting workload skew and network congestion (Kim et al., 2020).
- Flow and Congestion Control: NetRPC introduces ECN-based dynamic flow control, AIMD window updates, and per-flow idempotence with bitmap schemes for retransmissions. Host agents partition large arguments to saturate available bandwidth per flow (targeting 20–30 Gbps per flow on 100 Gbps links) (Zhao et al., 2022).
- Memory Management: Keys are hashed to logical addresses; switch memory is used as a cache subject to LRU or periodic counting, optimizing hit rates while avoiding chip memory exhaustion. Various clear policies (copy, shadow, lazy) allow trade-offs between memory overhead, latency, and throughput (Zhao et al., 2022).
- Multi-Tenancy: NetRPC supports scheduling and partitioning among concurrent services, achieving near-line-rate aggregate throughput with <20% per-service latency inflation under multi-tenant sharing (Zhao et al., 2022).
- Resilience and Fault Tolerance: NSinC plans redundant subtrees and live failure detection, with automatic topology reroute upon switch loss (Kim et al., 2020).
5. Limitations and Challenges
Multiple obstacles hinder broad deployment:
- Data Types and Precision: AI workloads favor low-precision (4/8-bit) or block-floating formats, but in-network accumulation typically requires higher precision. Upcasting at switches nullifies link savings. Handling new and diverse types (BF16, E5M2) is limited by slow ASIC cycle times (Hoefler et al., 27 Jan 2026).
- Sparse and Structured Data: Sparse tensor reductions cause index "fill-in," potentially inflating data volume as intermediate representations grow exponentially—a major concern in Core-INC (Hoefler et al., 27 Jan 2026).
- Result Reproducibility: Tree-based reduction schedules are non-deterministic for floating-point sums; bitwise reproduction requires Kahan or pairwise summation, doubling both compute and memory footprint (Hoefler et al., 27 Jan 2026).
- Context Management and Scalability: Switch state is constrained; mapping thousands of concurrent reduction trees entails complex group tracking, resource arbitration, and pipeline memory pressure (Hoefler et al., 27 Jan 2026, Zhao et al., 2022).
- Security: Core-INC must access packet contents to perform reduction—impeding end-to-end encryption. Homomorphic summations are not yet viable at scale. Edge-INC localizes trust but Core-INC widens attack surfaces (Hoefler et al., 27 Jan 2026).
- Topology Dependency: Most evaluations target isolated, single-switch or small multi-switch topologies. Scaling to over-subscribed, multi-tenant, hyperscale datacenters necessitates mature context programming and multi-level orchestration (Hoefler et al., 27 Jan 2026, Zhao et al., 2022).
- Hardware and Primitive Limitations: NetRPC supports only five robust primitives; workloads requiring control flow, dynamic branching, or non-trivial state fall back to the host stack (Zhao et al., 2022).
6. Practical Impact and Empirical Results
Empirical studies validate INC’s efficacy:
| Collective | Host MPI Time | In-Network Time | Speedup |
|---|---|---|---|
| MPI_Allreduce | 2.4 ms | 1.1 ms | ×2.2 |
| MPI_Barrier | 0.85 ms | 0.32 ms | ×2.7 |
| HPL (512×512) | 18.2 s | 16.9 s | ×1.08 |
| VPIC Reduction | 225 µs | 130 µs | ×1.7 |
Additional findings:
- Link utilization drops by 60% (≈30 Gbps saved) (Kim et al., 2020)
- Edge-INC and Core-INC nearly halve host memory traffic
- In distributed ML, NetRPC delivers up to 42% improvement over BytePS, surpasses SwitchML by 28% (Zhao et al., 2022)
- Multi-switch deployments (NetRPC): 1.63× goodput at 2.5 million keys
- Multi-tenant NetRPC: aggregate goodput ≈61 Gbps, per-service latencies increase <20%
- NSinC and NetRPC both demonstrate correctness and throughput robustness under packet loss scenarios (loss resilience improves by up to 21% vs SwitchML) (Zhao et al., 2022)
A notable bottleneck is that even halving collective communication time yields only modest end-to-end speedups unless the collective phase dominates the application’s iteration time (see Amdahl-variant in section 3).
7. Future Directions and Standardization
The field is anticipated to evolve along several axes:
- Standardization: The Ultra Ethernet Consortium (UEC) is developing a minimal, interoperable INC specification addressing packet headers, switch primitives, NIC state machines, and basic security handshakes, with initial focus on INT/F32/F16 AllReduce/Broadcast (Hoefler et al., 27 Jan 2026).
- Hybrid INC Co-Design: Successful deployments will combine Edge-INC (for memory traffic and orchestration efficiency) with Core-INC (for intermediate aggregation), with calibrated host-side upcast/downcast for final accumulation (Hoefler et al., 27 Jan 2026).
- Broader Primitive Sets: Ongoing work seeks to extend supported in-network operations to include barriers, segmented-reduce, selection/branching, and deadline/coflow-aware scheduling (Zhao et al., 2022).
- Topology and Scale-Out: There is a shift from single-switch and campus-scale clusters towards achievable deployments in oversubscribed, multi-tenant hyperscale datacenters—albeit with careful orchestration to overcome resource contention, group mapping complexity, and security (Hoefler et al., 27 Jan 2026, Zhao et al., 2022).
- Scheduler Integration: The integration of INC with global cluster/job schedulers (e.g., Chronus, Harmonia) is proposed to achieve balanced compute/communication scheduling (Zhao et al., 2022).
This body of work suggests that in-network collectives deliver significant reductions in synchronization costs, but their ultimate efficacy depends on advances in hardware, programming models, standardization, and pragmatic accommodation of data-type and multi-tenancy complexities (Hoefler et al., 27 Jan 2026, Kim et al., 2020, Zhao et al., 2022).