Papers
Topics
Authors
Recent
Search
2000 character limit reached

Infinity Fabric Interconnect (IFI)

Updated 24 January 2026
  • Infinity Fabric Interconnect (IFI) is AMD's packet-switched fabric that ensures cache-coherent, high-bandwidth data transfers in multi-CPU and multi-GPU environments.
  • IFI leverages xGMI links, DMA engines, and zero-copy protocols to optimize inter-die and inter-package communication in advanced system architectures.
  • IFI-based solutions in MI300A and MI250x enhance HPC performance by improving collective operations, reducing latency, and maximizing memory throughput.

Infinity Fabric Interconnect (IFI) is AMD's proprietary packet-switched fabric designed for high-bandwidth, cache-coherent communication within nodes containing multiple CPUs, GPUs, and high-bandwidth memory (HBM). IFI is integral to system architectures such as the AMD MI250x and MI300A, where it underpins inter-die and inter-package data movement, dramatically affecting performance and scalability in large heterogeneous HPC environments. Its design enables explicit and implicit memory transfers with support for zero-copy semantics and provides the hardware substrate for multi-tile GPU systems and APU nodes as deployed in leadership systems like LLNL’s El Capitan (Schieffer et al., 15 Aug 2025, Pearson, 2023, Schieffer et al., 2024).

1. Architecture and Topology

IFI’s physical and logical topology are critical to its function. MI300A nodes, as exemplified by El Capitan, expose four APUs, each as a cache-coherent NUMA domain integrating 24 Zen-4 CPU cores, 228 CDNA3 GPU CUs, and 128 GB HBM3. The APUs are wired in a full mesh via xGMI3 IFI links, each 16 bits wide at 32 GT/s, providing 64 GB/s per direction. Every APU pair connects through two such links (one per I/O die), resulting in a bidirectional aggregate of 128 GB/s per APU pair and enabling 512 GB/s bisection bandwidth across the node (Schieffer et al., 15 Aug 2025).

On MI250x systems, IFI links exist both intra-package (between dual GCDs) and inter-package. Link aggregation manifests as “quad,” “dual,” or “single” bundles: intra-package quad links yield up to 200 GB/s per direction, dual links provide 100 GB/s, and single links offer 50 GB/s. CPU–GCD connections use 36 GB/s per direction (Pearson, 2023, Schieffer et al., 2024).

The IFI is a scalable, fully cache-coherent, packet-switched fabric. Transfers are arbitrated per hop, with routing and bandwidth determined by link width and path aggregation. The memory hierarchy is thus NUMA-distributed but globally accessible across APUs and GCDs; zero-copy enables direct access between any CPU, GPU, or memory region (Pearson, 2023, Schieffer et al., 2024).

2. Hardware Characteristics and Protocols

  • Each IFI/xGMI link: 64 GB/s bidirectional (MI300A), 50 GB/s per direction (MI250x).
  • Aggregate per-pair bandwidth: 128 GB/s (MI300A full mesh), 200 GB/s (MI250x quad link).
  • Local APU HBM3 bandwidth: ~5.6 TB/s (MI300A, 6 XCDs).
  • PCIe 4.0 NIC one-way: 50 GB/s.
  • DMA (SDMA) engines in MI300A can saturate IFI at 128 GB/s; MI250x SDMA caps at PCIe speeds (~50 GB/s).
  • Pointer-chase latencies (MI300A): CPU local 240 ns, CPU remote 500 ns, GPU local 346 ns, GPU remote 690 ns (Schieffer et al., 15 Aug 2025).

IFI endpoints advertise a full 64-byte cache-line granularity and route packets over the fabric mesh embedded in I/O dies or interposers. CPU coherence is maintained via a dedicated IFI slice per GCD, integrating directly with EPYC NUMA (Pearson, 2023).

3. Data Movement Models and Communication APIs

IFI supports both explicit DMA-based transfers as well as implicit kernel-driven data movement. The primary data-movement APIs and their performance implications:

  • hipMemcpyPeer(Async): Standard peer-to-peer transfers, leveraging SDMA engines; peaks at 90 GB/s on MI300A (with hipMalloc), 50-51 GB/s on MI250x (irrespective of link width due to DMA ceiling). For allocations not using hipMalloc, falls back to single-threaded memcpy (Schieffer et al., 15 Aug 2025, Pearson, 2023).
  • Implicit STREAM Copy: In-kernel loads/stores can achieve up to 81% (MI300A) and 77% (MI250x) of IFI’s theoretical bandwidth, e.g., 103–104 GB/s (MI300A), 153 GB/s (MI250x quad link) (Schieffer et al., 15 Aug 2025, Pearson, 2023).
  • Zero-copy: Any GCD or CPU can directly access peer memory over IFI. Performance depends on allocation (hipMalloc, hipHostMalloc, hipMallocManaged) and path; managed page migration (XNACK) is 10× slower than standard paths (Schieffer et al., 2024).
  • Collective Communication: MPI and RCCL libraries provide multi-APU/GPU collectives mapped onto underlying IFI rings/trees. MPI falls back to CPU staging for small messages, while RCCL uses direct GPU–GPU IFI links, providing 5×–38× speedup for collectives exceeding 4 KB in message size (Schieffer et al., 15 Aug 2025, Schieffer et al., 2024).

Performance models are typically latency-plus-bandwidth: Tcomm(n)=α+n/βT_{comm}(n) = \alpha + n/\beta, where α\alpha is start-up latency, β\beta is sustainable bandwidth, and nn is message size (Schieffer et al., 15 Aug 2025).

4. Quantitative Benchmark Results

4.1 Direct Point-to-Point

System Link Type Explicit DMA (GB/s) Implicit Kernel (GB/s) %Peak (Implicit)
MI300A full-mesh 90 (hipMalloc) 103–104 81
MI250x (quad) quad 50–51 153 77
MI250x (dual) dual 51 77 77
MI250x (single) single 38 39 78

Explicit DMA saturates at SDMA ceiling except on MI300A, where SDMA reaches full link speed. Implicit kernel-driven (STREAM) copies approach 75–81% of link peak (Schieffer et al., 15 Aug 2025, Pearson, 2023, Schieffer et al., 2024).

4.2 Latency

4.3 Host–Device Bandwidth

Interface Allocation Peak Bandwidth (GB/s) % of Link
hipMemcpy pinned 28.3 78.6
hipMemcpy pageable 12.4 34.4
managed (0) managed (zero-copy) 25.5 70.8
managed (1) XNACK 2.8 7.8

Pinned or zero-copy allocations maximize host–GPU bandwidth; page-migration severely reduces throughput (Schieffer et al., 2024).

4.4 Collectives

RCCL collectives on MI300A and MI250x outperform MPI by 5×–38× for large messages and by 15–30% in latency across all collectives (except broadcast, where latencies are comparable). For 1 MiB messages, RCCL_AllReduce on eight GPUs achieves 68.2 μs vs. MPI_AllReduce at 90.4 μs (Schieffer et al., 2024).

5. Programming Best Practices and Optimization Techniques

  • For messages < 512 KB, memcpy or MPI with CPU staging is optimal.
  • For > 512 KB, use hipMalloc for communication buffers; hipMemcpyPeer or direct STREAM kernels exploit SDMA and IFI at near-peak rates.
  • RCCL is preferred for collectives, viable regardless of buffer allocation type.
  • SDMA can be disabled for MPI or hipMemcpyPeer (HSA_ENABLE_SDMA=0) to leverage full quad-link (in MI250x), but this does not improve MI300A where SDMA saturates IFI (Schieffer et al., 15 Aug 2025, Schieffer et al., 2024).
  • Task mapping should utilize “spread” across GPUs, allocating buffers to maximize link aggregation and bandwidth parallelism.
  • For host staging, hipHostMallocNonCoherent (pinned) avoids pageable overhead; prefer explicit kernel migration over hipMemPrefetchAsync for managed memory (Pearson, 2023, Schieffer et al., 2024).
  • Collective patterns should minimize CPU–GPU detours (disable MPI CPU-staging, prefer RCCL for intra-node).

6. Real-World HPC Application Case Studies

Quicksilver (Monte Carlo particle code) and CloverLeaf (hydrodynamics code) illustrate practical acceleration via IFI-aware optimizations:

  • Quicksilver baseline used system malloc, MPI_Isend/Irecv, XNACK disabled; switching to XNACK=1, migrating buffers to hipMalloc, and conditionally disabling SDMA improved communication runtime by up to 2× and reduced total runtime by 5–11% (Schieffer et al., 15 Aug 2025).
  • CloverLeaf baseline relied on hipMallocManaged and MPI exchanges; switching to RCCL point-to-point with hipMalloc (or malloc+hipHostRegister) reduced communication time from 1.5 s to 0.7 s (2.2× faster), improving total runtime by 10–15% (Schieffer et al., 15 Aug 2025).

7. Comparative Analysis and Implications

IFI on MI300A offers substantially higher inter-APU bandwidth compared to MI250x, where DMA engines are the limiting factor despite available link width (e.g., quad link “fat” paths). In MI250x, software scheduling and kernel migration are preferential; collective or exchange patterns based solely on hipMemcpyAsync underutilize IFI. Compared to NVIDIA NVLink2 and NVSwitch, IFI features topological and coherence similarities, but presently MI300A’s SDMA advantage distinguishes its performance ceiling.

Designers targeting AMD multi-APU or multi-GPU systems should utilize pinned/zero-copy memory, optimize task spreading, selectively disable SDMA, and favor RCCL for collectives. These strategies directly address hardware bottlenecks and can yield up to 2× communication speedup in real workloads (Schieffer et al., 15 Aug 2025, Pearson, 2023, Schieffer et al., 2024).

A plausible implication is that future IFI deployments will benefit from increasingly programmable DMA engines and enhanced software primitives, further narrowing the performance gap between hardware capacity and effective application throughput. Careful placement and API choices remain essential for extracting peak bandwidth from complex IFI topologies.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Infinity Fabric Interconnect (IFI).