Papers
Topics
Authors
Recent
Search
2000 character limit reached

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Published 25 Feb 2026 in cs.DC | (2602.21548v1)

Abstract: The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.

Summary

  • The paper presents a dual-path KV-Cache mechanism that reallocates storage I/O between prefill and decode engines to alleviate bottlenecks.
  • It employs a global, workload-aware scheduler and innovative RDMA data transfer to optimize resource utilization and improve throughput by up to 1.87×.
  • Experimental results validate enhanced latency stability, significant throughput gains, and scalable performance under multi-turn, agentic LLM workloads.

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Motivation and Problem Statement

The evolution of LLM workloads from single-turn, stateless prompt processing to agentic, multi-turn sessions has imposed unprecedented I/O pressure. Agentic inference—where contexts continuously accumulate across hundreds of turns—results in extreme KV-Cache reuse rates (>>95%), shifting the primary system bottleneck from computation to storage I/O. Disaggregated inference architectures, which separate prefill and decode engines and employ layerwise prefill, are now standard. However, the traditional assignment of all KV-Cache storage reads to prefill engines leads to persistent saturation of their storage NICs, leaving decode engines' substantial I/O capacity idle. This architecture-induced imbalance precludes full utilization of available bandwidth, resulting in underutilized GPUs and suppressed throughput in both offline and online deployment.

Hardware trends exacerbate the bottleneck; as shown by NVIDIA's recent GPU architectures, FLOPS have grown much faster than bandwidth or HBM, causing the I/O-compute ratio to decline by over an order of magnitude (Figure 1). Figure 1

Figure 1: Left: Hardware trends of NVIDIA GPUs. Right: Relative token throughput with varying request batch size highlights the growing gap between compute and I/O capacity in modern systems.

DualPath System Design

DualPath introduces a dual-path KV-Cache loading mechanism in PD-disaggregated systems, fundamentally rethinking the data flows during agentic inference. The key contribution is allowing requests to load the KV-Cache for context tokens either from storage into prefill engines (classic path) or via decode engines, which then transmit the cache to prefill engines using RDMA over the high-bandwidth compute network. This design aggregates the storage NIC bandwidth across all engines, breaking the unilateral bottleneck on prefill engine storage interfaces.

The system employs a workload-aware, global scheduler responsible for distributing requests, dynamically balancing computation and I/O, and selecting the optimal path per request. Each engine integrates a CNIC-centric traffic manager to enforce hardware QoS and prevent interference between model communication and cache transport traffic.

Dual-Path Data Movement

DualPath leverages both prefill (PE) and decode (DE) paths for KV-Cache loading (Figure 2, Figure 3): Figure 2

Figure 2

Figure 2: Detailed flow of the PE read path, optimizing fine-grained transfers from storage to engine buffers to HBM, and minimizing I/O interference through scheduling and overlap mechanisms.

Figure 3

Figure 3: Illustration of inter-engine PE scheduling. The scheduler selects among all eligible engines, minimizing queuing and maximizing bandwidth utilization by leveraging both PE and DE SNICs.

  • PE Read Path: KV-Cache is loaded from storage into prefill engines using DRAM buffering, then incrementally transferred to HBM for computation.
  • DE Read Path: KV-Cache is loaded into a decode engine’s buffer, then transferred over the compute network (RDMA) to the designated prefill engine.

Critical to the design is the asynchronous, layerwise transfer and careful DRAM and HBM management, exploiting computation-communication overlap where feasible. These flows are designed to operate without congestion or resource contention under practical prefill/decode ratios, as demonstrated by formal bandwidth analysis.

Traffic Isolation and Efficient Data Copy

To prevent disruptive interference with latency-critical model collectives (e.g., AllToAll for EP, ReduceScatter/AllGather for TP/CP), DualPath consolidates all data transfer through the GPU’s compute NIC using GPUDirect RDMA, leveraging InfiniBand virtual lanes for strict isolation. This ensures >99% of bandwidth is always available for model traffic, relegating KV-Cache transfers to lower-priority lanes that exploit idle bandwidth opportunistically.

Adaptive Scheduling Strategies

DualPath's scheduler operates at multiple granularity levels:

  • Inter-engine: Determines optimal engine assignment and path selection based on real-time metrics (e.g., NIC queue depth, token loads, GPU utilization).
  • Intra-engine: Enforces compute quotas for batching to minimize GPU idle bubbles and optimize parallel utilization (Figure 4). Figure 4

    Figure 4: Intra-engine scheduling. Left, compute-quota-based batch selection prevents stragglers; right, timeline comparison showing reduced idle periods after quota application.

This synergy ensures load-balancing across both storage and network interfaces and alleviates resource hotspots regardless of input workload skew.

Experimental Results

Comprehensive evaluation on three state-of-the-art LLMs (DeepSeek V3.2 660B, DS 27B, Qwen 2.5-32B) and real-world agentic workloads demonstrates the dominance of the storage I/O bottleneck and the efficacy of DualPath.

  • Offline Inference: DualPath improves throughput by up to 1.87×1.87\times (DS 660B) vs. a production baseline, and consistently achieves near-oracle performance under high agent concurrency (Figure 5). Figure 5

    Figure 5: Scaling results with 48P96D (1152 GPU) deployments; prompt TPS and end-to-end metrics demonstrate throughput saturation with minimal queueing and stable attention execution times.

  • Online Serving: Agent runs per second (APS) increase by 1.96×1.96\times, and tail latencies (TTFT, TTST, TPOT) remain stable, even at load thresholds where the baseline collapses due to queuing.

Ablation studies reveal the magnitude of impact: layerwise prefill cuts JCT by 17.2%, dual-path loading by a further 38.2%, and workload-aware scheduling by 45.6% (cumulative, on DS 660B).

Load balance metrics confirm that DualPath’s scheduler reduces maximum/average per-NIC traffic ratio from 1.53 to 1.18 and keeps per-attention-layer execution time maximum/average ratio under 1.06 during peak load phases.

Scalability is evidenced by linear strong scaling: expanding from 2P4D (2K agents) to 48P96D (48K agents) maintains per-agent job completion times, limited only by aggregate storage capacity.

Practical and Theoretical Implications

DualPath’s principal theoretical implication is that the KV-Cache loading bottleneck in agentic inference systems is not fundamental, but an artifact of suboptimal bandwidth utilization and rigid role assignment. Architectures that globally pool and schedule all available I/O bandwidth can approach the total storage throughput, limited only by hardware or compute network performance. DualPath’s approach is readily extensible to other forms of data cache and hierarchical memory management. The practical implication is clear: massive agentic deployments, and RL rollout in particular, benefit directly from DualPath's methods, providing predictable throughput scaling, reducing the need for costly DRAM pooling, and flattening skew-induced latency curves.

This work suggests several natural directions for future AI infrastructure:

  • Dynamic, feedback-driven adjustment of P/D ratios and parallelism to respond to workload phase changes.
  • Integration with hierarchical DRAM caching or speculative prefetching to raise hit rates and further buffer disk-pressure.
  • Direct application to long-context training and open-ended tool-agent operation.

Conclusion

DualPath systematically eliminates the storage bandwidth bottleneck in agentic LLM inference by enabling dual-path KV-Cache loading, global bandwidth pooling, and workload-aware traffic scheduling. It achieves up to 1.87×\times throughput improvement and provides robust, production-grade scalability to next-generation agentic applications. The results establish a new baseline for system-level KV-Cache management and hold significant implications for large-scale, multi-turn AI workloads.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper is about speeding up how big AI models (like chatbots) think and respond when they act like “agents” that do many steps in a row. In these long, multi-turn tasks, the main slowdown isn’t the model’s math but moving a lot of saved “memory” back and forth from storage. The authors build a system called DualPath that fixes this data-movement bottleneck so the model can run much faster.

The main questions the paper asks

  • Why are agent-style AI tasks (many short turns with a long, growing history) slow on today’s servers?
  • Where exactly is the bottleneck?
  • Can we redesign the way we load the model’s saved memory (called KV-Cache) so we use all the available network and storage bandwidth, not just a small part of it?
  • Can we do this without messing up other time-sensitive communications the model needs?

Key ideas and terms (explained simply)

Before the approach, here are the main pieces in everyday language:

  • KV-Cache: Imagine the model takes notes as it reads your conversation. These notes help it answer future turns without rereading everything. That notebook is the KV-Cache. In agent tasks, you reuse almost all of the past notes each turn, so you have to reload a lot of them.
  • Prefill vs. Decode: Prefill is the model “setting the stage” using the whole conversation so far. Decode is where it actually writes the next words. They use different parts of the computer and have different speed needs.
  • Disaggregation (splitting the work): Servers often split work across two teams:
    • Prefill Engines (PEs): focus on the heavy “read and set up” part.
    • Decode Engines (DEs): focus on fast word-by-word generation.
  • Two networks on the servers:
    • Storage network: talks to disks (where the KV-Cache notes live).
    • Compute network: connects GPUs to each other for fast communication.
  • RDMA: A very fast, direct way to send data between computers (like a private express lane).

The problem today: only the Prefill Engines load most of the notes (KV-Cache) from storage. Their storage network pipes get jammed, while the Decode Engines’ storage pipes sit almost unused.

What DualPath does (the approach)

DualPath adds a second, smarter way to load the KV-Cache and schedules traffic so nothing gets stuck. Think of it like adding a second highway and a traffic cop that balances cars across both routes.

Here’s the core idea:

  • Path 1 (the old way): Storage → Prefill Engine.
  • Path 2 (the new way): Storage → Decode Engine → Prefill Engine over the fast compute network (RDMA).

By choosing between Path 1 and Path 2 for each request, the system uses all available storage bandwidth (not just on Prefill Engines) and avoids overloading one side.

To make this work in real systems, they add three pieces:

  • A careful data path design: Moves KV-Cache in small chunks, timed to overlap with GPU work, and avoids creating new choke points.
  • Network “lane” control (traffic isolation): They send KV-Cache transfers over lower-priority lanes on the compute network, reserving the fast lanes for time-critical model communications. This stops slowdowns during decoding.
  • A smart scheduler: It decides, in real time, which path (PE or DE) should load each request’s KV-Cache, and how to balance work across GPUs and network cards so no single part gets overloaded.

Analogy:

  • Imagine a school with two copy rooms (storage connections). In the old setup, only one department (Prefill) used the copiers, causing lines. DualPath lets both departments (Prefill and Decode) make copies and then hand papers to each other over a fast hallway (compute network). A coordinator watches queues and sends people to the less busy copier. Fast lanes are kept clear for urgent deliveries.

How they tested it (methods, simply)

  • They built DualPath on a modern AI serving system that already splits prefill and decode and does “layer-wise prefill” (like processing one chapter at a time to fit in memory).
  • They ran real “agent” workloads: long conversations with many short turns, where most tokens are reused from KV-Cache (often 95%+ of the context).
  • They measured:
    • Offline throughput (how fast they finish a big batch of agent tasks, like in reinforcement learning rollouts).
    • Online serving throughput and latency (like first-token time).
  • They compared three setups:
    • Basic (their original system).
    • DualPath (their new system).
    • Oracle (an idealized upper bound with no I/O overhead, just to see what’s theoretically possible).

They also did a math check to show that, under common server setups, adding the second path won’t create new bottlenecks (for example, on GPU memory or the compute network), as long as you keep a sensible balance between how many Prefill and Decode machines you run.

What they found (results) and why it matters

  • Faster throughput:
    • Offline (batch) inference: up to 1.87× faster.
    • Online serving: on average 1.96× more throughput without breaking latency promises (SLOs).
  • Better first-token time: The system reduces the time to the first generated token, which is important for user experience in short responses.
  • Balanced resource use: Storage bandwidth that used to sit idle on Decode machines now helps, so the whole cluster works more efficiently.
  • Robustness: Their traffic isolation keeps latency-sensitive model communications smooth, even while moving lots of KV-Cache around.

Why this matters:

  • Agent-style AI is becoming common (coding assistants, web agents, multi-step planners). These tasks are long and reuse lots of context. DualPath targets exactly the real bottleneck—loading reused context—so it delivers gains where users feel them.
  • Hardware trends make this even more important: GPU compute keeps getting faster, but memory and network bandwidth don’t grow as quickly. That makes I/O the limiting factor unless you redesign data movement like DualPath does.

What this could change (implications)

  • Cheaper, faster serving at scale: By squeezing more performance out of the same hardware, companies can handle more agent requests without buying lots of new gear.
  • Better agent training (like RL rollouts): Faster rollouts mean quicker experiments and progress.
  • A blueprint for future systems: The dual-path idea and lane-based traffic control could guide how other AI systems handle big, reusable data.
  • Works with today’s data centers: The design matches how modern AI clusters separate storage and compute networks, and it uses standard QoS features (like virtual lanes) to keep everything smooth.

In short, DualPath turns a one-lane road into a two-lane highway for model memory, with a smart traffic cop. That keeps GPUs busy, responses faster, and servers more efficient—especially for long, multi-step agent tasks.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to be actionable for future research.

Architectural assumptions and generality

  • Validate DualPath on hardware configurations without per-GPU CNICs or without PCIe co-location (e.g., single NIC per node, GPUs behind multiple PCIe switches), and characterize performance degradation and feasibility.
  • Extend the bottleneck-free analysis beyond idealized assumptions (no compute-network congestion, perfect load balancing, storage fully utilized) to account for realistic contention and skew; provide sensitivity analyses to violations of these assumptions.
  • Quantify how DualPath behaves when the compute network is consistently busy (e.g., heavy EP AllToAll patterns, larger TP/CP) rather than intermittently bursty, and determine thresholds where KV-Cache transfers begin to harm inference.
  • Generalize DualPath to inference architectures beyond PD-disaggregation (e.g., monolithic engines, hybrid or pipelined designs) and quantify benefits or trade-offs.

Traffic management and QoS

  • Provide empirical validation of the CNIC-centric H2D/D2H path’s QoS isolation under a range of collective patterns and utilization levels (including tail latency impacts), not just theoretical reasoning.
  • Demonstrate deployment on RoCE and other fabrics (UEC, UnifiedBus), including concrete DSCP/TC configurations, switch/NIC queue mappings, and measured isolation vs. interference.
  • Investigate dynamic QoS policies (e.g., adaptive VL/queue weights) that respond to real-time collective bursts rather than fixed 99/1 splits; evaluate latency, throughput, and fairness effects.
  • Quantify the CPU cost and NIC-side overheads of doorbell batching and RDMA WR submission at scale (e.g., millions of small blocks per second), including effects on completion queues and interrupt moderation.

Scheduling and load balancing

  • Formalize and evaluate the choice of scheduler thresholds (α, β, the 1.05 multiplier for Z), including auto-tuning strategies and sensitivity analyses to workload and cluster conditions.
  • Implement and evaluate the proposed split-read optimization (simultaneous PE and DE reads for a single request), including chunking granularity, per-layer splitting, and conflict resolution with per-node storage queues.
  • Integrate additional load signals (HBM fragmentation, GPU memory headroom, per-model compute-per-token variability, per-expert activation skew in MoE) into inter-/intra-engine scheduling decisions, and quantify gains.
  • Study stability and fairness under dynamic arrivals (online serving), including starvation risks for small-context requests when large-context batches dominate.

Storage and KV-Cache management

  • Detail and evaluate Full Block vs. Layer Block layouts (block sizes, indexing, metadata overhead), and quantify their impact on random I/O, small-chunk transfer efficiency, and write amplification.
  • Analyze end-to-end storage write costs (persisting generated KV-Cache every N tokens), including replication, durability, garbage collection, and consistency under failures or reordering.
  • Compare SSD-based KV-Cache storage to alternative tiers (DRAM pools like Mooncake, NVMe-oF, object storage) with cost-performance modeling; identify when each tier is advantageous under agentic workloads.
  • Explore KV-Cache compression/quantization trade-offs (e.g., FP8 vs. lower precision or compressed formats) on bandwidth, latency, and model quality; provide mechanisms for selective compression by layer or recency.

Performance evaluation scope

  • Broaden model coverage (e.g., Llama-family dense models, encoder-decoder architectures, smaller latency-focused models) and quantify how architecture differences affect DualPath’s gains.
  • Test extreme contexts (≫64K; up to 1M tokens as claimed) and lower KV-Cache hit rates (<95%), to map the operating envelope and failure modes when workloads deviate from agentic norms.
  • Provide detailed online serving results (p50/p95/p99 TTFT/TTST, jitter), under varied concurrency and mixed request sizes, and show SLO compliance with tail latency under peak load.
  • Include ablation studies for individual components (dual-path data flow, CNIC-centric transfers, scheduling) to attribute performance benefits and identify the most impactful levers.

Robustness, failure handling, and correctness

  • Define and evaluate failure handling for partial reads/writes, DE/PE failures mid-transfer, storage timeouts, and network partitions; include retry, deduplication, and cache consistency semantics.
  • Verify correctness and consistency of KV-Cache merging between hit/miss tokens across paths (PE-read vs. DE-read), especially under concurrent updates and layer-wise streaming.
  • Assess the reliability characteristics of RDMA to GPU memory (QP scalability, retransmissions, flow control) at cluster scale, including recovery behaviors and their latency impacts.
  • Evaluate NUMA effects and multi-socket DRAM bandwidth contention for PE/DE buffers, including placement policies and cross-socket penalties.

Portability and deployment constraints

  • Quantify the minimal hardware requirements (per-GPU CNIC bandwidth, PCIe topology constraints, switch QoS capabilities) and provide a deployment guide for commodity clusters lacking DGX-like features.
  • Study interactions with NVSwitch/NVLink fabric configurations (GH200, Blackwell) and determine if CNIC-centric paths remain optimal or if GPUDirect Storage variants can be QoS-isolated in future hardware.
  • Evaluate coexistence with training jobs on the same fabric (multi-tenant clusters), including fabric-level fairness, QoS isolation between jobs, and impact on training collectives.

Privacy, security, and multi-tenancy

  • Define security and isolation for RDMA to GPU memory (memory keys, protection domains), including multi-tenant risks and mitigations.
  • Address KV-Cache persistence policies (encryption at rest, access control, data retention), especially for sensitive user contexts in production serving.

Cost, energy, and trade-offs

  • Provide a cost-performance analysis comparing DualPath to simply provisioning more prefill-side storage bandwidth (e.g., additional SNICs) and to DRAM-based caching; include energy per token and operational complexity.
  • Analyze the energy/network overhead of increased compute-network transfers vs. the throughput gains, and evaluate sustainability under large-scale deployment.

Explicitly stated future work

  • Implement and evaluate concurrent reading of a single request from both PE and DE sides, including scheduling heuristics, block partitioning, and expected gains under different P/D ratios and queue lengths.

Glossary

  • Agentic LLM: An LLM designed to autonomously plan, use tools, and interact over many turns with an environment. Example: "agentic LLM inference"
  • AllGather: A collective communication primitive that gathers data from all participants and distributes the aggregated result to all of them. Example: "ReduceScatter/AllGather"
  • AllToAll: A collective communication primitive where each participant sends data to and receives data from every other participant. Example: "AllToAll in expert parallel"
  • Autoregressive decoding: Token-by-token generation where each new token is conditioned on previously generated tokens. Example: "autoregressive decoding"
  • Cache-compute ratio: The ratio of KV-Cache data volume that must be loaded to the amount of computation, used to characterize I/O pressure. Example: "the cache-compute ratio"
  • CNIC (Compute NIC): The network interface dedicated to the compute (east–west) fabric used for inter-GPU/node communication. Example: "compute NIC (CNIC, also known as east-west NIC)"
  • CNIC-centric: A data/traffic management approach that routes GPU I/O through the GPU’s paired compute NIC to leverage network QoS for isolation. Example: "a CNIC-centric traffic management approach"
  • Compute quota: A predefined upper bound on expected layer compute time used to cap batch assembly during prefill scheduling. Example: "compute quota"
  • CUDA copy engine: A GPU hardware path for copying data between host and device memory via PCIe. Example: "CUDA copy engine"
  • DE buffer: A DRAM buffer on decode engines used to stage KV-Cache before transfer to GPU memory. Example: "DE buffer"
  • Decode engine (DE): A dedicated engine (GPU process) responsible for token generation in PD-disaggregated serving. Example: "decoding engines (DEs)"
  • Differentiated Services Code Point (DSCP): An IP-layer field used to mark packets for QoS differentiation in Ethernet/roce networks. Example: "Differentiated Services Code Point (DSCP)"
  • Doorbell batching: An RDMA optimization that amortizes submission overhead by batching work requests before ringing the NIC doorbell. Example: "doorbell batching"
  • East-west NIC: A NIC used for inter-node compute traffic within the cluster (the compute fabric). Example: "east-west NIC"
  • Expert parallel (EP): A parallelization strategy for MoE models where expert computations are sharded across devices and coordinated via collectives. Example: "expert parallel"
  • Full Block: A KV-Cache block layout that contains all layers’ cache for a token block, used for storage interactions. Example: "Full Block"
  • GPUDirect RDMA: A technology enabling NICs to directly read/write GPU memory, bypassing the host CPU. Example: "GPUDirect RDMA"
  • GPUDirect Storage: A technology enabling storage devices to DMA directly to GPU memory without host copies. Example: "GPUDirect Storage"
  • GQA: Grouped-Query Attention, an attention variant that reduces key/value redundancy by sharing K/V among grouped queries. Example: "GQA"
  • H2D and D2H: Host-to-Device and Device-to-Host data transfers between CPU memory and GPU memory. Example: "H2D and D2H"
  • HBM (High Bandwidth Memory): On-package memory with very high throughput used by GPUs. Example: "HBM capacity"
  • I/O-compute ratio: The relative scaling between data movement capability (I/O) and computational throughput, used to reason about bottlenecks. Example: "I/O-compute ratio"
  • I/O-bound: A workload characteristic where performance is limited by data movement rather than computation. Example: "I/O-bound"
  • InfiniBand: A high-performance, low-latency network interconnect commonly used in AI clusters. Example: "InfiniBand-based network"
  • Kernel bypass: An I/O technique that avoids kernel overhead by submitting operations directly from user space. Example: "kernel bypass"
  • KV-Cache: Cached attention keys and values that allow reusing past computations across tokens/turns. Example: "KV-Cache"
  • Layer Block: A KV-Cache block layout that contains cache for a single layer, used for layerwise streaming transfers. Example: "Layer Block"
  • Layer-wise prefill: Executing prefill one layer at a time so only one layer’s KV-Cache lives in HBM, increasing effective batch size. Example: "layer-wise prefill"
  • MMIO (Memory-Mapped I/O): A method of device communication via memory-mapped registers; in RDMA, used when submitting work requests. Example: "mmio writes"
  • MoE (Mixture of Experts): A model architecture that routes tokens to specialized expert sub-networks for efficiency and capacity. Example: "an MoE model"
  • NVLink: NVIDIA’s high-bandwidth interconnect linking GPUs within a node. Example: "NVLink"
  • PD disaggregation: Architectural separation of prefill and decode phases onto different engine types for efficiency and scalability. Example: "PD disaggregation"
  • P/D ratio: The ratio of prefill engines to decode engines in a deployment, affecting bandwidth and compute balance. Example: "P/D ratios"
  • PCIe QoS: Quality-of-service mechanisms on PCI Express links (often unavailable on current GPUs), relevant to isolating traffic classes. Example: "PCIe QoS"
  • QoS: Quality of Service; mechanisms to prioritize or reserve network bandwidth for specific traffic classes. Example: "QoS controls"
  • RDMA: Remote Direct Memory Access, enabling direct memory operations across nodes with low latency and low CPU overhead. Example: "via RDMA over the compute network"
  • RDMA over Converged Ethernet (RoCE): RDMA transport running over Ethernet, enabling RDMA semantics outside InfiniBand. Example: "RDMA over Converged Ethernet (RoCE)"
  • RDMA Write: An RDMA verb that writes data directly into a remote memory region. Example: "RDMA Write"
  • Read amplification: Extra storage reads incurred beyond what is strictly necessary due to system or data layout effects. Example: "read amplification"
  • ReduceScatter: A collective that reduces data across participants and scatters disjoint reduced partitions back to them. Example: "ReduceScatter/AllGather"
  • Reinforcement learning (RL): A training paradigm optimizing behavior via rewards; here used for agentic LLM training. Example: "reinforcement learning (RL) approaches"
  • SLO: Service Level Objective; a target for service quality such as latency or throughput. Example: "SLO"
  • SNIC (Storage NIC): The NIC connected to the storage fabric used for reading/writing datasets or KV-Cache. Example: "storage NIC (SNIC, also known as south-north NIC)"
  • Sparse attention: An attention mechanism that prunes attention patterns to reduce compute/memory costs. Example: "sparse attention design"
  • Tensor Core: Specialized GPU units (e.g., NVIDIA) for high-throughput tensor operations used in deep learning. Example: "Tensor Core"
  • Tensor/context parallel: Parallelization strategies that shard tensor dimensions and/or context across devices, coordinated by collectives. Example: "tensor/context parallel"
  • TTFT: Time to First Token; latency until the first generated token is produced. Example: "TTFT"
  • TTST: Time to the Second Token; latency from start to the second output token, often reflecting steady-state decode. Example: "TTST (Time to the second token)"
  • TPOT: Tokens Per Output Time (or similar throughput metric) used to evaluate serving performance. Example: "TPOT"
  • Virtual lanes (VLs): Link-level traffic classes in InfiniBand that enable strict prioritization and isolation of flows. Example: "virtual lanes (VLs)"
  • Weighted round-robin policy: A scheduling policy that assigns bandwidth shares to classes proportionally to weights. Example: "weighted round-robin policy"

Practical Applications

Immediate Applications

Below are concrete applications that can be deployed now, leveraging DualPath’s findings to improve agentic LLM inference in real-world settings.

  • Cloud LLM serving for agentic sessions
    • Sector: software/cloud, SaaS
    • What: Deploy DualPath’s dual-path KV-Cache loading and CNIC-centric traffic isolation in multi-turn, tool-using LLM services (e.g., coding copilots, autonomous task agents) to increase throughput (up to ~1.9×) and reduce time-to-first-token without violating SLOs.
    • Tools/Products/Workflows: “DualPath-aware KV-Cache loader” within PD-disaggregated serving; scheduler tuned with token-count proxies and thresholds; InfiniBand/RoCE QoS configuration bundles; SSD-based KV storage (e.g., 3FS); layerwise prefill integration.
    • Assumptions/Dependencies: Agentic workloads with ≥95% KV-Cache hit rate; PD-disaggregated architecture; RDMA-capable compute network (IB/RoCE) with QoS (VLs, DSCP/TC); GPUDirect RDMA and proper PCIe topology (GPU+CNIC under same switch); distributed storage that saturates SNICs; P/D ratios within derived bottleneck-free range.
  • Accelerated RL rollouts for agent LLM training
    • Sector: AI/ML, research & development
    • What: Use DualPath during rollout phases to shorten job completion time and improve GPU utilization while relying on cost-effective SSD KV-Cache instead of DRAM pools.
    • Tools/Products/Workflows: Rollout pipelines that schedule requests across prefill/decode engines using the inter- and intra-engine algorithms; token-count metrics to balance NIC/GPU load; per-group queues and leader engines; SSD-backed KV storage with kernel-bypass I/O (io_uring-like).
    • Assumptions/Dependencies: Large trajectory working sets; limited DRAM due to optimizer/reward model states; consistent storage bandwidth; well-tuned scheduler thresholds (α, β); EP/DP configurations where attention is data-parallel.
  • Enterprise on‑prem inference optimization
    • Sector: enterprise IT, software
    • What: Retrofit PD-disaggregated clusters with DualPath to pool storage bandwidth across prefill and decode engines, avoiding overloading prefill-side SNICs; produce an “agentic-optimized rack profile.”
    • Tools/Products/Workflows: Playbooks to set P/D ratios within the provably bottleneck-free range; QoS config templates for IB/RoCE (VL/TC/DSCP); CNIC-assisted H2D/D2H path; block layouts (Full Block for storage, Layer Block for transfers).
    • Assumptions/Dependencies: Physical separation of compute and storage networks; storage backend saturates SNIC; accurate per-node read queue metrics; low congestion on compute fabric; doorbell batching for RDMA submission.
  • Multi-tenant traffic isolation for inference clusters
    • Sector: data center operations, telecom
    • What: Apply CNIC-centric transfer with QoS isolation (e.g., IB virtual lanes) to shield latency-sensitive model collectives (AllToAll, ReduceScatter/AllGather) from KV-Cache traffic.
    • Tools/Products/Workflows: Network-wide VL arbitration (e.g., 99% bandwidth reserved for high-priority inference traffic); NIC and switch queue weights; RoCE TC/DSCP mapping equivalents; automation scripts to enforce QoS across tenants.
    • Assumptions/Dependencies: Hardware QoS support in network stack; correctly mapped traffic classes; monitoring to prevent starvation of low-priority KV traffic.
  • Cost and energy efficiency improvements
    • Sector: finance (TCO), energy/sustainability
    • What: Increase aggregate storage bandwidth utilization by using decode-side SNICs; reduce GPU idle time, energy per token, and overprovisioning of prefill-side NICs.
    • Tools/Products/Workflows: TCO calculators that incorporate DualPath throughput gains; energy dashboards reporting watts/token and NIC utilization; capacity planning models that include pooled SNIC bandwidth.
    • Assumptions/Dependencies: Measurable KV-Cache I/O bottleneck; sufficient decode-side SNIC headroom; accurate energy/utilization instrumentation.
  • Framework integrations and libraries
    • Sector: software/tools
    • What: Integrate DualPath into serving stacks (e.g., vLLM, SGLang-like systems) as a pluggable KV-Cache loading/scheduling component that supports PD disaggregation and layerwise prefill.
    • Tools/Products/Workflows: Open-source library for dual-path transfers (RDMA + QoS), schedulers with token-based balancing, storage adapters for SSD-backed KV-Cache (3FS-compatible), observability hooks.
    • Assumptions/Dependencies: Alignment with framework parallelism (EP/DP/TP); consistent model block layouts; RDMA APIs and GPUDirect RDMA availability.
  • Operations observability and scheduling governance
    • Sector: DevOps/SRE
    • What: Build dashboards and alarms around token-count proxies, per-node read queues, HBM headroom, and NIC utilization to steer the inter-/intra-engine schedulers and avoid imbalance.
    • Tools/Products/Workflows: Metrics pipelines; leader-engine coordination; compute quotas for layerwise batches; policy to prefer nodes with shorter read queues and avoid overloaded engines.
    • Assumptions/Dependencies: Accurate, low-latency telemetry; scheduler hooks; calibration/profiling for attention layer time vs. theoretical compute.
  • End‑user responsiveness improvements in agentic assistants
    • Sector: daily life, consumer-facing apps
    • What: Users see faster first responses and steadier token streams during multi-turn tool use (web browsing, code execution) thanks to reduced KV-Cache I/O bottlenecks.
    • Tools/Products/Workflows: Service tiers labeled “agentic-optimized”; A/B testing of TTFT/TTST improvements; backlog handling strategies for long-context sessions.
    • Assumptions/Dependencies: Back-end deployment of DualPath; workloads dominated by KV reuse and short appends; SLO tracking for latency-sensitive paths.

Long-Term Applications

The following opportunities require further research, hardware support, or larger-scale development before broad deployment.

  • Hardware–software co‑design for inference fabrics
    • Sector: hardware vendors, cloud providers
    • What: Incorporate PCIe QoS in GPUs/NICs and expose unified QoS controls across PCIe, RDMA, and storage stacks to natively isolate inference collectives from KV traffic; converge with Ultra Ethernet/UnifiedBus.
    • Potential Products: “QoS-ready AI NICs” paired per-GPU; firmware/driver support for fine-grained lane arbitration; consolidated QoS APIs.
    • Assumptions/Dependencies: Vendor support for PCIe QoS; standardization across IB/RoCE/UEC; firmware stability and cross-stack integration.
  • Extending DualPath principles to training and fine‑tuning
    • Sector: academia, AI/ML industry
    • What: Apply dual-path pooling and CNIC-centric QoS to optimizer-state shuffling, activation offloading, and checkpoint I/O during large-scale training; reduce congestion across data and compute fabrics.
    • Potential Workflows: Dual-path for activation paging; pooled SNIC utilization during gradient checkpointing; dynamic scheduling around bursty collectives.
    • Assumptions/Dependencies: Training frameworks adapted for PD-like staged operations; compatibility with ZeRO/offload strategies; sufficient DRAM/SSD bandwidth.
  • Zero‑copy and simultaneous dual‑side reads
    • Sector: software/hardware co‑optimization
    • What: Replace DRAM buffers with GPU Direct RDMA to DE HBM (where safe) and split KV reads across PE/DE concurrently to further cut TTFT and storage latency.
    • Potential Tools: Smart block-splitting algorithms; NIC–GPU coalescing for small chunks; adaptive policies to minimize copy overheads.
    • Assumptions/Dependencies: Robust RDMA-to-GPU memory semantics; careful QoS isolation; correctness in layerwise streaming with partial blocks.
  • Agentic workload benchmarks and QoS profiles standardization
    • Sector: policy/standards bodies, industry consortia
    • What: Define benchmarks for multi-turn, short-append workloads and recommended QoS profiles (bandwidth reservations, arbitration weights) for inference clusters.
    • Potential Products: Certification suites for “agentic-ready” clusters; reference P/D ratio calculators based on the bottleneck-free bounds; standard telemetry schemas.
    • Assumptions/Dependencies: Community consensus; reproducible traces; vendor cooperation on QoS exposure.
  • SSD KV‑Cache compression, deduplication, and indexing
    • Sector: storage systems
    • What: Reduce I/O volume with KV-Cache-aware compression and block-level dedup; build indexes tuned to layerwise streaming patterns to accelerate lookups.
    • Potential Products: “KV-Cache-aware” storage engines; cache compaction services that preserve latency SLOs; block layout optimizers.
    • Assumptions/Dependencies: Acceptable CPU overhead; minimal impact on latency-critical paths; compatibility with FP8/FP16 cache formats.
  • Managed “Agentic‑Optimized LLM Endpoint” as a cloud offering
    • Sector: cloud providers
    • What: Productize DualPath with autoscaling of P/D ratios, QoS policy enforcement, and SLAs for TTFT/TPOT in agentic scenarios; expose knobs for context length and tool-use profiles.
    • Potential Tools: Control planes that adapt engines/groups to load; per-tenant QoS allocations; billing models tied to energy-per-token and KV reuse.
    • Assumptions/Dependencies: Mature orchestration, isolation, and observability; multi-tenant fairness policies; consistent RDMA performance across regions.
  • Edge and on‑prem appliances with micro‑DualPath
    • Sector: edge computing, enterprise appliances
    • What: Adapt DualPath to smaller clusters (MIG partitions, fewer GPUs) to pool limited storage bandwidth across engines for long-context agents on-prem.
    • Potential Products: “Agentic edge boxes” with validated P/D ratios; lightweight QoS stacks; simplified schedulers.
    • Assumptions/Dependencies: RDMA-capable edge NICs; storage fabric capacity; reduced management complexity.
  • Green AI reporting and compliance
    • Sector: policy/sustainability
    • What: Use DualPath’s efficiency gains to meet or exceed regulatory/voluntary targets on energy-per-token and data center utilization; provide standardized reporting.
    • Potential Tools: Auditable energy dashboards; automated capacity planning with pooled SNIC modeling; sustainability scorecards for agentic services.
    • Assumptions/Dependencies: Accepted measurement methodologies; integration with DCIM/telemetry; evolving regulatory frameworks.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 14 tweets with 748 likes about this paper.