On-Package Memory with Universal Chiplet Interconnect Express (UCIe): A Low Power, High Bandwidth, Low Latency and Low Cost Approach
Abstract: Emerging computing applications such as AI are facing a memory wall with existing on-package memory solutions that are unable to meet the power-efficient bandwidth demands. We propose to enhance UCIe with memory semantics to deliver power-efficient bandwidth and cost-effective on-package memory solutions applicable across the entire computing continuum. We propose approaches by reusing existing LPDDR6 and HBM memory through a logic die that connects to the SoC using UCIe. We also propose an approach where the DRAM die natively supports UCIe instead of the LPDDR6 bus interface. Our approaches result in significantly higher bandwidth density (up to 10x), lower latency (up to 3x), lower power (up to 3x), and lower cost compared to existing HBM4 and LPDDR on-package memory solutions.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview: What this paper is about
Computers keep getting faster, but they’re running into a “memory wall.” That means the processor wants data faster than today’s memory systems can deliver, especially for AI, which needs huge amounts of data quickly and efficiently. This paper proposes a new way to put memory right next to the processor on the same package and connect it using a standard called UCIe (Universal Chiplet Interconnect Express). The goal is to get much more speed, much lower delay, and much lower power and cost than today’s on-package memory options.
The main questions the paper tries to answer
- How can we feed processors (especially for AI) with a lot more data without using too much power or money?
- Can we reuse existing memories (like LPDDR and HBM) but connect them in a smarter way using UCIe?
- Could future memory chips speak UCIe directly, making the whole system simpler?
- Which message formats and link styles work best over UCIe to make memory fast and efficient?
- Will this approach beat today’s solutions in bandwidth (how much data per second), latency (how quickly data arrives), power (energy per bit), and cost?
How the authors approach the problem (in everyday terms)
Think of a computer chip as a city and memory as the food supply. Today’s memory delivery uses many wide, slower roads with different rules for commands and data. That’s wasteful and hard to scale. The authors suggest switching to a few super-fast, point-to-point “trains” that carry both commands and data efficiently. Those “trains” are UCIe links.
They explore two main ways to build this:
- Using a translator stop (a “logic die”): The processor talks UCIe to a small chip (“logic die”) on the package. That logic die then talks to regular memory chips (either stacked HBM or wire-bonded LPDDR) using their native signaling. This reuses existing memory chips with minimal changes and can be shipped sooner.
- Making memory speak UCIe natively: The memory chip itself gets a UCIe interface and connects directly to the processor. This takes longer to develop but could be great for phones, tablets, and other small devices.
They also tailor the “tracks” (links) for memory traffic:
- Symmetric links: same size in both directions, good when requests and responses are balanced.
- Asymmetric links: more lanes for reads than writes (or vice versa), matching the fact that memory traffic is often read-heavy.
To make messages efficient, they map existing memory protocols onto UCIe:
- Map LPDDR6 and HBM (today’s popular memories) over UCIe with a logic die.
- Use standard message formats like CXL.Mem and ARM’s CHI over UCIe, then optimize how data and headers are packed so more useful data fits per transfer.
Key ideas explained simply:
- Bandwidth density: how much data can move per millimeter of chip edge; higher means more “throughput per space.”
- Latency: the wait time; lower is better, like faster delivery.
- Power per bit (pJ/b): energy to move one bit; lower saves battery and electricity.
- Chiplets: instead of one giant chip, use smaller “puzzle pieces” (compute, memory, I/O) joined on one package.
- UCIe: the standardized “railway” that lets chiplets from different companies connect reliably and quickly.
What they found and why it matters
According to the paper’s analysis and measurements:
- Much higher bandwidth density: up to about 10× more data per millimeter than traditional approaches. That means you can pack a lot more memory speed along the chip’s edge.
- Lower latency: roughly up to 3× faster response than current on-package memory buses, so data arrives sooner.
- Lower power: roughly up to 3× better energy efficiency per bit, thanks to fast links that can quickly sleep when idle and only wake the lanes that are needed.
- Lower cost: by reusing cheaper LPDDR chips with a logic die and by using a common interconnect standard.
Among the options, “CXL.Mem over symmetric UCIe with optimization” performed especially well across many workloads, and mapping LPDDR/HBM over asymmetric UCIe also did great for read-heavy traffic. Overall, UCIe-based approaches outperformed both next-gen HBM4 and LPDDR6 setups in bandwidth density and power, and they reduced latency.
Why this is important:
- AI and high-performance computing need huge, fast memory. Speeding up memory without blowing the power budget is critical for everything from laptops to data centers.
- Better energy efficiency helps control the growing electricity use of computing.
- Using a standard like UCIe lets companies mix and match chiplets, encouraging innovation and lowering costs.
What this could change in the real world
If adopted and standardized, this approach could:
- Enable faster, more efficient AI chips across phones, PCs, and servers.
- Cut data center energy use and costs by moving data more efficiently.
- Let manufacturers combine chiplets from different vendors more easily, speeding up new products.
- Scale into the future: as packaging improves (smaller bumps, higher frequencies), UCIe-based memory can keep increasing bandwidth without a power penalty.
In short, the paper argues for a shift from old, wide, slower memory buses to a modern, high-speed, UCIe-based “rail system” on the chip package. That shift could unlock big gains in speed, power efficiency, and cost for the next generation of AI and computing.
Collections
Sign up for free to add this paper to one or more collections.