On-Package Memory with Universal Chiplet Interconnect Express (UCIe): A Low Power, High Bandwidth, Low Latency and Low Cost Approach

Published 7 Oct 2025 in cs.AR and cs.DC | (2510.06513v1)

Abstract: Emerging computing applications such as AI are facing a memory wall with existing on-package memory solutions that are unable to meet the power-efficient bandwidth demands. We propose to enhance UCIe with memory semantics to deliver power-efficient bandwidth and cost-effective on-package memory solutions applicable across the entire computing continuum. We propose approaches by reusing existing LPDDR6 and HBM memory through a logic die that connects to the SoC using UCIe. We also propose an approach where the DRAM die natively supports UCIe instead of the LPDDR6 bus interface. Our approaches result in significantly higher bandwidth density (up to 10x), lower latency (up to 3x), lower power (up to 3x), and lower cost compared to existing HBM4 and LPDDR on-package memory solutions.

Abstract PDF Upgrade to Chat

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

Computers keep getting faster, but they’re running into a “memory wall.” That means the processor wants data faster than today’s memory systems can deliver, especially for AI, which needs huge amounts of data quickly and efficiently. This paper proposes a new way to put memory right next to the processor on the same package and connect it using a standard called UCIe (Universal Chiplet Interconnect Express). The goal is to get much more speed, much lower delay, and much lower power and cost than today’s on-package memory options.

The main questions the paper tries to answer

How can we feed processors (especially for AI) with a lot more data without using too much power or money?
Can we reuse existing memories (like LPDDR and HBM) but connect them in a smarter way using UCIe?
Could future memory chips speak UCIe directly, making the whole system simpler?
Which message formats and link styles work best over UCIe to make memory fast and efficient?
Will this approach beat today’s solutions in bandwidth (how much data per second), latency (how quickly data arrives), power (energy per bit), and cost?

How the authors approach the problem (in everyday terms)

Think of a computer chip as a city and memory as the food supply. Today’s memory delivery uses many wide, slower roads with different rules for commands and data. That’s wasteful and hard to scale. The authors suggest switching to a few super-fast, point-to-point “trains” that carry both commands and data efficiently. Those “trains” are UCIe links.

They explore two main ways to build this:

Using a translator stop (a “logic die”): The processor talks UCIe to a small chip (“logic die”) on the package. That logic die then talks to regular memory chips (either stacked HBM or wire-bonded LPDDR) using their native signaling. This reuses existing memory chips with minimal changes and can be shipped sooner.
Making memory speak UCIe natively: The memory chip itself gets a UCIe interface and connects directly to the processor. This takes longer to develop but could be great for phones, tablets, and other small devices.

They also tailor the “tracks” (links) for memory traffic:

Symmetric links: same size in both directions, good when requests and responses are balanced.
Asymmetric links: more lanes for reads than writes (or vice versa), matching the fact that memory traffic is often read-heavy.

To make messages efficient, they map existing memory protocols onto UCIe:

Map LPDDR6 and HBM (today’s popular memories) over UCIe with a logic die.
Use standard message formats like CXL.Mem and ARM’s CHI over UCIe, then optimize how data and headers are packed so more useful data fits per transfer.

Key ideas explained simply:

Bandwidth density: how much data can move per millimeter of chip edge; higher means more “throughput per space.”
Latency: the wait time; lower is better, like faster delivery.
Power per bit (pJ/b): energy to move one bit; lower saves battery and electricity.
Chiplets: instead of one giant chip, use smaller “puzzle pieces” (compute, memory, I/O) joined on one package.
UCIe: the standardized “railway” that lets chiplets from different companies connect reliably and quickly.

What they found and why it matters

According to the paper’s analysis and measurements:

Much higher bandwidth density: up to about 10× more data per millimeter than traditional approaches. That means you can pack a lot more memory speed along the chip’s edge.
Lower latency: roughly up to 3× faster response than current on-package memory buses, so data arrives sooner.
Lower power: roughly up to 3× better energy efficiency per bit, thanks to fast links that can quickly sleep when idle and only wake the lanes that are needed.
Lower cost: by reusing cheaper LPDDR chips with a logic die and by using a common interconnect standard.

Among the options, “CXL.Mem over symmetric UCIe with optimization” performed especially well across many workloads, and mapping LPDDR/HBM over asymmetric UCIe also did great for read-heavy traffic. Overall, UCIe-based approaches outperformed both next-gen HBM4 and LPDDR6 setups in bandwidth density and power, and they reduced latency.

Why this is important:

AI and high-performance computing need huge, fast memory. Speeding up memory without blowing the power budget is critical for everything from laptops to data centers.
Better energy efficiency helps control the growing electricity use of computing.
Using a standard like UCIe lets companies mix and match chiplets, encouraging innovation and lowering costs.

What this could change in the real world

If adopted and standardized, this approach could:

Enable faster, more efficient AI chips across phones, PCs, and servers.
Cut data center energy use and costs by moving data more efficiently.
Let manufacturers combine chiplets from different vendors more easily, speeding up new products.
Scale into the future: as packaging improves (smaller bumps, higher frequencies), UCIe-based memory can keep increasing bandwidth without a power penalty.

In short, the paper argues for a shift from old, wide, slower memory buses to a modern, high-speed, UCIe-based “rail system” on the chip package. That shift could unlock big gains in speed, power efficiency, and cost for the next generation of AI and computing.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (4)

Collections

Tweets

alphaXiv

On-Package Memory with Universal Chiplet Interconnect Express (UCIe): A Low Power, High Bandwidth, Low Latency and Low Cost Approach (8 likes, 0 questions)

On-Package Memory with Universal Chiplet Interconnect Express (UCIe): A Low Power, High Bandwidth, Low Latency and Low Cost Approach

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

The main questions the paper tries to answer

How the authors approach the problem (in everyday terms)

What they found and why it matters

What this could change in the real world

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

alphaXiv