Papers
Topics
Authors
Recent
Search
2000 character limit reached

NVIDIA Blackwell Architecture

Updated 22 February 2026
  • Blackwell Architecture is a family of high-performance GPUs featuring a dual-die design with integrated TMEM and on-chip decompression to accelerate large-scale inferencing and ML workloads.
  • It introduces 5th-generation Tensor Cores that support a range of precisions—including FP4 and FP6—achieving up to 1.92× throughput improvements in dense compute operations.
  • The system offers reduced memory latency (down 58%), enhanced energy efficiency (up 42% in transformer training), and streamlined dataflow for optimized algorithm performance.

NVIDIA’s Blackwell Architecture refers to a family of high-performance GPU platforms first deployed in B200 (“Blue”) and successor products. It is characterized by a combination of architectural advances targeting exascale computing, machine learning, and large-scale inferencing. Core innovations include 5th-generation Tensor Cores with ultra-low-precision support, a novel Tensor Memory (TMEM) hierarchy, a dedicated on-chip Decompression Engine (DE), and a dual-die system interconnected by the NV-HBI interface. This architecture systematically targets higher compute throughput, substantially lower cache-miss memory latency, expanded mixed-precision capabilities, and increased energy efficiency relative to prior Hopper/H200 generations, making it especially impactful for dense and sparse tensor processing workloads (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025).

1. Architectural Composition and Physical Organization

Blackwell B200 integrates two identical chiplets—each contributing to a total of 208 billion transistors—connected via NVIDIA’s High-Bandwidth Interface (NV-HBI). This dual-die organization forms a contiguous memory system with eight HBM3e stacks supporting a total of 192 GB and exceeding 1.5 TB/s aggregate memory bandwidth. The compute complex comprises 148 Streaming Multiprocessors (SMs) distributed across eight Graphics Processing Clusters (GPCs), with each SM featuring a 256 KB register file, four sub-cores (including unified INT32/FP32 units, 5th-gen Tensor Cores, and two FP64 units), and unified L1/shared memory (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025). Notable changes versus Hopper include a doubling of L2 cache partitions (4x), a single monolithic 65 MB L2, and expanded GPC/SM scaling.

2. Compute Units: 5th-Generation Tensor Cores and Execution Model

A dominant feature is the 5th-generation Tensor Core subsystem. Distinct from prior architectures that required warp-group synchronization (e.g., Volta→Ampere→Hopper used MMA in warp groups of 128 threads), Blackwell’s tcgen05.mma instruction enables independent per-thread tensor operations within a warp. This innovation achieves a reduction in warp scheduler stalls by 18–23% in memory-bound kernels and facilitates efficient warp-level MIMD/MMMA programming.

Supported numerical formats natively include FP4 (e2m1), FP6 (e2m3/e3m2), FP8 (e4m3/e5m2), FP16, BF16, TF32, FP32, FP64, and INT8, covering both scientific and inference-centric workloads. Throughput scaling is effected via increased parallel datapaths, supporting near peak utilization (>95%) across all precisions. Notably, FP4 and FP6 are newly supported, achieving up to 7702.5 TFLOPS and 5134.8 TFLOPS per B200 device, respectively. The architecture maintains a constant, tile-size-independent single-instruction latency (11–12 cycles up to 256×256×16), in contrast to linear tile-size scaling on Hopper (Jarmusch et al., 1 Dec 2025).

Table: Tensor Core Throughput and Peak Utilization

Precision B200 (TFLOPS) H200 (TFLOPS) Speedup
FP64 44.8 34.0 ×1.32
FP32 481.2 378.4 ×1.27
TF32 964.5 756.9 ×1.27
BF16/FP16 1929.2 1515.2 ×1.27
FP8 3851.4 3026.9 ×1.27
FP6 5134.8 N/A New
FP4 7702.5 N/A New
INT8 3927.1 3088.4 ×1.27

The aggregate mixed-precision throughput metric is defined as Tmix=(# ops/core/cycle)fclockT_{mix} = (\textrm{\# ops/core/cycle}) \cdot f_{clock}, peaking at 7.7 PFLOPS (FP4, single-precision equivalent) (Jarmusch et al., 1 Dec 2025).

3. Memory System and Tensor Memory (TMEM) Hierarchy

Blackwell's memory hierarchy features per-SM 256 KB TMEM, organized as a 2-D array (512×128 lanes), directly feeding tensor cores and bypassing shared memory or L2 cache contention. This enables tile-based, high-bandwidth, low-latency dataflow, crucial for dense matrix and transformer workloads. TMEM delivers end-to-end access latency of 420 cycles for cache-miss streams, a 58% reduction over Hopper's 1000 cycles, enabling new classes of memory-bound algorithm optimizations (Jarmusch et al., 1 Dec 2025).

TMEM bandwidth per SM is 16 TB/s read and 8 TB/s write, with alignment at 64×64 tiles (FP8) for optimal throughput (4 KB/tile, matching 1024-bit interface). Dataflows using TMEM for accumulator staging reduce overall device power by 15% at scale and eliminate ~12 TB/s per SM of redundant off-chip traffic in chained operations such as sequential GEMMs or transformer attention. For tiles larger than 128×128, multi-phase transfer introduces up to 30% throughput penalty, determining optimal software tile sizes (Jarmusch et al., 1 Dec 2025).

4. On-Chip Decompression Engine (DE)

A hardware-integrated Decompression Engine supports seven formats (LZ4, Snappy, Zstandard, GZIP, Cascaded, Bitcomp, ANS) to decompress data streams and weights directly in GPU memory. Peak format-specific DE performance (tested on 100 MB datasets) includes 173 GB/s (LZ4 input), 462 GB/s (Bitcomp output), and 539 GB/s (ANS output), with latency as low as 0.194 ms (ANS). The output bandwidth remains ~200 GB/s across all patterns, while input bandwidth varies with compression ratio (Jarmusch et al., 1 Dec 2025).

Batching strategies and pipeline depth are critical: DE achieves 53.8 GB/s with 32 KB chunks (16-deep pipeline) and 151.6 GB/s for 256 KB chunks (4-deep). Efficiency above 85% is maintained only within these operational windows. These properties influence framework and I/O design decisions for streaming versus bulk loading scenarios.

5. Performance and Energy Efficiency Metrics

Systematic benchmarking using PTX-level and SASS-mapped microbenchmarks demonstrates across-the-board performance improvements and several trade-offs compared to Hopper-based predecessors (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025). Mixed-precision throughput improves by 1.27×–1.32× for all tensor core formats, with end-to-end transformer training accelerating by 1.56× and LLM inference by up to 1.59×. Dense double-precision matrix multiplication (DGEMM) achieves 1.92× the throughput (36.3 TFLOPS vs 18.9). Energy efficiency, measured as η=GFLOPSWatt\eta = \tfrac{\textrm{GFLOPS}}{\textrm{Watt}}, improves by 42% for GPT training workloads (22.2 tok/s W⁻¹ vs 15.6 tok/s W⁻¹). Notably, cache-miss memory access latency reductions (1000→420 cycles) drive a fundamental re-evaluation of optimal workload design.

Case-study summary:

Workload B200 H200 Speedup
LLM Inference (FP16) 45.2k tok/s 28.5k tok/s ×1.59
DGEMM (FP64, 32K×32K) 36.3 TFLOPS 18.9 TFLOPS ×1.92
STREAM Triad 7.48 TB/s 4.38 TB/s ×1.71
SpMV w/ DE ~5 GFLOPS 1.6 GFLOPS ×3.16
GPT-1.3B Training (FP16) 14.4k tok/s 9.2k tok/s ×1.56

Results consistently demonstrate that the greatest improvements are observed in dense/sparse GEMM, transformer training/inference, and workloads exploiting deep tensor core pipelines and TMEM (Jarmusch et al., 1 Dec 2025).

6. Microbenchmark Methodologies

Performance metrics derive from component-isolated microbenchmarks written in PTX, with SASS mapping validation (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025). TMEM tests utilize pointer-chase latency, tile-based bandwidth, and tcgen05.ld/st/cp instruction sequences in comparison to cp.async+ld.global on H200. DE is characterized by end-to-end decompression benchmarks across all natively supported formats, varying chunk size (32–256 KB) and concurrency (up to 1024). Tensor Core characterization employs dependency-chain designs to measure single-instruction latency and throughput across tile shapes and precisions, with power monitored via NVML.

Integrated workload benchmarks include representative kernels: dense/sparse GEMM (ResNet-50, mixture-of-experts models), DGEMM, STREAM, SpMV, and mixed-precision transformer training. These systematic methodologies provide a controlled basis for inter-architectural comparisons and throughput/power regressions (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025).

7. Implications for Software and Algorithm Design

Blackwell’s architecture necessitates new strategies for both algorithm and framework design. Optimal performance is achieved by favoring 64×64 (FP8) or 32×32 (FP16/BF16) tiles, maximizing TMEM bandwidth while avoiding the throughput penalties of multi-phase transfers. Intermediate tensor results should be retained in TMEM across sequential tensor operations to eliminate extraneous memory movement (notably for attention and feedforward phases in transformers) (Jarmusch et al., 1 Dec 2025).

Per-thread scheduling with tcgen05.mma enables software double-buffering—issuing tcgen05.cp tile loads concurrently with ongoing computation. Quantization strategies are precision- and layer-specific: FP4 is suited for weight-stationary dense layers (yielding ≈70% memory saving, ~2.5× speedup, and <10% degradation in perplexity on tested LLMs); FP8 is recommended elsewhere. DE chunk sizes should be tuned according to workload latency and throughput targets: smaller (32–64 KB, 8–16 ops) for real-time streaming, larger (128–256 KB, 4–8 ops) for offline bulk processing. DE-aware data formats, such as Bitcomp for numerical tables in HPC, further enhance performance (Jarmusch et al., 1 Dec 2025).

This suggests that future kernels and frameworks intended for Blackwell platforms must be architecturally tuned, explicitly aware of TMEM, DE, and precision tradeoffs to fully exploit hardware capabilities.


Blackwell Architecture represents a significant shift in GPU design priorities, prioritizing ultra-low-precision tensor acceleration, dramatically lower on-chip and off-chip memory penalties, and architecturally flexible execution. These advances are validated through systematic microbenchmarking and establish new targets for algorithmic optimization, with direct impact on scientific computing, transformer-based machine learning, and real-time inferencing applications (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blackwell Architecture.