Mixed-Precision BFP16–BF16 GEMM

Updated 27 January 2026

Mixed-Precision BFP16–BF16 GEMM is a fusion of block and brain floating-point formats that enhances compute efficiency and reduces memory bandwidth in matrix multiplication.
It employs advanced tiling strategies, including asymmetric tile buffering, to maximize arithmetic intensity and efficient hardware utilization on AI accelerators and CPUs.
The approach achieves up to 4.5× speedup and significant energy and bandwidth savings, making it vital for high-throughput deep learning and scientific workloads.

Mixed-precision BFP16–BF16 GEMM refers to general matrix multiplication (GEMM) using a hybrid of block floating-point 16 (BFP16) and brain floating-point 16 (BF16) operand formats. This fusion enables substantial improvements in arithmetic intensity, memory traffic, and hardware utilization, particularly on vector and tensor-core architectures designed for deep learning and scientific workloads. GEMM kernels in this paradigm often employ advanced tiling, including asymmetric tile buffering (ATB), and leverage specialized microarchitecture features for dot-product operations and dataflow. Across high-throughput CPUs, GPGPUs, and AI accelerators, mixed BFP16–BF16 GEMM delivers record-setting compute efficiency and throughput, with verified gains in multiple published deployments.

1. Numeric Formats and Motivation

BF16 (Brain Float 16) is a 16-bit IEEE-style floating-point format, consisting of 1 sign bit, 8 exponent bits, and a 7-bit mantissa. Each element possesses its own exponent, matching the dynamic range of FP32 but with lower precision. BFP16 (Block Floating Point 16) is a 16-bit format wherein blocks of $N$ elements (typically 8 or 16) share a single exponent, with each lane carrying 8–11 bits of mantissa and sign. BFP16 storage achieves compression, requiring only $1.125$ bytes per element for 8-wide blocks compared to 2 bytes per element for BF16 (Wang et al., 20 Nov 2025).

Mixed-precision BFP16–BF16 strategies apply BFP16 to model weights, reducing buffer pressure and memory bandwidth, while using BF16 for activations and outputs to preserve elementwise dynamic range and accuracy. Accumulation commonly occurs in high precision (BF16 or FP32) to further improve final-sum fidelity (Wang et al., 20 Nov 2025, Rout et al., 19 Nov 2025).

2. Matrix Multiplication Formulation

In mixed-precision GEMM, matrices are assigned different formats:

$A \in \mathrm{BF16}^{M \times K}$ (activations)
$B \in \mathrm{BFP16}^{K \times N}$ (weights)
$C \in \mathrm{BF16}^{M \times N}$ (outputs)
Accumulation in BF16 or FP32

The elementwise computation is:

$C_{mn} \gets \sum_k \mathrm{BF16}(A_{mk}) \cdot \mathrm{BFP16}(B_{kn})$

with total operations $2M N K$.

The arithmetic intensity is determined as:

$I = \frac{2M N K}{D}$

where $D$ is total bytes moved. For tile-wise blocking,

$I = \frac{2}{a/T_N + b/T_M + c/K}$

with $1.125$0 the bytes-per-element for $1.125$1, $1.125$2, and $1.125$3, respectively, and $1.125$4 tile sizes (Wang et al., 20 Nov 2025).

Block exponents and tile packing are crucial: BFP16 blocks must be preprocessed to extract shared exponents and permute them for hardware-efficient layout (Zhang et al., 21 Aug 2025).

3. Asymmetric Tile Buffering (ATB) and Performance Modeling

Traditional symmetric buffering requires the tile dimension of $1.125$5 along $1.125$6 to equal that of $1.125$7, which can bottleneck hardware scratchpad usage. Asymmetric Tile Buffering decouples these parameters:

$1.125$8: rows of $1.125$9 buffered per tile (input)
$A \in \mathrm{BF16}^{M \times K}$ 0: rows of $A \in \mathrm{BF16}^{M \times K}$ 1 buffered per tile (output), $A \in \mathrm{BF16}^{M \times K}$ 2
$A \in \mathrm{BF16}^{M \times K}$ 3 as usual
$A \in \mathrm{BF16}^{M \times K}$ 4

ATB enables larger output tiles at fixed input buffer cost, raising arithmetic intensity:

$A \in \mathrm{BF16}^{M \times K}$ 5

The buffer constraint is:

$A \in \mathrm{BF16}^{M \times K}$ 6

For models using ATB on AMD XDNA2™ AIE, maximal arithmetic intensity and kernel efficiency are obtained by trading off $A \in \mathrm{BF16}^{M \times K}$ 7 (microkernel chain length) and $A \in \mathrm{BF16}^{M \times K}$ 8 (asymmetry factor). The per-core latency model combines compute and microkernel launch overhead:

$A \in \mathrm{BF16}^{M \times K}$ 9

with $B \in \mathrm{BFP16}^{K \times N}$ 0 cycles per launch (Wang et al., 20 Nov 2025).

4. Hardware Implementations

AIE Accelerator (AMD XDNA2™)

32 compute cores (dual-issue VLIW, 64 KB L1, dual 8×8×8 BFP16 MAC, 1.84 TFLOPS/core)
8 memory cores (512 KB L2 each, 65 GB/s off-chip DDR BW)
Peak throughput: 58.8 TFLOPS BFP16
Buffer constraint: L1 (63 KB per core) limits tile parameters. ATB permits infeasible tile sizes under symmetric buffering, such as 128×64×128 (56 KB under ATB, 91 KB if symmetric).

Record-setting results:

Baseline symmetric: $B \in \mathrm{BFP16}^{K \times N}$ 1, 4.8 TFLOPS
ATB: $B \in \mathrm{BFP16}^{K \times N}$ 2, $B \in \mathrm{BFP16}^{K \times N}$ 3, 24.3 TFLOPS ($B \in \mathrm{BFP16}^{K \times N}$4 TFLOPS/core, AI $B \in \mathrm{BFP16}^{K \times N}$ 5 op/B), 4.54× speedup

Design guidelines prioritize maximizing $B \in \mathrm{BFP16}^{K \times N}$ 6 and arithmetic intensity for memory-bound regimes (small $B \in \mathrm{BFP16}^{K \times N}$ 7, large $B \in \mathrm{BFP16}^{K \times N}$ 8), maximizing chain length and core efficiency for compute-bound scenarios (large $B \in \mathrm{BFP16}^{K \times N}$ 9, small $C \in \mathrm{BF16}^{M \times N}$ 0) (Wang et al., 20 Nov 2025).

GPGPU Dot Product Pipeline (Vortex, Alveo U55C)

4-stage “FEDP” pipeline: multiply, exponent extract, alignment, MOD-4 CSA accumulation, normalization/rounding
Supports FP16/BF16/BFP16/FP8/INT8 inputs, FP32/INT32 accumulation
Ideal throughput: 9.812 GFLOPS @ 306.6 MHz (4-cycle pipeline, 16 threads/warp)
Pure LUT design (no DSP usage), parameterizable at RTL for custom block formats

Accuracy: BF16 mantissa yields ≤0.5 ULP error, BFP16 block exponent yields ≤1% inter-block error. FP32 accumulation averts error growth over large $C \in \mathrm{BF16}^{M \times N}$ 1 (Rout et al., 19 Nov 2025).

5. Software Pipelines and Microkernel Design

TurboMind-style engines preprocess weights into “swizzled” blocks for either BFP16 or BF16 using hardware-aware permutations. At runtime:

Preload and dequantize weights (BFP16: mantissa extraction + shared block exponent; BF16: direct cast).
Micro-tile loop in (registers, shared memory, tensor cores): pipelined dequantization, prefetching, and MMA accumulation.
Overlapping loads, dequantization, and compute stages: eliminates hazards, sustains hardware peak throughput.

Performance on NVIDIA A100:

FP16×FP16: 9.1 TFLOP/s
BF16×FP16: 9.7 TFLOP/s (+6%)
BFP16×FP16: 9.8 TFLOP/s (+6.8%)
Mixed BFP16–BF16: 9.75 TFLOP/s

Latency advantages (up to +25%) for small batches due to better pipeline overlap. At large batch sizes, mixed-precision matches conventional FP16 peak (Zhang et al., 21 Aug 2025).

6. ISA and CPU Support for Mixed Precision

Modern ISAs (x86_64 AVX-512_VNNI_BF16, ARM SVE2/SME2, RISC-V Vector) support native or emulated BF16/BFP16 DOT instructions:

Architecture	FP32 GF/s	BFP16–BF16 GF/s	Speed-up
ARM Cortex–A72 (4 cores)	32	75	2.3×
ARM Cortex–A78AE (12)	120	500	4.2×
RISC-V SpacemiT K1 (8)	90	315	3.5×

Memory bandwidth pressure is reduced by 1.5–1.7×, and energy/op by 2–3×, measured on-board. Core microkernels use high-degree loop unrolling, aggressive prefetching, and packing aligned buffers for cache and register file efficiency (Martínez et al., 13 Jun 2025).

7. Synthesis and Deployment Considerations

Mixed-precision BFP16–BF16 GEMM provides demonstrable speedups over FP32, BF16, and FP16, with arithmetic intensity uplift and reduced resource demands. ATB tiling enables tiles previously infeasible under symmetric models, allowing hardware scratchpad resources to be used more efficiently and maximizing throughput. On modern accelerators and CPUs, proper design—format packing, pipeline depth/overlap, tuneable microtile/block sizes, optimized swizzle, and register management—can yield 2–4× speed-up, 1.6× bandwidth reduction, and significant energy savings.

Deployment trade-offs include:

Selecting $C \in \mathrm{BF16}^{M \times N}$ 2, $C \in \mathrm{BF16}^{M \times N}$ 3, and microtile parameters to balance memory-bound and compute-bound regimes
Ensuring hardware DOT product support for maximal gains
Adapting to cache/register file size constraints and L1/L2/L3 hierarchy
Empirical measurement of kernel launch overheads ( $C \in \mathrm{BF16}^{M \times N}$ 4) to tune parallel execution

These advances generalize to all narrow-scratchpad, high-throughput accelerator designs and represent a systematic methodology for future GEMM and deep learning kernels (Wang et al., 20 Nov 2025, Zhang et al., 21 Aug 2025, Rout et al., 19 Nov 2025, Martínez et al., 13 Jun 2025).