Versal AI Engine Architecture
- Versal AI Engine is a high-performance, spatially programmable processor array integrated within AMD’s Versal ACAP, optimized for data-parallel tasks like deep learning and scientific simulation.
- It leverages fine-grained VLIW/SIMD vectorization, explicit multi-level memory management, and low-latency on-chip networks to maximize throughput and efficiency.
- The programming model uses high-level C/C++ APIs with custom graph compilers, enabling efficient micro-kernel tiling and heterogeneous acceleration for both regular and irregular algorithms.
The Versal AI Engine (AIE) is a high-performance, spatially programmable processor array integrated within AMD’s Versal Adaptive Compute Acceleration Platform (ACAP) system-on-chip devices. AIE is engineered to deliver massive throughput for data-parallel, low-latency compute workloads such as matrix multiplication, deep learning inference, signal processing, and scientific simulation, by leveraging fine-grained VLIW/SIMD vectorization, multi-level explicit memory, and low-latency network-on-chip (NoC) communication. Unlike prior FPGA-centric architectures, the Versal AIE combines up to hundreds of VLIW vector compute tiles, each with tightly coupled local SRAM, into a 2D array connected via deterministic high-bandwidth mesh and cascade buses, and is programmable via high-level C/C++ APIs and custom graph compilers. The AIE subsystem natively supports heterogeneous workflows, forms the computational backbone for state-of-the-art accelerators, and has established a new paradigm for mapping both regular and irregular algorithms to spatial hardware (Zhuang et al., 2023, Danopoulos et al., 17 Dec 2025, Mhatre et al., 13 Apr 2025, Li et al., 13 Jun 2025, Dai et al., 2024).
1. Architectural Overview
Versal AIE arrays are organized as a 2D mesh of homogeneous processor “tiles” (typically 8–10 rows × 38–50 columns, e.g., 400 tiles on VC1902), each implementing a 7-way VLIW datapath. Every tile comprises:
- A SIMD vector MAC (e.g., 8×FP32 or 128×INT8 MACs/cycle, AIE1; 256×INT8 MACs/cycle, AIE2).
- Local scratchpad: 32–64 KB SRAM (AIE1–AIE2), 16–128 KB instruction memory.
- Independent load/store pipelines, typically two 256-bit loads and one 256-bit store per cycle in AIE2.
- No hardware-managed cache—memory hierarchies and data movement are orchestrated in software.
- On-tile DMA engines and multiple neighbor links for single-cycle communication to north/south/east/west, 384–512 bit “cascade buses” for partial-sum accumulation, and access to mesh NoC routers for global, circuit-switched routing.
- Optionally, memory tiles interleaved in the array (AIE2+) provide larger buffers and programmable tilers for on-chip, all-inference executions (Zhuang et al., 2023, Mhatre et al., 13 Apr 2025, Danopoulos et al., 17 Dec 2025).
Peripheral programmable logic (PL) interfaces (PLIO) provide high-bandwidth streaming ingress/egress (up to 1.3 TB/s for AIE–PL and 0.9 TB/s for PL–AIE), facilitating integration with ARM CPUs, high-speed memory controllers, and custom logic (Zhuang et al., 2023, Li et al., 13 Jun 2025).
AIE arrays achieve theoretical performance levels (e.g., 6.4–8.0 TFLOPs for FP32, 128–165 TOPS for INT8, >80 TBFLOPs for BF16) via aggressive tiling, register-level blocking, and spatial parallelism, provided local memory and PLIO bandwidth constraints are managed (Taka et al., 2023, Mhatre et al., 13 Apr 2025).
2. Programming and Computational Model
Programming the Versal AIE may be achieved via C++-based APIs, often within the Vitis AI Engine (ADF) framework, which models the hardware as an explicit dataflow graph. Kernels are defined as vectorized, pipelined compute units, mapped one per tile. Efficient code generation requires:
- Explicit micro-kernel tiling (“blocking” and “micro-tile” strategies from CPU BLAS are transplanted to AIE, with optimal choices for , etc.), exposing maximal vector and VLIW pipeline utilization (Lei et al., 2023, Lei et al., 2024).
- Software-managed multi-level memory hierarchy: input operands and weights are prepacked/allocated into local SRAM, on-chip UltraRAM, and BlockRAM; data is staged through layers of double buffers, and inter-layer tiling is matched to memory tile capabilities (Lei et al., 2024, Danopoulos et al., 17 Dec 2025).
- Explicit, programmer-controlled DMA and neighbor streaming, no hardware-managed caching, and deterministic timing with II=1 pipelining are maintained in all performance-critical kernels (Danopoulos et al., 17 Dec 2025).
- Topology-aware graph placement: layout of kernels and dataflows is crafted to minimize NoC hops, balance bandwidth, avoid bank/SRAM conflicts, and maximize cross-tile reduction/fusion opportunities (Mhatre et al., 13 Apr 2025, Danopoulos et al., 17 Dec 2025).
- Graph and runtime compilers generate the AIE firmware, PL bitstreams, host orchestration (CPU/ARM or x86), scheduling domains, and all necessary APIs for dynamic model management (Danopoulos et al., 17 Dec 2025, Zhang et al., 2024, Dai et al., 2024).
The model favors a hierarchical decomposition: problem partitioning to tiles, intratile blocking, intertile reduction, and explicit orchestration of dataflows through local and shared buffers.
3. Matrix Computation and Deep Learning Workloads
AIE arrays are heavily optimized for matrix multiply-accumulate (MM or GEMM), convolution, and related kernels: the backbone of AI inference, scientific computing, and signal processing. Key points include:
- Each tile executes “blocked micro-kernels”—core sizes (e.g., , ) are selected to fit accumulator registers and maximize SIMD occupancy (Lei et al., 2023, Lei et al., 2024).
- Multi-layer memory: L1 (registers), L2 (local AIE SRAM), L3 (PL-side URAM/BRAM), L4 (DDR). Performance is maximized by maximizing reuse at the highest level, pre-packing weights, and overlapping DMAs with compute (Lei et al., 2024, Lei et al., 2023).
- Micro-kernel performance on INT16 achieves up to 87% of architectural peak (e.g., 27.8 MACs/cycle out of 32) for large block sizes (); INT8 and mixed-precision modes are similarly efficient (Lei et al., 2023, Mhatre et al., 13 Apr 2025).
- For array-level parallelism, the GEMM -dimension is split across tiles; AIEs either process distinct column blocks or participate in local reductions. Overall, scalability is close to linear up to tens of tiles; bandwidth on UltraRAM, PLIO contention, or off-chip memory becomes the bottleneck beyond that point (Lei et al., 2024, Taka et al., 2023).
- Deep-learning frameworks exploit fine-grained partitioning: CHARM (Zhuang et al., 2023) partitions the array dynamically into multiple heterogeneous MM accelerators to maximize utilization for networks with both large and small layers; DPUV4E extends this with highly specialized compute/dataflow units for convolution and effficient non-convolutional ops (Li et al., 13 Jun 2025).
- Systolic array mappings with polyhedral-model transformations further enhance utilization for uniform recurrences and regular MM/conv/FFT pipelines (Dai et al., 2024, Zhang et al., 2024).
4. Heterogeneous Acceleration and Regular/Irregular Workloads
The ACAP SoC's tight integration of CPU, programmable logic, and the AIE tile array enables heterogeneous scheduling strategies:
- Non-linear, memory- and control-heavy operators (e.g., softmax, layer-norm, activation functions, pool, etc.) are deployed in PL, reducing the AIE/PL bandwidth burden and freeing AIEs for datapath-dominated computation (Zhang et al., 2024, Li et al., 13 Jun 2025).
- For graph neural networks and irregular data (GNNs): density-aware tile assignment, a mix of sparse/dense systolic tensor arrays, and hybrid PL/AIE designs yield state-of-the-art performance (Zhang et al., 2022).
- Recurrences and regular CA (communication-avoiding) algorithms—e.g., stencils, FFT, matrix-multiply—are mapped through frameworks such as EA4RCA or WideSA, employing top-down decomposition, aggressive space-time tiling, and routing-aware PLIO assignment (Zhang et al., 2024, Dai et al., 2024, Brown, 2022).
- For workloads with extreme memory reuse requirements or strong communication bounds (e.g., deep-wave models, windowed FFT), resource allocation across PLIO, local DMEM, and URAM/BRAM is fine-tuned for maximal efficiency (Li et al., 13 Jun 2025, Li et al., 22 Jun 2025).
5. Performance, Efficiency, and Bottlenecks
AIE-based accelerators consistently push close to hardware limits, but realization of peak compute depends on coordinated memory and interconnect management:
- FP32 theoretical performance (AIE1): $6.4$–$8.0$ TFLOPs ($16$ MACs/tile/cycle × $400$ tiles × $1$–$1.25$ GHz); INT8 on AIE2: $128$–$165$ TOPS (Taka et al., 2023, Mhatre et al., 13 Apr 2025).
- Empirical results: e.g., MaxEVA: $5.44$ TFLOPs ( of peak) for FP32; $77.0$ TOPS ( of peak) for INT8, $1.16$ TOPS/W (Taka et al., 2023). GAMA (AIE2): $165$ TOPS ( of peak), $83$ TBFLOPS () for BF16 (Mhatre et al., 13 Apr 2025).
- For regular algorithms: WideSA delivers $4.15$ TOPS for float MM on VCK5000 ( over prior state-of-the-art) with $95$– core utilization (Dai et al., 2024); EA4RCA achieves throughput improvement on 3×3 Filter2D and on MM compared to SOTA (Zhang et al., 2024).
- Application-level accelerators (e.g., CAT framework for Transformers) show throughput vs. prior ACAP designs, over Nvidia A10G, over ZCU102 for BERT; peak efficiency up to $520.97$ GOPS/W (Zhang et al., 2024).
- Bandwidth is the typical limiting factor: off-chip DDR (~25–102 GB/s), PL→AIE/PLIO (~1 TB/s), and on-chip URAM (tens of GB/s/tile) must be balanced. Stalls manifest from insufficient DMA content, underprovisioned PLIO, or streaming/cascade bus overload (Mhatre et al., 13 Apr 2025, Zhuang et al., 2023, Taka et al., 2023).
- Design best practices include careful static blocking, buffer placement to avoid SRAM bank conflicts, double-buffered tiling, pipelined reduction trees, and staggered kernel placement across the array to resolve routing congestion (Lei et al., 2024, Mhatre et al., 13 Apr 2025, Taka et al., 2023).
6. Frameworks, Automated Toolchains, and Application Domains
AIE software development flow is increasingly tool-assisted, supporting full-stack compilation from ML models (e.g., PyTorch, hls4ml, Keras) down to AIE firmware:
- AIE4ML: end-to-end compiler (hls4ml/PyTorch input, AIE-ML/MLv2 targets) with deterministic, topology-aware graph placement, blocked linear/fused kernels, and high efficiency (up to of single-tile peak when scaling to 296 tiles) (Danopoulos et al., 17 Dec 2025).
- CHARM: automated DSE for MM/AI workloads, dynamically partitions the AIE array to maximize utilization across heterogeneous workloads (Zhuang et al., 2023).
- CAT: hardware–model co-constraint framework for customized Transformer accelerators, balancing AIE utilization, PLIO, and model hyperparameters (Zhang et al., 2024).
- GAMA: compiler with buffer allocation and kernel/cascade placement heuristics for AIE2, achieves higher memory utilization and array usability versus vendor defaults (Mhatre et al., 13 Apr 2025).
- EA4RCA: generic code generator for regular CA algorithms, fully automates partitioning of computation and dataflow (Zhang et al., 2024).
- For numerical simulation and computational finance, AIE achieves order-of-magnitude speedups for high arithmetic-intensity routines over CPU/GPU, with bottlenecks remaining in cases with low compute/byte ratios or iteration-dependent feedback (e.g., ODE/PDE solvers, cycle graphs in quantitative finance) (Li et al., 22 Jun 2025, Klaisoongnoen et al., 2024, Brown, 2022).
Target domains now span DNN inference/training, scientific stencils, graph analytics, spectral estimation, and real-time control.
7. Challenges, Limitations, and Future Directions
Despite the favorable comparison to state-of-the-art GPU/FPGA accelerators, several limitations remain:
- Off-chip bandwidth and PLIO-channel capacity limit strong-scaling beyond a few hundred tiles unless communication can be orchestrated with near-perfect reuse (Zhuang et al., 2023, Dai et al., 2024, Taka et al., 2023).
- For kernels with iteration-to-iteration feedback, or memory-bound workloads with low arithmetic intensity, the AIE array may yield less benefit than a pure PL design (Klaisoongnoen et al., 2024, Brown, 2022).
- Cycle-based graph dependencies (e.g., feedback loops) are not supported natively; the workaround is a PL “loopback” with ping-pong buffering at additional cost (Klaisoongnoen et al., 2024).
- Explicit buffer placement and manual banking are still mandatory for optimal performance, though toolchain support is improving (Mhatre et al., 13 Apr 2025, Danopoulos et al., 17 Dec 2025).
- Next-generation AIE-MLv2 chips expand memory, cascade width, and bank count; frameworks designed for AIE-ML adapt seamlessly, but scaling of kernel and memory tile tilers requires continued algorithmic and tool advances (Danopoulos et al., 17 Dec 2025, Mhatre et al., 13 Apr 2025).
The Versal AI Engine architecture represents a spatially scalable, VLIW/SIMD-tuned hardware substrate with software-deterministic performance and is redefining accelerator design for high-throughput, parallel, and reconfigurable computing tasks across AI and traditional HPC domains. For further technical specifics, see (Zhuang et al., 2023, Lei et al., 2023, Lei et al., 2024, Taka et al., 2023, Mhatre et al., 13 Apr 2025, Danopoulos et al., 17 Dec 2025, Zhang et al., 2024, Dai et al., 2024, Li et al., 13 Jun 2025, Zhang et al., 2024).
Key references:
(Zhuang et al., 2023, Lei et al., 2023, Lei et al., 2024, Taka et al., 2023, Mhatre et al., 13 Apr 2025, Danopoulos et al., 17 Dec 2025, Zhang et al., 2024, Dai et al., 2024, Li et al., 13 Jun 2025, Zhang et al., 2024, Shimamura et al., 17 Feb 2025, Sapkas et al., 19 Nov 2025, Li et al., 22 Jun 2025, Zhang et al., 2022, Brown, 2022, Klaisoongnoen et al., 2024).