Hybrid/Tiled Architectures

Updated 12 February 2026

Hybrid/tiled architectures are systems that partition computation and data into regular tiles, enhancing memory locality and parallel processing.
They are applied in high-performance computing, deep learning, and CIM accelerators, achieving significant speedups and energy efficiency improvements.
Key design trade-offs include optimal tile sizing, scheduling methods, and balancing spatial-temporal partitioning to minimize inter-tile communication overhead.

Hybrid/tiled architectures refer to computational systems, algorithmic frameworks, or machine learning models that partition data, computation, or hardware into discrete, often spatial or logical, “tiles,” and orchestrate their execution using a variety of strategies to optimize for locality, scalability, memory efficiency, parallelism, or device heterogeneity. These methods underlie much of modern high-performance scientific computing, deep learning training/inference, domain-specific accelerators, and emerging hardware for machine learning and analog/IMC domains.

1. Fundamental Principles of Hybrid/Tiled Partitioning

Hybrid/tiled architectures are defined by their division of computation and data into regular blocks or tiles, and the subsequent coordination of these blocks across one or more resources (cores, PEs, accelerators, devices, or memory units). The two main axes of hybridization are:

Spatial partitioning: Tiles are assigned to different processing resources enabling parallel execution (e.g., tiles mapped to multicore CPUs, GPUs, Systolic Arrays, or CIM tiles).
Temporal partitioning: Each resource further tiles its data over time, reusing local memory to maximize data locality and minimize bandwidth requirements.

Hybridization frequently refers to the mixing of spatial and temporal strategies, as in modern spatial accelerators for deep neural networks (Moon et al., 2021), hybrid analog/digital in-memory computing systems (Lammie et al., 2024), or combinations of data/model/hybrid parallelism in deep learning (Wang et al., 2018).

Key goals driving the design of these architectures:

Locality: Tiles are sized to fit into fast, local storage (cache, scratchpad, L1/L2, local SRAM, crossbar, etc.).
Concurrency: Exposes fine-grain, asynchronous tasks conducive to loose synchronization and out-of-order dynamic scheduling (Bouwmeester, 2013, 0709.1272).
Reuse and Communication Optimization: Minimize costly data movement by maximizing in-tile reuse and scheduling inter-tile communication efficiently.
Heterogeneity Management: Support for hybrid device mappings where architecture or layer can be assigned to the most appropriate hardware/resource (e.g., layer-to-device in IMC (Bhattacharjee et al., 2023), analog tiles vs. digital PEs (Lammie et al., 2024)).

2. Tiling and Hybridization in Algorithms and Hardware

Linear Algebra and Numerical Kernels

Tiled algorithms for linear algebra (Cholesky, QR, LU) partition matrices into small square tiles (e.g., $n_b \times n_b$ ), exposing a DAG of tile-level tasks with dependencies mirroring mathematical recursion. This factorization is critical for parallel efficiency on multicore and manycore systems, as tasks operate on different tiles independently and are coordinated with only light-weight dependency tracking and minimal global synchronization (Bouwmeester, 2013, 0709.1272).

In spatial accelerators, hybrid tiling maps outer tiles to PE clusters and inner tiles to local PE memory (scratchpad and register levels), enabling simultaneous spatial and temporal data reuse. A canonical example is the two-level tiling of GEMM, with outer tiles mapped across a spatial grid and inner tiles streamed through PE local stores, yielding orders-of-magnitude reductions in energy and runtime (Moon et al., 2021).

Deep Learning and Graph Parallelism

In distributed deep learning, hybrid tiling is formalized as an assignment of tensor tiles to devices. Here, data parallelism splits the batch dimension, model parallelism splits parameter dimensions, and hybrids leverage arbitrary hierarchical combinations. The optimal tiling problem is cast as minimizing total inter-device communication, given the computation graph and hardware hierarchy (Wang et al., 2018). The SoyBean system automates this, finding optimal tensor partitionings, inserting conversion operators, and mapping to device interconnects, yielding substantial communication reductions and superlinear speedups in some configurations.

Computing-in-Memory (CIM) and IMC Systems

Emerging analog and digital CIM architectures employ a dense 2-D or 3-D tiling of crossbar arrays or memory units. Recent frameworks (e.g., LionHeart, CLSA-CIM) combine these hardware tiles with mapping heuristics and cross-layer scheduling that coordinate the execution of sub-graph tiles to maximize utilization and minimize latency (Lammie et al., 2024, Pelke et al., 2024). Hybridization in this context includes both mixing analog/digital tiles and cross-layer scheduling/fusion to reuse on-chip and in-tile data (weights, partial sums) as effectively as possible.

In hybrid IMC device search (HyDe), each layer’s tile(s) is mapped to the most area/energy-optimal device among SRAM, PCM, FeFET, with mappings found via differentiable search over a layer–device assignment matrix. This yields solutions that dramatically improve total TOPS/mm² and inference energy within nominal accuracy budgets (Bhattacharjee et al., 2023).

3. Core Methodologies and Scheduling Principles

General Hybrid/Tiled Scheduling

Central to hybrid/tiled architectures is the expression of computations as a dependency graph of tile-level or sub-tile tasks (DAG). Distributed and heterogeneous systems (e.g., multicore+GPU, multi-chiplet, hybrid IMC) benefit from dynamic, asynchronous task scheduling where ready tasks are dispatched to available resources as soon as their dependencies are satisfied, with priority heuristics sometimes assigned to critical-path or high-value kernels (Bouwmeester, 2013, 0709.1272).

Tile size selection balances cache/scratchpad fit (for maximal locality) versus concurrency, as excessively small tiles reduce vector efficiency and increase synchronization and boundary overheads, while large tiles may overrun per-core or per-tile memory and restrict parallelism (Zhang et al., 2016).

In cross-layer scheduling for CIM, integer programming or earliest-start scheduling is used to maximize hardware utilization, combining weight duplication (to relieve pipeline and scheduling bottlenecks) and cross-layer as-soon-as-ready block dispatch (Pelke et al., 2024). This approach can deliver up to 29.2× speedup and a 17.9× utilization boost over purely sequential scheduling.

Data Movement and Memory Optimization

Tiling substantially reduces bandwidth by maximizing on-tile data reuse and minimizing off-tile traffic. In fused-tiled architectures, multiple layers are tiled and fused together, ensuring intermediate outputs never leave fast local memory, represented by constraint-based combinatorial programs which select tile/block sizes to stay within memory capacity at each level (Jung et al., 21 Mar 2025). This approach was shown to reduce data traffic by nearly 50% and runtime by 60% on RISC-V SoCs.

Hybrid tiling in GPUs uses register/shared memory partitioning to maximize occupancy and minimize synchronization: part of the tile is held in thread-local registers, the rest in shared memory. This adjustment allows significant performance improvements in image processing and other bandwidth-limited workloads (Jangda et al., 2019).

4. Example Applications and Quantitative Results

Hybrid/tiled architectures are pervasive in domains demanding high performance, memory locality, and/or device heterogeneity:

Industrial Anomaly Detection: The tiled-ensemble method partitions high-res images into tiles, trains per-tile models, and fuses pixel-wise outputs, achieving 1–2 pp AUROC gain over baselines and enabling full-resolution inference within the memory of a single low-res model (Rolih et al., 2024).
Deep Learning Training/Inference: SoyBean's automatic hybrid tensor tiling yields 1.5–4× speedup and 40–60% comm reduction vs. standard data/model parallelism (Wang et al., 2018).
DNN Hardware Acceleration: Fused-tiled layers on a RISC-V SoC decreased runtime by 60% and reduced off-chip data movement nearly 50%, compared to layer-wise tiling (Jung et al., 21 Mar 2025).
RRAM-based CIM Accelerators: CLSA-CIM schedules sub-feature sets across physical tiles via cross-layer earliest-start scheduling. Combined with modest weight duplication, it attains up to 29.2× speedup and 20.1% utilization in large DNNs (Pelke et al., 2024).
IMC Device Heterogeneity: The HyDe hybrid device assignment optimizes for area/energy constraints under non-idealities, achieving up to 2.74× higher TOPS/mm² and up to 26% better energy efficiency than single-device baselines (Bhattacharjee et al., 2023).
NLP Tiled CNNs: HTCNNs cluster vocabulary tokens, assign per-cluster filters, and restore n-gram coverage via neighbor masking; this approach boosts sentiment classification accuracy by 3–6% absolute over classic CNNs (Trusca et al., 2020).

5. Design Trade-Offs, Extensions, and Generalization

Designing hybrid/tiled architectures entails a series of trade-offs:

Tile size and shape: Must fit local memory, support vectorization, and maximize parallel scheduling flexibility. Insufficiently large tiles lose locality; excessively large tiles limit concurrency.
Scheduling strategy: The balance between static partitioning and dynamic, dependency-driven execution is architecture- and workload-dependent; dynamic scheduling unlocks better resource utilization at the expense of greater scheduling complexity (0709.1272, Bouwmeester, 2013).
Heterogeneity: In systems with heterogeneous tiles (e.g., digital+analog, multiple IMC devices), mapping must account for accuracy, energy, retention, and drift.
Cross-layer Fusion: Fusing layers in tiling can further reduce data movement at the potential cost of higher complexity in tile dependency management and scheduling (Jung et al., 21 Mar 2025).
Hybrid Topologies in ML Models: Striped and sparse-expert hybrids, discovered in mechanism-driven architecture search (MAD), outperform uniform architectures in compute- and state-optimal scaling (Poli et al., 2024).

Hybrid/tiled strategies generalize naturally to broad problem classes:

Any domain where problem size exceeds on-tile/on-chip resources and/or inter-resource bandwidth is at premium.
Machine learning workloads with structural or device heterogeneity.
Software/hardware co-design situations for scientific computing, graph analytics, and emerging post-von-Neumann architectures.

6. Representative Architectures and Results Tables

Domain	Hybrid/Tiled Approach	Noted Benefits	Reference
Linear algebra	Tile/DAG + dynamic scheduling	1.5–2× speedup; stability	(Bouwmeester, 2013)
Deep learning	Tensor-tiling (SoyBean)	1.5–4× faster, −40–60% comm	(Wang et al., 2018)
Analog neuromorphic (IMC)	Hybrid layer-to-device tile mapping	2.74× TOPS/mm²; −26% energy	(Bhattacharjee et al., 2023)
GPU Image Processing	Hybrid tiling (register/shared mem)	1.33–1.65× Halide; 100% occ	(Jangda et al., 2019)
Anomaly detection (vision)	Tiled ensemble (overlapping tiles)	+6% AUROC, −50% memory	(Rolih et al., 2024)
Distributed DNN inference	Fused-tile layer fusion (RISC-V SoC)	−60% runtime; −47% traffic	(Jung et al., 21 Mar 2025)
CIM accelerator scheduling	Cross-layer tile scheduling (CLSA-CIM)	29.2× speedup; 17.9× util	(Pelke et al., 2024)

Results demonstrate that when tile size, dataflow, and memory constraints are jointly optimized, hybrid/tiled architectures consistently outperform their monolithic or purely spatial/temporal/block counterparts.

7. Emerging Trends and Open Challenges

The field continues to move rapidly in several directions. Recent work demonstrates the value of mechanism-driven synthesizing of hybrid topology in ML/model architectures, with scaling-law validation and lightweight synthetic-task proxies (Poli et al., 2024). Memory-system co-optimization (e.g., NUMA-aware regional tiling (Zhang et al., 2016)) and software-managed cache-aware tiling/fusion for modern CPU/SoC accelerators (Jung et al., 21 Mar 2025) are gaining traction, particularly in edge and distributed deployments.

Open challenges include:

Global communication minimization and NUMA/memory affinity in extreme-scale and heterogenous settings,
Scheduling/fusion of highly irregular graphs or dataflow on chiplet-based architectures and analog CIM,
Unified frameworks that automate tiling/mapping under mixed-precision, device-heterogeneous constraints,
Exploiting tile-based stacking or attention for models that require both global and high-resolution, local receptive fields,
Extending cross-layer scheduling and fusion techniques to dynamically varying runtime graphs or models undergoing continual adaptation.

Hybrid/tiled architectures have become foundational across scales of modern computational science, machine learning, and hardware-software co-design, providing the necessary abstraction and optimization layer to bridge the gap between ambitious workloads and practical performance or efficiency limits.