Register Grouping (LMUL) in RISC-V

Updated 13 January 2026

Register Grouping (LMUL) is a technique that unifies contiguous physical vector registers into a single logical group, allowing arbitrary-length vector operations.
The Zoozve extension specifies hardware enhancements and LLVM compiler passes to dynamically allocate and manage these register groups, eliminating the need for traditional strip-mining.
Empirical evaluations show significant reductions in instruction count and strip-mining iterations, with minor area overhead, improving performance across data-parallel tasks.

Register Grouping (LMUL) refers to the technique by which multiple contiguous physical vector registers are allocated and treated as a unified logical vector register group (RG), enabling the execution of vector operations across a logical vector of arbitrary length. This approach, formalized within the Zoozve extension to RISC-V Vector Extension (RVV), allows flexible grouping ("arbitrary LMUL"), eliminating the need for traditional strip-mining in long vector computations and facilitating optimized resource usage and performance for data-parallel tasks (Xu et al., 22 Apr 2025).

1. Hardware Architecture for Arbitrary Register Grouping

Zoozve introduces substantial modifications to standard RISC-V vector hardware to support arbitrary RG formation:

Instruction Decode: The v_head field (13 bits) defines the starting register index for a group and is stored in a dedicated CSR (VHEAD_CSR). The decoded instruction includes an expanded logic for arbitrary LMUL encoding.
Register File: Instead of the fixed 32-vector-register design, Zoozve employs a multi-bank register file with up to 1024 physical registers, implemented as RegFile1024.sv (parameters: NBANKS=16, NWORDS=64).
RGDetector.sv: This control-path module uses comparators to detect hazards (RAW/WAW) within RG spans, addressing inter-group conflicts and enforcing correct scheduling.
ShuffleEngine.sv: For operations requiring element shuffling, Zoozve’s data-path incorporates an M×M crossbar (Crossbar.sv) and programmable state machines to route data from input vector elements to the proper processing element (PE), supporting both symmetric (straight-through) and asymmetric (gather/scatter) operations.
CSR and Hazard Unit: Enhanced registers and logic units update CSR fields with new grouping indices and vector lengths, while the hazard logic OR’s comparator signals with existing dependency checks.

Example hardware modules:

// RGDetector.sv
module RGDetector(
  input  logic [12:0] RG_head, RG_tail,
  input  logic [12:0] addr1, addr2, // vrs1, vrs2 or vrd
  output logic hazard
);
  logic in1 = (addr1 >= RG_head) && (addr1 <= RG_tail);
  logic in2 = (addr2 >= RG_head) && (addr2 <= RG_tail);
  assign hazard = in1 || in2;
endmodule

// ShuffleEngine.sv (excerpt)
module ShuffleEngine #(
  parameter MAXG = 32, VEW = 32
)(
  input  logic [VEW*MAXG-1:0] vec_in,
  input  logic [VEW*MAXG-1:0] idx_in,
  output logic [VEW*MAXG-1:0] vec_out,
  input  logic [clog2(MAXG)-1:0] G  // actual group size
);
  Crossbar #(.WIDTH(VEW), .N(MAXG)) xbar (
    .data_in  (vec_in),
    .select   (idx_in),
    .data_out (vec_out)
  );
endmodule

2. Compiler Mechanisms for Data-Adaptive Register Allocation

In Zoozve, the LLVM-based compiler is adapted to orchestrate arbitrary register grouping through three principal passes:

Intrinsic Splitting Pass ("SplitPass"): Analyzes each vector intrinsic in LLVM IR, calculates the required grouping factor $G = \lceil L \cdot VEW / VLEN \rceil$ , segments the operation into $G$ sub-intrinsics, and surrounds these with zv_begin_group([G](https://www.emergentmind.com/topics/flow-index-_-d-p-g))/zv_end_group(G) delimiter intrinsics. Metadata LMUL=G marks the group.
Register-Allocation Extension: The live intervals are augmented by a queue tracking $(virtualregs[], G)$ tuples. The allocation algorithm ensures that $G$ contiguous physical registers are reserved for each RG by modifying the linear-scan approach.
Assembly Coalescing ("CoalescePass"): Detects $G$ consecutive instructions with the same opcode, vector length, and stride in v_head. These are merged into a single "wide" Zoozve instruction.

Compiler pseudocode:

for each zv_intrinsic I of length L:
  G = ceil(L*VEW / VLEN)
  for i in 0..G-1:
    newI = clone(I) with element-range [i*VLAN, (i+1)*VLAN)
    if i==0:
      annotate(newI, LMUL=G)
      insert before newI: zv_begin_group(G)
    if i==G-1:
      insert after newI:  zv_end_group(G)
    emit newI
  remove original I

3. Mathematical Formulation and Alignment

Effective LMUL (G): The register grouping factor is given by

$G = \left\lceil \frac{L \times \text{VEW}}{\text{VLEN}} \right\rceil$

where $L$ is the logical vector length, VEW is vector element width, and VLEN is the physical vector register width.

Register Group Indices: The allocation pass assigns $RG_{head}$ and $RG_{tail} = RG_{head} + G - 1$ .
Vector Length Alignment: Zoozve sets the actual vector length to

$VL' = \left\lfloor \frac{L}{G}\right\rfloor \times G$

to ensure fitting the entire logical vector into one RG and avoiding partial strip-mining loops.

New Vector-Length Setting: The programming interface sets $\mathrm{VL} \leftarrow \mathrm{VL}'$ , guaranteeing that all data fits exactly within RG boundaries.

This mathematical foundation directly mitigates the need for multiple strip-mining iterations inherent in standard RVV approaches, which require dividing long vectors into manageable chunks due to fixed register group sizes.

4. Evaluation: Instruction Count, Performance, and Area

Zoozve empirical results demonstrate significant advantages:

Kernel	Data Size N	I_RVV / I_Zoozve	Speedup
FFT	32	1010 vs. 100	10.1×
FFT	2048	34444 vs. 100	344.4×
DotProd	16384	1292 vs. 17	76×
AXPY	16384	707 vs. 12	58.9×

Instruction Count Reduction: Dynamic instruction count is reduced by up to 344.4× (FFT, N=2048) and at least 10.1× (FFT, N=32). Even simple kernels (Dot Product, AXPY) benefit from reductions exceeding 50×.
Strip-Mining Iterations: Zoozve eliminates all but a single iteration for vector computations, unlike RVV, where strip count is $\lceil N / (VL_{max} \cdot LMUL) \rceil$ .
Area Overhead: Baseline RVV core area (A_base) is 11.3 mm². Zoozve hardware modules account for an additional 0.6 mm² (total: 11.9 mm²), representing a 5.2% increase:

$\frac{\Delta A}{A_{base}} = \frac{0.6}{11.3} \approx 5.2\%$

5. Limitations and Corner Cases

Decode Latency Impact: The additional comparator and crossbar logic may introduce one extra decode cycle in pathological cases; this impact can often be hidden by appropriate instruction scheduling.
Minimum Group Size: For $VL < VLEN/VEW$ , $G$ may calculate to zero. The compiler must enforce $G \geq 1$ , ensuring at least one physical register is always allocated.
CSR Space Demands: Each RG requires 13+ bits for addressing via $v_{head}$ fields, increasing CSR requirements well beyond the traditional 5-bit RVV register fields.
Asymmetric Operation Complexity: Operations such as gather/scatter that leverage shuffling require extensive crossbar hardware. For large $G$ , area complexity grows as $O(G^2)$ .
Register Group Granularity: Zoozve employs integer $G \geq 1$ only; fractional registers cannot be grouped. If $L \cdot VEW$ is not a multiple of VLEN, the last register in $RG$ may only be partially used, but strip-mining is avoided.

A plausible implication is that while arbitrary register grouping improves computational throughput and resource allocation, extremely large $G$ settings may entail non-trivial hardware area growth and decode latency, necessitating architectural trade-offs (Xu et al., 22 Apr 2025).

6. Implementation Considerations and Integration

Zoozve’s mechanisms are implementable via SystemVerilog templates for hardware (e.g., RegFile1024.sv, RGDetector.sv, ShuffleEngine.sv) and custom LLVM passes for compiler support. These subsystems collectively ensure that logical vector operations are decomposed, allocated, and executed across arbitrary RGs, fully avoiding strip-mining through precise VL alignment and grouping. The formulas for $G$ and $VL'$ directly map logical vector properties to hardware and compiler configuration, achieving high data-level parallelism and resource utilization (Xu et al., 22 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP) (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Register Grouping (LMUL).