Register Grouping (LMUL) in RISC-V
- Register Grouping (LMUL) is a technique that unifies contiguous physical vector registers into a single logical group, allowing arbitrary-length vector operations.
- The Zoozve extension specifies hardware enhancements and LLVM compiler passes to dynamically allocate and manage these register groups, eliminating the need for traditional strip-mining.
- Empirical evaluations show significant reductions in instruction count and strip-mining iterations, with minor area overhead, improving performance across data-parallel tasks.
Register Grouping (LMUL) refers to the technique by which multiple contiguous physical vector registers are allocated and treated as a unified logical vector register group (RG), enabling the execution of vector operations across a logical vector of arbitrary length. This approach, formalized within the Zoozve extension to RISC-V Vector Extension (RVV), allows flexible grouping ("arbitrary LMUL"), eliminating the need for traditional strip-mining in long vector computations and facilitating optimized resource usage and performance for data-parallel tasks (Xu et al., 22 Apr 2025).
1. Hardware Architecture for Arbitrary Register Grouping
Zoozve introduces substantial modifications to standard RISC-V vector hardware to support arbitrary RG formation:
- Instruction Decode: The
v_headfield (13 bits) defines the starting register index for a group and is stored in a dedicated CSR (VHEAD_CSR). The decoded instruction includes an expanded logic for arbitrary LMUL encoding. - Register File: Instead of the fixed 32-vector-register design, Zoozve employs a multi-bank register file with up to 1024 physical registers, implemented as
RegFile1024.sv(parameters: NBANKS=16, NWORDS=64). - RGDetector.sv: This control-path module uses comparators to detect hazards (RAW/WAW) within RG spans, addressing inter-group conflicts and enforcing correct scheduling.
- ShuffleEngine.sv: For operations requiring element shuffling, Zoozve’s data-path incorporates an M×M crossbar (
Crossbar.sv) and programmable state machines to route data from input vector elements to the proper processing element (PE), supporting both symmetric (straight-through) and asymmetric (gather/scatter) operations. - CSR and Hazard Unit: Enhanced registers and logic units update CSR fields with new grouping indices and vector lengths, while the hazard logic OR’s comparator signals with existing dependency checks.
Example hardware modules:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
// RGDetector.sv
module RGDetector(
input logic [12:0] RG_head, RG_tail,
input logic [12:0] addr1, addr2, // vrs1, vrs2 or vrd
output logic hazard
);
logic in1 = (addr1 >= RG_head) && (addr1 <= RG_tail);
logic in2 = (addr2 >= RG_head) && (addr2 <= RG_tail);
assign hazard = in1 || in2;
endmodule
// ShuffleEngine.sv (excerpt)
module ShuffleEngine #(
parameter MAXG = 32, VEW = 32
)(
input logic [VEW*MAXG-1:0] vec_in,
input logic [VEW*MAXG-1:0] idx_in,
output logic [VEW*MAXG-1:0] vec_out,
input logic [clog2(MAXG)-1:0] G // actual group size
);
Crossbar #(.WIDTH(VEW), .N(MAXG)) xbar (
.data_in (vec_in),
.select (idx_in),
.data_out (vec_out)
);
endmodule |
2. Compiler Mechanisms for Data-Adaptive Register Allocation
In Zoozve, the LLVM-based compiler is adapted to orchestrate arbitrary register grouping through three principal passes:
- Intrinsic Splitting Pass ("SplitPass"): Analyzes each vector intrinsic in LLVM IR, calculates the required grouping factor , segments the operation into sub-intrinsics, and surrounds these with
zv_begin_group([G](https://www.emergentmind.com/topics/flow-index-_-d-p-g))/zv_end_group(G)delimiter intrinsics. MetadataLMUL=Gmarks the group. - Register-Allocation Extension: The live intervals are augmented by a queue tracking tuples. The allocation algorithm ensures that contiguous physical registers are reserved for each RG by modifying the linear-scan approach.
- Assembly Coalescing ("CoalescePass"): Detects consecutive instructions with the same opcode, vector length, and stride in v_head. These are merged into a single "wide" Zoozve instruction.
Compiler pseudocode:
1 2 3 4 5 6 7 8 9 10 11 |
for each zv_intrinsic I of length L:
G = ceil(L*VEW / VLEN)
for i in 0..G-1:
newI = clone(I) with element-range [i*VLAN, (i+1)*VLAN)
if i==0:
annotate(newI, LMUL=G)
insert before newI: zv_begin_group(G)
if i==G-1:
insert after newI: zv_end_group(G)
emit newI
remove original I |
3. Mathematical Formulation and Alignment
- Effective LMUL (G): The register grouping factor is given by
where is the logical vector length, VEW is vector element width, and VLEN is the physical vector register width.
- Register Group Indices: The allocation pass assigns and .
- Vector Length Alignment: Zoozve sets the actual vector length to
to ensure fitting the entire logical vector into one RG and avoiding partial strip-mining loops.
- New Vector-Length Setting: The programming interface sets , guaranteeing that all data fits exactly within RG boundaries.
This mathematical foundation directly mitigates the need for multiple strip-mining iterations inherent in standard RVV approaches, which require dividing long vectors into manageable chunks due to fixed register group sizes.
4. Evaluation: Instruction Count, Performance, and Area
Zoozve empirical results demonstrate significant advantages:
| Kernel | Data Size N | I_RVV / I_Zoozve | Speedup |
|---|---|---|---|
| FFT | 32 | 1010 vs. 100 | 10.1× |
| FFT | 2048 | 34444 vs. 100 | 344.4× |
| DotProd | 16384 | 1292 vs. 17 | 76× |
| AXPY | 16384 | 707 vs. 12 | 58.9× |
- Instruction Count Reduction: Dynamic instruction count is reduced by up to 344.4× (FFT, N=2048) and at least 10.1× (FFT, N=32). Even simple kernels (Dot Product, AXPY) benefit from reductions exceeding 50×.
- Strip-Mining Iterations: Zoozve eliminates all but a single iteration for vector computations, unlike RVV, where strip count is .
- Area Overhead: Baseline RVV core area (A_base) is 11.3 mm². Zoozve hardware modules account for an additional 0.6 mm² (total: 11.9 mm²), representing a 5.2% increase:
5. Limitations and Corner Cases
- Decode Latency Impact: The additional comparator and crossbar logic may introduce one extra decode cycle in pathological cases; this impact can often be hidden by appropriate instruction scheduling.
- Minimum Group Size: For , may calculate to zero. The compiler must enforce , ensuring at least one physical register is always allocated.
- CSR Space Demands: Each RG requires 13+ bits for addressing via fields, increasing CSR requirements well beyond the traditional 5-bit RVV register fields.
- Asymmetric Operation Complexity: Operations such as gather/scatter that leverage shuffling require extensive crossbar hardware. For large , area complexity grows as .
- Register Group Granularity: Zoozve employs integer only; fractional registers cannot be grouped. If is not a multiple of VLEN, the last register in may only be partially used, but strip-mining is avoided.
A plausible implication is that while arbitrary register grouping improves computational throughput and resource allocation, extremely large settings may entail non-trivial hardware area growth and decode latency, necessitating architectural trade-offs (Xu et al., 22 Apr 2025).
6. Implementation Considerations and Integration
Zoozve’s mechanisms are implementable via SystemVerilog templates for hardware (e.g., RegFile1024.sv, RGDetector.sv, ShuffleEngine.sv) and custom LLVM passes for compiler support. These subsystems collectively ensure that logical vector operations are decomposed, allocated, and executed across arbitrary RGs, fully avoiding strip-mining through precise VL alignment and grouping. The formulas for and directly map logical vector properties to hardware and compiler configuration, achieving high data-level parallelism and resource utilization (Xu et al., 22 Apr 2025).