Papers
Topics
Authors
Recent
Search
2000 character limit reached

Register Grouping (LMUL) in RISC-V

Updated 13 January 2026
  • Register Grouping (LMUL) is a technique that unifies contiguous physical vector registers into a single logical group, allowing arbitrary-length vector operations.
  • The Zoozve extension specifies hardware enhancements and LLVM compiler passes to dynamically allocate and manage these register groups, eliminating the need for traditional strip-mining.
  • Empirical evaluations show significant reductions in instruction count and strip-mining iterations, with minor area overhead, improving performance across data-parallel tasks.

Register Grouping (LMUL) refers to the technique by which multiple contiguous physical vector registers are allocated and treated as a unified logical vector register group (RG), enabling the execution of vector operations across a logical vector of arbitrary length. This approach, formalized within the Zoozve extension to RISC-V Vector Extension (RVV), allows flexible grouping ("arbitrary LMUL"), eliminating the need for traditional strip-mining in long vector computations and facilitating optimized resource usage and performance for data-parallel tasks (Xu et al., 22 Apr 2025).

1. Hardware Architecture for Arbitrary Register Grouping

Zoozve introduces substantial modifications to standard RISC-V vector hardware to support arbitrary RG formation:

  • Instruction Decode: The v_head field (13 bits) defines the starting register index for a group and is stored in a dedicated CSR (VHEAD_CSR). The decoded instruction includes an expanded logic for arbitrary LMUL encoding.
  • Register File: Instead of the fixed 32-vector-register design, Zoozve employs a multi-bank register file with up to 1024 physical registers, implemented as RegFile1024.sv (parameters: NBANKS=16, NWORDS=64).
  • RGDetector.sv: This control-path module uses comparators to detect hazards (RAW/WAW) within RG spans, addressing inter-group conflicts and enforcing correct scheduling.
  • ShuffleEngine.sv: For operations requiring element shuffling, Zoozve’s data-path incorporates an M×M crossbar (Crossbar.sv) and programmable state machines to route data from input vector elements to the proper processing element (PE), supporting both symmetric (straight-through) and asymmetric (gather/scatter) operations.
  • CSR and Hazard Unit: Enhanced registers and logic units update CSR fields with new grouping indices and vector lengths, while the hazard logic OR’s comparator signals with existing dependency checks.

Example hardware modules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// RGDetector.sv
module RGDetector(
  input  logic [12:0] RG_head, RG_tail,
  input  logic [12:0] addr1, addr2, // vrs1, vrs2 or vrd
  output logic hazard
);
  logic in1 = (addr1 >= RG_head) && (addr1 <= RG_tail);
  logic in2 = (addr2 >= RG_head) && (addr2 <= RG_tail);
  assign hazard = in1 || in2;
endmodule

// ShuffleEngine.sv (excerpt)
module ShuffleEngine #(
  parameter MAXG = 32, VEW = 32
)(
  input  logic [VEW*MAXG-1:0] vec_in,
  input  logic [VEW*MAXG-1:0] idx_in,
  output logic [VEW*MAXG-1:0] vec_out,
  input  logic [clog2(MAXG)-1:0] G  // actual group size
);
  Crossbar #(.WIDTH(VEW), .N(MAXG)) xbar (
    .data_in  (vec_in),
    .select   (idx_in),
    .data_out (vec_out)
  );
endmodule

2. Compiler Mechanisms for Data-Adaptive Register Allocation

In Zoozve, the LLVM-based compiler is adapted to orchestrate arbitrary register grouping through three principal passes:

  • Intrinsic Splitting Pass ("SplitPass"): Analyzes each vector intrinsic in LLVM IR, calculates the required grouping factor G=LVEW/VLENG = \lceil L \cdot VEW / VLEN \rceil, segments the operation into GG sub-intrinsics, and surrounds these with zv_begin_group([G](https://www.emergentmind.com/topics/flow-index-_-d-p-g))/zv_end_group(G) delimiter intrinsics. Metadata LMUL=G marks the group.
  • Register-Allocation Extension: The live intervals are augmented by a queue tracking (virtualregs[],G)(virtualregs[], G) tuples. The allocation algorithm ensures that GG contiguous physical registers are reserved for each RG by modifying the linear-scan approach.
  • Assembly Coalescing ("CoalescePass"): Detects GG consecutive instructions with the same opcode, vector length, and stride in v_head. These are merged into a single "wide" Zoozve instruction.

Compiler pseudocode:

1
2
3
4
5
6
7
8
9
10
11
for each zv_intrinsic I of length L:
  G = ceil(L*VEW / VLEN)
  for i in 0..G-1:
    newI = clone(I) with element-range [i*VLAN, (i+1)*VLAN)
    if i==0:
      annotate(newI, LMUL=G)
      insert before newI: zv_begin_group(G)
    if i==G-1:
      insert after newI:  zv_end_group(G)
    emit newI
  remove original I

3. Mathematical Formulation and Alignment

  • Effective LMUL (G): The register grouping factor is given by

G=L×VEWVLENG = \left\lceil \frac{L \times \text{VEW}}{\text{VLEN}} \right\rceil

where LL is the logical vector length, VEW is vector element width, and VLEN is the physical vector register width.

  • Register Group Indices: The allocation pass assigns RGheadRG_{head} and RGtail=RGhead+G1RG_{tail} = RG_{head} + G - 1.
  • Vector Length Alignment: Zoozve sets the actual vector length to

VL=LG×GVL' = \left\lfloor \frac{L}{G}\right\rfloor \times G

to ensure fitting the entire logical vector into one RG and avoiding partial strip-mining loops.

  • New Vector-Length Setting: The programming interface sets VLVL\mathrm{VL} \leftarrow \mathrm{VL}', guaranteeing that all data fits exactly within RG boundaries.

This mathematical foundation directly mitigates the need for multiple strip-mining iterations inherent in standard RVV approaches, which require dividing long vectors into manageable chunks due to fixed register group sizes.

4. Evaluation: Instruction Count, Performance, and Area

Zoozve empirical results demonstrate significant advantages:

Kernel Data Size N I_RVV / I_Zoozve Speedup
FFT 32 1010 vs. 100 10.1×
FFT 2048 34444 vs. 100 344.4×
DotProd 16384 1292 vs. 17 76×
AXPY 16384 707 vs. 12 58.9×
  • Instruction Count Reduction: Dynamic instruction count is reduced by up to 344.4× (FFT, N=2048) and at least 10.1× (FFT, N=32). Even simple kernels (Dot Product, AXPY) benefit from reductions exceeding 50×.
  • Strip-Mining Iterations: Zoozve eliminates all but a single iteration for vector computations, unlike RVV, where strip count is N/(VLmaxLMUL)\lceil N / (VL_{max} \cdot LMUL) \rceil.
  • Area Overhead: Baseline RVV core area (A_base) is 11.3 mm². Zoozve hardware modules account for an additional 0.6 mm² (total: 11.9 mm²), representing a 5.2% increase:

ΔAAbase=0.611.35.2%\frac{\Delta A}{A_{base}} = \frac{0.6}{11.3} \approx 5.2\%

5. Limitations and Corner Cases

  • Decode Latency Impact: The additional comparator and crossbar logic may introduce one extra decode cycle in pathological cases; this impact can often be hidden by appropriate instruction scheduling.
  • Minimum Group Size: For VL<VLEN/VEWVL < VLEN/VEW, GG may calculate to zero. The compiler must enforce G1G \geq 1, ensuring at least one physical register is always allocated.
  • CSR Space Demands: Each RG requires 13+ bits for addressing via vheadv_{head} fields, increasing CSR requirements well beyond the traditional 5-bit RVV register fields.
  • Asymmetric Operation Complexity: Operations such as gather/scatter that leverage shuffling require extensive crossbar hardware. For large GG, area complexity grows as O(G2)O(G^2).
  • Register Group Granularity: Zoozve employs integer G1G \geq 1 only; fractional registers cannot be grouped. If LVEWL \cdot VEW is not a multiple of VLEN, the last register in RGRG may only be partially used, but strip-mining is avoided.

A plausible implication is that while arbitrary register grouping improves computational throughput and resource allocation, extremely large GG settings may entail non-trivial hardware area growth and decode latency, necessitating architectural trade-offs (Xu et al., 22 Apr 2025).

6. Implementation Considerations and Integration

Zoozve’s mechanisms are implementable via SystemVerilog templates for hardware (e.g., RegFile1024.sv, RGDetector.sv, ShuffleEngine.sv) and custom LLVM passes for compiler support. These subsystems collectively ensure that logical vector operations are decomposed, allocated, and executed across arbitrary RGs, fully avoiding strip-mining through precise VL alignment and grouping. The formulas for GG and VLVL' directly map logical vector properties to hardware and compiler configuration, achieving high data-level parallelism and resource utilization (Xu et al., 22 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Register Grouping (LMUL).