Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Architecture for RISC-V Vector Memory Access

Published 11 Apr 2025 in cs.AR and cs.DC | (2504.08334v3)

Abstract: Vector processors frequently suffer from inefficient memory accesses, particularly for strided and segment patterns. While coalescing strided accesses is a natural solution, effectively gathering or scattering elements at fixed strides remains challenging. Naive approaches rely on high-overhead crossbars that remap any byte between memory and registers, leading to physical design issues. Segment operations require row-column transpositions, typically handled using either element-level in-place transposition (degrading performance) or large buffer-based bulk transposition (incurring high area overhead). In this paper, we present EARTH, a novel vector memory access architecture designed to overcome these challenges through shifting-based optimizations. For strided accesses, EARTH integrates specialized shift networks for gathering and scattering elements. After coalescing multiple accesses within the same cache line, data is routed between memory and registers through the shifting network with minimal overhead. For segment operations, EARTH employs a shifted register bank enabling direct column-wise access, eliminating dedicated segment buffers while providing high-performance, in-place bulk transposition. Implemented on FPGA with Chisel HDL based on an open-source RISC-V vector unit, EARTH enhances performance for strided memory accesses, achieving 4x-8x speedups in benchmarks dominated by strided operations. Compared to conventional designs, EARTH reduces hardware area by 9% and power consumption by 41%, significantly advancing both performance and efficiency of vector processors.

Summary

  • The paper introduces EARTH, a novel architecture using shifting-based optimizations to improve constant-stride and segment memory accesses in RISC-V vector processors.
  • The design leverages DROM, LSDO, and RCVRF modules to coalesce memory transactions, achieving up to 14.7x speedups and reducing area by 9.11%.
  • By eliminating large segment buffers, EARTH cuts power consumption by 29-41% while maintaining performance on unit-stride and segment-intensive operations.

This paper introduces EARTH (Efficient Architecture for RISC-V Vector Memory Access), a novel hardware architecture designed to improve the efficiency of memory accesses in RISC-V vector processors, particularly for constant-stride and segment memory operations (2504.08334). Current vector processors often struggle with these patterns, leading to performance bottlenecks or excessive hardware overhead.

Problem:

  • Constant-stride Access: Existing designs either issue multiple memory requests for elements within the same cache line (inefficient) or use complex, high-overhead crossbar networks to gather/scatter data after coalescing requests (costly in area and power, complex routing).
  • Segment Access: Handling the required row-column transposition typically involves either processing elements one by one (severely limiting throughput) or using large, dedicated segment buffers (consuming significant area and power).

EARTH Architecture:

EARTH addresses these issues using shifting-based optimization strategies, implemented through three core innovations integrated into an open-source RISC-V vector unit (Saturn):

  1. Data Reorganization Module (DROM):
    • This module is central to EARTH and uses specialized shift networks: Scatter Shift Network (SSN) and Gather Shift Network (GSN).
    • These networks are implemented as layered structures where each layer performs power-of-2 shifts. They are designed to be conflict-free, ensuring data elements can be efficiently rearranged without path interference.
    • A Shift Count Generation (SCG) module calculates the necessary shift amounts based on stride, element width, and offset.
    • DROM handles the fundamental tasks of gathering (reorganizing strided/scattered data into sequential order) and scattering (distributing sequential data into strided/scattered positions).
  2. Load/Store Data Organization (LSDO):
    • Specifically targets constant-stride accesses.
    • Utilizes DROM and a Reverser module (for negative strides).
    • Enables coalescing multiple strided memory accesses within the same aligned memory region (e.g., cache line) into a single memory transaction.
    • DROM efficiently extracts (gathers) the required elements from the coalesced memory response during loads or arranges (scatters) elements correctly into the memory line during stores.
  3. Row/Column-accessible Vector Register File (RCVRF):
    • Addresses segment access inefficiency without dedicated segment buffers.
    • It consists of a Shifted VRF, DROM, and Block Circular Shifters.
    • The Shifted VRF partitions registers into banks (8 banks, ELEN-width) using a circular-shifted mapping. This allows simultaneous access to elements belonging to the same segment (column-wise access) distributed across different registers, as well as standard single-register access (row-wise access).
    • For column-wise access (needed for segment operations), data read from parallel banks is first aligned by the Block Shifter and then reorganized by DROM into the correct sequential order. For writes, DROM scatters the data before it's written to the banks.

Implementation and Flow:

  • EARTH was implemented in Chisel HDL and integrated into the Saturn vector unit on an FPGA platform.
  • Strided Flow: Instructions are split by the Load/Store Address Sequencer (LAS/SAS) to maximize coalescing within aligned memory regions. Requests are sent to memory. Ordered responses go to LSDO, where DROM gathers/scatters data using generated shift counts, followed by byte alignment. Data is then written/read to/from RCVRF row-wise.
  • Segment Flow: EARTH uses a "Segment-wise" approach. LAS splits requests based on segments, coalescing accesses within the same segment and memory region. Ordered responses are byte-aligned in LSDO and then written to RCVRF using its column-wise access capability, leveraging the Shifted VRF and DROM for transposition.

Evaluation:

  • Performance: Compared to the baseline Saturn design, EARTH showed significant speedups on benchmarks dominated by constant-stride accesses (4x–8x, up to 14.7x on stride-intensive tests). Performance on unit-stride and segment-heavy benchmarks remained comparable (within ±3% for unit-stride, ~1.0x for segment-intensive). This demonstrates efficient stride handling and buffer-free segment support without performance loss. Compared to a commercial SpacemiT X60 core (adjusted for frequency and configuration), EARTH showed competitive or superior performance on most benchmarks except those heavy on indexed operations.
  • Area: EARTH eliminated the need for large segment buffers. While the RCVRF area increased slightly due to DROM/shifters, the VLSU area decreased significantly. Overall, this led to a 9.11% area reduction in the larger P-Config (VLEN=512) compared to Saturn.
  • Power: EARTH reduced power consumption by 29-41% compared to Saturn, primarily due to eliminating segment buffer overhead and reducing memory transactions via coalescing, which lowered internal power despite a slight increase in switching power from the shift logic.

Conclusion:

EARTH presents an effective architecture that significantly improves the performance and efficiency of RISC-V vector processors by tackling key memory access bottlenecks (constant-stride and segment patterns) using novel shifting-based techniques (DROM, LSDO, RCVRF). It achieves substantial speedups for strided operations and eliminates the area/power overhead of segment buffers without sacrificing segment performance, offering a promising design paradigm for future vector processors.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 219 likes about this paper.

HackerNews