Papers
Topics
Authors
Recent
Search
2000 character limit reached

C910 RISC-V Core Overview

Updated 10 February 2026
  • The C910 RISC-V core is a high-performance processor featuring superscalar, out-of-order execution with a deep 12-stage pipeline and comprehensive cache hierarchy.
  • Advanced microarchitectural techniques including dynamic register renaming, branch prediction, and robust scheduling contribute to significant IPC gains over in-order designs.
  • Extensive side-channel analysis reveals unique security challenges, prompting mitigation strategies across software, ISA-level, and microarchitectural layers.

The C910 RISC-V core, also known as the XuanTie C910, is a high-performance, high-efficiency processor core designed for superscalar, out-of-order execution on the RISC-V instruction set architecture (ISA). Developed by Alibaba's T-Head Semiconductor, the C910 is notable for its robust microarchitectural features, full RISC-V standard compliance (following open-source platform integration), and comprehensive cache subsystem, resulting in a performance profile that is highly competitive with contemporary open-source and proprietary designs. Its architecture, security profile, and system integration have been the subject of detailed empirical study and benchmarking in recent academic literature (Fu et al., 30 May 2025, Austa et al., 9 Oct 2025).

1. Microarchitecture and Pipeline Organization

The C910 implements a deep, 12-stage pipeline with dynamic superscalar issue and out-of-order (OoO) execution capabilities. The pipeline stages are:

  • Instruction Fetch (IF1, IF2): Dual-stream fetch, supporting up to 3 instructions per cycle.
  • Instruction Decode (ID) & Rename: Performs both decode and register renaming, essential for OoO execution.
  • Issue/Dispatch (IS): Assigns instructions to functional clusters and reservation stations.
  • Execution Units (EX1–EX4): Supports multiple parallel clusters: integer/branch, multiply/divide, floating-point, and load/store.
  • Memory (MEM): Handles memory accesses with support for optional extra memory stage.
  • Write-back (WB) & Commit (COM): Write results and retire instructions from the reorder buffer (ROB).

Superscalar issue width is three instructions per cycle, enabling the commit of up to three instructions per cycle, or up to nine via 3-into-1 ROB-entry compression. The ROB features 64 entries with an effective depth of up to 192 micro-ops, and the microarchitecture includes 96 physical integer and 64 FP registers with tag-based renaming. Wakeup/select logic and reservation stations enable efficient instruction scheduling and hazard elimination (Fu et al., 30 May 2025).

2. Cache Hierarchy and Memory Interface

The memory subsystem consists of:

  • L1 Instruction Cache: 64 KB, 4-way set-associative, physically indexed and tagged (PIPT).
  • L1 Data Cache: 64 KB, 4-way PIPT.
  • On-chip Memory Interface: Standards-based AXI4, adapted from the proprietary AXI-ACE bus.

In security-focused side-channel studies, an alternative C910 configuration is documented with a 32 KB L1 I-cache (write-through for instruction fetches), a 32 KB L1 D-cache (write-back), and a unified 2 MiB L2 cache (16-way). Pseudo-LRU replacement policies are used at all levels, with no set randomization or skewed indexing (Austa et al., 9 Oct 2025). Notably, the C910 includes a custom T-Head FLUSH.C instruction for cache control, which is leveraged in side-channel benchmark ports.

For open-source SoC integration (Cheshire), the C910 uses a shared 512 KB L2 cache and strictly compliant AXI4 interfaces (Fu et al., 30 May 2025).

3. Out-of-Order Execution and Predictive Mechanisms

Comprehensive OoO mechanisms are implemented:

  • Register Renaming: Rename Map Table (RMT) maps architectural to physical registers, with free-list management, eliminating WAR/WAW hazards.
  • Reservation Stations (RS): Grouped per cluster; age-based priority encoding and tag-based wakeup facilitate up to three instruction issues per cycle.
  • ROB and Commit Logic: Circular ROB integrated with commit circuitry supporting three instructions per cycle.
  • Branch Prediction:
    • Two-level BTB: L0 (16-entry, fully-associative) and L1 (4K-entry, set-associative).
    • 32K-entry BHT with global/local arrays; a 12-entry RAS; and a 16-entry loop buffer.
  • Yield under Mispredictions: Pipeline flush and redirect logic operate at the decode stage upon misprediction.

The combination of advanced branch prediction, deep pipelines, and large TLB contribute to greater microarchitectural state coverage and are directly linked to observed microarchitectural vulnerabilities (Austa et al., 9 Oct 2025).

4. Side-Channel Analysis and Security Implications

Empirical assessment using a ported cache timing side-channel benchmark suite has identified the C910 as exhibiting twelve distinct timing types, a factor of two to three greater than the SiFive U54 and U74. Specifically:

  • L1 D-cache Prime+Probe: Reveals three conflict classes (same-set, adjacent-set, random), with Δμ up to ≈8 cycles and a combined channel capacity C≈1.8 bits.
  • L1 I-cache Flush+Reload: Two patterns (line- and page-grained), Δμ≈5 cycles, C≈0.9 bits.
  • Cross-page Flush (I→D Interference): Four classes influenced by TLB state, up to Δμ=12 cycles.
  • Evict+Time (L2): Patterns reflect L2 associativities, Δμ=6–10 cycles, C up to 1.2 bits.

Across all tests, 75% exhibited at least one exploited channel and 40% more than one. The broader range of timing types, greater channel capacity, and unique vulnerabilities in the C910 are attributed to the deeper pipeline, larger L2, advanced predictor/TLB structures, and the custom FLUSH.C operation (partial evict not present in U54/U74) (Austa et al., 9 Oct 2025).

Shannon capacity is used to quantify channel leakage: C=maxp(x)I(X;Y)=maxpx{0,1}yp(x)P(yx)log2[P(yx)PY(y)]C = \max_{p(x)} I(X;Y) = \max_{p} \sum_{x\in\{0,1\}} \sum_{y} p(x)\cdot P(y|x)\cdot \log_2\left[\frac{P(y|x)}{P_Y(y)}\right] The mean difference (Δμ) and Kullback–Leibler divergence (D_{KL}) quantify distinguishability and style of leakage.

5. Implementation, Physical Design, and Metrics

The modified C910 was mapped to GlobalFoundries 22nm FDX (GF22FDX) and integrated into the open-source Cheshire SoC. The implementation utilized SystemVerilog RTL, Synopsys Design Compiler, and Cadence Innovus, targeting a 1.3 GHz peak frequency (1.05 GHz at worst-case corners).

Area breakdown (excluding pads/macros) is as follows:

Microarchitectural Block Area (kGE) Approx. Area (mm²)
Front-End (IF1/IF2 + predictor) 1,200 0.192
Decode + Rename + Register File 1,050 0.168
Issue + RS + Free-list 850 0.136
ROB + Commit Logic 700 0.112
Integer/Branch EXU 450 0.072
Mul/Div EXU 250 0.040
FP EXU (2 FPUs) 400 0.064
LSU + LQ/SQ + AXI Interface 650 0.104
L1I / L1D Cache (64 KB each) 1,200 0.192
Interconnect / Glue 450 0.072
Total 8,400 1.344

Timing closure targeted 1.3 GHz at TT/25 °C; 61.4% of non-critical paths use lower-VT cells for reduced power.

Measured average IPC for typical Embench-IoT workloads is 1.61 (vs. 0.70 for CVA6, 0.94 for CVA6S+), representing a 119.5% improvement over scalar in-order designs. Power at 900 MHz (0.8 V, TT, 25 °C) is 168 mW for typical integer workloads. Energy and area efficiency are approximately 9 GOPS/W and 1.67 GOPS/mm², respectively (Fu et al., 30 May 2025).

6. Comparative Performance, Efficiency, and Trade-Offs

Direct benchmarking and uniform-flow ASIC implementation permit head-to-head comparison with the in-order CVA6 and superscalar in-order CVA6S+ cores:

Metric CVA6 CVA6S+ C910
Area (mm²) 0.50 0.53 0.87
Avg IPC 0.70 0.94 1.61
Energy (GOPS/W) 9.0 9.0 8.6
Area (GOPS/mm²) 1.26 1.60 1.67

C910 exhibits a 75% area increase and notable power uplift (~80% vs. CVA6) but maintains energy and area efficiencies comparable to the best superscalar in-order core and matches or exceeds the performance-per-area of scalar designs above 500 MHz (Fu et al., 30 May 2025).

7. Security Considerations and Mitigation Strategies

The breadth and magnitude of timing side channels in the C910 raise substantial concerns for cryptographic and multi-tenant workloads. Approximately 1.8 bits of information leakage per cache access can be achieved, markedly above the SiFive U54 (~0.4 bits). The underlying causes are the deep pipeline, larger L2, advanced prediction units, and custom flush instructions.

The primary mitigation strategies include:

  • Software: Addition of random delays (noise), explicit constant-time algorithms, and cache partitioning (page/way).
  • ISA-level: Introduction of secure flush semantics ensuring total evict, or disabling caches for critical code regions.
  • Microarchitectural: Online randomization, per-thread cache coloring.

These insights inform both defensive programming practice and future hardware design for risk mitigation in high-assurance computation scenarios (Austa et al., 9 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to C910 RISC-V Core.