Papers
Topics
Authors
Recent
Search
2000 character limit reached

MHRC-Bench-Eval: HDL Code Completion Benchmark

Updated 14 January 2026
  • MHRC-Bench-Eval is a benchmark that assesses multilingual, repository-level code completion for hardware design languages and paradigms.
  • It covers diverse hardware description languages including Verilog, VHDL, Chisel, and HLS C/C++, rigorously challenging LLMs with cross-file dependency modeling.
  • Dual-layer annotations capture both code structure and hardware semantics to enable granular performance analysis and error diagnosis in LLM completions.

MHRC-Bench-Eval is the held-out repository-level evaluation split of MHRC-Bench, the first large-scale, multilingual benchmark for code completion in hardware description languages (HDLs) and hardware-oriented programming paradigms. It was introduced to rigorously measure the context-aware code generation capability of LLMs on unseen hardware repositories, with a focus on realistic, cross-file code completion for both Register-Transfer Level (RTL) and High-Level Synthesis (HLS) domains (Zou et al., 7 Jan 2026).

1. Objectives and Benchmark Design

MHRC-Bench-Eval is designed to assess a model’s repository-level code completion abilities in a heterogeneous hardware environment, evaluating both in-file context understanding and cross-file dependency resolution (such as imports, instantiations, macros). Unlike training splits, its completion targets and repositories are fully disjoint from MHRC-Bench-Train, ruling out memorization and requiring true generalization to previously unseen design patterns and coding conventions. MHRC-Bench-Eval provides dual-layer annotations—code-structure level and hardware-oriented semantic—for granular analysis of completion behaviors and model error modes.

2. Dataset Composition and Statistics

MHRC-Bench-Eval derives its samples from 584 repositories, with coverage across four major hardware description and design paradigms:

  • Chisel: 48 repositories, 319 completion examples
  • Verilog/SystemVerilog (V/SV): 247 repositories, 492 examples
  • VHDL: 239 repositories, 491 examples
  • HLS C/C++ (Xilinx HLS, SystemC): 50 repositories, 456 examples

In total, the benchmark contains 1,758 completion tasks. Each example constitutes a syntactically complete code node—typically between one and five lines—identified and extracted using Tree-sitter-based concrete syntax tree (CST) analysis.

To quantify the inter-file reasoning burden, the average number of external files required for each completion are: Chisel 3.12, V/SV 4.02, VHDL 4.34, HLS 5.85. This quantification directly exposes the challenge of cross-file context modeling for LLMs in hardware codebases.

3. Coverage: Languages and Coding Styles

MHRC-Bench-Eval encompasses three principal hardware design domains:

  • RTL (Register-Transfer Level) HDLs: Verilog/SystemVerilog and VHDL, classical hardware description languages for cycle-accurate behavioral and structural modeling.
  • High-Level Synthesis (HLS): C/C++-based design with hardware synthesis directives, bridging software idioms and low-level hardware semantics.
  • Generator-based Construction: Chisel, a hardware construction language embedded in Scala, enabling metaprogramming and parameterized RTL code generation.

Through this spectrum, MHRC-Bench-Eval uniquely captures the breadth of real-world hardware design, from low-level event-driven RTL to metaprogrammed and synthesis-driven workflows.

4. Annotation Protocol: Structure and Semantics

Each completion target is annotated according to two orthogonal axes:

4.1 Code-Structure-Level Labels

Target nodes are localized within the CST and labeled by their structural depth (with root at depth 0). The depth is bucketed into M=5 bins, mapping completion targets from top-level constructs (e.g. module declarations) to deeply nested expressions (e.g. complex combinational logic). This enables stratified analysis of how syntactic nesting influences completion difficulty.

4.2 Hardware-Oriented Semantic Labels

Hardware code features domain-specific semantic categories absent from software benchmarks. Each target is classified under one of nine roles:

Category Index Category Name Example Constructs
1 Design Structure module/entity declarations
2 Declaration and Definition wire/reg, signal/generic, functions
3 Storage Block memories, registers, processes
4 Computation Block arithmetic operations, muxes
5 Control Flow Block always, if, when, for loops
6 Interface port lists, AXI interfaces
7 Property and Assertion assert, PSL
8 Testbench Stimulus/Environment pokes, testbench loops
9 Monitoring/Checking Logic display, expect, golden checks

These labels, assigned via node type and token-level keywords, support per-category reporting of model strengths and failure modes.

5. Task Format and Evaluation Protocols

Each evaluation task consists of a snapshot of all repository files (up to a 2,048-token window), using explicit file delimiters:

1
2
3
--- FILE: <repo>/<path> (BEGIN)
[file contents]
--- FILE: <repo>/<path> (END)
In one file, a syntactic node is replaced with <TARGET>. The model is prompted to infill the missing code segment by generating exactly the target span, under temperature=0.0, top_p=1.0, max_length=64, applying symmetric trimming to the surrounding context as needed.

Pre- and post-processing standardizes whitespace and strips comments. Evaluation metrics include:

  • Exact Match (EM): 1 iff the generation matches the reference exactly after cleaning.
  • Edit Similarity (ES): 1Levenshtein(target,pred)max(1,target,pred)1 - \frac{\text{Levenshtein}(target, pred)}{\max(1, |target|, |pred|)}
  • BLEU: standard token-level BLEU score.
  • CodeBLEU: combined metric integrating BLEU, weighted n-gram, syntax match, and dataflow match, with equal weights.
  • Compilation Rate: fraction of completions that elaborate/compile successfully when wrapped in a minimal harness using the target language’s reference toolchain (Verilator for V/SV, GHDL for VHDL, Scala/Chisel3 for Chisel, Clang syntax check for HLS).

6. Baseline Results and Error Analysis

Empirical results show that general-purpose code LLMs perform at chance on MHRC-Bench-Eval (EM ≈ 0–1%). After fine-tuning on MHRC-Bench-Train, substantial improvements are observed:

Model Chisel EM/ES V/SV EM/ES VHDL EM/ES HLS EM/ES
Qwen2.5-Coder-7B (pre) 0 % / 14.8 % 0 % / 12.7 % 0 % / 13.0 % 0.2 % / 14.1 %
Qwen2.5-Coder-7B (+Tune) 26.6 % / 66.7 % 39.0 % / 70.8 % 34.5 % / 71.9 % 35.5 % / 69.5 %
Qwen2.5-Coder-14B (+Tune) 31.3 % / 68.9 % 41.9 % / 74.4 % 35.3 % / 73.8 % 38.2 % / 71.5 %
GPT-5 (No RAG) 24.8% / 56.7% 34.5% / 60.9% 15.4% / 52.0% 31.4% / 63.7%
  • CodeBLEU: 49–56% for fine-tuned models, 41–51% for GPT-5.
  • Compilation rate rises from ≈0% to 28–41% after fine-tuning.
  • “Declaration and Definition” achieves the highest EM; “Control Flow Block” and “Computation Block” remain most difficult.
  • Completion success decreases for multi-line and deeply nested (extreme CST depth) spans, a behavior differing from software benchmarks.
  • Retrieval-augmented LLMs provide marginal to significant improvement, contingent on retrieval quality; indiscriminate retrieval may degrade performance.

7. Implications and Research Recommendations

Findings from MHRC-Bench-Eval suggest several directions:

  • Fine-tuning on hardware code is essential; off-the-shelf models lack inductive biases for HDL and hardware-specific constructs.
  • Improvements in retrieval relevance could substantially benefit repository-level hardware code completion, particularly for HLS and V/SV.
  • Context windows >2,048 tokens yield diminishing returns for hardware code, in contrast to software completion benchmarks.
  • Two-axis annotation (structure + semantic) enables targeted model diagnosis and suggests the need for hardware-aware encoding architectures, such as programmable AST encoders or inductive biases for type and dataflow information.
  • Multi-line, hierarchical completions and complex control/data flow present persistent challenges, indicating future benchmark extensions should target these higher-order code reasoning abilities.

MHRC-Bench-Eval establishes a rigorous, richly annotated testbed for repository-level hardware code completion and is positioned to support advancements in specialized LLM architecture, retrieval strategies, and evaluation metrics focused on correct and functional hardware design generation (Zou et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MHRC-Bench-Eval.