MHRC-Bench-Eval: HDL Code Completion Benchmark

Updated 14 January 2026

MHRC-Bench-Eval is a benchmark that assesses multilingual, repository-level code completion for hardware design languages and paradigms.
It covers diverse hardware description languages including Verilog, VHDL, Chisel, and HLS C/C++, rigorously challenging LLMs with cross-file dependency modeling.
Dual-layer annotations capture both code structure and hardware semantics to enable granular performance analysis and error diagnosis in LLM completions.

MHRC-Bench-Eval is the held-out repository-level evaluation split of MHRC-Bench, the first large-scale, multilingual benchmark for code completion in hardware description languages (HDLs) and hardware-oriented programming paradigms. It was introduced to rigorously measure the context-aware code generation capability of LLMs on unseen hardware repositories, with a focus on realistic, cross-file code completion for both Register-Transfer Level (RTL) and High-Level Synthesis (HLS) domains (Zou et al., 7 Jan 2026).

1. Objectives and Benchmark Design

MHRC-Bench-Eval is designed to assess a model’s repository-level code completion abilities in a heterogeneous hardware environment, evaluating both in-file context understanding and cross-file dependency resolution (such as imports, instantiations, macros). Unlike training splits, its completion targets and repositories are fully disjoint from MHRC-Bench-Train, ruling out memorization and requiring true generalization to previously unseen design patterns and coding conventions. MHRC-Bench-Eval provides dual-layer annotations—code-structure level and hardware-oriented semantic—for granular analysis of completion behaviors and model error modes.

2. Dataset Composition and Statistics

MHRC-Bench-Eval derives its samples from 584 repositories, with coverage across four major hardware description and design paradigms:

Chisel: 48 repositories, 319 completion examples
Verilog/SystemVerilog (V/SV): 247 repositories, 492 examples
VHDL: 239 repositories, 491 examples
HLS C/C++ (Xilinx HLS, SystemC): 50 repositories, 456 examples

In total, the benchmark contains 1,758 completion tasks. Each example constitutes a syntactically complete code node—typically between one and five lines—identified and extracted using Tree-sitter-based concrete syntax tree (CST) analysis.

To quantify the inter-file reasoning burden, the average number of external files required for each completion are: Chisel 3.12, V/SV 4.02, VHDL 4.34, HLS 5.85. This quantification directly exposes the challenge of cross-file context modeling for LLMs in hardware codebases.

3. Coverage: Languages and Coding Styles

MHRC-Bench-Eval encompasses three principal hardware design domains:

RTL (Register-Transfer Level) HDLs: Verilog/SystemVerilog and VHDL, classical hardware description languages for cycle-accurate behavioral and structural modeling.
High-Level Synthesis (HLS): C/C++-based design with hardware synthesis directives, bridging software idioms and low-level hardware semantics.
Generator-based Construction: Chisel, a hardware construction language embedded in Scala, enabling metaprogramming and parameterized RTL code generation.

Through this spectrum, MHRC-Bench-Eval uniquely captures the breadth of real-world hardware design, from low-level event-driven RTL to metaprogrammed and synthesis-driven workflows.

4. Annotation Protocol: Structure and Semantics

Each completion target is annotated according to two orthogonal axes:

4.1 Code-Structure-Level Labels

Target nodes are localized within the CST and labeled by their structural depth (with root at depth 0). The depth is bucketed into M=5 bins, mapping completion targets from top-level constructs (e.g. module declarations) to deeply nested expressions (e.g. complex combinational logic). This enables stratified analysis of how syntactic nesting influences completion difficulty.

4.2 Hardware-Oriented Semantic Labels

Hardware code features domain-specific semantic categories absent from software benchmarks. Each target is classified under one of nine roles:

Category Index	Category Name	Example Constructs
1	Design Structure	module/entity declarations
2	Declaration and Definition	wire/reg, signal/generic, functions
3	Storage Block	memories, registers, processes
4	Computation Block	arithmetic operations, muxes
5	Control Flow Block	always, if, when, for loops
6	Interface	port lists, AXI interfaces
7	Property and Assertion	assert, PSL
8	Testbench Stimulus/Environment	pokes, testbench loops
9	Monitoring/Checking Logic	display, expect, golden checks

These labels, assigned via node type and token-level keywords, support per-category reporting of model strengths and failure modes.

5. Task Format and Evaluation Protocols

Each evaluation task consists of a snapshot of all repository files (up to a 2,048-token window), using explicit file delimiters:

1
2
3

--- FILE: <repo>/<path> (BEGIN)
[file contents]
--- FILE: <repo>/<path> (END)

In one file, a syntactic node is replaced with <TARGET>. The model is prompted to infill the missing code segment by generating exactly the target span, under temperature=0.0, top_p=1.0, max_length=64, applying symmetric trimming to the surrounding context as needed.

Pre- and post-processing standardizes whitespace and strips comments. Evaluation metrics include:

Exact Match (EM): 1 iff the generation matches the reference exactly after cleaning.
Edit Similarity (ES): $1 - \frac{\text{Levenshtein}(target, pred)}{\max(1, |target|, |pred|)}$
BLEU: standard token-level BLEU score.
CodeBLEU: combined metric integrating BLEU, weighted n-gram, syntax match, and dataflow match, with equal weights.
Compilation Rate: fraction of completions that elaborate/compile successfully when wrapped in a minimal harness using the target language’s reference toolchain (Verilator for V/SV, GHDL for VHDL, Scala/Chisel3 for Chisel, Clang syntax check for HLS).

6. Baseline Results and Error Analysis

Empirical results show that general-purpose code LLMs perform at chance on MHRC-Bench-Eval (EM ≈ 0–1%). After fine-tuning on MHRC-Bench-Train, substantial improvements are observed:

Model	Chisel EM/ES	V/SV EM/ES	VHDL EM/ES	HLS EM/ES
Qwen2.5-Coder-7B (pre)	0 % / 14.8 %	0 % / 12.7 %	0 % / 13.0 %	0.2 % / 14.1 %
Qwen2.5-Coder-7B (+Tune)	26.6 % / 66.7 %	39.0 % / 70.8 %	34.5 % / 71.9 %	35.5 % / 69.5 %
Qwen2.5-Coder-14B (+Tune)	31.3 % / 68.9 %	41.9 % / 74.4 %	35.3 % / 73.8 %	38.2 % / 71.5 %
GPT-5 (No RAG)	24.8% / 56.7%	34.5% / 60.9%	15.4% / 52.0%	31.4% / 63.7%

CodeBLEU: 49–56% for fine-tuned models, 41–51% for GPT-5.
Compilation rate rises from ≈0% to 28–41% after fine-tuning.
“Declaration and Definition” achieves the highest EM; “Control Flow Block” and “Computation Block” remain most difficult.
Completion success decreases for multi-line and deeply nested (extreme CST depth) spans, a behavior differing from software benchmarks.
Retrieval-augmented LLMs provide marginal to significant improvement, contingent on retrieval quality; indiscriminate retrieval may degrade performance.

7. Implications and Research Recommendations

Findings from MHRC-Bench-Eval suggest several directions:

Fine-tuning on hardware code is essential; off-the-shelf models lack inductive biases for HDL and hardware-specific constructs.
Improvements in retrieval relevance could substantially benefit repository-level hardware code completion, particularly for HLS and V/SV.
Context windows >2,048 tokens yield diminishing returns for hardware code, in contrast to software completion benchmarks.
Two-axis annotation (structure + semantic) enables targeted model diagnosis and suggests the need for hardware-aware encoding architectures, such as programmable AST encoders or inductive biases for type and dataflow information.
Multi-line, hierarchical completions and complex control/data flow present persistent challenges, indicating future benchmark extensions should target these higher-order code reasoning abilities.

MHRC-Bench-Eval establishes a rigorous, richly annotated testbed for repository-level hardware code completion and is positioned to support advancements in specialized LLM architecture, retrieval strategies, and evaluation metrics focused on correct and functional hardware design generation (Zou et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MHRC-Bench-Eval.