MHRC-Bench-Eval: HDL Code Completion Benchmark
- MHRC-Bench-Eval is a benchmark that assesses multilingual, repository-level code completion for hardware design languages and paradigms.
- It covers diverse hardware description languages including Verilog, VHDL, Chisel, and HLS C/C++, rigorously challenging LLMs with cross-file dependency modeling.
- Dual-layer annotations capture both code structure and hardware semantics to enable granular performance analysis and error diagnosis in LLM completions.
MHRC-Bench-Eval is the held-out repository-level evaluation split of MHRC-Bench, the first large-scale, multilingual benchmark for code completion in hardware description languages (HDLs) and hardware-oriented programming paradigms. It was introduced to rigorously measure the context-aware code generation capability of LLMs on unseen hardware repositories, with a focus on realistic, cross-file code completion for both Register-Transfer Level (RTL) and High-Level Synthesis (HLS) domains (Zou et al., 7 Jan 2026).
1. Objectives and Benchmark Design
MHRC-Bench-Eval is designed to assess a model’s repository-level code completion abilities in a heterogeneous hardware environment, evaluating both in-file context understanding and cross-file dependency resolution (such as imports, instantiations, macros). Unlike training splits, its completion targets and repositories are fully disjoint from MHRC-Bench-Train, ruling out memorization and requiring true generalization to previously unseen design patterns and coding conventions. MHRC-Bench-Eval provides dual-layer annotations—code-structure level and hardware-oriented semantic—for granular analysis of completion behaviors and model error modes.
2. Dataset Composition and Statistics
MHRC-Bench-Eval derives its samples from 584 repositories, with coverage across four major hardware description and design paradigms:
- Chisel: 48 repositories, 319 completion examples
- Verilog/SystemVerilog (V/SV): 247 repositories, 492 examples
- VHDL: 239 repositories, 491 examples
- HLS C/C++ (Xilinx HLS, SystemC): 50 repositories, 456 examples
In total, the benchmark contains 1,758 completion tasks. Each example constitutes a syntactically complete code node—typically between one and five lines—identified and extracted using Tree-sitter-based concrete syntax tree (CST) analysis.
To quantify the inter-file reasoning burden, the average number of external files required for each completion are: Chisel 3.12, V/SV 4.02, VHDL 4.34, HLS 5.85. This quantification directly exposes the challenge of cross-file context modeling for LLMs in hardware codebases.
3. Coverage: Languages and Coding Styles
MHRC-Bench-Eval encompasses three principal hardware design domains:
- RTL (Register-Transfer Level) HDLs: Verilog/SystemVerilog and VHDL, classical hardware description languages for cycle-accurate behavioral and structural modeling.
- High-Level Synthesis (HLS): C/C++-based design with hardware synthesis directives, bridging software idioms and low-level hardware semantics.
- Generator-based Construction: Chisel, a hardware construction language embedded in Scala, enabling metaprogramming and parameterized RTL code generation.
Through this spectrum, MHRC-Bench-Eval uniquely captures the breadth of real-world hardware design, from low-level event-driven RTL to metaprogrammed and synthesis-driven workflows.
4. Annotation Protocol: Structure and Semantics
Each completion target is annotated according to two orthogonal axes:
4.1 Code-Structure-Level Labels
Target nodes are localized within the CST and labeled by their structural depth (with root at depth 0). The depth is bucketed into M=5 bins, mapping completion targets from top-level constructs (e.g. module declarations) to deeply nested expressions (e.g. complex combinational logic). This enables stratified analysis of how syntactic nesting influences completion difficulty.
4.2 Hardware-Oriented Semantic Labels
Hardware code features domain-specific semantic categories absent from software benchmarks. Each target is classified under one of nine roles:
| Category Index | Category Name | Example Constructs |
|---|---|---|
| 1 | Design Structure | module/entity declarations |
| 2 | Declaration and Definition | wire/reg, signal/generic, functions |
| 3 | Storage Block | memories, registers, processes |
| 4 | Computation Block | arithmetic operations, muxes |
| 5 | Control Flow Block | always, if, when, for loops |
| 6 | Interface | port lists, AXI interfaces |
| 7 | Property and Assertion | assert, PSL |
| 8 | Testbench Stimulus/Environment | pokes, testbench loops |
| 9 | Monitoring/Checking Logic | display, expect, golden checks |
These labels, assigned via node type and token-level keywords, support per-category reporting of model strengths and failure modes.
5. Task Format and Evaluation Protocols
Each evaluation task consists of a snapshot of all repository files (up to a 2,048-token window), using explicit file delimiters:
1 2 3 |
--- FILE: <repo>/<path> (BEGIN) [file contents] --- FILE: <repo>/<path> (END) |
<TARGET>. The model is prompted to infill the missing code segment by generating exactly the target span, under temperature=0.0, top_p=1.0, max_length=64, applying symmetric trimming to the surrounding context as needed.
Pre- and post-processing standardizes whitespace and strips comments. Evaluation metrics include:
- Exact Match (EM): 1 iff the generation matches the reference exactly after cleaning.
- Edit Similarity (ES):
- BLEU: standard token-level BLEU score.
- CodeBLEU: combined metric integrating BLEU, weighted n-gram, syntax match, and dataflow match, with equal weights.
- Compilation Rate: fraction of completions that elaborate/compile successfully when wrapped in a minimal harness using the target language’s reference toolchain (Verilator for V/SV, GHDL for VHDL, Scala/Chisel3 for Chisel, Clang syntax check for HLS).
6. Baseline Results and Error Analysis
Empirical results show that general-purpose code LLMs perform at chance on MHRC-Bench-Eval (EM ≈ 0–1%). After fine-tuning on MHRC-Bench-Train, substantial improvements are observed:
| Model | Chisel EM/ES | V/SV EM/ES | VHDL EM/ES | HLS EM/ES |
|---|---|---|---|---|
| Qwen2.5-Coder-7B (pre) | 0 % / 14.8 % | 0 % / 12.7 % | 0 % / 13.0 % | 0.2 % / 14.1 % |
| Qwen2.5-Coder-7B (+Tune) | 26.6 % / 66.7 % | 39.0 % / 70.8 % | 34.5 % / 71.9 % | 35.5 % / 69.5 % |
| Qwen2.5-Coder-14B (+Tune) | 31.3 % / 68.9 % | 41.9 % / 74.4 % | 35.3 % / 73.8 % | 38.2 % / 71.5 % |
| GPT-5 (No RAG) | 24.8% / 56.7% | 34.5% / 60.9% | 15.4% / 52.0% | 31.4% / 63.7% |
- CodeBLEU: 49–56% for fine-tuned models, 41–51% for GPT-5.
- Compilation rate rises from ≈0% to 28–41% after fine-tuning.
- “Declaration and Definition” achieves the highest EM; “Control Flow Block” and “Computation Block” remain most difficult.
- Completion success decreases for multi-line and deeply nested (extreme CST depth) spans, a behavior differing from software benchmarks.
- Retrieval-augmented LLMs provide marginal to significant improvement, contingent on retrieval quality; indiscriminate retrieval may degrade performance.
7. Implications and Research Recommendations
Findings from MHRC-Bench-Eval suggest several directions:
- Fine-tuning on hardware code is essential; off-the-shelf models lack inductive biases for HDL and hardware-specific constructs.
- Improvements in retrieval relevance could substantially benefit repository-level hardware code completion, particularly for HLS and V/SV.
- Context windows >2,048 tokens yield diminishing returns for hardware code, in contrast to software completion benchmarks.
- Two-axis annotation (structure + semantic) enables targeted model diagnosis and suggests the need for hardware-aware encoding architectures, such as programmable AST encoders or inductive biases for type and dataflow information.
- Multi-line, hierarchical completions and complex control/data flow present persistent challenges, indicating future benchmark extensions should target these higher-order code reasoning abilities.
MHRC-Bench-Eval establishes a rigorous, richly annotated testbed for repository-level hardware code completion and is positioned to support advancements in specialized LLM architecture, retrieval strategies, and evaluation metrics focused on correct and functional hardware design generation (Zou et al., 7 Jan 2026).