MHRC-Bench: HDL Code Completion Benchmark
- MHRC-Bench is a large-scale, multilingual benchmark for hardware code completion that targets complex, repository-level problems in HDLs.
- It partitions data into training and evaluation sets with detailed structural and semantic annotations across RTL, HLS, and generator-based languages.
- Fine-tuning on MHRC-Bench markedly improves model accuracy, highlighting its role in advancing LLM capabilities for hardware design automation.
MHRC-Bench is a large-scale, multilingual, repository-level code completion benchmark specifically developed for hardware description languages (HDLs), filling the gap left by prior software-centric benchmarks. Its design enables rigorous assessment of LLMs on complex hardware codebases that exhibit parallelism, cross-file dependencies, and hardware-oriented semantics. MHRC-Bench is partitioned into MHRC-Bench-Train (for fine-tuning) and MHRC-Bench-Eval (for evaluation), each targeting three major coding paradigms and annotated at both structural and semantic levels (Zou et al., 7 Jan 2026).
1. Motivation and Conceptual Scope
Modern LLMs have achieved substantial performance gains in code completion for general-purpose software languages, particularly in single-file and repository-level settings (e.g., HumanEval, RepoBench, M2RC). However, hardware code presents distinct challenges: its organization spans multiple files and modules, describes synchronous and parallel execution, and enforces strong structural and interface semantics. Existing benchmarks such as RTL-Repo are limited, covering Verilog/SystemVerilog only, and do not enable structural or semantic stratification. MHRC-Bench advances the field by extending evaluation to repository-level completion across three hardware coding paradigms: Register-Transfer Level (RTL: Verilog/SystemVerilog/VHDL), High-Level Synthesis (HLS: C/C++ with synthesis directives), and generator-based languages (Chisel).
MHRC-Bench defines repository-level code completion as the task of reconstructing a syntactically complete code segment marked by <TARGET>, using all other file contents and optionally cross-file dependencies for contextual reasoning.
2. Dataset Construction and Corpus Organization
MHRC-Bench samples from approximately 584 permissively licensed (MIT/BSD/Apache) hardware repositories (GitHub, post-2000, ≥5 stars), after excluding GPL/copyleft code and vendor lock-in. Following removal of duplicates and irrelevant files, 47,175 source files—spanning Chisel, Verilog/SystemVerilog, VHDL, and HLS C/C++—constitute the corpus.
Repository-disjoint splits are enforced to inhibit data leakage, by ensuring train, validation, and test splits do not share repositories:
| Language | # Repositories | Train Targets | Validation | Test |
|---|---|---|---|---|
| Chisel | 48 | 2,573 | 119 | 319 |
| V/SV | 247 | 25,586 | 107 | 492 |
| VHDL | 239 | 13,039 | 104 | 491 |
| HLS | 50 | 3,766 | 123 | 456 |
Cross-file dependencies are prevalent, with average numbers referenced per target ranging from 3.12 (Chisel) to 5.85 (HLS), substantiating the repository-level design.
3. Target Selection and Annotation Schema
Target selection leverages Tree-sitter to convert each file into a concrete syntax tree (CST). A single syntactically complete CST node is sampled per file, with constraints requiring 40% of spans to be 2–5 lines long, and exclusion of trivial segments (whitespace, comments).
Annotation is performed at two levels:
- Code-Structure Level: CST node depth is bucketed into five structural levels (Level 1: top-level declarations, up to Level 5: deeply nested constructs), enabling analysis of structural granularity effects on completion.
- Hardware-Semantic Level: Each target is assigned one of nine semantic categories representing essential hardware constructs (e.g., Design Structure, Declaration & Definition, Storage Block, Computation Block, Control Flow Block, Interface, Property & Assertion, Testbench Stimulus, Monitoring & Checking).
This dual annotation facilitates comprehensive structural and semantic error analysis. For example, in SystemVerilog, “assert (my_signal == 1);” is labeled as “Property & Assertion.”
4. Evaluation Protocols and Metrics
MHRC-Bench evaluates both open-source (Qwen2.5-Coder, DeepSeek-Coder) and commercial LLM APIs (GPT-5, Gemini 2.5 Pro, Grok-4, DeepSeek V3.2), alongside retrieval-augmented models (RepoCoder, GraphCoder, RLCoder) with a fixed 2,048-token input window.
Key metrics include:
- Exact Match (EM): Proportion of predictions exactly matching cleaned ground truth.
- Edit Similarity (ES): 1 minus normalized Levenshtein distance.
- Token-level F1 Score:
Where and are the token sets of ground truth and prediction, respectively. BLEU and CodeBLEU are also reported, but EM and ES are emphasized.
Models are prompted using a fill-in-the-middle paradigm, with repository context provided in delimiters, and fine-tuned using LoRA (rank=16, =32, dropout=0.05, AdamW, learning rate 2×10⁻⁴, one epoch).
5. Experimental Results and Analysis
Empirical evaluation reveals:
- Pretrained LLMs without MHRC-Bench fine-tuning yield EM ≈ 0–1%; ES ≈ 5–18%.
- Fine-tuning on MHRC-Bench-Train dramatically improves EM to 25–42%, with Qwen2.5-7B and Qwen2.5-14B outperforming larger commercial models. Post-tuning EM for V/SV and HLS approaches 40%, VHDL at ∼35%, Chisel at 25–31%.
- RLCoder is the most effective retrieval method, improving EM by 2–5 points; GraphCoder fails to deliver gains, attributed to low hit@k in hardware code retrieval.
- Prompt context length positively correlates with accuracy up to 2048 tokens, with diminishing returns beyond 4096.
- Larger completion targets (spans >5 lines) exhibit EM degradation (<20% for 10+ lines).
- Structural complexity (CST depth): Hardware code completion quality decreases for deeper nodes, contrary to trends in software-focused benchmarks.
- Semantic granularity: “Declaration & Definition” and “Interface” categories yield the highest EM (∼50% for Qwen2.5-7B), “Control Flow” and “Property & Assertion” score lower (rarely >25%).
- Error analysis: Pretrained models hallucinate headers or unrelated code and misinterpret syntax context (C/C++ vs. HLS). Fine-tuned models demonstrate robust procedural, port, and structural block completion.
6. Discussion, Limitations, and Future Extensions
MHRC-Bench establishes a foundation for hardware-centric LLM benchmarking, providing nuanced annotation for structural and semantic difficulty, and demonstrates that fine-tuned smaller models, via targeted corpus training, exceed the capabilities of large, generic code LLMs for domain-specific HDL tasks.
Limitations include:
- Exclusion of analog/mixed-signal, layout/physical design languages.
- Evaluation metrics focus on textual similarity, omitting functional simulation.
- Scope limited to digital design and verification; formal proof and timing closure are beyond current coverage.
Potential extensions include:
- Integration of additional hardware languages (SystemC, Bluespec, VPI), analog and mixed-signal code.
- Supplementary evaluation modes that assess simulation-based functional correctness of completed code.
- Exploration of advanced prompting methods, such as multi-agent or chain-of-thought paradigms tailored to hardware reasoning.
7. Implications for Hardware Design Automation
MHRC-Bench provides a standardized, reproducible environment for training and benchmarking hardware-aware LLMs. This facilitates practical LLM-based assistance for RTL and HLS code development, potentially leading to productivity improvements for hardware engineers. Its annotations and quantitative findings inform future research directions on architecture and retrieval mechanisms optimized for hardware semantics. MHRC-Bench thus bridges a critical gap in EDA workflow automation, transitioning from software-dominated LLM tooling to the unique demands of hardware system design (Zou et al., 7 Jan 2026).