Papers
Topics
Authors
Recent
Search
2000 character limit reached

MHRC-Bench-Train: HDL Completion Dataset

Updated 14 January 2026
  • MHRC-Bench-Train is a repository-level dataset for multilingual HDL code completion that integrates fill-in-the-middle tasks with robust semantic annotations.
  • It annotates both code structure and hardware semantics across four paradigms, enabling precise evaluation using metrics like exact match and token-level BLEU.
  • Carefully curated from 584 GitHub repositories, the dataset ensures rigorous data filtering and repository-disjoint splits to prevent information leakage.

MHRC-Bench-Train is a large-scale, repository-level dataset for multilingual hardware description language (HDL) code completion, designed to systematically advance code generation and completion by LLMs in hardware design. It forms the training split of MHRC-Bench, which is the first benchmark to focus on repository-level, multilingual, and multi-paradigm HDL code completion at scale, with code structure and hardware-oriented semantic annotation. MHRC-Bench-Train encompasses four major hardware programming paradigms: Verilog/SystemVerilog (V/SV), VHDL, HLS C/C++ (including SystemC), and Chisel (Scala-based), covering three design styles: RTL, high-level synthesis, and generator-based hardware construction (Zou et al., 7 Jan 2026).

1. Dataset Composition and Construction

MHRC-Bench-Train is constructed from an extensive mining of the open-source hardware design landscape. The process began with identifying 584 distinct GitHub repositories (MIT/BSD-style licenses, minimum 5 stars, post-2000 creation date), from which 47,175 unique source files were extracted after filtering vendor, dependency directories, and duplicates. Precisely one syntactically complete completion target is selected per source file via concrete syntax tree (CST) node sampling, yielding a total of 47,175 examples, with 44,964 forming the training split.

The repository-level organization ensures no information leakage across train/val/test splits and allows fill-in-the-middle (FIM) prompts to be constructed from full project context. Each example is associated with a unique file, containing exactly one completion target (Zou et al., 7 Jan 2026).

Language/Paradigm Example Count Percentage
V/SV (RTL) 25,586 ~56.9%
VHDL (RTL) 13,039 ~29.0%
HLS C/C++ 3,766 ~8.4%
Chisel (Scala) 2,573 ~5.7%

2. Annotation Schema

Annotation spans both code-structure and hardware semantics:

  • Code-Structure Labels: Parsing with Tree-sitter yields CSTs, from which the completion target node is sampled. Each node's depth dd is recorded and bucketed into %%%%1%%%% discrete levels to facilitate stratified analysis.
  • Hardware-Oriented Semantic Categories: Every completion target is annotated with one of nine hardware-relevant classes:

    1. Design Structure (e.g., module/entity)
    2. Declaration and Definition (wire/signal)
    3. Storage Block (always_ff, Reg)
    4. Computation Block (arithmetic, slices)
    5. Control Flow Block (if, switch, loops)
    6. Interface (ports, AXI)
    7. Property and Assertion Specification
    8. Testbench Stimulus and Modeling
    9. Monitoring Logic

The overall semantic label distribution is as follows:

Semantic Category Fraction fif_i
Declaration ≈ 0.28
Computation ≈ 0.21
ControlFlow ≈ 0.16
Storage ≈ 0.13
Interface ≈ 0.10
DesignStructure ≈ 0.06
Property ≈ 0.03
Testbench ≈ 0.02
Monitoring ≈ 0.01

This explicit labeling supports fine-grained downstream evaluation and targeted model analysis (Zou et al., 7 Jan 2026).

3. Task Formulation and Input/Output Protocol

Each MHRC-Bench-Train example represents a fill-in-the-middle code completion scenario:

  • Prompt Construction: For a repository, all files are concatenated with meta delimiters. The file containing the target wraps its left/right context in <Left Context> and <Right Context> tags; the target is replaced by a single <TARGET> token.

  • Completion Target: The output required is the exact code span—line or logical block—corresponding to the removed CST node; only syntactic units are used to guarantee parseability.

This FIM format forces models to rely on genuine structural and semantic understanding at repository scale, rather than single-file context or handcrafted cues.

4. Preprocessing, Filtering, and Dataset Splits

Several filtering steps ensure quality and diversity:

  • Files within vendor/dependency structures are removed.
  • Exact duplicate files and targets containing only whitespace/comments/debug code are filtered.
  • At least 40% of targets span 2–5 lines.
  • Only one target per file is preserved.

Splits are repository-disjoint:

  • Train: 44,964 examples (~95.3%)
  • Validation: 453 (~0.96%)
  • Test: 1,758 (~3.7%)

This configuration enforces generalization across projects and paradigm boundaries (Zou et al., 7 Jan 2026).

5. Training Objective and Evaluation Metrics

Supervised fine-tuning is performed using a standard cross-entropy loss over the completion target tokens. The loss function is:

L=t=1Tlogpθ(ytcontext,y<t)\mathcal{L} = - \sum_{t=1}^T \log p_\theta(y_t \mid \text{context},\, y_{<t})

Only tokens in the <TARGET> span are included in the loss; masking ensures no leakage from context portions.

Primary evaluation metrics for code completion on the train (and dev) split are:

  • Exact Match (EM): strict span equivalence
  • Edit Similarity (ES):

ES=1Levenshtein(pred,ref)max(1,pred,ref)ES = 1 - \frac{\textrm{Levenshtein}(\text{pred},\,\text{ref})}{\max(1,|\text{pred}|,|\text{ref}|)}

  • Token-level BLEU

MHRC-Bench-Eval introduces CodeBLEU and compilation rate as additional metrics, but training is assessed primarily by EM, ES, and BLEU (Zou et al., 7 Jan 2026).

6. Contextual Significance and Comparisons

MHRC-Bench-Train addresses limitations in previous code completion datasets, which are predominantly software-centric and single-language, and typically lack both repository-scale context and detailed semantic partitioning. By spanning RTL, HLS, and generator-based levels as well as multiple HDLs, MHRC-Bench-Train offers a comprehensive challenge for LLMs and supports both discriminative model evaluation and fine-tuned generative training that is representative of contemporary hardware engineering practices.

A plausible implication is that future models trained on MHRC-Bench-Train could leverage its multi-paradigm, strongly contextualized setup to improve robustness, error rates, and cross-language transfer in hardware code generation tasks.

7. Reproducibility and Extension

The explicit delineation of source selection (GitHub, licensing, star count, date), sophisticated code structure labeling (Tree-sitter), and loss/metric design in MHRC-Bench-Train allow for straightforward reproduction and extension. Researchers can expand to additional languages, hardware paradigms, or alternative labeling strategies, or port the approach to related domains (e.g., analog hardware, FPGA-specific design flows) with minimal adaptation (Zou et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MHRC-Bench-Train.