ClassEval-TDD: Code & ISAC Detection
- The paper introduces a dependency-aware, iterative TDD framework for class-level code synthesis, achieving significant improvements in fully correct class generation and compositional accuracy.
- It employs systematic cleaning, skeleton alignment, and reflection-based repair to generate reliable method-level tests that mitigate cascading errors in program synthesis.
- For ISAC target detection, ClassEval-TDD leverages periodogram analysis and localized DFTs to accurately distinguish true radar targets from TDD-induced sidelobes in real-time applications.
ClassEval-TDD denotes two distinct frameworks: (1) a cleaned benchmark and iterative, dependency-aware test-driven development (TDD) framework for class-level code generation with LLMs (Liang et al., 3 Feb 2026), and (2) a target-vs-sidelobe classifier supporting target detection in integrated sensing and communications (ISAC) systems with time division duplex (TDD) transmission (Henninger et al., 27 Apr 2025). Both share the ClassEval-TDD moniker due to their evaluative or discriminative anchoring; in each domain, ClassEval-TDD advances the state of rigorous, specification-driven testing—of either code correctness or radar reflectors—under practical and compositional constraints.
1. Motivation and Scope
Code Generation
Prior benchmarks for program synthesis (e.g., HumanEval, MBPP) focus on isolated functions. This fragmentary scope omits aspects critical to realistic software, such as inter-method dependencies, shared state, and cross-method invariants. Function-level metrics fail to account for “composition gap” effects, where aggregating individually correct methods yields incorrect classes due to cascading errors in unseen interactions. There is a methodological imperative for a benchmark and workflow that enables TDD-style, class-level synthesis and evaluation, using method-level public tests as executable specifications and enforcing alignment on signatures, docstrings, and reference implementations (Liang et al., 3 Feb 2026).
ISAC Target Detection
In 5G/6G ISAC, the periodic on/off pattern of TDD transmission modulates the radar point spread function (PSF), generating impulsive sidelobes in the Doppler domain. Conventional peak detection cannot distinguish these sidelobes from genuine targets, causing false alarms. There exists a need for a classifier that robustly discriminates true targets from TDD-induced artifacts, informed by analytic characterization of the PSF and executable detection criteria (Henninger et al., 27 Apr 2025).
2. Construction of the Code Synthesis Benchmark
ClassEval-TDD for code generation is derived via a systematic cleaning pipeline applied to the original ClassEval benchmark:
- Skeleton Alignment: All class skeletons are verified for unused imports, mismatched signatures, and missing method stubs.
- Docstring Normalization: Unified reStructuredText (reST) format, error correction, and completion of missing parameter and return specifications.
- Test Determinization: Unit tests are constructed with isolated setUp/tearDown, fixed seeds, and no uncontrolled randomness or file dependencies.
- Public Test Synthesis: For every method, 3–4 public test cases (unittest.TestCase) are authored to explicate core functionality.
Benchmark statistics: | Statistic | Value | |--------------------------------------------|----------| | Total classes | 100 | | Total methods | 412 | | Classes w/ at least one dependency | 55 | | Methods w/ non-empty dependencies | 84 | | Mean/median public-test coverage | 98.7%/99.0% | | Private-test coverage | 100% |
Each example comprises: fully specified class skeleton, method-level public and private tests, and a deterministic runtime harness (Liang et al., 3 Feb 2026).
3. Iterative TDD Framework: Workflow and Dependency Analysis
Dependency-Aware Scheduling
An LLM is prompted as a “senior software architect” to analyze method docstrings and skeletons, inferring two forms of dependencies:
- Direct Calls: Explicit intra-class calls (e.g.,
self.foo()) - Logical Prerequisites: Statements such as “first do X then Y”
These dependencies form an adjacency list , from which a valid schedule (topological sort) is derived such that
Iterative Implementation and Reflection-based Repair
- For each method in schedule order: the LLM is prompted with the partial class, ’s signature/docstring, and public tests , and instructed to implement so that all public tests pass.
- If tests fail, a reflection-style loop (max three rounds) is triggered, involving (1) error classification, (2) culprit localization, (3) high-level repair strategy, (4) minimal patch generation.
- Commit and proceed once tests pass or repair budget is exhausted.
Significance
Public tests localize errors; the reflection loop confines repair cost and limits error propagation. This approach offers measurable improvements in both per-method and class-level correctness and highlights the compositional limitations of direct one-shot generation (Liang et al., 3 Feb 2026).
4. Evaluation Protocol and Baselines
Metrics
- Method-level:
fun_success(percentage of methods passing all private tests),fun_partial_success. - Class-level:
class_success(classes with all methods fully correct),class_partial_success(classes with at least one correct method). - Dependency Analysis: Precision, Recall, F1 score for predicted vs ground-truth dependency edges, exact-match class accuracy, and count of topological violations.
Baseline Strategies
| Strategy | Description |
|---|---|
| Holistic (H) | One-shot full-class generation |
| Incremental (I) | Sequential method generation with full prior context |
| Compositional (C) | Each method generated independently on skeleton+docstring |
Direct-generation baselines (H/I/C) generally underperform the iterative, dependency-aware TDD approach for class-level correctness (Liang et al., 3 Feb 2026).
5. Empirical Results and Benchmark Tables
Class-Level Accuracy and Repair Cost
| Model | Best Baseline | TDD (Δ) | Best Baseline (fun_success) | TDD (Δ) |
|---|---|---|---|---|
| deepseek-v3 | 47% | 68% (+21) | 76.9% | 90.8% (+13.9) |
| gpt-oss-120B | 45% | 71% (+26) | 73.4% | 91.0% (+17.6) |
| qwen2.5-7B | 33% | 46% (+13) | 65.2% | 74.2% (+9.0) |
| qwen3-480B | 55% | 68% (+13) | 81.6% | 91.6% (+10.0) |
| gemini3-flash | 59% | 71% (+12) | 82.4% | 91.8% (+9.4) |
- Absolute improvements of +12 to +26 percentage points in class_success; up to 71% fully correct classes.
- Average repairs per method (for gpt-oss-120B): 0.06; most methods require no repair (Liang et al., 3 Feb 2026).
Dependency Inference Quality
| Model | Precision | Recall | F1 | Exact-match | Topo Violations |
|---|---|---|---|---|---|
| gpt-oss-120B | 92.6% | 96.6% | 94.5% | 88.6% | 6/100 |
| qwen3-480B | 88.4% | 95.4% | 91.7% | 82.8% | 4/100 |
| deepseek-v3 | 80.4% | 95.4% | 87.2% | 75.0% | 6/100 |
High recall (92–98%) corresponds to a tendency to over-approximate dependencies (i.e., add extra edges), and a moderate proportion of topological violations (Liang et al., 3 Feb 2026).
6. TDD-Based ISAC Target vs Sidelobe Classification
In ISAC radar processing, ClassEval-TDD (as defined in (Henninger et al., 27 Apr 2025)) implements a multi-stage process for detecting targets while suppressing TDD-induced Doppler sidelobes:
- Periodogram Construction: The two-dimensional FFT of the CSI matrix yields the range-Doppler power map .
- CA-CFAR Selection: Cell-averaging CFAR identifies candidate peaks.
- Candidate Refinement: Iteratively, for strongest unclaimed candidate:
- Focused (localized) DFT for sub-bin resolution ( grid).
- PSF or CSI-domain coherent subtraction removes hypothesized contribution.
- Sidelobe-power check compares mean sidelobe bins’ power before/after hypothesis removal, validating a candidate if with a tunable threshold.
This loop continues until all candidates are processed. The method is computationally dominated by one 2D FFT (per frame) and a handful () of small local DFTs and subtraction operations, enabling real-time applicability (sub-1 ms per 10 ms frame for , ) (Henninger et al., 27 Apr 2025).
Performance Metrics
| (dBm) | () | () | () | () |
|---|---|---|---|---|
| -70 | 0.05 | 0.10 | 0.92 | 0.85 |
| -80 | 0.12 | 0.22 | 0.85 | 0.70 |
| -90 | 0.30 | 0.45 | 0.70 | 0.50 |
| -100 | 0.60 | 0.75 | 0.50 | 0.30 |
Outdoor drone measurements confirm that ClassEval-TDD (with PSF removal and ) correctly rejects all TDD sidelobes up to 150 m range (Henninger et al., 27 Apr 2025). Feeding validated peaks into a Kalman tracker yields stable range/speed tracking.
7. Methodological Insights and Limitations
- In code generation: Class-level TDD with explicit dependency reasoning and iterative reflection-based repair substantially reduces the compositional gap, achieving high per-method success and up to 71% fully correct class generation with minimal repair overhead.
- Dependency inference: LLMs are proficient at high-recall dependency mapping, but present a proclivity for superfluous dependencies; semantic and syntactic cues can diverge, producing schedule errors in edge cases.
- In ISAC radar: Accurate PSF modeling and removal in both CSI and periodogram domains are essential for sidelobe discrimination. Sidelobe power thresholds () and zero-padding parameters provide a tunable trade-off between false alarm rate and sensitivity.
- Remaining challenges: Despite >90% fun_success, class_success plateaus at ~71%, with persistent difficulties in enforcing implicit invariants and field consistency in code generation. For TDD radar detection, extreme SNR regimes and ambiguous sidelobe structure can still pose classification ambiguity.
All code, data, and experiments for ClassEval-TDD (code generation) are publicly available at https://anonymous.4open.science/r/ClassEval-TDD-C4C9/ (Liang et al., 3 Feb 2026).