Papers
Topics
Authors
Recent
Search
2000 character limit reached

NNGPT Framework: Next-Gen AutoML

Updated 2 December 2025
  • NNGPT is a LLM-driven framework that integrates automated neural architecture search, hyperparameter tuning, and code synthesis in a closed-loop pipeline.
  • It employs five synergistic LLM-based pipelines, including zero-shot synthesis, hyperparameter recommendation, and code-aware performance prediction, to enhance efficiency.
  • The framework achieves state-of-the-art results by continuously improving models through reward-driven fine-tuning and rigorous validation on benchmarks like the LEMUR corpus.

NNGPT refers to a LLM–centric framework for self-improving neural network generation and optimization, proposed as a foundation for next-generation AutoML. NNGPT reconceptualizes automated neural architecture search, hyperparameter tuning, and code synthesis as closed-loop tasks orchestrated by an LLM agent, integrating prompt-based model generation, validation, code execution, performance prediction, and on-the-fly fine-tuning. The framework achieves state-of-the-art efficiency and performance in computer vision neural design, leveraging structured prompting and differentiable adapters for rapid, data-driven iteration, as demonstrated on the LEMUR corpus and PyTorch backends (Kochnev et al., 25 Nov 2025).

1. System Architecture and Closed-Loop Cycle

At its core, NNGPT operates a fully closed-loop AutoML pipeline, positioning the LLM as the central agent that incrementally improves itself and the models it generates. The cycle entails:

  • Retrieval and configuration: Given a prompt PP (specifying task, dataset, and resource budget) and context set RR (retrieved from the LEMUR corpus), a configuration module emits a JSON/YAML scaffold outlining requirements and constraints.
  • Prompt assembly: This scaffold, augmented by relevant reference implementations, is rendered into a structured instruction block encompassing both textual and code exemplars.
  • LLM generation: The LLM GθG_\theta receives the prompt and outputs a full candidate CC containing executable PyTorch code and training specifications.
  • Validation: An automated validator VV enforces schema, shape, type, and code constraints, optionally correcting or reprompting if errors are detected.
  • Execution and logging: The validated candidate is executed by a trainer EE, producing per-epoch metric logs m1:Tm_{1:T} and auxiliary runtime data uu. All artifacts are logged for further analysis and re-training.
  • Code-aware performance prediction: A predictor HϕH_\phi ingests the training code and early epoch metrics m1:t0m_{1:t_0}, estimating both final accuracy acc^\hat{acc}_* and an ideal early-stopping point t^\hat{t}_*. Low-confidence runs may be terminated or reallocated.
  • Self-improvement: After a batch of KK runs, GθG_\theta is fine-tuned via LoRA adapters using successful (prompt, model, metrics) tuples, and HϕH_\phi is retrained for improved regression on logged outcomes.

This tightly interlocked workflow distinguishes NNGPT from classical search-based AutoML, as the model improves by direct experience, reward-driven policy gradient, and supervised finetuning within a reproducible, auditable protocol (Kochnev et al., 25 Nov 2025).

2. Integrated LLM-Based Pipelines

NNGPT's innovation arises from the integration of five synergistic LLM-based pipelines atop a shared backend:

  1. Zero-Shot Architecture Synthesis: Uses Few-Shot Architecture Prompting (FSAP) to produce executable neural architectures from natural language task descriptions, reference code blocks, and dataset metadata. Hash-based deduplication ensures diversity and reduces redundant experiments.
  2. Hyperparameter Recommendation: Predicts optimal hyperparameters directly from model code and task context using a structured prompt schema, achieving root mean square error (RMSE) $0.60$ on the LEMUR corpus versus $0.64$ for Optuna.
  3. Code-Aware Accuracy and Early-Stop Prediction: Employs a DeepSeek-Coder-1.3B backbone fine-tuned with QLoRA to regress on final accuracy and early stop epoch from raw code and early validation metrics, reaching RMSE 0.145\approx 0.145, Pearson r0.78r \approx 0.78.
  4. Retrieval-Augmented Block Synthesis (NN-RAG): Queries a curated SQLite index of 1,289 scope-closed PyTorch blocks for plug-and-play architectures, with 73% executability on randomized insertions. The backend extracts dependency closures and assembles modules respecting Python scope and import order.
  5. Reinforcement Learning (RL) with Closed-Loop Generation: Utilizes a policy-gradient–trained LLM as a policy πθ\pi_\theta to generate novel architectures, with reward aggregating syntactic validity, runtime feasibility, and short-run accuracy, and includes channel-mutation of architectures via torch.fx (Kochnev et al., 25 Nov 2025).

These pipelines operate within a unified seven-stage backbone (retrieval, config, prompt assembly, LLM generation, validation/execution, LoRA fine-tuning, logging), enabling continuous, data-driven improvement.

3. Validation, Fine-Tuning, and Dataset Management

All NNGPT outputs and experiments are validated, logged, and reprocessed using the LEMUR dataset—a large, auditable corpus of executable neural networks and unified preprocessing code. The system:

  • Retrieves exemplars for contextual prompting and in-context learning.
  • Fine-tunes GθG_\theta on accepted (input prompt, code, and metrics) pairs, employing LoRA adapters for rapid, parameter-efficient transfer.
  • Fine-tunes HϕH_\phi on (code, early metrics) to (final accuracy, optimal stop epoch), improving early-termination heuristics.
  • Performs rigorous hash-based deduplication (via MD5) on whitespace-normalized code, saving between 200–300 GPU-hours in synthetic architectures.
  • Logs all results for reproducible, queryable benchmarking (Kochnev et al., 25 Nov 2025).

This design guarantees that all self-improvement cycles are traceable and auditable.

4. Key Algorithms and Pseudocode

A representative end-to-end run is encapsulated by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
procedure NNGPT_Run(task P, budget T):
    R  LEMUR.query(task=P)
    config  build_config(task=P, budget=T, context=R)
    prompt  assemble_prompt(config)
    C_raw  LLM.generate(prompt)
    if not Validator.check(C_raw):
        C_raw  LLM.regenerate(prompt_with_errors)
    C  Validator.fix_and_parse(C_raw)
    (m_1:T, u)  Trainer.execute(C, T)
    log_run(P, R, C, m_1:T, u)
    if early_stop_pred:
        (âcc_*, ˆt_*)  Predictor.predict(C, m_1:t_0)
        if âcc_* < threshold: terminate_run()
    if LoRA_update_condition:
        Gθ  LoRA.finetune(Gθ, recent_successes)
        Hφ  retrain_predictor(Hφ, all_logged)
    return final_metrics(m_1:T)
This codifies the closed-loop selection, generation, validation, prediction, and self-improvement sequence (Kochnev et al., 25 Nov 2025).

5. Quantitative Evaluation and Comparison

NNGPT is empirically benchmarked on the LEMUR dataset and diverse computer vision tasks. Notable results include:

Task/Pipeline Metric / Result Baseline Comparison
Zero-shot synthesis Balanced mean acc. 53.1% Baseline 51.5%
Hyperparameter HPO RMSE ≈ 0.60 Optuna: 0.64
Code accuracy pred. RMSE ≈ 0.145, r0.78r ≈ 0.78 -
NN-RAG executability 941/1,289 (73%) -
RL one-epoch (MNIST) 0.9876 avg AlexNet 0.7088

This suggests that LLM-driven pipelines can match or outperform classical search-based AutoML in accuracy and efficiency, achieving similar results with substantially fewer trials and lower compute cost. NNGPT's one-shot predictions, RL-based architecture search, and retrieval-augmented synthesis represent substantial compute savings over >20K trial-based approaches (Kochnev et al., 25 Nov 2025).

6. Design Considerations, Trade-Offs, and Future Directions

NNGPT's efficacy and generality depend on architectural, prompt, and backend choices:

  • Performance degrades when prompt context exceeds 3–4 examples due to LLM context length limits; retrieval-augmented construction may mitigate this bottleneck.
  • Prolonged LoRA fine-tuning on hyperparameter prompts leads to overfitting; early-stopping and strong regularization are required.
  • Correlation in early-stop prediction varies by task; model uncertainty quantification is needed for robust early termination.
  • Current scaling is limited to single 24 GB GPU nodes; large-scale extensions (e.g., ImageNet-1k, segmentation) necessitate distributed orchestration.
  • Potential research extensions: hardware-constrained architecture search via prompt-objective integration, multi-objective RL pipelines, expanded retrieval of user-supplied modules, and architectural novelty enforcement.

NNGPT provides a reproducible, framework-agnostic foundation for generative AutoML workflows, demonstrating the capability of LLMs to autonomously generate, validate, predict, and improve neural network designs in a rigorously closed loop (Kochnev et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NNGPT Framework.