NNGPT Framework: Next-Gen AutoML
- NNGPT is a LLM-driven framework that integrates automated neural architecture search, hyperparameter tuning, and code synthesis in a closed-loop pipeline.
- It employs five synergistic LLM-based pipelines, including zero-shot synthesis, hyperparameter recommendation, and code-aware performance prediction, to enhance efficiency.
- The framework achieves state-of-the-art results by continuously improving models through reward-driven fine-tuning and rigorous validation on benchmarks like the LEMUR corpus.
NNGPT refers to a LLM–centric framework for self-improving neural network generation and optimization, proposed as a foundation for next-generation AutoML. NNGPT reconceptualizes automated neural architecture search, hyperparameter tuning, and code synthesis as closed-loop tasks orchestrated by an LLM agent, integrating prompt-based model generation, validation, code execution, performance prediction, and on-the-fly fine-tuning. The framework achieves state-of-the-art efficiency and performance in computer vision neural design, leveraging structured prompting and differentiable adapters for rapid, data-driven iteration, as demonstrated on the LEMUR corpus and PyTorch backends (Kochnev et al., 25 Nov 2025).
1. System Architecture and Closed-Loop Cycle
At its core, NNGPT operates a fully closed-loop AutoML pipeline, positioning the LLM as the central agent that incrementally improves itself and the models it generates. The cycle entails:
- Retrieval and configuration: Given a prompt (specifying task, dataset, and resource budget) and context set (retrieved from the LEMUR corpus), a configuration module emits a JSON/YAML scaffold outlining requirements and constraints.
- Prompt assembly: This scaffold, augmented by relevant reference implementations, is rendered into a structured instruction block encompassing both textual and code exemplars.
- LLM generation: The LLM receives the prompt and outputs a full candidate containing executable PyTorch code and training specifications.
- Validation: An automated validator enforces schema, shape, type, and code constraints, optionally correcting or reprompting if errors are detected.
- Execution and logging: The validated candidate is executed by a trainer , producing per-epoch metric logs and auxiliary runtime data . All artifacts are logged for further analysis and re-training.
- Code-aware performance prediction: A predictor ingests the training code and early epoch metrics , estimating both final accuracy and an ideal early-stopping point . Low-confidence runs may be terminated or reallocated.
- Self-improvement: After a batch of runs, is fine-tuned via LoRA adapters using successful (prompt, model, metrics) tuples, and is retrained for improved regression on logged outcomes.
This tightly interlocked workflow distinguishes NNGPT from classical search-based AutoML, as the model improves by direct experience, reward-driven policy gradient, and supervised finetuning within a reproducible, auditable protocol (Kochnev et al., 25 Nov 2025).
2. Integrated LLM-Based Pipelines
NNGPT's innovation arises from the integration of five synergistic LLM-based pipelines atop a shared backend:
- Zero-Shot Architecture Synthesis: Uses Few-Shot Architecture Prompting (FSAP) to produce executable neural architectures from natural language task descriptions, reference code blocks, and dataset metadata. Hash-based deduplication ensures diversity and reduces redundant experiments.
- Hyperparameter Recommendation: Predicts optimal hyperparameters directly from model code and task context using a structured prompt schema, achieving root mean square error (RMSE) $0.60$ on the LEMUR corpus versus $0.64$ for Optuna.
- Code-Aware Accuracy and Early-Stop Prediction: Employs a DeepSeek-Coder-1.3B backbone fine-tuned with QLoRA to regress on final accuracy and early stop epoch from raw code and early validation metrics, reaching RMSE , Pearson .
- Retrieval-Augmented Block Synthesis (NN-RAG): Queries a curated SQLite index of 1,289 scope-closed PyTorch blocks for plug-and-play architectures, with 73% executability on randomized insertions. The backend extracts dependency closures and assembles modules respecting Python scope and import order.
- Reinforcement Learning (RL) with Closed-Loop Generation: Utilizes a policy-gradient–trained LLM as a policy to generate novel architectures, with reward aggregating syntactic validity, runtime feasibility, and short-run accuracy, and includes channel-mutation of architectures via torch.fx (Kochnev et al., 25 Nov 2025).
These pipelines operate within a unified seven-stage backbone (retrieval, config, prompt assembly, LLM generation, validation/execution, LoRA fine-tuning, logging), enabling continuous, data-driven improvement.
3. Validation, Fine-Tuning, and Dataset Management
All NNGPT outputs and experiments are validated, logged, and reprocessed using the LEMUR dataset—a large, auditable corpus of executable neural networks and unified preprocessing code. The system:
- Retrieves exemplars for contextual prompting and in-context learning.
- Fine-tunes on accepted (input prompt, code, and metrics) pairs, employing LoRA adapters for rapid, parameter-efficient transfer.
- Fine-tunes on (code, early metrics) to (final accuracy, optimal stop epoch), improving early-termination heuristics.
- Performs rigorous hash-based deduplication (via MD5) on whitespace-normalized code, saving between 200–300 GPU-hours in synthetic architectures.
- Logs all results for reproducible, queryable benchmarking (Kochnev et al., 25 Nov 2025).
This design guarantees that all self-improvement cycles are traceable and auditable.
4. Key Algorithms and Pseudocode
A representative end-to-end run is encapsulated by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
procedure NNGPT_Run(task P, budget T):
R ← LEMUR.query(task=P)
config ← build_config(task=P, budget=T, context=R)
prompt ← assemble_prompt(config)
C_raw ← LLM.generate(prompt)
if not Validator.check(C_raw):
C_raw ← LLM.regenerate(prompt_with_errors)
C ← Validator.fix_and_parse(C_raw)
(m_1:T, u) ← Trainer.execute(C, T)
log_run(P, R, C, m_1:T, u)
if early_stop_pred:
(âcc_*, ˆt_*) ← Predictor.predict(C, m_1:t_0)
if âcc_* < threshold: terminate_run()
if LoRA_update_condition:
Gθ ← LoRA.finetune(Gθ, recent_successes)
Hφ ← retrain_predictor(Hφ, all_logged)
return final_metrics(m_1:T) |
5. Quantitative Evaluation and Comparison
NNGPT is empirically benchmarked on the LEMUR dataset and diverse computer vision tasks. Notable results include:
| Task/Pipeline | Metric / Result | Baseline Comparison |
|---|---|---|
| Zero-shot synthesis | Balanced mean acc. 53.1% | Baseline 51.5% |
| Hyperparameter HPO | RMSE ≈ 0.60 | Optuna: 0.64 |
| Code accuracy pred. | RMSE ≈ 0.145, | - |
| NN-RAG executability | 941/1,289 (73%) | - |
| RL one-epoch (MNIST) | 0.9876 avg | AlexNet 0.7088 |
This suggests that LLM-driven pipelines can match or outperform classical search-based AutoML in accuracy and efficiency, achieving similar results with substantially fewer trials and lower compute cost. NNGPT's one-shot predictions, RL-based architecture search, and retrieval-augmented synthesis represent substantial compute savings over >20K trial-based approaches (Kochnev et al., 25 Nov 2025).
6. Design Considerations, Trade-Offs, and Future Directions
NNGPT's efficacy and generality depend on architectural, prompt, and backend choices:
- Performance degrades when prompt context exceeds 3–4 examples due to LLM context length limits; retrieval-augmented construction may mitigate this bottleneck.
- Prolonged LoRA fine-tuning on hyperparameter prompts leads to overfitting; early-stopping and strong regularization are required.
- Correlation in early-stop prediction varies by task; model uncertainty quantification is needed for robust early termination.
- Current scaling is limited to single 24 GB GPU nodes; large-scale extensions (e.g., ImageNet-1k, segmentation) necessitate distributed orchestration.
- Potential research extensions: hardware-constrained architecture search via prompt-objective integration, multi-objective RL pipelines, expanded retrieval of user-supplied modules, and architectural novelty enforcement.
NNGPT provides a reproducible, framework-agnostic foundation for generative AutoML workflows, demonstrating the capability of LLMs to autonomously generate, validate, predict, and improve neural network designs in a rigorously closed loop (Kochnev et al., 25 Nov 2025).