LLM-Driven Transformation Discovery
- LLM-driven transformation discovery is the automated generation, adaptation, and verification of formal data and scientific transformation rules using neural language models.
- It enables applications like high-definition map verification, scientific equation re-expression, and structured schema extraction with measurable accuracy improvements.
- The approach combines prompt-driven synthesis, modular pipelines, and evolutionary search to reduce manual effort and reveal current model limitations.
LLM-driven transformation discovery refers to the use of modern neural LLMs, most commonly LLMs, to automate or semi-automate the generation, adaptation, and verification of data, scientific, symbolic, or logical transformations in diverse domains. This paradigm leverages the code synthesis, symbolic reasoning, and prompt-following abilities of LLMs to (1) propose formal or executable representations of rules, equations, algorithms, or structural mappings, and (2) optionally optimize, evaluate, or verify them against well-defined correctness criteria or quantitative metrics. Recent literature demonstrates that LLM-driven discovery pipelines substantially reduce manual engineering effort, increase flexibility for new domains, and expose both the strengths and current limitations of LLMs as symbolic and reasoning agents.
1. Joint Formula and Function Discovery in Rule-Based Verification
LLM-driven transformation discovery is exemplified by joint generation of formal logic rules and code predicates for rule-based verification platforms. In high-definition map transformations for autonomous driving, the LLM-augmented pipeline in CommonRoad (He et al., 3 Nov 2025) couples a prompt-driven LLM “rules generator” with an existing formal-verification engine:
- Context files (ANTLR grammar, current library of Python predicates, XML schema) and instruction are assembled into a system prompt to drive LLM output.
- The LLM generates:
- ANTLR-compliant First-Order Logic (FOL) rule(s) (e.g., ∀ l ∈ Lanelet. slopeWithinLimit(l)).
- Corresponding Python predicate implementation.
- Brief natural-language explanation.
Syntactic validity is enforced via grammar-based checks; semantic correctness is maintained through human review.
- The FOL rule and predicate are integrated into the verification engine for immediate application.
In a synthetic “Excessive Slope” scenario, the tool detected 100% of injected elevation defects in map data, reduced false positives to zero, and cut manual rule-design time by 75%. The methodology ensures that executable and formally specified transformation rules are discovered in lockstep, bridging symbolic and concrete map representations with minimal manual effort (He et al., 3 Nov 2025).
2. Scientific Equation Transformation and Reasoning Challenges
LLM-driven discovery extends to scientific equation transformation tasks that rigorously test algebraic and data-driven reasoning. The LLM-SRBench “LSR-Transform” benchmark (Shojaee et al., 14 Apr 2025) evaluates a model’s ability to invert, rationalize, or otherwise systematically re-express mathematical laws (e.g., inverting to ) given only problem description, variables, and numeric data:
- Transformed representations preserve original expression-tree complexity, removing trivial pattern-matching solutions.
- LLMs must apply algebraic manipulations, reason about domain constraints, and correctly ground variable semantics.
- Symbolic equivalence is assessed via LLM-based equivalence checking (accuracy: 94.6% GPT-4O vs. experts).
- On 111 tasks, the best-performing method attains symbolic accuracy of only 31.5%, with most numeric-precision metrics below 15%.
These findings demonstrate that current LLMs, even with multi-stage prompt refinement and programmatic search, face substantial difficulty in producing correct transformation rules outside textbook forms. They excel at memorized patterns but are challenged by algebraic generalization and context-dependent reasoning (Shojaee et al., 14 Apr 2025).
3. Modular Approaches in Structured Text and Schema Transformation
In document understanding and information extraction, LLM-driven pipelines automate highly-structured transformations from unstructured text to formal schema representations:
- Template-Driven Pipelines: Systems such as CDMizer (Mridul et al., 1 Jun 2025, Mridul et al., 28 Oct 2025) leverage schema-derived JSON templates, depth-based retrieval augmentation (RAG), and few-shot prompting to guide the LLM in filling slots of a target schema. Minimal templates ensure that only schema-valid outputs are generated, with downstream validation (JSON Schema, semantic coverage) ensuring both syntactic and contextual fidelity.
- Semantic Extraction Workflows: Hierarchical generation, recursive error correction, and semantic evaluation enable modular slot-filling in large, heterogeneous documents (e.g., OTC financial contracts converted to Common Domain Model).
- Empirical evaluations show that even compact open-source LLMs, when paired with high-precision templates and RAG, achieve near-state-of-the-art results in clause extraction and structuring, matching or approaching larger proprietary models’ performance but with lower cost and compute (Mridul et al., 1 Jun 2025, Mridul et al., 28 Oct 2025).
LLM-driven modular pipelines guarantee 100% schema compliance, demonstrate robust generalization via leave-one-out RAG, and provide scalable extension to new clause types or schemas.
4. Model- and Algorithm-Discovery in Dynamical Systems and Algorithmic Science
In the domain of scientific computing and system identification, LLM-driven transformation discovery includes the autonomous synthesis of interpretable dynamic models and algorithmic updates:
- Power Systems (LLM-DMD): Discovery of DAEs for power grids is achieved by sequential LLM-driven loops: (1) differential-equation loop generating parameterized state update skeletons, and (2) algebraic-equation loop discovering explicit mappings or constraints. Gradient-based loss minimizes simulation error, and an island-based archival mechanism fosters solution diversity. Adaptive variable extension responds to model stagnation, allowing the LLM to inject necessary latent variables. Performance surpasses SINDy-based baselines, with MAPE ≈ 0.22% and ≈ 0.97 on IEEE 39-bus benchmarks (Shen et al., 9 Jan 2026).
- Algorithmic Discovery (Kalman Filter): Evolutionary pipelines combining LLM code synthesis and Cartesian genetic programming recover standard and improved variants of algorithms like the Kalman filter under nominal and pathological conditions. LLM-driven evolution uses function-level mutation, combination under specified signatures, and parallel search in “islands”, converging on interpretable structural innovations (e.g., state-dependent regularization under non-Gaussian noise) that outperform the nominal algorithm (Saketos et al., 13 Aug 2025).
These approaches confirm that LLMs, especially when augmented with gradient-based selection and hybrid search, are effective in discovering both standard and novel transformation rules for complex scientific and engineering problems.
5. Evolutionary and Agentic Search: Financial and Materials Transformations
LLM-driven transformation discovery is further characterized by evolutionary pipelines and agentic frameworks that blend mutation, recombination, and domain-guided critique:
- Cognitive Alpha Mining: CogAlpha (Liu et al., 24 Nov 2025) implements a multi-agent, code-centric evolutionary framework for financial alpha discovery. LLM agents generate, mutate, and recombine Pythonic alpha signals under a documented function interface, with multi-stage prompt refinement and domain-quality checks. Fitness is scored on predictive IC/ICIR/MI metrics, and modules for quality/diversity ensure robust, interpretable innovation. CogAlpha delivers significant accuracy and robustness gains over both ML/DL baselines and legacy factor libraries.
- Metric Evolution (AlphaSharpe): AlphaSharpe (Yuksel et al., 23 Jan 2025) employs symbolic expression trees, LLM-directed crossover/mutation, and out-of-sample ranking and robustness metrics to discover novel risk-adjusted financial metrics. Evolved formulas, combining log-returns, smoothing, downside penalties, and regime sensitivity, outperform standard Sharpe ratio by 2–3x in both asset ranking and portfolio realized Sharpe ratio.
- Materials Discovery: ChatBattery (Liu et al., 21 Jul 2025) orchestrates an agentic exploration–exploitation loop, with domain CoT prompt injection, for the discovery of improved battery cathode materials. The LLM proposes candidates, self-justifies with reasoning, and is filtered by analytical, computational, and experimental gates. Validated new materials achieve 19–29% capacity improvement over NMC811, with much faster iteration cycles than manual design.
These systems structure LLM reasoning as adaptive, multi-stage guided search, coupling generative creativity with domain-specific feedback loops and hierarchical evaluation.
6. Empirical Benchmarks and Failure Analysis
Empirical analysis across benchmarks such as DiscoveryBench (Majumder et al., 2024) for automated hypothesis and workflow generation reveals both the capabilities and boundaries of LLM-driven transformation discovery:
- Tasks decompose into hypotheses of the form ψ(c, v, r), where context (c), variables (v), and relationships (r) must be discovered via sequences of atomic transformations and tested against provided data.
- Mainstream LLM pipelines (CodeGen, ReAct, DataVoyager, Reflexion) demonstrate moderate performance on short pipelines (≤10 steps, 15–25% hypothesis matching score), but experience severe drop-off as complexity grows or context/variable discovery is required.
- Failure modes include mis-stratification, incomplete feature engineering, and selection of inappropriate statistical tests.
Effective methodologies employ modular planning, transformation libraries, iterative LLM-verification loops, domain-knowledge hints, and chunked execution/refinement to maximize accuracy and correctness.
7. Limitations, Extensions, and Future Directions
Common challenges in LLM-driven transformation discovery include:
- Syntactic correctness can be enforced via DSL/grammar constraints, but semantic inversion, inappropriate variable usage, or subtle logical errors remain risks.
- Current LLMs underperform on unseen mathematical forms, complex pipelines, or tasks requiring nuanced domain reasoning without explicit guidance.
- Dependency on large proprietary LLMs (e.g., GPT-4o) raises reproducibility and cost concerns, highlighting the need for open-source alternatives and hybrid symbolic-integration approaches.
- Fully automated semantic checking—especially in safety-critical systems—remains an open problem.
Proposed extensions across literature include more robust semantic verification, coupling LLMs with symbolic solvers, retrieval augmentation, self-critique or “oracle” feedback loops, expanded agent frameworks with explicit planning layers, and domain-specific consistency/type verifiers. Progress in these dimensions will facilitate more reliable, interpretable, and scalable transformation discovery pipelines for autonomous software, science, finance, and engineering (He et al., 3 Nov 2025, Shojaee et al., 14 Apr 2025, Liu et al., 24 Nov 2025, Shen et al., 9 Jan 2026, Liu et al., 21 Jul 2025, Majumder et al., 2024).