Proxy-Aligned Co-Evolution Mechanism

Updated 20 January 2026

Proxy-Aligned Co-Evolution is an optimization framework where multiple proxy models are simultaneously updated under explicit alignment constraints to closely mimic non-differentiable ground-truth systems.
The mechanism employs differentiable neural proxies and embedding caches, using loss functions and coupled or alternating optimization to prevent proxy drift and improve system robustness.
Applications include hardware design reinforcement (RL-MORPH) and multimodal out-of-distribution detection (CoEvo), achieving improved convergence and calibration metrics in empirical evaluations.

The proxy-aligned co-evolution mechanism is a class of rigorous optimization and adaptive realignment procedures in machine learning and artificial intelligence that simultaneously evolve two or more proxy models or signal caches under explicit alignment constraints. Recent implementations span hardware design-reinforcement co-optimization and zero-shot out-of-distribution (OOD) detection in multimodal representation spaces, exemplified by RL-MORPH for design-policy joint search (He et al., 2023) and CoEvo for vision-language OOD uncertainty calibration (Tang et al., 13 Jan 2026). These instantiations integrate differentiable proxies (typically neural networks or embedding caches) with non-differentiable or black-box ground-truth models by penalizing proxy drift, synchronizing proxy updates, and using alternating or coupled optimization routines to yield robust, high-performance solutions.

1. Architectural Foundations of Proxy Models

Proxy-aligned co-evolution deploys at least two model representations:

For hardware co-design (RL-MORPH), the true environment is defined by a possibly non-differentiable physics model $h_\phi(z|s,a)$ , indexed by design parameters $\phi$ (e.g., link lengths, joint angles). A differentiable proxy neural network $h^{\mathrm{nn}}_\psi(z|s,a)$ is concurrently evolved: implemented as a multilayer MLP architecture (e.g., 3 hidden layers, ReLU activations, output dimension $|Z|$ ), it approximates the cumulative effect of $\phi$ on task-space effect $z$ , enabling gradient-based optimization.
For cross-modal representation (CoEvo), dual proxy caches of textual and visual embeddings are initialized. The textual proxy $T_p$ is derived from encoded prompts for in-distribution (ID) classes ( $K\times D$ dimensionality), while the negative textual proxy $T_n$ ( $M\times D$ ) grows at test time by mining semantic negatives. The visual proxies $V_p$ and $V_n$ are $K\times L\times D$ and $M\times L\times D$ queues, respectively, continually refined with empirical exemplars aligned to respective textual proxies.

This dual-proxy setup is foundational, enabling proxy alignment and mutual evolution to support robust optimization even in the presence of non-differentiable or drifting features and distributions.

2. Alignment Objectives and Loss Formulations

A core element is the explicit alignment loss that penalizes the discrepancy between proxy outputs and ground-truth or reference signals:

In RL-MORPH, the alignment objective is $L_\mathrm{align}(\psi; \phi) = \mathbb{E}_{(s,a)\sim D}\left[\|h_\phi(s,a) - h_\psi(s,a)\|^2\right]$ , using a buffer $D$ of input pairs collected during training to ensure the neural proxy remains close to the physics model.
In CoEvo, alignment is implicit through shared embedding spaces and update gating. All proxies and embeddings are normalized in $\mathbb{R}^D$ , and cosine similarity $\mathrm{cos}(u,v)$ is used for scoring. Updates are gated through adaptive thresholds and confidence margins, preventing noisy alignment near uncertain regions.

Alignment loss terms are balanced with task or detection objectives via trade-off hyperparameters (e.g., $\alpha$ in RL-MORPH), providing a regularization effect that tethers the proxy to a reliable reference while allowing for adaptive evolution, minimizing risk of drift or misalignment.

3. Co-Evolution Mechanism and Optimization Algorithms

Proxy-aligned co-evolution proceeds via coupled or alternating updates driven by shared loss landscapes and proxy-task interdependence:

For RL-MORPH, the total loss $L_\mathrm{total}(\theta, \psi; \phi) = L_\mathrm{task}(\theta, \psi) + \alpha L_\mathrm{align}(\psi; \phi)$ is jointly minimized: PPO/A2C is used for policy gradients $\nabla_\theta L_\mathrm{task}$ ; for proxy gradients, destructive interference between $\nabla_\psi L_\mathrm{task}$ and $\nabla_\psi L_\mathrm{align}$ is mitigated via PCGrad projection. Specifically, if $S_c < 0$ (computed as inner product normalized by $\ell_2$ norm), the RL gradient is projected to remove the component aligned with proxy loss, then summed with weighted proxy gradient for SGD/Adam update.
In CoEvo, a test-time loop conditions proxy evolution on sample-specific scores: after computing preliminary multimodal scores $S_\mathrm{CoEvo}^{\mathrm{pre}}(x)$ , new negatives are mined (near/far semantics) based on confidence-gated thresholds. Visual caches are synchronously expanded and refined by entropy-based selection, keeping high-confidence, low-entropy exemplars. To prevent premature misalignment, updates only occur away from the adaptive threshold, as regulated by margin $\gamma$ .

4. Dynamic Re-Weighting and Decision Fusion

Both RL-MORPH and CoEvo employ dynamic fusion of proxy signals for robust decision-making:

In RL-MORPH, proxy accuracy and actionable policy are balanced via $\alpha$ . Ablating $\alpha$ yields a U-shaped performance curve: low $\alpha$ produces proxy drift and poor returns, high $\alpha$ impedes task reward learning.
CoEvo utilizes a fusion parameter $\lambda$ ( $0.5 \leq \lambda < 1$ ) to combine textual and visual proxy scores. Pre-update fusion ( $S_\mathrm{CoEvo}^{\mathrm{pre}}(x) = \lambda S_T^{\mathrm{pre}}(x) + (1-\lambda) S_V^{\mathrm{pre}}(x)$ ) privileges stable textual cues; post-update ( $S_\mathrm{CoEvo}^{\mathrm{post}}(x) = (1-\lambda) S_T^{\mathrm{post}}(x) + \lambda S_V^{\mathrm{post}}(x)$ ) leverages evolved, contextually aligned visual features. Empirically, $\lambda \approx 0.8, N \approx 5$ yield optimal calibration (Tang et al., 13 Jan 2026).

This fusion strategy calibrates OOD detection and balances task performance with proxy reliability, enabling adaptability to distributional shift and data imbalance.

5. Experimental Setups, Results, and Ablations

Empirical evaluation on RL-MORPH encompassed 2D Reacher (5-link planar arm with forward kinematics physics) and 3D multi-finger manipulation with full Mujoco dynamics, using episodic return as the metric. Baselines included RL-noHWOpt, outer-loop CMA-ES, and Transform2Act. RL-MORPH converged to high-return designs under all tested tasks, with learning collapse observed in the absence of proxy alignment or proper gradient projection.

For CoEvo, experiments on ImageNet-1K (ID) and four OOD sets (iNat, SUN, Places, Textures) demonstrated AUROC improvement to $97.95\%$ (vs. $96.66\%$ for AdaNeg) and FPR95 reduction to $10.22\%$ (vs. $18.92\%$ ), with ablation showing dual-modal evolution is superior to textual-only or visual-only variants. Performance was robust under data imbalance and varying test set sizes (Tang et al., 13 Jan 2026).

6. Practical Recommendations and Limitations

Reported best practices are:

For RL-MORPH, always include a small alignment term ( $\alpha \in [0.1, 1.0]$ ) and employ PCGrad or similar conflict mitigation between RL and alignment gradients. A 3-4 layer MLP proxy with widths $\approx 2 \times (|s|+|a|)$ sufficed for various robotic configurations. Sample proxy training data every $K=10$ updates to maintain buffer freshness. For design search, use CMA-ES for low-dimensional $\phi$ or SGD for differentiable $h_\phi$ ; scalability to topological changes (e.g., variable numbers of robot links) remains an open direction (He et al., 2023).
For CoEvo, optimal adaptation was achieved with fusion weight $\lambda \approx 0.8$ and mining batch size $N \approx 5$ . High-confidence gating ( $\gamma$ ) and entropy-based exemplar selection were critical for stable evolution. Both textual and visual caches should be iteratively updated and realigned for best OOD calibration (Tang et al., 13 Jan 2026).

Limitations include parametric design constraints (fixed topology) in RL-MORPH and absence of gradient-based test-time supervision in CoEvo. For sim-to-real transfer, real-world rollout data can be incorporated into alignment objectives.

7. Impact and Research Directions

Proxy-aligned co-evolution mechanisms decompose otherwise intractable joint optimization into modular, tightly coupled subproblems, facilitating efficient gradient-based search and adaptation under complex, non-differentiable, or distributionally unstable regimes. Extending these mechanisms to topological and structure search (e.g., graph-NN proxies), as well as incorporating real-world feedback for sim-to-real transfer, represent salient directions for future research. The proxy-aligned co-evolution paradigm is empirically validated to enable robust joint optimization and adaptive detection in open-world scenarios (He et al., 2023, Tang et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

MORPH: Design Co-optimization with Reinforcement Learning via a Differentiable Hardware Model Proxy (2023)

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proxy-Aligned Co-Evolution Mechanism.