Unlabeled Reward Generation

Updated 15 February 2026

Unlabeled Reward Generation is a set of methods that infer reward functions from unlabeled data, enabling learning in environments lacking explicit fitness metrics.
It employs techniques such as self-supervision, pseudo-labeling, and contrastive discrimination to adapt reward structures dynamically.
These approaches enhance scalability and generalization in reinforcement learning and generative modeling, with applications in robotics and open-ended systems.

Unlabeled reward generation encompasses methodologies for constructing, learning, or adaptively evolving reward functions in scenarios where explicit reward labels or external fitness metrics are absent, incomplete, or unreliable. In contrast to paradigms relying on fully supervised reward annotation or task-specific scalar feedback, these approaches create dense or structured rewards from experiential, demonstrational, or self-supervised signals—enabling reinforcement learning, imitation, and generative modeling in open-ended, reward-scarce, or autonomous agent settings.

1. Conceptual Landscape: Definitions and Motivations

Unlabeled reward generation refers to algorithms that infer, synthesize, or adapt reward functions without direct access to human-annotated reward labels or externally imposed fitness measures. Central challenges motivating these approaches include:

The emergence and adaptation of goals and associated behaviors endogenously as environments evolve, as in open-ended life-like or multi-agent systems (Bailey, 2024).
The impracticality or cost of exhaustively labeling large datasets, particularly for diverse or complex tasks seen in offline RL, robotics, and generative modeling (Lee et al., 3 Apr 2025, Zolna et al., 2020).
The limitations of hand-crafted, fixed reward functions in supporting generalization, transfer, and scalable learning.

Unlabeled reward generation thus spans a spectrum: from endogenous adaptation of internal reward structures within individual or population agents, to reward inference via self-supervision, intrinsic motivation, pseudo-labeling using auxiliary models, or contrastive discrimination from unlabeled experience.

2. Endogenous and Self-Adaptive Reward Generation

The RULE algorithm ("Reward Updating through Learning and Expectation") typifies endogenous reward generation in open-ended RL (Bailey, 2024). Here, each agent (Ent) operates according to a POMDP framework where the reward function $R(\theta, s, a)$ is parameterized by a vector $\theta\in\mathbb{R}^k$ , with no external fitness guidance.

Meta-Reward Adaptation: Agents accumulate per-component rewards $R_i(t)$ ; at reproductive events, actual cumulative rewards $A_i(T)$ are compared to expected profiles $E_i(\tau)$ . The $\theta$ coefficients governing reward salience are meta-updated:

$E_i(\tau)\leftarrow E_i(\tau)+\alpha_i\,\mathrm{sign}(\Delta_i),\quad \theta_i\leftarrow \theta_i+\beta_i\,\mathrm{sign}(\Delta_i),$

where $\Delta_i = A_i(T) - E_i(\tau)$ and $\alpha_i,\beta_i>0$ are step sizes.

Continuous Evolution: Offspring inherit updated $\theta, E$ ; novel environmental items are integrated as dormant reward components whose coefficient evolution is experience-driven.
Empirical Findings: In ecosystem simulations, RULE supports the discovery, abandonment, or amplification of behaviors (e.g., eschewing detrimentally distracting coins, leveraging emergent vitamins/poisons)—dynamically reformulating rewards and thus underlying policies in complex, changing worlds (Bailey, 2024).

3. Discriminative and Contrastive Reward Inference from Unlabeled Data

Several frameworks extract reward signals using discriminative models trained to distinguish expert-like states/trajectories from mixed or unlabeled experience:

Positive-Unlabeled (PU) Reward Learning: The PURL framework addresses scenarios with positive (expert) and unlabeled (mixed-quality) data (Xu et al., 2019). A surrogate reward is learned via non-negative PU-risk minimization:

$\tilde R_{PU}(r_\theta) = \pi\,\hat{\mathbb{E}}_{D^+}[\ell(r_\theta,+1)] + \max\left(0, \hat{\mathbb{E}}_{D_u}[\ell(r_\theta,-1)] - \pi\,\hat{\mathbb{E}}_{D^+}[\ell(r_\theta,-1)]\right)$

where $D^+$ contains positives, $D_u$ unlabeled, and $\pi$ is the class-prior. Integration ranges from adversarial imitation (PUGAIL) to reward-masking in supervised settings.

Offline Reinforced Imitation Learning (ORIL): ORIL learns a reward function by contrasting demonstrator and unlabeled trajectories using a discriminator, then annotates all transitions (including those in $D_U$ ) for compatibility with offline RL algorithms (e.g., Critic-Regularized Regression) (Zolna et al., 2020). PU-learning penalties and TRAIL constraints are used to mitigate the bias of treating all unlabeled data as negative.
Zero-Reward and Reweighting in Offline RL: A simple yet effective method is to assign zero reward to all unlabeled transitions (UDS), optionally reweighted (CDS+UDS) to compensate for reward bias and distribution shift. This yields competitive results and is often more robust to reward model misspecification than explicit reward prediction approaches (Yu et al., 2022).

4. Self-Supervised and Generative Unlabeled Reward Modeling

Large models in vision-language and language-only domains increasingly employ self-supervised or generative modeling strategies for reward generation:

Generative Reward Models (GRAM, GRAM-R $^2$ ): By pre-training on unlabeled response pairs to encourage diversity-sensitive modeling, then fine-tuning with (label-smoothed) preference data, generative foundation reward models can assign scalar or rationale-grounded rewards to arbitrary input pairs—even generalizing out-of-distribution (Wang et al., 17 Jun 2025, Wang et al., 2 Sep 2025). Self-training and pseudo-labeling on unlabeled corpora further enhance generalization and downstream reward reasoning, outperforming standard discriminative baselines.
Reward Generation via Large Vision-LLMs (RG-VLM): LVLMs (e.g., Gemini, GPT-4V) are prompted to produce dense, per-step rewards from unlabeled offline trajectories and task descriptions by scoring state-transitions in windowed context—integrated as auxiliary or shaping rewards in standard RL pipelines (Lee et al., 3 Apr 2025).
Intrinsic Reward via Self-Uncertainty (IRIS): In generative models such as autoregressive text-to-image (T2I), intrinsic reward is mined from model-internal uncertainty metrics, specifically the forward KL (self-uncertainty) between the output distribution and uniform. IRIS directly uses this signal for RL-style updates, obviating external preference data or verifiers, and yielding qualitatively richer and quantitatively superior outputs in image generation benchmarks (Chen et al., 29 Sep 2025).

5. Reward Generation with Partial, Imperfect, or Incomplete Supervision

In realistic settings, reward sources may be incomplete, noisy, or derived from partial demonstrations:

Mixture of Autoencoder Experts Guidance (MoE-GUIDE): A Mixture-of-Autoencoders is trained on possibly incomplete or sparse demonstrations, with a mapping $g$ transforming reconstruction loss $L(s)$ into a shaped intrinsic reward $r_{\text{int}}(s) = g(L(s))$ . This methodology supports reward shaping and targeted exploration, even when demonstrations are weak or highly partial; integration into standard RL algorithms (e.g., SAC, PPO) is straightforward (Malomgré et al., 21 Jul 2025).
Optimistic Relabeling with Unlabeled Priors: In online RL with access to large unlabeled prior datasets (e.g., from prior tasks or random policies), transitions are relabeled with upper-confidence-bound reward estimates—combining learned reward predictions and model uncertainty (e.g., RND-based novelty bonuses)—to accelerate exploration and address sparse reward tasks (Li et al., 2023).
Reward-Directed Conditional Diffusion: For conditional generative models, unlabeled data is pseudo-labeled using a learned reward function and then used for reward-conditioned training of diffusion models. Subspace recovery and reward improvement are provable under latent-linear and support-coverage assumptions (Yuan et al., 2023).

6. Methodological Tradeoffs, Limitations, and Future Directions

Challenges

Dependence on Component Discovery: Methods like RULE require predefined reward components; fully automated reward structure inference remains open (Bailey, 2024).
Bias and Distribution Shift: Zero-reward or optimistic relabeling methods rely on the distributional concordance between labeled and unlabeled data; mismatches lead to reward bias or suboptimal policy improvement (Yu et al., 2022, Yuan et al., 2023).
Quality of Pseudo-Labels: Self-supervised and generative models depend critically on the quality of pseudo-labels or self-generated rationales; error propagation and over-extrapolation can impair performance (Wang et al., 17 Jun 2025, Wang et al., 2 Sep 2025, Yuan et al., 2023).
Computational Cost: Querying large LVLMs for dense reward inference or running large-scale generative pretraining can be computationally intensive (Lee et al., 3 Apr 2025, Wang et al., 2 Sep 2025).

Extensions

Dynamic Discovery of Reward Dimensions: Extensions to allow dynamic reward component expansion based on surprise or novelty signals (Bailey, 2024).
Task-Adaptive and Domain-Transfer Pretraining: More targeted selection of unlabeled data as pretraining substrate to maximize downstream reward model transfer and adaptation (Wang et al., 17 Jun 2025, Wang et al., 2 Sep 2025).
Uncertainty and Consistency Integration: Efforts to incorporate uncertainty estimation, consistency checks, or confidence filtering in reward inference pipelines (Lee et al., 3 Apr 2025).
Intrinsic/Extrinsic Reward Hybridization: Combining intrinsic rewards (e.g., self-uncertainty, reconstruction distance) with sparse external signals to foster both exploration and goal-directed behavior (Lee et al., 3 Apr 2025, Malomgré et al., 21 Jul 2025).

7. Empirical Outcomes and Impact

The adoption of unlabeled reward generation methods has yielded substantial empirical successes across various domains:

Domain/Benchmark	Approach	Notable Result / Impact
Open-ended RL (ecosystem)	RULE	Adaptive reward evolution, robust survival, behavior correction under changing rewards (Bailey, 2024)
Offline RL (robotics, control)	ORIL, PURL, UDS, optimistic relabeling	Near-expert performance without labeled rewards, sample efficiency, robustness to domain shift (Zolna et al., 2020, Xu et al., 2019, Yu et al., 2022, Li et al., 2023)
Language/vision modeling	GRAM, GRAM-R², RG-VLM, IRIS	Superior ranking and RLHF performance over discriminative baselines; interpretable rationales; state-of-the-art T2I quality via intrinsic reward (Wang et al., 17 Jun 2025, Wang et al., 2 Sep 2025, Lee et al., 3 Apr 2025, Chen et al., 29 Sep 2025)
Exploration in RL	MoE-GUIDE, optimistic relabeling	Robust exploration in sparse/dense rewards, handling incomplete demos (Malomgré et al., 21 Jul 2025, Li et al., 2023)
Conditional generative modeling	Reward-Directed Conditional Diffusion	Configuration of output population mean reward via pseudo-labeling and provable subspace identification (Yuan et al., 2023)

Method selection must be governed by the specifics of domain, reward sparsity, distributional assumptions, and computational cost. In sum, unlabeled reward generation methodologies enable scalable, adaptive, and generalist reinforcement learning and generative modeling in settings lacking reliable external reward supervision.