Classifier-First Reinforcement Learning (CF-RL)

Updated 15 January 2026

Classifier-First Reinforcement Learning (CF-RL) is a paradigm that recasts RL by optimizing explicit classifiers, resulting in clear and interpretable policy rules.
It integrates symbolic methods (e.g., PPL-DL, PPL-ST) with deep RL techniques, enabling reward-free and example-driven policy search through evolutionary and Monte Carlo strategies.
CF-RL offers practical advantages including accelerated early reward gains, improved sample efficiency, and enhanced interpretability essential for explainable AI.

Classifier-First Reinforcement Learning (CF-RL) is a paradigm that recasts the policy learning problem in reinforcement learning as directly optimizing or evolving a set of explicit classifiers. Rather than representing the agent’s policy via value-tables, neural networks, or other global function approximators, CF-RL constructs the policy as an ensemble of symbolic or learned classification rules that determine actions. This approach underpins both recent developments in interpretable symbolic reinforcement learning systems and several modern techniques for reward-free or example-driven RL.

1. Definition and Core Principles

In the CF-RL paradigm, the agent’s policy $\pi : S \to A$ is explicitly parameterized as a set or list of classifiers (rules), each delineating a region of the state space via a condition and prescribing a discrete action. Learning in CF-RL operates on these classifiers as atomic units—by refining their selection boundaries, predictions, and associated statistics—rather than through indirect or distributed updates as in value-based or pure policy-gradient approaches. Decision-making at runtime involves matching the current state against these classifiers and resolving potentially conflicting predictions through mechanisms such as ordered decision lists, strength-based arbitration, or learned conflict resolution.

The global policy emerges from the composition of individual classifiers, with adaptation targeting classifier structure, local function-approximation weights, and strength/confidence metrics. This structure is foundational in systems pursuing eXplainable AI (XAI) for RL, as it enables explicit inspection and interpretation of agent policies (Bishop et al., 2023).

2. Symbolic and Evolutionary Instantiations

Pittsburgh-style Learning Classifier Systems (LCSs) provide a canonical symbolic instantiation of CF-RL. In these systems, individuals in the population encode entire rule sets (policies), and genetic operators drive the refinement and selection of compact, high-performing policies. Two recent systems exemplify this structure:

PPL-DL: In PPL-DL, each policy is a fixed-length decision list composed of rules with hyper-rectangular conditions over the state space and associated discrete actions. At each timestep, the agent fires the first rule whose condition covers the current state; if none match, a null action is emitted. Genetic fitness (performance) is evaluated on average episodic return over a standard test set, and genetic operators include tournament selection, uniform crossover (per-allele, including within rules), and mutation via geometric noise or action shuffling. Rules are optimized at the population level, with no explicit local rule strength (Bishop et al., 2023).

PPL-ST: PPL-ST extends the classifier structure by endowing each rule with local linear prediction weights and a variance estimate, supporting overlapping and non-ordered rulesets. Inference uses a double-max scheme: for each action, compile all matching rules, take the maximal (prediction minus variance root) among them, and select the action with the highest such value. Rule parameters are updated via backward-view Monte Carlo, with rewards propagated and normalized least-mean-squares updates applied at the rule level. Genetic operations are performed at the whole-rule boundary as in the SAMUEL system. This design allows for enriched per-rule adaptation while maintaining interpretability (Bishop et al., 2023).

A summary comparison:

System	Rule Structure	Learning Method	Conflict Resolution	Fitness
PPL-DL	Decision list	Genetic operators	Ordered list	Population avg
PPL-ST	Overlapping	MC + evolutionary	Double-max (strength)	Population avg

3. Classifier-First Algorithms in Deep RL and Reward-Free Contexts

CF-RL concepts are integrally realized in several recent deep RL approaches that eschew explicit reward models in favor of recursive classification:

Example-Based Policy Search (RCE) (Eysenbach et al., 2021): This approach introduces a future-success classifier $C_\theta^\pi(s, a)$ parameterized to approximate the probability that a given state-action leads to future success, as indicated by supplied outcome examples. The classifier’s odds directly yield the future success probability and are trained to satisfy a Bellman-like recursion with labels derived from example transitions. The control objective is to maximize the classifier estimate, effectively bypassing the need for a conventional extrinsic reward.
C-Learning (Eysenbach et al., 2020): Here, goal-conditioned RL is reframed as learning the discounted future state density via a Bayes-optimal classifier. For triples $(s_t, a_t, s_{t+\Delta})$ , a classifier predicts whether $s_{t+\Delta}$ was sampled from the future under policy $\pi$ or from a background distribution. The output ratio $C/(1-C)$ yields the future-state density up to a constant, and recursive bootstrapping allows for off-policy learning, accurate density estimation, and direct maximization of goal-reach probabilities.

Both methodologies implement recursive classifier updates (analogous to TD or value backup), policy optimization via classifier-driven objectives, and empirical Bellman consistency—even when reward functions are absent or unknown.

4. Computational and Algorithmic Implications

CF-RL encompasses a broad space of computational trade-offs. In symbolic Pittsburgh LCSs:

PPL-ST incurs higher per-generation computation due to the requirement for full population evaluation and per-individual, per-generation Monte Carlo rollouts. For example, with population sizes up to 672 and large test sets, episode counts per generation can reach $2.3$ million, accumulating to hundreds of millions over runs. This cost is offset by the data-parallel and batch-style nature of the method, as well as parsimony in the final rule sets.
PPL-DL is computationally leaner, omitting inner Monte Carlo updates, but yields lower peak performance, particularly under stochastic transition dynamics.

By contrast, conventional connectionist RL (e.g., XCS) operates with more frequent genetic algorithm invocations in a continual learning framework and requires larger classifier populations, but reduces explicit simulation costs per genetic operation (Bishop et al., 2023).

In deep RL classifier-first algorithms (RCE, C-Learning), the computational cost is aligned with standard actor-critic approaches: one pass through the classifier, one through the actor, and a modest number of additional operations for bootstrapped expectations or cross-entropy losses (Eysenbach et al., 2021, Eysenbach et al., 2020). These algorithms typically reduce hyperparameter overhead by excluding auxiliary reward models or GAN objectives.

5. Interpretability, Policy Structure, and Performance

A central advantage of CF-RL is the direct interpretability of its policy representations. In symbolic systems:

PPL-ST yields rule sets that are parsimonious and easily inspected, with sparse and non-overlapping coverage of the state space. Comparative experiments show that for challenging stochastic environments (e.g., FrozenLake with maximum slip), PPL-ST matches or exceeds the performance of XCS while utilizing an order-of-magnitude fewer rules per best policy—average counts of $18.4$ (PPL-ST) versus $272$ (XCS) in a fixed grid and slip setting (Bishop et al., 2023).
Rule density visualizations confirm that PPL-ST achieves interpretable, compact representations critical for XAI-compliant systems.

Classifier-first deep RL consistently outperforms reward-model and adversarial baselines on reward-free control and manipulation tasks, often with significantly fewer positive (success) examples or tighter sample efficiency (Eysenbach et al., 2021, Eysenbach et al., 2020). In goal-conditioned RL, classifier-based density estimation outstrips Q-learning and HER, notably on longer-horizon and higher-dimensional benchmarks (Eysenbach et al., 2020).

For ensemble neural network selection via RL (e.g., Least-Action Classifier), the policy adaptively unfolds computation graphs, dynamically allocating computation and recovering most of the static stacker’s accuracy with reduced resource usage (Malashin, 2021).

6. Mechanistic Insights and Learning Dynamics

Recent analysis of RL post-training for LLMs using the neural tangent kernel (NTK) framework has clarified the underlying dynamics of CF-RL (Tomihari, 8 Jan 2026). Key findings include:

Feature Space Limitation: High mutual alignment in feature representations causes standard RL updates to uniformly amplify token logits across contexts, increasing model confidence and reducing diversity, largely via the representation NTK component.
Classifier-First Acceleration: Introducing a classifier-first stage (i.e., freezing backbone features, updating only the classifier) rapidly enriches the gradient NTK component responsible for sample-specific adaptation. This modification accelerates reward optimization without introducing shortcut mechanisms associated with supervised LP-FT (linear-probe-then-fine-tune) protocols.
Distinct Mechanism: CF-RL does not reduce backbone feature distortion or amplify the classifier norm but reshapes classifier rows, particularly those tied to structural tokens, thus improving gradient flow and learning efficiency.

Empirical results validate that CF-RL produces larger reward gains in early RL post-training epochs compared to conventional RL, with benefits attributed to early, targeted adaptation of the classification layer (Tomihari, 8 Jan 2026).

7. Limitations, Open Questions, and Future Directions

Current CF-RL implementations expose several limitations and research challenges:

Hyperparameter Sensitivity: The optimal length of classifier-first phases (in two-stage protocols) is task-dependent, suggesting potential for adaptive or curriculum-based schedules (Tomihari, 8 Jan 2026).
Computational Cost: Symbolic LCS approaches, particularly those requiring full-population rollouts (e.g., PPL-ST), entail high simulator and evaluation costs, although they benefit from parallelism (Bishop et al., 2023).
Generality and Scalability: CF-RL generalizes to high-dimensional and image-based domains in deep RL but requires careful choice of classifier capacity and balance between representation and decision layers (Eysenbach et al., 2021).
Mechanistic Theory: Further theoretical work is needed to characterize how early classifier shaping modulates learning trajectory and NTK evolution across architectures, and to establish principled rules for stage scheduling and classifier design (Tomihari, 8 Jan 2026).

CF-RL unifies a spectrum of approaches, from symbolic, rule-based interpretable systems to modern reward-free, example-driven policy search and adaptive deep RL optimization. Its design principles and empirical advantages—interpretable policies, sample efficiency, and mechanistic transparency—underpin ongoing advances in explainable RL and effective algorithm design for alignment-centric post-training.