Two-Stage Cooperative Training Strategy

Updated 10 February 2026

Two-stage cooperative training is a paradigm that trains modules individually before combining them for joint optimization.
It enhances sample efficiency and stability by separating isolated learning from cooperative refinement across various domains.
The strategy employs techniques like KL divergence and critic blending to align local objectives with global performance.

A two-stage cooperative training strategy is a general methodological paradigm in machine learning and reinforcement learning that decomposes the training process into two distinct, sequential phases. The first stage typically focuses on learning individual modules, skills, components, or subpolicies with limited or isolated objectives; the second stage transitions to joint, cooperative, or fine-tuned training where these modules are optimized for collective, downstream, or cross-module criteria. By explicitly structuring training in this way, two-stage cooperative strategies address challenges related to temporal credit assignment, modularization, large-scale optimization, transfer, and sample efficiency across diverse problem domains including hierarchical reinforcement learning, speech recognition, federated learning, multi-task neural modeling, and beyond.

1. Core Frameworks and Mathematical Principles

The prototypical structure of a two-stage cooperative training scheme involves: (i) an initial phase where agents, submodels, or functional blocks optimize for local, task-specific, or self-contained objectives, followed by (ii) a phase where these are updated in a coordinated manner to account for successor tasks, shared global objectives, or joint performance. This pattern recurs with context-specific implementations across recent research.

Multi-Stage RL and Cooperative Critics

For hierarchical or multi-stage reinforcement learning (RL), the formalization often proceeds as follows (Erskine et al., 2022):

Hierarchical Decomposition: The task is decomposed into $N$ subtasks, each assigned to an agent or policy $\pi_i$ with associated critic $Q_i$ . The global state and action spaces $(S, A)$ are partitioned according to a subtask index function $U(s_t)$ .
Cooperative Consecutive Policies (CCP): Instead of optimizing each actor for its single-stage expected return, the actor objective is adjusted to also incorporate the downstream (next-stage) critic:

$J_i^{CCP}(\theta_i) = \mathbb{E}_{s \sim D, a_i \sim \pi_i} \left[ Q_i(s, a_i) + \alpha Q_{i+1}(s', \pi_{i+1}(s')) \right]$

where $\alpha \geq 0$ is the cooperative ratio, $s' \sim P(\cdot|s, a_i)$ .

Convex Critic Combination: In Soft Actor-Critic (SAC) style, the policy loss blends both critics after normalization:

$L^{actor}_i(\theta_i) = \mathbb{E}_{s \sim D, a_i \sim \pi_i} \left[ \alpha \log \pi_i(a_i|s) - ( \eta \hat{Q}_i(s, a_i) + (1 - \eta)\hat{Q}_{i+1}(s, a_i)) \right]$

where $\eta \in [0,1]$ .

Stagewise Training Algorithm: Policies and critics are updated iteratively using separate replay buffers for each stage, sampling transitions and normalizing critic outputs to preserve alignment between surrogate and total return objectives.

This approach demonstrably outperforms both naïve per-stage and end-to-end agents in multi-stage RL domains, as measured by sample efficiency and overall task success (Erskine et al., 2022).

2. Cooperative Mechanisms Across Domains

The two-stage cooperative training principle underpins diverse cooperative mechanisms in varied architectures. Major instantiations include:

Bidirectional Model Consistency in Multi-Capacity Learning

In simultaneous training of large and small transformer models, an initial joint stage penalizes divergence between their output distributions via a symmetrized KL objective, promoting mutual regularization:

$L^{{KL}} = D_{KL}(P_H \| P_L) + D_{KL}(P_L \| P_H)$

This is incorporated into each submodel's loss alongside task cross-entropy, e.g.:

$L_H^{stage1} = L(\theta_H) + \alpha_H \cdot L^{KL}$

When the soft target agreement is achieved, models are decoupled and independently fine-tuned (Jiang et al., 2023). Shared-layer and independent-layer variants have been explored, with the shared variant accumulating gradients into common parameters.

Hierarchical Feature and Fusion in Speech Recognition

In multi-stream ASR, universal feature extraction is first trained on single-stream data, after which a lightweight attention-based fusion module (hierarchical attention network, HAN) is trained for stream-wise integration with all upstream UFE parameters frozen. The cooperative benefit emerges via using robust stage-1 features to stabilize and focus the training of stage-2 fusion without overfitting to scarce parallel data (Li et al., 2019).

Bilevel Meta-Learning for Supervised + RL Synergy

Cooperative SFT+RL for LLMs employs a bilevel scheme: base model parameters are updated via an interpolation of RL and SFT gradients, while an upper-level LoRA adapter meta-learns signal shaping to maximize the gain of the joint SFT-RL process over RL alone (Chen et al., 8 Sep 2025). This architecture maintains alignment between supervised and reward-driven optimization, counteracting the catastrophic forgetting typical in naively staged schedules.

3. Algorithmic Patterns and Meta-Theoretical Properties

Across applications, two-stage cooperative training exhibits the following algorithmic traits:

Initialization/Warm-start: Stage 1 generates skillful or informative policies/modules in isolation (skills, features, knowledge).
Cooperative Augmentation: Stage 2 exposes those modules to joint objectives (e.g., downstream critic, cross-model KL, fusion attention), optimizing not only for individual competence but also for group synergy or task composition.
Theoretical Alignment: When critic combination weights are empirically matched to the relative value ranges, cooperative surrogate objectives are guaranteed to align their stationary points with the global reward sum under deterministic transitions (Erskine et al., 2022).

This separation confers stability, modularity, and sample efficiency, often improving convergence rate and final performance over single-stage or fully joint training baselines.

4. Empirical Results and Benchmark Evaluations

Two-stage cooperative strategies have demonstrated state-of-the-art or superior performance in multiple benchmarks:

Multi-room Maze (RL): Cooperative CCP (CSAC) achieves >95% success and 4× faster convergence over single-agent SAC; noncooperative policies frequently exhibit dead-end-seeking due to local greed (Erskine et al., 2022).
Peg-In-Hole Manipulation: CSAC yields ∼40% insertion success rate versus ∼11–13% for standard and naïve agents.
Multilingual MT (Transformers): Simultaneous joint-stage training with KL coupling provides up to +1.66 BLEU for high-capacity models and +0.42 BLEU for device-scale models compared to respective single/baseline training (Jiang et al., 2023).
Speech Recognition: Two-stage feature+fusion design reduces WER by up to 32.4% in multi-array DIRHA and 8.2% in AMI compared to joint-training and naive fusion (Li et al., 2019).
Robustness and Generalization: Across all domains, cooperative scheduling yields greater stability and lower task-specific parameter requirements than end-to-end or monolithic approaches.

5. Limitations, Open Challenges, and Practical Considerations

Despite the widespread gains, several limitations and sensitivities are noted:

Manual Decomposition or Task Assignment: Most schemes require hand-designed subtask boundaries, cooperative ratios, or curriculum stages; automated discovery or dynamic switching remains an open direction (Erskine et al., 2022).
Hyperparameter and Ratio Tuning: Cooperative weights (e.g., $\eta$ in critic blending or $\alpha$ in loss coupling) are task- and stage-dependent, often requiring empirical sweeps.
Long Chains and Deeper Cooperation: Current two-stage designs typically propagate rewards to immediate successors only; in deep hierarchies, considering multiple future critics or cross-module dependencies could increase performance but at the risk of unstable gradients.
Normalization and Scale Alignment: Critic normalization is critical for theoretical alignment and empirical stability; large-batch updates and careful architecture are necessary to preserve signal consistency between stages.
Generalizability: Some implementations are currently tailored to strictly sequential subtasks; broader applicability to branched, cyclic, or dynamic subproblem structures is an open avenue.

6. Taxonomy of Representative Approaches

The following table situates key representative methods:

Domain / Problem	Stage 1 Objective	Stage 2 Cooperative Objective	Reference
Multi-stage RL (HRL)	Local skill Q optimization	Actor maximizes own + next Q	(Erskine et al., 2022)
Multi-capacity networks	Cross-entropy; independent	Add symmetric KL between outputs	(Jiang et al., 2023)
Multi-stream ASR	Train universal encoder	Train only fusion HAN using frozen encoder	(Li et al., 2019)
MARL (CM3, robot soccer)	Single-agent goal policies	Joint training with team-level mixing/credit	(Yang et al., 2018, Kim et al., 2021)
LLM RL + SFT (bilevel)	SFT on supervised traces	Bilevel: SFT meta-learns to guide RL gain	(Chen et al., 8 Sep 2025)
Federated Distillation	Offline covariate sharing	Online distillation over model outputs	(Ahn et al., 2020)

7. Significance and Outlook

Two-stage cooperative training embodies a unifying algorithmic and conceptual principle applicable across modern machine learning. By explicitly segmenting the learning trajectory into self-optimization and synergistic refinement, these strategies deliver enhanced convergence, stability, and generalization in domains characterized by modularity, hierarchy, or distributed learning agents. Continuing research targets automation of decomposition, meta-learned cooperative scheduling, and theoretical generalizations to arbitrarily complex multi-agent and modular learning systems.