Papers
Topics
Authors
Recent
Search
2000 character limit reached

SODACER: Dual-Buffer Adaptive Clustering RL

Updated 17 January 2026
  • The paper introduces SODACER, a reinforcement learning framework that integrates dual-buffer experience replay, adaptive clustering, and CBF-based safety to optimize nonlinear control tasks.
  • It employs a novel adaptive clustering mechanism to reduce memory redundancy and balance rapid adaptability with stable policy improvements.
  • Empirical results show up to a 40% faster convergence and zero safety violations, demonstrating significant advancements in sample efficiency and robust control.

Self-Organizing Dual-buffer Adaptive Clustering Experience Replay (SODACER) is a reinforcement learning (RL) framework designed to enhance safety, sample efficiency, and scalability in the optimal control of nonlinear dynamical systems. It introduces a dual-buffer experience replay structure with an adaptive clustering mechanism to maintain both rapid adaptability to recent experiences and a compact, diverse archive of historical interactions. Integration with Control Barrier Functions (CBFs) enforces state and input constraints, ensuring safety throughout learning, while use of the Sophia optimizer accelerates and stabilizes policy improvement. SODACER achieves notable reductions in redundant memory usage, faster convergence rates, and improved safety performance in constrained optimal control settings, as empirically validated on a nonlinear Human Papillomavirus (HPV) transmission model (Amirabadi et al., 10 Jan 2026).

1. Dual-Buffer Experience Replay Architecture

SODACER employs a two-tiered experience replay memory:

  • Fast-Buffer: A small, fixed-size FIFO buffer (capacity M1M_1) storing the most recent transitions (xt,ut,rt,xt+1)(x_t, u_t, r_t, x_{t+1}). This buffer supplies “low-bias, high-variance” samples, facilitating rapid adaptation to policy changes.
  • Slow-Buffer: A larger repository (capacity M2M_2) for maintaining “low-variance, high-relevance” samples sampled from the entire training history. Experiences transferred from the Fast-Buffer undergo clustering to enforce diversity and to prune redundant samples, optimizing memory usage and ensuring critical environmental patterns are retained.

The experience flow is governed by the routine:

1
2
3
4
5
On each new transition S_new = (x_t, u_t, r_t, x_{t+1}):
    FastBuffer.push(S_new)
    if FastBuffer.size > M1:
        S_old = FastBuffer.pop_oldest()
        SlowBuffer.cluster_and_insert(S_old)
This arrangement enables SODACER to simultaneously achieve fast responsiveness and long-term stability in policy learning (Amirabadi et al., 10 Jan 2026).

2. Self-Organizing Adaptive Clustering for Redundancy Reduction

The Slow-Buffer implements a self-organizing clustering mechanism utilizing Gaussian membership:

μj(S)=exp(Scj22σj2)\mu_{j}(S) = \exp\left(-\frac{\|S - c_j\|^2}{2\,\sigma_j^2}\right)

where cjc_j is the jj-th cluster centroid and σj\sigma_j its standard deviation.

Key clustering operations:

  • New Cluster Creation: If maxjμj(S)Γth\max_j \mu_j(S) \leq \Gamma_{\rm th}, allocate a new cluster initialized at SS.
  • Centroid and Count Update: The closest cluster jj^* updates its centroid and count:

cjNjcj+SNj+1,NjNj+1.c_{j^*} \gets \frac{N_{j^*}\,c_{j^*} + S}{N_{j^*}+1},\quad N_{j^*} \gets N_{j^*}+1.

  • Variance Amplification: Absorbs outliers via σjσj(1+β)\sigma_{j^*} \leftarrow \sigma_{j^*}(1 + \beta).
  • Variance Reduction: Implements a “forgetting” mechanism:

σkσk×σ0(1ρ)(1NkiNi).\sigma_k \leftarrow \sigma_k \times \sigma_0 \left(\frac{1}{\rho}\right) \left(1 - \frac{N_k}{\sum_i N_i}\right).

  • Pruning and Merging: Clusters with σkσth\sigma_k \leq \sigma_{\rm th} are pruned; clusters within proximity cicj<γmax(σi,σj)\|c_i - c_j\| < \gamma \max(\sigma_i, \sigma_j) are merged.

This adaptive mechanism dynamically regulates cluster population, maximizing experience diversity and minimizing storage overhead (Amirabadi et al., 10 Jan 2026).

3. Safety Enforcement via Control Barrier Functions

To guarantee satisfaction of safety-critical state and input constraints, SODACER integrates CBFs, which enforce state constraints h(x)0h(x) \geq 0 by ensuring:

h(x)x[f(x)+g(x)u]+γ0h(x)0\frac{\partial h(x)}{\partial x} [f(x) + g(x)u] + \gamma_0 h(x) \geq 0

where γ0\gamma_0 is a class-K\mathcal{K} function gain.

At each action selection step, the policy action u~t\tilde{u}_t is projected into the feasible set by solving the quadratic program:

ut=argminuuu~t2s.t.hx(xt)[f(xt)+g(xt)u]+γ0h(xt)0u_t^* = \arg\min_u \|u - \tilde{u}_t\|^2 \quad \text{s.t.} \quad \frac{\partial h}{\partial x}(x_t)[f(x_t) + g(x_t)u] + \gamma_0 h(x_t) \geq 0

This projection ensures that every action executed by the agent respects all encoded safety constraints, delivering robust operation in dynamic or safety-critical environments (Amirabadi et al., 10 Jan 2026).

4. Deep RL Optimization with Sophia

Policy and value function updates in SODACER employ the Sophia optimizer, a scalable stochastic second-order variant:

  • First Moment Estimate: mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t
  • Second Moment (Diagonal Hessian Proxy): vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2
  • Bias Correction: m^t=mt/(1β1t)\hat{m}_t = m_t / (1-\beta_1^t), v^t=vt/(1β2t)\hat{v}_t = v_t / (1-\beta_2^t)
  • Parameter Update:

Wt=Wt1ηm^tv^t+ϵ0W_t = W_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon_0}

Sophia’s adaptive diagonal curvature estimation permits per-coordinate step sizes, mitigating ill-conditioning and enabling accelerated, stable convergence compared to purely first-order methods (Amirabadi et al., 10 Jan 2026).

5. Reinforcement Learning Pipeline

The overall SODACER-Sophia RL algorithmic process comprises the following steps:

  1. Observe current state xtx_t.
  2. Generate unconstrained action u~t\tilde{u}_t from the actor network.
  3. Apply CBF projection to compute safe action utu_t.
  4. Execute action, observe reward and next state.
  5. Store interaction into Fast-Buffer; when full, insert oldest into Slow-Buffer.
  6. For learning, form a mini-batch from all Fast-Buffer content and one representative per Slow-Buffer cluster (with optional weighting).
  7. Compute critic loss and gradient.
  8. Parameter update via Sophia optimizer.
  9. Actor (policy) update using CBF-adjusted samples.

This loop continues for a designated horizon TT, ensuring both safety compliance and efficient learning (Amirabadi et al., 10 Jan 2026).

6. Empirical Validation and Comparative Results

SODACER-Sophia was empirically validated on a nonlinear five-compartment HPV transmission model with three independent control inputs and explicit state/input constraints. Its performance was benchmarked against Random Experience Replay (RER) and static Clustering-Based Experience Replay (CBER):

Method Epochs to Converge Samples Used Safety Violations
RER 450 ± 30 8,000 ± 500 7%
CBER 380 ± 25 7,200 ± 450 4%
SODACER 310 ± 20 6,300 ± 300 0%

The cost-minimization trajectory demonstrated that SODACER achieves convergence approximately 40% faster than RER and 20% faster than CBER. The variance across runs was also lowest for SODACER, indicating superior robustness.

Further, SODACER was ranked best (average rank 1.00) in a Friedman test over five control scenarios (CBER: 2.20, RER: 2.80). Redundancy reduction in experience storage was quantified at approximately 25–35%, with a convergence acceleration of 15–30%, and zero observed constraint violations (Amirabadi et al., 10 Jan 2026).

7. Significance and Applicability

SODACER offers a reproducible blueprint for off-policy actor–critic or value-based RL frameworks where sample-efficient and safe control is required. Its dual-buffer, clustering-enhanced replay mechanism, in tandem with enforced CBF safety and accelerated convergence from Sophia, makes it suitable for high-stakes domains such as robotics, healthcare, and large-scale optimization under constraints. The modular structure facilitates extension to alternative clustering schemes, safety certificate methods, or optimizers.

A plausible implication is that sophisticated experience management via adaptive clustering, as realized in SODACER, provides tangible advantages in both memory efficiency and the bias-variance trade-off in RL with formal safety requirements. Empirical findings suggest SODACER’s approach generalizes across diverse, constrained control applications (Amirabadi et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Organizing Dual-buffer Adaptive Clustering Experience Replay (SODACER).