Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba-DDQN Framework Overview

Updated 22 February 2026
  • Mamba-DDQN is a reinforcement learning framework that combines Mamba SSMs with the dueling DDQN approach to better capture long-range sequential dependencies.
  • It substitutes traditional MLPs with Mamba linear-time SSMs, enabling efficient history-dependent modeling in tasks like ABR control and liquidity provisioning.
  • Empirical analyses show that the framework outperforms standard DDQN, offering improved stability, risk mitigation, and performance across diverse applications.

The Mamba-DDQN framework is a reinforcement learning (RL) architecture that integrates the Mamba linear-time state-space model (SSM) with the Dueling Double Deep Q-Network (DDQN) paradigm. Originating from efforts to overcome limitations in sequential decision-making for both adaptive bitrate (ABR) control and decentralized finance (DeFi), Mamba-DDQN aims to leverage the long-range temporal modeling of SSMs while inheriting the stable learning dynamics of DDQN. Its core feature is the substitution of traditional multilayer perceptrons (MLPs) in RL agents with Mamba SSMs, conferring enhanced expressivity for history-dependent tasks such as real-time communications and liquidity provision (Li et al., 2023, Zhang, 27 Nov 2025).

1. Model Architecture and State Representation

The hallmark of Mamba-DDQN is the incorporation of Mamba sequence modeling for RL agent state encoding. In canonical DDQN, the state input sts_t is a flat feature vector (e.g., 32-dimensional for Uniswap-V3 trading, including market statistics and position variables). In contrast, Mamba-DDQN stacks a fixed-length sequence of recent states: St=[stD+1,,st]Rd×D,S_t = [s_{t-D+1},\ldots,s_t] \in \mathbb{R}^{d \times D}, where dd is the feature dimension and DD the history window (e.g., 32). This temporal block is processed by a Mamba SSM as follows: hu+1=GELU(huA+xuB),h0=0,h_{u+1} = \mathrm{GELU}(h_u A + x_u B), \qquad h_0 = \mathbf{0}, with SSM parameters ARH×HA \in \mathbb{R}^{H \times H}, BRd×HB \in \mathbb{R}^{d \times H}. The final hidden hDh_D summarizes the sequence and is fed into a dueling Q-network head, producing value V(hD)V(h_D) and advantage A(hD,a)A(h_D, a) streams: Q(st,a)=V(hD)+(A(hD,a)1AaA(hD,a)).Q(s_t, a) = V(h_D) + \left(A(h_D, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(h_D, a')\right). A target network with parameters θ\theta^- is synchronized by soft updates. All SSM and head parameters are trained jointly end-to-end (Zhang, 27 Nov 2025).

2. Formal Reinforcement Learning Framework

Mamba-DDQN operates within the standard RL paradigm (state, action, reward, transition), with specific adaptations:

  • State Space (S)(\mathcal{S}): Sequences of states as above, allowing SSMs to model temporal dependencies.
  • Action Space (A)(\mathcal{A}): Task-specific; e.g., in Uniswap, actions correspond to liquidity position width adjustments, with $0$ denoting inaction.
  • Transition Dynamics: Determined by system environment (e.g., Uniswap-V3 portfolio updates or ABR encoding transitions).
  • Reward Reshaping: In finance, rewards aggregate fee income, gas costs, Loss-Versus-Rebalancing (LVR), and penalize frequent allocation changes:

rt=FeetλLVRtGastλ1{atat1}l0,r_t = \frac{\mathrm{Fee}_t - \lambda\,\mathrm{LVR}_t - \mathrm{Gas}_t - \lambda 1_{\{a_t \neq a_{t-1}\}}}{l_0},

where λ\lambda tunes risk aversion (Zhang, 27 Nov 2025). In ABR, reward functions weigh perceptual quality, frame rate, and delay, tuned via coefficients λ,ν,μ\lambda, \nu, \mu (Li et al., 2023).

  • Target Update/Bellman Backup:

Qtarget=rt+γQ(St+1,argmaxaQ(St+1,a;θ);θ)Q_\text{target} = r_t + \gamma\,Q(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a';\theta); \theta^-)

3. Training Algorithm and Curriculum

Mamba-DDQN uses the standard off-policy DDQN loop with adaptations for sequence modeling and potentially multiple agents (in ABR). Key algorithmic steps include:

  1. Initialization: Policy network (θ\theta), target network (θ\theta^-), replay buffer (DD), Mamba history buffer, and ε-greedy schedule.
  2. Interaction:
    • Observe and stack latest feature vectors to form StS_t.
    • Select action ata_t with ε-greedy policy.
    • Execute ata_t, obtain rt,St+1,doner_t, S_{t+1}, \text{done}.
    • Store transition in buffer.
    • Sample mini-batches; compute targets and TD loss.
    • Update θ\theta via Adam or similar.
    • Soft update of θ\theta^-.
    • Anneal ε.
  3. Multi-Agent Extensions (ABR): Each dimension (QP, resolution, frame rate) may be handled by its own agent, with rewards shared via centralized training and decentralized execution (CTDE) (Li et al., 2023).

Curriculum learning can be applied, especially for multi-agent formulations: first pretraining agents on isolated subproblems, then joint MARL fine-tuning (Li et al., 2023).

4. Data Pipeline and Experimental Setup

  • Data Ingestion: For finance, data is sourced from Uniswap V3 pools via public APIs, extracting raw ticks, liquidity, and price features. Redundant or highly correlated features are pruned; top features are selected using a suite of statistical/ML techniques (Lasso, ElasticNet, RF, XGBoost), reducing the feature dimension (e.g., from 28 to 20) (Zhang, 27 Nov 2025).
  • Preprocessing: Heavy-tailed variables (e.g., prices, volumes) are log-transformed. Features are standardized by training set mean/std, auxiliary state variables are min-max scaled.
  • Hyperparameters: Learning rate (α=104\alpha=10^{-4}), batch size (B=256B=256), replay buffer size (10610^6), discount (γ=0.9\gamma=0.9), soft update (τ=0.01\tau=0.01), gradient clipping ($0.7$), Mamba history D=32D=32, hidden dimension H=64H=64, ε annealed from 1.0 to 0.05, risk penalty λ\lambda tuned via validation (Zhang, 27 Nov 2025).
  • ABR (WebRTC): Training and evaluation involve synthetic network traces (cellular, Wi-Fi) and real-world video sequences, with metrics such as VMAF, video bitrate, frame rate, and end-to-end delay (Li et al., 2023).

5. Empirical Results and Comparative Analysis

DEX Liquidity Provision

Mamba-DDQN achieves superior out-of-sample performance relative to baseline DDQN and other rule-based strategies. Results over multiple periods and liquidity allocations (l0=250,500,1000l_0=250,500,1000) demonstrate:

  • Consistent outperformance versus plain DDQN with "no-hedge" in all Uniswap-V3 pools and periods.
  • Improved scalability and risk mitigation at larger fund sizes, attributed to the Mamba SSM’s enhanced memory and sequence modeling.
  • Positive relative profit-and-loss (PnL) in every test period, outperforming both "Buy-and-Hold" and "Daily Rebalance" baselines, which suffer from high drawdowns or transaction costs (Zhang, 27 Nov 2025).

WebRTC ABR

Mamba demonstrates robust gains over both classic (WebRTC-GCC) and learning-based (Loki) baselines:

Method Video Rate (kbps) VMAF Delay (ms) Frame Rate (fps)
WebRTC 5,490 53.76 127.1 55.53
Loki 10,454 60.29 159.9 49.53
Mamba 7,638 61.16 125.3 55.65
  • Mamba yields a 13.7% VMAF increase over vanilla WebRTC and 1.4% over Loki, while reducing delay and increasing effective frame rate at lower video bitrates (Li et al., 2023).
  • Real-world tests confirm the VMAF lift (+16.9%) and stability under diverse network conditions.

6. Theoretical Implications and Significance

The architectural innovation of Mamba-DDQN lies in injecting a strong inductive bias for long-range sequential dependencies into Q-learning through the Mamba SSM. This design stabilizes gradient flow in temporal credit assignment, mitigates vanishing gradients, and allows RL agents to exploit history when the environment exhibits non-Markovian effects or delayed reward structures. The comprehensive integration with the dueling DDQN head retains competitive advantage estimation and leverages proven off-policy stability. A plausible implication is improved adaptation in regimes characterized by complex, long-range structure or where prior approaches fail due to short memory or reactive-only strategies (Zhang, 27 Nov 2025).

In multi-dimensional control (e.g., ABR for WebRTC), replacing policy/value nets with Mamba-DDQN variants is expected to enable end-to-end optimization of correlated variables (quantization, resolution, frame rate), enhancing both sample efficiency and operational performance compared to independent, single-dimensional adaptation (Li et al., 2023).

7. Applicability and Extensions

Mamba-DDQN is applicable as a drop-in replacement for RL agents in environments where temporal structure is critical and where traditional deep RL agents (MLP or shallow RNN) fail to capture long-horizon patterns. The architectural details—state sequence assembly, SSM parameterization, and dueling Q-head—can be extended to novel tasks involving stochastic control, complex communication protocols, or decentralized financial markets. The modular design supports curriculum learning, centralized training with decentralized execution, and reward shaping suited for domain-specific objectives (Li et al., 2023, Zhang, 27 Nov 2025). The ability to generalize to new codecs, network conditions, or asset classes is established by the reproducibility procedures and indicative empirical robustness.


References:

  • "Mamba: Bringing Multi-Dimensional ABR to WebRTC" (Li et al., 2023)
  • "Adaptive Dueling Double Deep Q-networks in Uniswap V3 Replication and Extension with Mamba" (Zhang, 27 Nov 2025)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-DDQN Framework.