Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Key-Value Memory Networks

Updated 19 February 2026
  • DKVMN is a memory-augmented neural network that uses a dual memory mechanism—static keys and dynamic values—to represent and update learner mastery per concept.
  • It employs exercise embeddings and attention mechanisms to map interactions to latent concepts, enabling interpretable and automated concept discovery.
  • Empirical evaluations show that DKVMN outperforms traditional methods like BKT and DKT, achieving higher AUC on diverse educational datasets.

Dynamic Key-Value Memory Networks (DKVMN) are a class of memory-augmented neural networks designed for the task of knowledge tracing (KT)—inferring and modeling the evolving mastery state of a learner across latent concepts during sequential educational interactions. DKVMN addresses limitations of prior approaches by introducing a dual memory mechanism: a static "key" matrix representing concepts and a dynamic "value" matrix encoding the learner's concept mastery, enabling interpretable, per-concept assessment and end-to-end concept discovery (Zhang et al., 2016). DKVMN has demonstrated superior performance over Bayesian Knowledge Tracing (BKT) and Deep Knowledge Tracing (DKT) across multiple real and synthetic datasets, while offering an explicit representation of student knowledge state at the level of individual concepts.

1. Model Architecture

DKVMN is parameterized by two core memory matrices:

  • Key matrix MkRN×dkM^k \in \mathbb{R}^{N \times d_k}, which is static and encodes latent concept representations (with NN the number of concept slots and dkd_k the embedding dimension);
  • Value matrix MtvRN×dvM^v_t \in \mathbb{R}^{N \times d_v}, which is dynamic and stores the student's evolving mastery state per concept at time tt.

Processing follows these four stages at each time step tt:

  1. Exercise embedding: The incoming exercise qtq_t is one-hot encoded and projected to a dense vector kt=Aqtk_t = A^\top q_t, with ARQ×dkA \in \mathbb{R}^{Q \times d_k}.
  2. Attention computation: Similarity between ktk_t and each row of MkM^k yields attention weights wt(i)=softmaxi(ktMk(i))w_t(i) = \text{softmax}_i(k_t^\top M^k(i)) over concepts.
  3. Read operation: The value memory is accessed by weighted sum: rt=i=1Nwt(i)Mt1v(i)r_t = \sum_{i=1}^N w_t(i) M^v_{t-1}(i), resulting in an aggregate representation of the learner's relevant concept mastery.
  4. Prediction and update: Predictive features ftf_t are formed by concatenating rtr_t and ktk_t, then nonlinearly mapped. The probability of a correct response is pt=σ(W2ft+b2)p_t = \sigma(W_2^\top f_t + b_2). The value memory is updated using an erase gate ete_t and add vector ata_t, both generated from the joint embedding of (qt,rt)(q_t, r_t).

2. Memory Read and Write Operations

Read Operation

  • For question qtq_t:
    • Project qtq_t to key embedding ktk_t.
    • Compute attention: wt(i)=exp(ktMk(i))j=1Nexp(ktMk(j))w_t(i) = \frac{\exp(k_t^\top M^k(i))}{\sum_{j=1}^N \exp(k_t^\top M^k(j))}.
    • Read from value memory: rt=i=1Nwt(i)Mt1v(i)r_t = \sum_{i=1}^N w_t(i) M^v_{t-1}(i).

Write Operation

  • Embed the observed interaction (qt,rt)(q_t, r_t) into vt=B[qt;rt]v_t = B^\top [q_t; r_t].
  • Compute erase et=σ(Evt+be)e_t = \sigma(E^\top v_t + b_e) and add at=tanh(Dvt+ba)a_t = \tanh(D^\top v_t + b_a) gates.
  • Update Mtv(i)M^v_t(i):

Mtv(i)=[Mt1v(i)(1wt(i)et)]+wt(i)at,M^v_t(i) = [M^v_{t-1}(i) \circ (1 - w_t(i) e_t)] + w_t(i) a_t,

where \circ denotes elementwise multiplication.

3. Concept Discovery and Interpretation

The correlation matrix W=[w(q)]q=1QW = [w(q)]_{q=1}^Q aggregates attention weights between exercises and latent concepts. This matrix is typically sparse; for most qq, one concept dominates. Exercises can be assigned to discovered concepts via i=argmaxiw(q)(i)i^* = \arg\max_i w(q)(i). This enables clustering exercises according to learned concept slots without the need for expert-provided concept tags. Concept distributions and structure can be visualized via heatmaps or dimensionality reduction techniques such as t-SNE (Zhang et al., 2016).

4. Training Procedure and Hyperparameters

DKVMN is trained end-to-end to minimize cross-entropy loss over all time steps and students:

L=t[rtlogpt+(1rt)log(1pt)]L = -\sum_t \left[r_t \log p_t + (1 - r_t) \log(1 - p_t)\right]

Optimization is performed using stochastic gradient descent with momentum (0.9), and hyperparameters are tuned per dataset. Key settings include:

  • Memory slots NN (typically $10$–$50$),
  • Key/value embedding dimensions dk,dvd_k, d_v (e.g., $50$, $100$, $200$),
  • Hidden dimension dfd_f for prediction nonlinearity (often $100$),
  • Training/validation/test splits (30% test; 20% of remaining for validation),
  • Learning rate scheduling and gradient clipping.

Reported datasets include Synthetic-5, ASSISTments2009, ASSISTments2015, and Statics2011, ranging from $4,000$ to $19,840$ students and up to over $1,200$ unique exercises.

5. Experimental Evaluation

DKVMN achieves consistently higher AUC than DKT, MANN, and BKT across all evaluated benchmarks:

Dataset DKVMN AUC DKT AUC MANN AUC BKT/BKT+ AUC
Synthetic-5 82.7% 80.3% 81.0% 62%/80%
ASSISTments2009 81.6% 80.5% 79.7% 63%
ASSISTments2015 72.7% 72.5% 72.3% 64%
Statics2011 82.8% 80.2% 77.6% 73–75%

Ablation studies confirm that DKVMN outperforms DKT even with comparable or fewer parameters. Learning curves reveal that DKVMN is more robust to overfitting, with stable validation AUC during training (Zhang et al., 2016).

6. Advantages, Limitations, and Comparisons

DKVMN offers several notable advantages:

  • Provides explicit and interpretable per-concept mastery estimation,
  • Learns exercise-to-concept mapping without reliance on expert annotation,
  • Demonstrates empirical gains in AUC over both shallow (BKT) and deep (DKT) KT methods,
  • Scalability not tied to RNN hidden size but to the number of concept slots NN.

Primary limitations include the need to specify NN a priori (though with excess slots, unused concepts typically collapse), the absence of hierarchical concept modeling (a possible extension outlined in the literature), reliance solely on exercise ID (where richer input modalities offer potential improvement), and task-dependent hyperparameter sensitivity (Zhang et al., 2016).

Comparison with the Sequential Key-Value Memory Network (SKVMN) highlights the architectural evolution:

  • DKVMN: Memory access and updates depend only on current (qt,rt)(q_t, r_t) interaction.
  • SKVMN: Incorporates a summary vector in memory writes and employs a Hop-LSTM linking history of concept-related exercises, capturing longer-term dependencies. Empirical results indicate SKVMN outperforms DKVMN by $2$–$3$ AUC points across several datasets (Abdelrahman et al., 2019).

7. Impact and Extensions

DKVMN constitutes a significant advance in interpretable deep knowledge tracing, setting a new empirical standard for multiple educational datasets. Its structure enables automated latent concept discovery and assessment at the concept level, a capability not present in prior RNN-based or Bayesian models. While the flat structure of concepts and reliance on one-hot exercise encoding present ongoing opportunities for extension—including hierarchical memory (e.g., Hierarchical KVMN) and multimodal input embeddings—a plausible implication is that DKVMN remains a strong foundation for future advances in both personalized education and interpretable sequential prediction (Zhang et al., 2016, Abdelrahman et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Key-Value Memory Networks (DKVMN).