Dynamic Key-Value Memory Networks

Updated 19 February 2026

DKVMN is a memory-augmented neural network that uses a dual memory mechanism—static keys and dynamic values—to represent and update learner mastery per concept.
It employs exercise embeddings and attention mechanisms to map interactions to latent concepts, enabling interpretable and automated concept discovery.
Empirical evaluations show that DKVMN outperforms traditional methods like BKT and DKT, achieving higher AUC on diverse educational datasets.

Dynamic Key-Value Memory Networks (DKVMN) are a class of memory-augmented neural networks designed for the task of knowledge tracing (KT)—inferring and modeling the evolving mastery state of a learner across latent concepts during sequential educational interactions. DKVMN addresses limitations of prior approaches by introducing a dual memory mechanism: a static "key" matrix representing concepts and a dynamic "value" matrix encoding the learner's concept mastery, enabling interpretable, per-concept assessment and end-to-end concept discovery (Zhang et al., 2016). DKVMN has demonstrated superior performance over Bayesian Knowledge Tracing (BKT) and Deep Knowledge Tracing (DKT) across multiple real and synthetic datasets, while offering an explicit representation of student knowledge state at the level of individual concepts.

1. Model Architecture

DKVMN is parameterized by two core memory matrices:

Key matrix $M^k \in \mathbb{R}^{N \times d_k}$ , which is static and encodes latent concept representations (with $N$ the number of concept slots and $d_k$ the embedding dimension);
Value matrix $M^v_t \in \mathbb{R}^{N \times d_v}$ , which is dynamic and stores the student's evolving mastery state per concept at time $t$ .

Processing follows these four stages at each time step $t$ :

Exercise embedding: The incoming exercise $q_t$ is one-hot encoded and projected to a dense vector $k_t = A^\top q_t$ , with $A \in \mathbb{R}^{Q \times d_k}$ .
Attention computation: Similarity between $k_t$ and each row of $M^k$ yields attention weights $w_t(i) = \text{softmax}_i(k_t^\top M^k(i))$ over concepts.
Read operation: The value memory is accessed by weighted sum: $r_t = \sum_{i=1}^N w_t(i) M^v_{t-1}(i)$ , resulting in an aggregate representation of the learner's relevant concept mastery.
Prediction and update: Predictive features $f_t$ are formed by concatenating $r_t$ and $k_t$ , then nonlinearly mapped. The probability of a correct response is $p_t = \sigma(W_2^\top f_t + b_2)$ . The value memory is updated using an erase gate $e_t$ and add vector $a_t$ , both generated from the joint embedding of $(q_t, r_t)$ .

2. Memory Read and Write Operations

Read Operation

For question $q_t$ $q_{t}$ :
- Project $q_t$ to key embedding $k_t$ .
- Compute attention: $w_t(i) = \frac{\exp(k_t^\top M^k(i))}{\sum_{j=1}^N \exp(k_t^\top M^k(j))}$ .
- Read from value memory: $r_t = \sum_{i=1}^N w_t(i) M^v_{t-1}(i)$ .

Write Operation

Embed the observed interaction $(q_t, r_t)$ into $v_t = B^\top [q_t; r_t]$ .
Compute erase $e_t = \sigma(E^\top v_t + b_e)$ and add $a_t = \tanh(D^\top v_t + b_a)$ gates.
Update $M^v_t(i)$ :

$M^v_t(i) = [M^v_{t-1}(i) \circ (1 - w_t(i) e_t)] + w_t(i) a_t,$

where $\circ$ denotes elementwise multiplication.

3. Concept Discovery and Interpretation

The correlation matrix $W = [w(q)]_{q=1}^Q$ aggregates attention weights between exercises and latent concepts. This matrix is typically sparse; for most $q$ , one concept dominates. Exercises can be assigned to discovered concepts via $i^* = \arg\max_i w(q)(i)$ . This enables clustering exercises according to learned concept slots without the need for expert-provided concept tags. Concept distributions and structure can be visualized via heatmaps or dimensionality reduction techniques such as t-SNE (Zhang et al., 2016).

4. Training Procedure and Hyperparameters

DKVMN is trained end-to-end to minimize cross-entropy loss over all time steps and students:

$L = -\sum_t \left[r_t \log p_t + (1 - r_t) \log(1 - p_t)\right]$

Optimization is performed using stochastic gradient descent with momentum (0.9), and hyperparameters are tuned per dataset. Key settings include:

Memory slots $N$ (typically $10$–$50$),
Key/value embedding dimensions $d_k, d_v$ (e.g., $50$, $100$, $200$),
Hidden dimension $d_f$ for prediction nonlinearity (often $100$),
Training/validation/test splits (30% test; 20% of remaining for validation),
Learning rate scheduling and gradient clipping.

Reported datasets include Synthetic-5, ASSISTments2009, ASSISTments2015, and Statics2011, ranging from $4,000$ to $19,840$ students and up to over $1,200$ unique exercises.

5. Experimental Evaluation

DKVMN achieves consistently higher AUC than DKT, MANN, and BKT across all evaluated benchmarks:

Dataset	DKVMN AUC	DKT AUC	MANN AUC	BKT/BKT+ AUC
Synthetic-5	82.7%	80.3%	81.0%	62%/80%
ASSISTments2009	81.6%	80.5%	79.7%	63%
ASSISTments2015	72.7%	72.5%	72.3%	64%
Statics2011	82.8%	80.2%	77.6%	73–75%

Ablation studies confirm that DKVMN outperforms DKT even with comparable or fewer parameters. Learning curves reveal that DKVMN is more robust to overfitting, with stable validation AUC during training (Zhang et al., 2016).

6. Advantages, Limitations, and Comparisons

DKVMN offers several notable advantages:

Provides explicit and interpretable per-concept mastery estimation,
Learns exercise-to-concept mapping without reliance on expert annotation,
Demonstrates empirical gains in AUC over both shallow (BKT) and deep (DKT) KT methods,
Scalability not tied to RNN hidden size but to the number of concept slots $N$ .

Primary limitations include the need to specify $N$ a priori (though with excess slots, unused concepts typically collapse), the absence of hierarchical concept modeling (a possible extension outlined in the literature), reliance solely on exercise ID (where richer input modalities offer potential improvement), and task-dependent hyperparameter sensitivity (Zhang et al., 2016).

Comparison with the Sequential Key-Value Memory Network (SKVMN) highlights the architectural evolution:

DKVMN: Memory access and updates depend only on current $(q_t, r_t)$ interaction.
SKVMN: Incorporates a summary vector in memory writes and employs a Hop-LSTM linking history of concept-related exercises, capturing longer-term dependencies. Empirical results indicate SKVMN outperforms DKVMN by $2$–$3$ AUC points across several datasets (Abdelrahman et al., 2019).

7. Impact and Extensions

DKVMN constitutes a significant advance in interpretable deep knowledge tracing, setting a new empirical standard for multiple educational datasets. Its structure enables automated latent concept discovery and assessment at the concept level, a capability not present in prior RNN-based or Bayesian models. While the flat structure of concepts and reliance on one-hot exercise encoding present ongoing opportunities for extension—including hierarchical memory (e.g., Hierarchical KVMN) and multimodal input embeddings—a plausible implication is that DKVMN remains a strong foundation for future advances in both personalized education and interpretable sequential prediction (Zhang et al., 2016, Abdelrahman et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Dynamic Key-Value Memory Networks for Knowledge Tracing (2016)

Knowledge Tracing with Sequential Key-Value Memory Networks (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Key-Value Memory Networks (DKVMN).

Dynamic Key-Value Memory Networks

1. Model Architecture

2. Memory Read and Write Operations

Read Operation

Write Operation

3. Concept Discovery and Interpretation

4. Training Procedure and Hyperparameters

5. Experimental Evaluation

6. Advantages, Limitations, and Comparisons

7. Impact and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dynamic Key-Value Memory Networks

1. Model Architecture

2. Memory Read and Write Operations

Read Operation

Write Operation

3. Concept Discovery and Interpretation

4. Training Procedure and Hyperparameters

5. Experimental Evaluation

6. Advantages, Limitations, and Comparisons

7. Impact and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research