TeSAN Model for EHRs

Updated 10 February 2026

TeSAN Model for EHRs is a neural architecture that embeds timestamped medical events by combining contextual co-occurrence with detailed temporal relationships.
It utilizes a feature-wise temporal self-attention mechanism, temporal-interval embeddings, and a skip-gram objective to generate robust, interpretable representations.
Empirical evaluations on datasets like MIMIC-III and CMS demonstrate state-of-the-art performance in clustering, nearest-neighbor search, and mortality prediction tasks.

The Temporal Self-Attention Network (TeSAN) is a neural architecture designed for medical concept embedding in the context of longitudinal electronic health records (EHRs). By explicitly incorporating both contextual co-occurrence and fine-grained temporal relationships between medical events, TeSAN advances over prior embedding approaches for representing timestamped medical concepts as fixed-length vectors suitable for downstream predictive tasks. The model is characterized by a feature-wise temporal self-attention mechanism, temporal-interval embeddings, and a skip-gram–style learning objective. Empirical results across clustering, nearest-neighbor search, and inpatient mortality prediction on diverse EHR datasets demonstrate that TeSAN achieves state-of-the-art performance relative to representative baselines, offering robustness and interpretability in the derived embeddings (Peng et al., 2019).

1. Model Architecture and Components

TeSAN comprises four principal modules: input embedding, temporal-interval embedding, a temporal self-attention block (TeSA), and an attention pooling/output projection. The model processes an input sequence of medical concepts, each associated with a timestamp, and outputs context-rich embeddings that encode both semantic and temporal relationships.

Module	Description	Dimensions/Hyperparameters
Input Embedding	One-hot codes mapped via learned weights $W^{(e)}$	$\|\mathcal{C}\| \approx 7,000$ , $d=100$
Temporal-Interval Embed.	Learnable table $E \in \mathbb{R}^{n_{\text{days}} \times d}$	$n_{\text{days}}$ varies per dataset, $d=100$
Temporal Self-Attention	Single “multi-dimensional” head, gated fusion	$W^{(1)}, W^{(2)}, W^{(3)}, W^{(f1)}, W^{(f2)} \in \mathbb{R}^{d \times d}$
Attention Pooling	Multi-dimensional attention pooling	Produces $h \in \mathbb{R}^d$

TeSAN uses a vocabulary size of $|\mathcal{C}| = 6,985$ (MIMIC-III) or $7,873$ (CMS) and an embedding dimension $d = 100$ , with $\ell=6$ or $7$ as the skip window, and $r \approx 5$ negative samples as in word2vec.

2. Temporal Self-Attention Mechanism

A distinguishing feature of TeSAN is its feature-wise temporal self-attention block, which enables the model to capture pairwise temporal dependencies between medical concepts. Given input embeddings $C = [c_1, \ldots, c_n] \subset \mathbb{R}^d$ and all pairwise day-differences $\Delta \in \mathbb{N}^{n \times n}$ , the compatibility function for a concept pair $(i,j)$ is

$f(c_i, c_j, \Delta_{ij}) = W^T\,\sigma\Bigl( W^{(1)}c_i + W^{(2)}c_j + W^{(3)}e_{\Delta_{ij}} + b^{(1)} \Bigr) + b$

where $\sigma(\cdot)$ denotes an activation function such as ReLU. This yields a $d$ -dimensional vector score for each $(i, j)$ , allowing for feature-wise (rather than scalar) attention.

For each query position $j$ , feature-wise softmax normalization is applied across all “keys” $i$ :

$P^j_{k i} = \frac{\exp\bigl( f(c_i, c_j, \Delta_{ij})_k \bigr)} {\sum_{i'=1}^n \exp\bigl( f(c_{i'}, c_j, \Delta_{i'j})_k \bigr)}, \quad k=1, \ldots, d$

This enables each embedding dimension to attend differently to temporal/contextual neighbors. The attended representation is:

$s_j = \sum_{i=1}^n P^j_{\cdot i} \odot c_i \in \mathbb{R}^d$

A gated fusion mechanism—using parameters $W^{(f1)}, W^{(f2)}, b^{(f)}$ —combines $s_j$ and the original $c_j$ to yield $u_j$ , the output embedding at position $j$ :

$F_j = \mathrm{sigmoid}( W^{(f1)} s_j + W^{(f2)} c_j + b^{(f)} ), \quad u_j = F_j \odot s_j + (1 - F_j) \odot c_j$

Stacking $\{u_j\}$ across all $j$ yields $U \in \mathbb{R}^{d \times n}$ .

3. Learning Objective and Optimization

TeSAN’s training objective adapts skip-gram with negative sampling to the EHR context. For each target position $i$ , an attention-pooled context vector $h_i \in \mathbb{R}^d$ is computed from $U$ , constituting the “context” for the target concept $c_i$ . The loss is given by

$\mathcal{L} = -\Bigl[ \log \sigma( c_i^T h_i ) + \sum_{j=1}^r \mathbb{E}_{c_j \sim P(c)} [ \log \sigma( -c_j^T h_i ) ] \Bigr]$

where $\sigma$ denotes the sigmoid, and $P(c)$ is a unigram noise distribution with $r \approx 5$ negative samples. No additional regularization or auxiliary losses are used. Training is performed for 30 epochs on MIMIC-III and 20 epochs on CMS, with batch size only specified for downstream GRU (128). Preprocessing includes the exclusion of codes with frequency $<5$ and, for CMS, patients with fewer than four visits; no imputation or time-binning is performed.

4. Empirical Evaluation

TeSAN is benchmarked against five state-of-the-art embedding methods: CBOW, Skip-gram, GloVe, med2vec, and MCE. The model is evaluated on clustering (using k-means, NMI), nearest-neighbor search (P@1), and inpatient mortality prediction (PR-AUC/ROC-AUC via GRU using visit-level aggregated embeddings).

Dataset/Task	Metric	Baseline Best	TeSAN Best
MIMIC-III, Clustering	NMI (ICD/CCS) %	26.02 / 53.09 (Skip-gram)	32.84 / 58.33
CMS, Clustering	NMI (ICD/CCS) %	8.52 / 41.70 (CBOW)	14.69 / 45.63
MIMIC-III, NN Search	P@1 (ICD/CCS) %	54.3 / 35.2 (CBOW)	66.1 / 43.8
CMS, NN Search	P@1 (ICD/CCS) %	34.3 / 16.4 (CBOW)	47.8 / 24.9
MIMIC-III, Mortality	PR-AUC / ROC-AUC (GRU)	0.5276 / 0.7785 (Sg+)	0.5544 / 0.8064

TeSAN displays consistently superior performance across all reported metrics, particularly notable in clustering and nearest-neighbor search. TeSAN’s clustering and retrieval metrics are both higher and more robust when varying the skip window length ( $\ell$ ). In mortality prediction, TeSAN+ (with GRU) achieves the highest PR-AUC and ROC-AUC on MIMIC-III. No tests for statistical significance are reported.

5. Analysis, Discussion, and Limitations

The explicit modeling of temporal intervals as $d$ -dimensional continuous embeddings, and their integration via multi-dimensional gated self-attention, enables TeSAN to encode both the co-occurrence of medical concepts and the temporal gaps between events. This dual encoding provides a richer representation space than methods focused solely on context (CBOW, Skip-gram), global co-occurrence (GloVe), hierarchical structure (med2vec), or prior forms of local attention (MCE).

Ablation studies demonstrate that both the temporal interval embedding and the multi-dimensional attention mechanism are essential; removing either degrades performance on clustering and retrieval. However, several limitations remain: only a single-head attention block is used, restricting the expressiveness for long-range dependencies beyond the local context window; training hyper-parameters (optimizer, learning-rate) are omitted; GPU specifications and statistical significance tests are not provided.

Potential extensions include development of multi-head attention variants, adaptive or hierarchical skip-window mechanisms, and integration with structured clinical ontologies for further semantic enrichment. These directions could enhance modeling of complex, heterogeneous temporal patterns in longitudinal EHRs (Peng et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Temporal Self-Attention Network for Medical Concept Embedding (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeSAN Model for EHRs.