TeSAN Model for EHRs
- TeSAN Model for EHRs is a neural architecture that embeds timestamped medical events by combining contextual co-occurrence with detailed temporal relationships.
- It utilizes a feature-wise temporal self-attention mechanism, temporal-interval embeddings, and a skip-gram objective to generate robust, interpretable representations.
- Empirical evaluations on datasets like MIMIC-III and CMS demonstrate state-of-the-art performance in clustering, nearest-neighbor search, and mortality prediction tasks.
The Temporal Self-Attention Network (TeSAN) is a neural architecture designed for medical concept embedding in the context of longitudinal electronic health records (EHRs). By explicitly incorporating both contextual co-occurrence and fine-grained temporal relationships between medical events, TeSAN advances over prior embedding approaches for representing timestamped medical concepts as fixed-length vectors suitable for downstream predictive tasks. The model is characterized by a feature-wise temporal self-attention mechanism, temporal-interval embeddings, and a skip-gram–style learning objective. Empirical results across clustering, nearest-neighbor search, and inpatient mortality prediction on diverse EHR datasets demonstrate that TeSAN achieves state-of-the-art performance relative to representative baselines, offering robustness and interpretability in the derived embeddings (Peng et al., 2019).
1. Model Architecture and Components
TeSAN comprises four principal modules: input embedding, temporal-interval embedding, a temporal self-attention block (TeSA), and an attention pooling/output projection. The model processes an input sequence of medical concepts, each associated with a timestamp, and outputs context-rich embeddings that encode both semantic and temporal relationships.
| Module | Description | Dimensions/Hyperparameters |
|---|---|---|
| Input Embedding | One-hot codes mapped via learned weights | , |
| Temporal-Interval Embed. | Learnable table | varies per dataset, |
| Temporal Self-Attention | Single “multi-dimensional” head, gated fusion | |
| Attention Pooling | Multi-dimensional attention pooling | Produces |
TeSAN uses a vocabulary size of (MIMIC-III) or $7,873$ (CMS) and an embedding dimension , with or $7$ as the skip window, and negative samples as in word2vec.
2. Temporal Self-Attention Mechanism
A distinguishing feature of TeSAN is its feature-wise temporal self-attention block, which enables the model to capture pairwise temporal dependencies between medical concepts. Given input embeddings and all pairwise day-differences , the compatibility function for a concept pair is
where denotes an activation function such as ReLU. This yields a -dimensional vector score for each , allowing for feature-wise (rather than scalar) attention.
For each query position , feature-wise softmax normalization is applied across all “keys” :
This enables each embedding dimension to attend differently to temporal/contextual neighbors. The attended representation is:
A gated fusion mechanism—using parameters —combines and the original to yield , the output embedding at position :
Stacking across all yields .
3. Learning Objective and Optimization
TeSAN’s training objective adapts skip-gram with negative sampling to the EHR context. For each target position , an attention-pooled context vector is computed from , constituting the “context” for the target concept . The loss is given by
where denotes the sigmoid, and is a unigram noise distribution with negative samples. No additional regularization or auxiliary losses are used. Training is performed for 30 epochs on MIMIC-III and 20 epochs on CMS, with batch size only specified for downstream GRU (128). Preprocessing includes the exclusion of codes with frequency and, for CMS, patients with fewer than four visits; no imputation or time-binning is performed.
4. Empirical Evaluation
TeSAN is benchmarked against five state-of-the-art embedding methods: CBOW, Skip-gram, GloVe, med2vec, and MCE. The model is evaluated on clustering (using k-means, NMI), nearest-neighbor search (P@1), and inpatient mortality prediction (PR-AUC/ROC-AUC via GRU using visit-level aggregated embeddings).
| Dataset/Task | Metric | Baseline Best | TeSAN Best |
|---|---|---|---|
| MIMIC-III, Clustering | NMI (ICD/CCS) % | 26.02 / 53.09 (Skip-gram) | 32.84 / 58.33 |
| CMS, Clustering | NMI (ICD/CCS) % | 8.52 / 41.70 (CBOW) | 14.69 / 45.63 |
| MIMIC-III, NN Search | P@1 (ICD/CCS) % | 54.3 / 35.2 (CBOW) | 66.1 / 43.8 |
| CMS, NN Search | P@1 (ICD/CCS) % | 34.3 / 16.4 (CBOW) | 47.8 / 24.9 |
| MIMIC-III, Mortality | PR-AUC / ROC-AUC (GRU) | 0.5276 / 0.7785 (Sg+) | 0.5544 / 0.8064 |
TeSAN displays consistently superior performance across all reported metrics, particularly notable in clustering and nearest-neighbor search. TeSAN’s clustering and retrieval metrics are both higher and more robust when varying the skip window length (). In mortality prediction, TeSAN+ (with GRU) achieves the highest PR-AUC and ROC-AUC on MIMIC-III. No tests for statistical significance are reported.
5. Analysis, Discussion, and Limitations
The explicit modeling of temporal intervals as -dimensional continuous embeddings, and their integration via multi-dimensional gated self-attention, enables TeSAN to encode both the co-occurrence of medical concepts and the temporal gaps between events. This dual encoding provides a richer representation space than methods focused solely on context (CBOW, Skip-gram), global co-occurrence (GloVe), hierarchical structure (med2vec), or prior forms of local attention (MCE).
Ablation studies demonstrate that both the temporal interval embedding and the multi-dimensional attention mechanism are essential; removing either degrades performance on clustering and retrieval. However, several limitations remain: only a single-head attention block is used, restricting the expressiveness for long-range dependencies beyond the local context window; training hyper-parameters (optimizer, learning-rate) are omitted; GPU specifications and statistical significance tests are not provided.
Potential extensions include development of multi-head attention variants, adaptive or hierarchical skip-window mechanisms, and integration with structured clinical ontologies for further semantic enrichment. These directions could enhance modeling of complex, heterogeneous temporal patterns in longitudinal EHRs (Peng et al., 2019).