KeyNMF: Transformer-Enhanced Topic Modelling
- KeyNMF is a topic modelling framework that integrates transformer-based contextual embeddings with non-negative matrix factorization for both static and dynamic analysis.
- It constructs a non-negative keyword–document matrix using cosine similarity and applies multiplicative update rules to optimize factorization performance.
- Demonstrated on Chinese diaspora media, KeyNMF achieves a strong balance between topic diversity and external coherence, enabling clear analysis of information dynamics.
KeyNMF is a topic modelling framework that integrates transformer-based contextual embeddings with stable non-negative matrix factorization (NMF), designed for both static and dynamic modelling of topical information in large text corpora, particularly in the context of Chinese diaspora media. The approach optimizes topical coherence and diversity while enabling the quantitative analysis of information dynamics over time (Kristensen-McLachlan et al., 2024).
1. Mathematical Formulation of Static KeyNMF
KeyNMF operates on a corpus of documents with a vocabulary of candidate keywords. Each document is embedded as a vector and each candidate keyword as via a pre-trained transformer encoder. A non-negative keyword–document matrix is constructed such that: where is the set of the top words most similar to by cosine similarity.
The model seeks a low-rank non-negative factorization , with and , by minimizing
where the regularization terms are optional ( in practice).
2. Algorithmic Procedure for Model Fitting
Parameter optimization is conducted via block-coordinate descent using multiplicative update rules, following principles similar to Cichocki & Phan. The algorithm iteratively alternates updates of and as follows:
1 2 3 4 5 6 7 8 9 |
Input: keyword matrix M, topics K, max_iters, tol Initialize W←random_+(D×K), H←random_+(K×V) for iter in 1…max_iters: H ← H ⊙ (Wᵀ M) / (Wᵀ W H + λ_H H) W ← W ⊙ (M Hᵀ) / (W H Hᵀ + λ_W W) Compute objective L_new if |L_old - L_new|/L_old < tol: break L_old ← L_new return W, H |
3. Dynamic Extension and Temporal Information Dynamics
For modelling temporal progression, the corpus is divided into time slices. Given submatrices and for each time slice , the method first learns a global across all data, then fixes and solves for slice-specific by minimizing .
Topic activation over time is quantified by: The L1-normalized form pseudo-probability distributions for entropy-based novelty and resonance analysis, enabling the detection and interpretation of real-world event signals in media streams.
4. Experimental Workflow and Hyperparameters
The experimental pipeline includes:
- Data collection: Scraping five Chinese-language diaspora news sites every six hours between late April and mid-June 2024.
- Preprocessing: Article body extraction, tokenization using jieba, stopword removal.
- Embedding: Both documents and candidate words are embedded with paraphrase-multilingual-MiniLM-L12-v2 (sequence truncation: 128 tokens).
- Modelling parameters:
- Number of nearest keywords per document .
- Topics per site: (fitted individually).
- Window size for novelty/resonance: 12 time-points (≈3 days).
- Smoothing span: 56.
- NMF solver: max 300 iterations, tolerance .
5. Comparative Evaluation and Performance Metrics
KeyNMF was benchmarked against S³ (Kardos et al. 2024), Top2Vec, BERTopic, two Contextualized Topic Models (CTM), classical NMF, and LDA using the following metrics:
- Diversity (): Unique words across topics.
- Internal coherence (): Average pairwise cosine similarity between topic words in embedding space.
- External coherence (): Consistency using paraphrase-multilingual MiniLM embeddings.
On all five Chinese news corpora analyzed, KeyNMF outperformed classical NMF and LDA, and was competitive with state-of-the-art contextual models. An example from the Chinanews corpus summarizes typical results:
| Model | d | C_in | C_ex |
|---|---|---|---|
| KeyNMF | 0.93 | 0.29 | 0.63 |
| Top2Vec | 0.78 | 0.14 | 0.71 |
| BERTopic | 0.91 | 0.16 | 0.47 |
| NMF | 0.74 | 0.27 | 0.57 |
| LDA | 0.61 | 0.19 | 0.57 |
KeyNMF achieves the highest balance between diversity and coherence, with particularly strong external coherence, indicating robust alignment between the learned keyword-term matrices and transformer embedding space.
6. Empirical Insights: 2024 European Parliament Election Case Study
Dynamic KeyNMF with revealed interpretable information flow and persistence around major political events. The analysis of news corpora showed that:
- Xi Jinping’s European tour (May 5–10) induced spikes in novelty and resonance, with surge topics such as “Paris / state visit” and “President / Xi Jinping.”
- Putin’s state visit to China (May 16–17) produced pronounced novelty and resonance peaks, tied to “China News Service” and “Russia / Ukraine / Putin.”
- EU parliamentary elections (June 6–9) showed increased novelty and resonance before and after the election, with dominant topics varying by site (e.g., “EU Parliament,” “Spanish PM,” “UK elections,” “Europe overview”).
These findings demonstrate that the joint use of KeyNMF and novelty/resonance analysis detects significant information flows and relates them to concrete topics and real-world events.
7. Limitations and Prospects for Extension
Several limitations of KeyNMF were noted:
- Contextual embeddings are truncated to 128 tokens; this may omit subtleties in longer documents.
- The dynamic extension maintains a fixed global ; future work could allow both and to evolve smoothly (e.g., with temporal regularization).
- Absence of an explicit probabilistic interpretation; a possible direction is to develop a Bayesian NMF variant.
- Topic interpretability and causal modeling would benefit from richer metadata (e.g., author, location) and more extensive qualitative analysis.
- Further research is needed to link detected topical dynamics to persuasive framing and influence operations.
In summary, KeyNMF constitutes a robust and extensible framework for transformer-aware, non-negative topic modelling and dynamic information flow analysis, as demonstrated in the large-scale study of Chinese diaspora media during sensitive political periods (Kristensen-McLachlan et al., 2024).