Selecting Cut-off Thresholds for Topic Prevalences in Topic Modeling

Determine principled and generalizable cut-off probability thresholds for topic prevalences (expected topic proportions) produced by topic modeling algorithms, to guide when topic prevalence values should be considered appropriate for substantive discussion and interpretation in empirical analyses.

Background

The paper emphasizes that many analyses of topic model outputs rely on absolute topic prevalence values and informal thresholds when interpreting results. However, there is no agreed-upon standard for what probability levels should qualify a topic as sufficiently prevalent to warrant substantive interpretation.

This ambiguity motivates focusing on relative changes over time (e.g., via non-stationarity analysis), but it also highlights the need for principled criteria for absolute cut-offs. Establishing such thresholds would improve comparability and rigor across studies that interpret topic prevalence values.

References

In topic modeling, absolute values are often the focus of discussion due to the open question of which cut-off values should be considered appropriate when discussing topic prevalences.

Quantitative Tools for Time Series Analysis in Natural Language Processing: A Practitioners Guide  (2404.18499 - Schmal, 2024) in Subsection “Non-stationarity in topic modeling” (Section: Stationarity)