Coreference Resolution & Clustering

Updated 4 February 2026

Coreference Resolution and Clustering is the process of detecting text spans that refer to the same entity or event and grouping them into coherent clusters.
Techniques include mention-pair, span-based neural, and graph-based models that optimize clustering through both local and global feature integration.
Recent advances focus on incorporating entity-centric features, efficient memory management, and incremental clustering to enhance performance in complex NLP tasks.

Coreference resolution and clustering is the combinatorial process of (a) identifying spans in text that are mentions of entities or events, and (b) grouping these mentions into clusters such that each cluster corresponds to all and only the textual references to a specific entity or event. This problem is fundamental for natural language understanding, entity linking, event extraction, question answering, and discourse analysis. Advanced research frames this as joint detection and clustering, constrained by computational, statistical, and sometimes memory-efficiency considerations.

1. Formal Frameworks for Coreference Resolution

Early and contemporary research distinguish multiple paradigms for resolving coreference and forming clusters:

Mention-Pair Models: Classify each mention pair as coreferent/non-coreferent independently. The transitive closure of positive predictions yields clusters. These models are straightforward but linguistically naïve, lacking explicit transitivity and cluster-level reasoning (Rahman et al., 2014).
Mention-Ranking and Cluster-Ranking: Instead of independent pairwise links, mention-ranking selects a single antecedent for each mention (possibly ε for new entity), while cluster-ranking assigns each mention to a full preceding cluster, ranking potential merges via a scoring function $s(c_j, m_k) = \mathbf{w}^\top \phi(c_j, m_k)$ (Rahman et al., 2014). This allows the use of cluster-level features and explicit modeling of anaphoricity.
End-to-End Span-Based Neural Models: These models score all text spans (up to length $L$ ) for mentionhood and antecedent compatibility, optimizing over possible partitions via marginal likelihood of correct antecedents, typically with aggressive pruning to ensure tractability (Lee et al., 2017). Variants include explicit joint modeling of mention detection and antecedent selection (Zhang et al., 2018).
Cluster-Merge and Learning-to-Search Paradigms: Distributed cluster representations are learned in parallel with pairwise encoders, enabling the system to greedily merge clusters based on downstream clustering quality (e.g., B³ F1) via learning-to-search objectives (Clark et al., 2016).
Higher-Order and Graph-Based Models: Graph neural networks are used to propagate cluster-level entity-centric features among mentions, refining span embeddings via message passing and enforcing consistent global clustering via higher-order (e.g., second-order) arborescence decoding (Liu et al., 2020).
Segmentation and Set-Based Decoding: Innovations include treating coreference as a segmentation problem, predicting all mentions coreferent with a single span via a position embedding matrix and 1D convolutional networks, thereby modeling long-range dependencies in a single pass (Jafari et al., 2020).

2. Neural Architectures and Clustering Algorithms

Table: Major Model Paradigms and Their Clustering Approaches

Model Type	Mention Representation	Clustering/Decoding
End-to-End Span Model (Lee et al., 2017)	BiLSTM+attention span vector	Argmax antecedent + union-find
Triad Network (Meng et al., 2018)	BiLSTM+mutual attention triad	Agglomerative average-link
GNN Coref (Liu et al., 2020)	GNN-refined span vectors	Arborescence DP (2nd order)
Segmentation Model (Jafari et al., 2020)	BERT + CNN over positions	Matrix threshold + iterative
Memory-Efficient Incremental (Xia et al., 2020, Luo et al., 31 Dec 2025)	Transformer; cluster slots	Incremental, dual-threshold, SAES/IRP
RL Actor–Critic (Wang et al., 2022)	BERT span, LSTM policy	Policy rollout, action chain

Neural coreference systems universally build high-dimensional span representations from contextual encoders (e.g., BERT, SpanBERT, BiLSTM, XLNet). Clustering is realized by:

Greedy or best-first linking via antecedent scores and transitive closure (Lee et al., 2017, Zhang et al., 2018).
Agglomerative clustering over affinity matrices from pairwise or triad models, with thresholds tuned on validation (Meng et al., 2018, Kenyon-Dean et al., 2018).
Online/incremental cluster updates, employing fixed or adaptive memory with explicit eviction/condensation policies to restrict resource consumption (Xia et al., 2020, Luo et al., 31 Dec 2025).
Structured-linking over mention and candidate-entity graphs, enforcing consistency as arborescences in joint entity linking + coref tasks (Zaporojets et al., 2021).
Higher-order decoding via projective or non-projective second-order DP, incorporating cluster context through explicit scoring of arc pairs (Liu et al., 2020).

3. Cluster Consistency, Entity-Centric Features, and Higher-Order Methods

Recent research has systematically explored methods for incorporating cluster-level or “entity-centric” features, which are essential for reasoning over properties that are only evident in aggregate (e.g., gender agreement, semantic type, argument patterns):

Explicit Cluster Representations and Feature Sharing: Mention cluster-pair encoders aggregate the representations of all mention pairs between two clusters via pooling, feeding them to a scoring function $s_c(c_i, c_j)$ for cluster merging (Clark et al., 2016). GNN-based models propagate entity-type or cluster-level features through message passing, dynamically refining mention representations based on the structure of the predicted (or pruned) mention graph (Liu et al., 2020).
Higher-Order Decoding: Span-level refinement strategies, such as attended-antecedent or entity-equalization, as well as explicit span clustering and cluster merging, enable the model to capture global contextual signals (Xu et al., 2020). However, empirical evidence shows that with strong base encoders (e.g., SpanBERT), higher-order inference (HOI) modules may offer at most marginal gains in average F1 over purely local models.
Clustering-Oriented Regularization: Explicit loss terms encourage same-cluster mention embeddings to reside in close proximity, and different-cluster ones to repel, so that learned span representations are cluster-friendly for subsequent agglomerative clustering (Kenyon-Dean et al., 2018).

4. Memory Efficiency, Incrementality, and Long-Document Coreference

Scalability and efficiency have led to hybrid and incremental solutions for document- and cluster-wide coreference:

Constant-Memory and Dual-Threshold Models: Memory-efficient architectures bound the number of active clusters and mentions, maintaining a fixed-size “cache” of cluster representations. Dual-threshold mechanisms strictly limit the number of clusters (τ₁) and size per cluster (τ₂), augmented by statistics-aware eviction and semantic condensation policies (SAES and IRP), enabling highly competitive results even under strict memory constraints (Luo et al., 31 Dec 2025, Xia et al., 2020).
Online and Incremental Clustering Strategies: Incremental systems compose clusters incrementally as new mentions are encountered. Cluster representations are dynamically updated, sometimes interpolating new mention vectors with the existing cluster centroid via a learnable gate (Xia et al., 2020). Some models process one sentence at a time, using shift–reduce parsing frameworks and maintaining only a limited cache of encoder states for efficiency and cognitive plausibility (Grenander et al., 2023).

5. Advanced Clustering Strategies: Beyond Pairwise Linkage

Clustering in coreference is no longer limited to pairwise linkage or simple transitive closures. Instead:

Triad and Polyad Modeling: Neural networks are trained to compute affinity scores for triads (or higher-order “k-ads”) of mentions, jointly enforcing mutual constraints and improving transitivity, especially when used in combination with average-linkage, windowed hierarchical clustering (Meng et al., 2018).
Bayesian Nonparametric Models: Hierarchical distance-dependent Bayesian models (HDDCRP) use learnable pairwise distances as priors for cluster formation, decoupling feature-driven affinity from the nonparametric process, and enabling hierarchical joint within- and cross-document clustering (Yang et al., 2015).
Global Arborescence Constraints: In document-level entity linking, coreference and linking are formulated jointly as constrained tree-structured predictions, with global normalization via matrix-tree theorem, unifying clustering and linking (Zaporojets et al., 2021).
Discourse- and Semantics-Aware Event Coreference: For cross-document event coreference, heterogeneous graphs publish discourse (RST) structure and cross-document lexical chains, and event similarity is learned through Graph Attention Networks, followed by thresholded agglomerative clustering (Gao et al., 2024).
Iterative and Second-Order Clustering: Alternating within- and cross-document merges, using distinct classifiers and enriching clusters by argument propagation and second-order merging (dependency or context similarity), have shown gains, especially for event coreference where event-argument relations are only partially lexically expressed (Choubey et al., 2017).

6. Evaluation, Error Analysis, and the Impact of Clustering Choices

Clustering is evaluated using standard metrics—MUC, B³, CEAF_Φ4, or CoNLL F1 average—often with ablations quantifying the role of model choices:

Cluster-level approaches (entity/cluster ranking, GNN, triad models) consistently yield higher CEAF (Φ4), reflecting improved global coherence (Clark et al., 2016, Meng et al., 2018).
End-to-end, span-based models excel in recall due to joint mention detection, with empirical F₁ gains over pipelined and pairwise systems (Lee et al., 2017, Zhang et al., 2018).
Memory-efficient and incremental models demonstrate only negligible drops (≈0.3%–1% F1) when capping memory or processing adaptively, suggesting that entities seldom require very long “lifetimes” in GPU/TPU memory (Xia et al., 2020, Luo et al., 31 Dec 2025).
Higher-order inference, despite its theoretical appeal, yields at best marginal gains over strong span-encoder local models in English, with gains often within ±0.3 F1 (Xu et al., 2020).
Empirical error analyses indicate that cluster-level modeling most benefits pronoun resolution, long-distance anaphora, and the avoidance of over-merged or conflated clusters.

7. Open Problems and Directions

Limitations and open research directions include:

Long-Distance and Cross-Document Coreference: Robustness remains challenging in the presence of long-range dependencies or when arguments are distributed, motivating graph-based and discourse-aware models (Gao et al., 2024).
Integration of World Knowledge and Cross-Lingual Coreference: Extending models to incorporate real-world semantics, knowledge bases, and multilingual contexts is ongoing.
Memory Adaptivity and Dynamic Budgeting: Dual-threshold mechanisms require per-domain tuning; adaptive or reinforcement-learned resource allocation is a subject of ongoing investigation (Luo et al., 31 Dec 2025).
Incremental, Streaming, and Lifelong Settings: Adapting systems for streaming input, dynamic document collections, and lifelong learning evolves as real-world applications (NLP at scale, in-the-wild deployments) become more common (Grenander et al., 2023).

Coreference resolution and clustering now encompasses a diverse landscape—from ranking models and neural end-to-end span encoders, through graph and triad/polyad-based algorithms, to resource-bounded, incremental, and cluster-level representation-learning systems. These research directions are reflected in continually evolving architectures and evaluation regimes (Lee et al., 2017, Clark et al., 2016, Meng et al., 2018, Jafari et al., 2020, Luo et al., 31 Dec 2025, Liu et al., 2020, Wang et al., 2022).