Papers
Topics
Authors
Recent
Search
2000 character limit reached

HTKGH-Polecat: Temporal Geopolitical Forecasting Dataset

Updated 8 January 2026
  • The htkgh-polecat dataset is a benchmark corpus using a hyper-relational temporal knowledge graph that models multi-actor geopolitical events and complex interactions.
  • It employs a robust extraction pipeline with techniques like DistilBERT and RoBERTa, achieving high F1 scores and 84% top-1 accuracy for entity linking.
  • Empirical results show chain-of-thought LLMs and GNNs gain significant improvements in forecasting accuracy, emphasizing the value of enriched contextual event filtering.

The htkgh-polecat dataset is a benchmark corpus for forecasting and reasoning over complex geopolitical events, introducing a hyper-relational temporal knowledge generalized hypergraph (HTKGH) structure. Built atop the POLECAT event database, it provides native support for multi-actor, multi-recipient, and richly qualified event facts, aiming to overcome expressive limitations of prior temporal knowledge graph (TKG) formalisms in geopolitical event forecasting (Ahrabian et al., 1 Jan 2026, Halterman et al., 2023).

1. Formal Foundations of HTKGHs

The HTKGH formalism generalizes traditional TKGs, HTKGs, and hypergraphs to allow unlimited sets of actors and recipients in event records, capturing group- and set-to-set-type interactions. Its structure is as follows:

  • Entities: E\mathcal{E}, finite, e.g., countries or country-sector pairs.
  • Relations: R\mathcal{R}, composite types derived from a merged event ontology (“event type” ×\times “event mode”; 42 types total).
  • Timestamps: T\mathcal{T}.
  • Qualifiers: QR×EQ \subseteq \mathcal{R} \times \mathcal{E}, secondary context as key-value pairs (e.g., location, context code).

An HTKGH record is a tuple (Λactors,r,Λrecipients,t,Q)(\Lambda_{\text{actors}}, r, \Lambda_{\text{recipients}}, t, Q), where:

  • ΛactorsE\Lambda_{\text{actors}} \subseteq \mathcal{E}: set of one or more actors.
  • rRr \in \mathcal{R}: main event relation.
  • ΛrecipientsE\Lambda_{\text{recipients}} \subseteq \mathcal{E}: set of recipients (possibly empty).
  • tTt \in \mathcal{T}: timestamp.
  • QQ: set of contextual qualifiers.

Special subtypes include:

  • Group-Type: Λrecipients=\Lambda_{\text{recipients}} = \emptyset, Λactors>2|\Lambda_{\text{actors}}| > 2 (e.g., multi-country summits).
  • Set2Set-Type: Λactors1|\Lambda_{\text{actors}}| \geq 1, Λrecipients1|\Lambda_{\text{recipients}}| \geq 1 (e.g., coalitions acting together on targets).

This design natively models events with complex n-ary relations, avoiding lossy decompositions into binary or pairwise facts and reducing data sparsity (Ahrabian et al., 1 Jan 2026).

2. Data Construction and Extraction Pipeline

htkgh-polecat is derived from approximately 2.23 million POLECAT event records (2018–2024). Rigorous preprocessing filters out degenerate or insufficiently populated events (e.g., zero actors or one actor with no recipients), resulting in about 556,000 hyper-relational facts. Entities are constructed at the country-sector level (e.g., “Canada (GOV)”), with 5,268 unique entities across 199 countries.

Relations are defined by merging POLECAT’s “event type” and “mode” from the PLOVER ontology, producing 42 fine-grained event types such as retreat(ceasefire)\texttt{retreat(ceasefire)} and sanction(embargo)\texttt{sanction(embargo)}. Qualifiers are extracted from “location” (country and subnational codes) and “context” (37 thematic codes), with a mean of 1.37 qualifiers per fact.

The POLECAT extraction pipeline employs:

  • Document-level event classification (DistilBERT fine-tuned for each event type).
  • Binary mode and context classification (linear SVMs on TF-IDF features).
  • QA-style attribute extraction for actors, recipients, location, and date (RoBERTa fine-tuned on SQuAD2.0 and bespoke data). Assignment of text spans uses the Hungarian algorithm.
  • Entity linking to Wikipedia via sentence-embedding-based neural re-ranking (Sentence-BERT) and fuzzy matching. Attribute extraction yields F1 scores of approximately 0.86 for actors, 0.73 for recipients, and 0.91 for locations; Wikipedia entity resolution achieves top-1 accuracy of ~84% (Halterman et al., 2023).

3. Dataset Schema, Statistics, and Scope

The dataset’s atomic unit is the HTKGH tuple described above, capturing the full set of primary participants and secondary event context:

Attribute Description Example
actors Non-empty set of actor entities {US, UK}
relation Event type and mode label sanction(embargo)\texttt{sanction(embargo)}
recipients Set of recipient entities (possibly empty) {Russia, Belarus}
timestamp Date of event (granularity: day) 2022-03-15
qualifiers Contextual key-value pairs (location, context, cause, etc.) {(cause, RussoUkrWar), (location, Kyiv)}

Key statistics include:

  • E|\mathcal{E}| = 5,268 entities.
  • R|\mathcal{R}| = 42 relations.
  • W|\mathcal{W}| ≈ 556,000 facts.
  • Average qualifiers per fact: 1.37.
  • 23.6% of facts are group-type or set2set-type.
  • Temporal range: January 2018–July 2024 (uniform annual coverage except 2024, truncated by ~46% due to cutoff).
  • Evaluation splits: train ([2018, 2022]), validation (10% of train), test (\sim1% stratified sample, \sim5,500 facts, all tq>t_q> train timestamps) (Ahrabian et al., 1 Jan 2026).

4. Benchmarking Tasks, Evaluation Protocols, and Historical Context Retrieval

The primary task is relation prediction under temporal forecasting constraints:

  • Given (Λactors,?,Λrecipients,tq,Q)(\Lambda_{\text{actors}}, ?, \Lambda_{\text{recipients}}, t_q, Q), predict rr for tq>ttraint_q > t_{\text{train}}.
  • Link prediction tasks (actors or recipients) are analogous.

Context retrieval for each test query involves assembling up to h=100h=100 most recent past events filtered by overlapping actors/recipients, location, and context qualifiers. This supports grounding and contextualization for temporal reasoning models.

Strict temporal separation is enforced to prevent leakage; anonymization via entity/relation shuffling is recommended to assess model robustness and minimize effects of pretraining memorization (Ahrabian et al., 1 Jan 2026).

5. Experimental Results: Baselines and Model Comparisons

Experiments benchmark LLMs (13 models, 0.3B–22B parameters), supervised GNNs, and heuristics (frequency, recency, copy). Key findings:

  • Instruction-tuned (“non-thinking”) LLMs modestly outperform heuristics; chain-of-thought (“thinking”) LLMs achieve up to +6% absolute improvement—especially with high-quality contextual filtering.
  • LLM accuracy is robust to entity shuffling (change <±\pm3%), but relation-shuffling impacts performance (variation up to ±\pm10%).
  • Enlarging context (historical window) reliably benefits LLMs, up to saturation (\sim100 facts).
  • Model scale benefits LLMs generally past 1B parameters; alignment quality (e.g., Qwen3-4B-Instruct outperforming larger 8B models) sometimes trumps model size.
  • GNNs are mildly superior under weak context filtering, but LLMs exceed GNNs by up to +21% with stricter, more relevant filters (Ahrabian et al., 1 Jan 2026).

6. Recommendations, Best Practices, and Research Implications

For reproducible and robust research, the following recommendations apply:

  • High-quality, multi-faceted context retrieval (entity, location, context) is critical for LLM success.
  • Anonymization (entity/relation shuffling) should be employed for fair benchmarking and leakage assessment.
  • Chain-of-thought models offer superior accuracy at the cost of computational latency and tokenization overhead.
  • Smaller, well-aligned LLMs (e.g., Qwen3-4B-Instruct-2507) are advantageous for resource-constrained settings.
  • Events should retain all involved entities in hyper-relational form; decomposing to pairwise facts induces unnecessary redundancy and sparsity.
  • Evaluation protocols must strictly separate training and test periods and report on anonymized variants to ensure generalization (Ahrabian et al., 1 Jan 2026).

The htkgh-polecat dataset provides a challenging benchmark for temporal event forecasting and collective reasoning, illuminating the strengths and current limitations of both LLMs and graph neural architectures under complex, open-world event representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to htkgh-polecat Dataset.