Papers
Topics
Authors
Recent
Search
2000 character limit reached

FactNet: Multilingual Knowledge Graph & FNAR

Updated 4 February 2026
  • FactNet is a multilingual resource that deterministically unifies over 1.7B atomic assertions and 3.01B evidence pointers from 316 Wikipedia editions.
  • It employs a rigorous three-stage pipeline—view extraction, canonicalization, and evidence matching—to ensure precise and reproducible fact grounding.
  • FactNet also denotes FNAR, a factor network autoregressive model that uses tensor PCA to enable scalable estimation in high-dimensional time series analysis.

FactNet denotes two distinct concepts in recent research: a billion-scale, multilingual factual grounding infrastructure for LLMs ("FactNet" in (Shen et al., 3 Feb 2026)) and a dimension-reducing factor network autoregressive model for time series with multilayer network structures ("FNAR" or "FactNet" in (Barigozzi et al., 2022)). The following exposition examines both, with primary focus on the knowledge graph resource due to its distinctive impact and scope.

1. Definition and Overview

FactNet (Shen et al., 3 Feb 2026) is an open-source, billion-scale knowledge graph specifically constructed to facilitate multilingual factual grounding, with an emphasis on high precision, reproducibility, and verifiable provenance. Unlike prior resources that are either structured databases lacking explicit textual evidence or text-grounded corpora of limited coverage, FactNet deterministically unifies 1.7 billion atomic Wikidata-sourced assertions ("FactStatements") with over 3.01 billion byte-precise pointers to auditable textual evidence ("FactSense") across 316 Wikipedia language editions. All facts are linked to exact character spans within Wikipedia articles, enabling fine-grained provenance and supporting trustworthy evaluation for knowledge-intensive systems.

Separately, the Factor Network Autoregression (FNAR) model (Barigozzi et al., 2022) is a methodological framework for regression analysis on multilayer network-indexed time series, where high-dimensional network data are summarized via tensor-based principal component analysis, enabling efficient estimation and hypothesis testing in macroeconomic network spillover contexts.

2. Deterministic Construction Pipeline

FactNet's construction is strictly deterministic, executed in three principal stages:

  1. View Extraction: For each of 316 Wikipedia language editions, raw Wikimedia dumps (Wikidata JSON, Wikipedia XML, SQL sitelink tables) are processed. The wikitext is parsed into three normalized "views": sentence (markup-stripped, linguistically segmented), template (infobox parameter extraction), and table (linearized to (table_id, row, column)). All text undergoes versioned normalization with Unicode NFC, whitespace collapsing, and standardization of directional marks. Each segmentation and normalization is logged with a language_pack_id.
  2. Statement Canonicalization: Wikidata statements—of the form f=(S,P,V,Q)f=(S,P,V,Q) for subject SS, property PP, typed value VV, qualifiers QQ—are normalized by a versioned policy π\pi. Values and qualifiers are canonicalized, and a stable aggregation key K(f;π)\mathcal{K}(f;\pi) is defined by concatenation to guarantee grouping of logically identical statements. A content-derived synset_id via SHA1\mathrm{SHA1} hashing of the canonical key and build metadata deterministically clusters statements into "FactSynset" equivalence classes.
  3. Evidence Matching & FactSense Extraction: For each statement and subject SS with a Wikipedia sitelink in a given language, page resolution follows deterministic fallback procedures. Three matching strategies are prioritized: structural (infobox/table parameter match), wikilink resolution, and datatype-aware lexical matching in sentences. Evidence is deduplicated by match-type priority; each match is assigned a FactSense ID by hashing its canonical representation. Deterministic confidence scores are computed from base, datatype, resolution, ambiguity, and sanity checkers, clipped to [0.5,0.95][0.5,0.95]. Evidence is encoded as a 7-tuple with byte-level offsets into the normalized view. The full pipeline supports complete auditing and exact regeneration.

3. Knowledge Graph Structure

FactNet unifies four interlinked record classes:

  • FactStatement (1.70B records): Stores the canonicalized atomic assertions with subject, property, value, qualifiers, rank, references, sitelinks, and claim-hash. Serialization adheres to RFC 8785 canonical JSON, and all IDs are content-derived.
  • FactSense (3.01B records): Each links a FactStatement, language, and page to an evidence_pointer tuple, specifying the exact textual span, view-type, and match-type. Metadata includes confidence scores and parser versioning.
  • FactSynset (1.55B records): Represents merged equivalence classes of normalized facts, maintains canonical mentions per language, the list of member statements, and explicit merge reasons (e.g., time-precision relaxations, unit conversions).
  • RelationEdge (3.69B records): Encodes rule-derived relationships among synsets, including direct entity-joins, schema-based bounded-hop joins, and potential-conflict edges arising from functional/temporal constraint violations. Each edge references its derivation, evidence, and a compositional confidence metric.

4. Precision, Audit Protocols, and Evaluation

FactNet employs rigorous evaluation and stratified auditing protocols to ensure grounding reliability:

  • Grounding Precision: A stratified cluster sample (language tier × match-type, N=4,200N=4,200 instances) is manually audited, using the Horvitz–Thompson estimator for corpus-level precision. The overall design-weighted precision is p^=0.921\hat{p}=0.921 (95% CI: [0.913, 0.929]). High-coverage languages reach precision $0.934$, with long-tail languages maintaining $0.885$.
  • Span Re-localization Stability: On a 1M-item test set, 99.63% of evidence pointers re-localize to the exact span, indicating high reproducibility; discrepancies are mainly due to idiosyncratic template and table structures.
  • Recall Lower Bound: In a targeted null-match audit, 24% of unreachable cases contain actual unrecognized evidence—suggesting further gains are possible with relaxed or neuro-symbolic matchers.
  • FactNet-Bench Suite: FactNet provides split-aware, synset-wise train/dev/test partitions for three supervised tasks with tight leakage prevention. The suite includes Knowledge Graph Completion (KGC), Multilingual KG Question Answering (MKQA), and Multilingual Fact Checking (MFC). Task-specific metrics and leakage control protocols are integral, exemplified by predicate masking and split-based evidence exclusion.

5. Applications: Grounded NLP, LLM Training, and Verification

FactNet addresses critical needs in robust language technology and trustworthy AI:

  • Multilingual Retrieval-Augmented Generation (RAG): FactNet’s deterministic provenance enables LLMs to be trained and evaluated with access to over 1.7B facts, each with byte-level traceability to 3.01B evidence pointers. FactSense spans serve as explicit targets for truthfulness supervision.
  • Automated Fact Checking Pipelines: Triaging and confirmation of claims are facilitated by retrieval of supporting or potentially conflicting evidence across 316 languages. POTENTIAL_CONFLICT edges (precision 0.742) assist in automated inconsistency detection.
  • Knowledge Graph Completion and QA: FactNet-Bench’s KGC splits support evaluation with leakage-resistant masking. In QA, grammar-guided constrained decoding with FactNet data increases answer validity (from 88.5% to 95.2%) and Macro F1 (+3.2 points); LLM-based approaches achieve up to 41.4 Macro F1.
  • Span-level Veracity and Evidence Metrics: For fact checking, dense retriever plus NLI pipelines outperform BM25 baselines, with top-5 evidence recall at 0.83 and evidence span F1 at 0.49.

6. Comparison: FactNet and FNAR in Network Science

The term "FactNet" also refers to the Factor Network Autoregression (FNAR) model in time-series/network econometrics (Barigozzi et al., 2022). FNAR is specified as follows:

  • Model Structure: Given multilayer network observations WtRN×N×m\mathcal{W}_t \in \mathbb{R}^{N \times N \times m} for NN nodes, mm layers, over TT periods, layers are decomposed as Wt=Ft×3U+Et\mathcal{W}_t = \mathcal{F}_t \times_3 U + \mathcal{E}_t with rmr \ll m network factors (third-mode tensor PCA). Node-level time series yty_{t} follow a VAR driven by these factors: yt=j=1rβj(Fj,t1/N)yt1+ρyt1+α+νty_t = \sum_{j=1}^r \beta_j (F_{j,t-1}/N) y_{t-1} + \rho y_{t-1} + \alpha + \nu_t.
  • Statistical Methodology: Tensor-PCA for layer reduction; iterated least squares (Bai 2009 style) for parameter estimation, controlling for latent cross-sectional factors in residuals.
  • Empirical Evidence: Applied to OECD+BRICS GDP networks with 25 economic/financial layers, FNAR's six retained factors explain ~85% of total layer variance; aggregate network and industrial input effects are statistically significant spillover channels. Out-of-sample forecasts of GDP growth with FNAR outperform both standard VAR and NAR models in most instances.
  • Dimension Reduction Rationale: Homogeneous network effects reduce parameter count from O(N2)O(N^2) or mm to rr coefficients, ensuring scalable, consistent estimation in ultra-high dimensions (N,mTN,m \gg T).

7. Data Release, Community Impact, and Future Directions

FactNet is fully open-sourced, providing build manifests, deterministic extraction pipelines, language packs, schemas, and benchmark tasks at https://hf.co/collections/openbmb/factnet and https://github.com/yl-shen/factnet. The inclusion of byte-level provenance for all knowledge graph facts across 316 languages establishes a new standard for reproducibility in multilingual factual grounding.

Proposed future enhancements include neuro-symbolic recall (hybrid LLM-symbolic span proposal), patch-based differential updates (streaming Wikimedia diffs), event-centric relation frames, and systematic negative-evidence mining for explicit refutation span surfacing. These avenues aim to further raise recall, close gaps in fine-grained veracity discrimination, and facilitate continual updates for dynamic knowledge applications.

FactNet’s dual role—as a resource for grounded, auditable knowledge in NLP and as a statistical methodology in high-dimensional network analysis—demonstrates its relevance in both factual reasoning systems and empirical network science (Shen et al., 3 Feb 2026, Barigozzi et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FactNet.