Knowledge Graphs

Published 4 Mar 2020 in cs.AI, cs.DB, and cs.LG | (2003.02320v6)

Abstract: In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.

Abstract PDF Upgrade to Chat

Citations (1,368)

View on Semantic Scholar

Summary

The paper presents a rigorous taxonomy of knowledge graphs, detailing diverse models, reasoning techniques, and schema design.
It compares multiple graph data models to highlight trade-offs in interoperability, scalability, and expressivity.
The paper examines both deductive (symbolic) and inductive (statistical) methods to enhance real-world deployment and quality assurance.

Knowledge Graphs: Foundations, Models, Reasoning, and Applications

Introduction

The paper "Knowledge Graphs" (2003.02320) provides a comprehensive, technically rigorous exposition of the foundational principles, methodologies, and practical deployments of knowledge graphs (KGs), consolidating advances from database theory, knowledge representation, reasoning, and machine learning. The work systematically introduces the essential components of KGs, including data models, query mechanisms, schema, identity, context, deductive and inductive knowledge, and the full lifecycle: acquisition, enrichment, validation, refinement, and publication. Emphasis is placed on both the theoretical underpinnings—ranging from formal graph abstraction to description logics—and the practical implications, especially in large-scale scientific, commercial, and industrial settings.

Knowledge Graph Definitions and Core Models

Efforts to define "knowledge graph" have historically been contentious, with the term predating the Google Knowledge Graph by decades. The authors adopt an inclusive definition: a knowledge graph is a graph-structured collection of data intended to accumulate and convey knowledge about real-world entities and their relations, supporting both extensional (fact-based) and intensional (inferred, ontological) content. Nodes correspond to entities, edges to binary (and, via reification, $n$ -ary) relations.

Several graph data models are formalized, notably:

Directed edge-labelled graphs: Multi-relational graphs (e.g., RDF triples) with distinct handling of node identity (named nodes, blank nodes, literals).
Property graphs: Nodes and edges are annotated with arbitrary key-value pairs, supporting richer representational idioms.
Heterogeneous graphs: Typed nodes and edges, amenable to expressivity in multi-modal or multi-source integration tasks.
Graph datasets: Compound structures comprising named (possibly provenance-encoded) graphs, supporting modularity and context-sensitive attribution.

A critical insight is that these models can be translated among each other (modulo operational constraints), and that choice of representation is subordinate to expressivity, interoperability, and scalability requirements.

Schema, Identity, and Context

Schema in KGs subsumes both semantic schema (e.g., RDFS/OWL ontologies, description logics) and validating schema (e.g., SHACL, ShEx shapes for structural integrity constraints). While semantic schema facilitate reasoning under the Open World Assumption (OWA), validating schema (with closed or open shapes) permit enforcement of data quality or domain-dependent completeness, with formal semantics for recursive and negated constraints.

Entity identity is handled via persistent global identifiers (e.g., IRIs), external identity links (e.g., owl:sameAs), and sophisticated disambiguation strategies. Contextualization is addressed through higher-arity modeling (reification, named graphs, RDF*, annotations), aligning with requirements for temporal, provenance, or epistemic qualification.

The paper underscores the technical and philosophical subtleties of identity and context, noting that semantic interoperability, especially in Linked Data scenarios, is hard-coupled to robust global identification and context propagation strategies.

Deductive Reasoning: Ontologies, Description Logics, and Rules

KGs serve as substrates for deductive reasoning, enabling explicit and implicit knowledge to be entailed under well-defined semantics. The work rigorously describes:

Ontological formalisms: RDFS, OWL, and OBOF are surveyed, highlighting model-theoretic semantics, the distinction between Open World and Closed World Assumption, and the handling of the Unique Name Assumption.
Description logics (DLs): Both the theoretical foundations (ALC, SROIQ) and practical profiles (OWL 2 DL, QL, RL) are covered. The importance of decidability and complexity trade-offs inherent to DL reasoners is made explicit.
Rule systems: Datalog-style Horn rules, and non-monotonic extensions, are discussed as tractable, often incomplete, mechanisms for deriving materialized inferences or for query rewriting under RL/QL profiles.

Of particular note is the explicit contrast drawn between reasoning in KGs and classical database settings: KGs operate where incompleteness and context-dependence are normative, necessitating combinations of explicit assertional content with contextual schema and rule-driven inferences.

Inductive and Statistical Knowledge: Analytics, Embeddings, and GNNs

Advanced knowledge extraction and completion within KGs leverages a spectrum of statistical-relational and deep learning methods:

Graph analytics: Centrality, community detection, path finding, and connectivity are operationalized for discovery and validation.
Knowledge graph embeddings: Numerous embedding paradigms are covered. Translational models (TransE family), tensor decomposition (RESCAL, DistMult, ComplEx, TuckER), and neural approaches (ConvE, HypER) learn compact representations optimizing link prediction and completion under various loss objectives [Wang2017KGEmbedding]. The strong formal result is that models such as ComplEx and TuckER are fully expressive given sufficient capacity.
Graph neural networks: Both recursive (RecGNN) and convolutional (ConvGNN) architectures are adapted for node and graph-level tasks. The authors relate GNN expressivity to classic results from Weisfeiler-Lehman isomorphism and description logic ( $\mathcal{ALCQ}$ ) expressivity [abs-1901-00596, BarceloKMPRS20], providing a unifying theoretical lens.

Crucially, the interplay of deductive (symbolic) and inductive (statistical) methods is explored, including joint models that enforce logical constraints in the learning phase of representations (e.g., KALE [GuoWWWG16], FSL [DemeesterRR16]), with implications for regularizing completion and improving plausibility reasoning.

Construction, Enrichment, and Quality Assurance

KGs are created via a combination of manual (crowdsourced, expert-driven), extraction-based (NLP/NER/EL from text), markup-based (semi-structured web, tables), and structured-source mapping (relational, CSV, XML/JSON, virtual OBDA) methods. The enrichment process is formalized as iterative, pay-as-you-go, and modular, making heavy use of mapping languages (R2RML), wrappers, and datacenter-scale ETL.

A major contribution is the nuanced typology of quality assessment:

Accuracy: Syntactic and semantic validity, updatedness.
Coverage: Completeness at schema, property, population, and linkage levels.
Coherency: Logical/formal consistency relative to ontology and constraint definitions.
Succinctness and understandability: Ease of human interpretation and minimal redundancy.

Strong emphasis is placed on the inherent, often irreducible, incompleteness and noise in KGs assembled from multi-modal, multi-source environments.

Completion (link and type prediction) is operationalized as a combined problem of statistical-relational learning (embeddings, Rule/ILP-based symbolic induction) and entity resolution (identity matching, leveraging blocking and similarity measures). Correction (fact validation and inconsistency repair) is framed both as a function of plausibility—using evidence from web and structured sources, with scoring functions adapted from mutual information and HITS-like authority/hub propagation [Kleinberg99]—and as a challenge of repair, with minimal hitting-set approaches to restore model consistency.

The review identifies open challenges in the intersection of deduction and induction, particularly in hybrid architectures that leverage embeddings regularized by symbolic logic, for improved evidential robustness and explainability.

Publication, FAIR Principles, and Linked Data

The paper details the protocols and standards supporting KG publication:

FAIR principles: Enforced via persistent identifiers, searchable registries, protocol standardization, and licensing.
Linked Data: Drives design and interlinking, leveraging HTTP IRIs, content negotiation for RDF serializations, and protocols ranging from bulk download to fragment-based querying (Triple Pattern Fragments, SPARQL endpoints) [VerborghSHHVMHC16].

The technical handling of usage control is extended to fine-grained licensing (ODRL), access and usage policies (WAC, DPVCG), cryptographic protection (CryptOntology), and privacy-preserving anonymization ( $k$ -anonymity, differential privacy).

Knowledge Graphs in Practice

Prominent deployments are analysed:

Open KGs: DBpedia, YAGO, Freebase, and Wikidata present contrasting strategies in extraction, schema design, and editorial/curation approaches [LehmannIJJKMHMK15, VrandecicK14, FarberBMR18].
Enterprise KGs: Industrial knowledge graphs (Google, Microsoft, Amazon, LinkedIn, Facebook) are characterized by scalability, redundancy resolution, heterogenous data integration, lightweight schemas, automated completion, and support for downstream semantics-based applications (semantic search, QA, recommendations).

The authors stress that current industrial practice is typified by thin ontological schema, massive data scale, and hybrid deductive/statistical processing pipelines.

Conclusion

The paper offers an authoritative, technically precise taxonomy and synthesis of knowledge graph research and practice. It reconciles the tension between formal logical semantics and scalable, statistical-relational reasoning, and codifies methodological best practices in data modeling, schema design, querying, and maintenance. Theoretical implications include the need for further work in semantically invariant graph analytics, robust context modeling, and convergence of deductive and inductive learning frameworks. Practically, the paper’s prescriptions guide large-scale, multi-modal knowledge integration in scientific, governmental, and enterprise contexts.

Future research is especially motivated in semantically-aware machine learning for KGs, joint symbolic–subsymbolic reasoning, large-scale quality assurance, robust contextualization, privacy/compliance enforcement, and more expressive, performant query interfaces. The surveyed methodologies and formal frameworks are foundational for continued advancements in AI, particularly as the requirement for reliable, contextualized, explainable structured knowledge becomes more acute.

Citation: "Knowledge Graphs" (2003.02320)