Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graphs from Complex Documents

Published 3 Apr 2026 in cs.AI, cs.IR, and cs.LG | (2604.03496v1)

Abstract: Knowledge graph construction typically relies either on predefined ontologies or on schema-free extraction. Ontology-driven pipelines enforce consistent typing but require costly schema design and maintenance, whereas schema-free methods often produce fragmented graphs with weak global organization, especially in long technical documents with dense, context-dependent information. We propose TRACE-KG (Text-dRiven schemA for Context-Enriched Knowledge Graphs), a multimodal framework that jointly constructs a context-enriched knowledge graph and an induced schema without assuming a predefined ontology. TRACE-KG captures conditional relations through structured qualifiers and organizes entities and relations using a data-driven schema that serves as a reusable semantic scaffold while preserving full traceability to the source evidence. Experiments show that TRACE-KG produces structurally coherent, traceable knowledge graphs and offers a practical alternative to both ontology-driven and schema-free construction pipelines.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces TRACE-KG, a novel multimodal framework that simultaneously constructs a knowledge graph and an induced schema from complex technical documents.
It leverages conditional relation modeling and structured qualifiers to maintain evidence traceability and semantic fidelity to the source material.
Experiments demonstrate superior graph connectivity and modular organization compared to traditional ontology-driven and schema-free approaches.

TRACE-KG: Advancing Context-Enriched Knowledge Graph Construction Without Predefined Schemas

Introduction

The paper "Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graphs from Complex Documents" (2604.03496) addresses the persistent fragmentation and organizational shortcomings in knowledge graph (KG) construction by introducing TRACE-KG, a novel multimodal framework for extracting context-enriched KGs from complex technical documents. Unlike traditional ontology-driven approaches that require manual schema design or schema-free techniques that result in weakly organized representations, TRACE-KG induces a data-driven schema during extraction. This paradigm enables conditional relation modeling via structured qualifiers and maintains full traceability to the original document context, establishing a practical alternative to existing KG construction pipelines.

Methodological Contributions

TRACE-KG departs from ontology-first and schema-free models by simultaneously constructing both the KG and an induced schema derived from the input corpus. The framework is fundamentally multimodal, supporting information fusion across textual, tabular, and visual inputs. Core technical innovations include:

Contextualized Conditional Relations: TRACE-KG encodes conditionality and context through structured qualifiers, going beyond simple subject-predicate-object triples and supporting hyper-relational modeling, following insights from hyper-relational KG literature [Li, 2025; Ding, 2024].
Schema Induction: Rather than imposing external ontologies, it aggregates entity/relation types and qualifiers from corpus-driven patterns, forming a reusable, data-backed semantic scaffold. This contrasts with approaches like AutoSchemaKG (Bai et al., 29 May 2025) and Kggen (Mo et al., 14 Feb 2025), which either focus on web-scale corpora or rely on LLM-based extraction.
Traceability to Source Evidence: All elements in the constructed KG are linked back to the explicit document segments (text spans, table cells, captions) supporting their existence, supporting evidence tracking and facilitating downstream explainability.

The framework leverages LLM-based models for entity/relation extraction and a post-processing pipeline that consolidates context-specific relation qualifiers, mitigating coreference and semantic drift issues common in schema-free extraction.

Experimental Results

TRACE-KG was systematically evaluated against both ontology-driven and schema-free baselines, using complex technical document corpora with dense, conditional semantics. Key findings include:

Structural Coherence: TRACE-KG achieves higher graph connectivity, fewer isolated nodes, and superior modular organization, as measured by structural graph metrics and coherence scores. Notably, the induced schema demonstrates stability and broad applicability across subdomains within a corpus.
Traceability: Unlike baselines, all graph elements in TRACE-KG are directly traceable to document fragments, supporting auditability and error attribution.
Conditional Relation Modeling: TRACE-KG effectively captures nuanced, context-sensitive relations (e.g., result validity under specific conditions), offering clear improvements over vanilla triple extraction systems.

The framework demonstrates state-of-the-art performance in representing complex document knowledge structures without external schema engineering, while maintaining high semantic fidelity to source material.

Implications and Theoretical Impact

TRACE-KG redefines the trade-off space in KG construction. It achieves the global structural regularity and semantic richness of ontology-driven systems without the cost of upfront schema development, and it surpasses schema-free pipelines in semantic organization and evidence traceability. The conditional and contextualized modeling supported by TRACE-KG directly addresses key requirements in scientific and technical domains, where context and qualifiers fundamentally affect knowledge validity.

A significant theoretical implication is the movement toward corpus-derived schema induction, supporting adaptive, evolving ontological structures responsive to domain and document variations. This advances existing work in dynamic schema learning (Bai et al., 29 May 2025) and hyper-relational KGs [Li, 2025], and positions the system for integration with LLM-augmented reasoning and entity disambiguation pipelines (Abolhasani et al., 2024, Zhang et al., 2024, Ding et al., 2024).

Practically, TRACE-KG's evidence-linked outputs enhance the reliability of automated documentation systems, decision-support tools, and scientific knowledge bases, enabling users to validate extractive results against underlying data.

Limitations and Future Directions

The framework's reliance on high-quality LLM-driven extraction and accurate context parsing underlines ongoing challenges in noisy or heterogenous corpora. Scalability to even broader multimodal contexts (e.g., combining textual, tabular, and imagery data at large scale) remains an open problem. Research into fine-grained qualifier taxonomy induction and continual schema evolution is needed for deployment in rapidly-changing technical domains.

Aligning induced schemas across corpora and establishing cross-domain interoperability present further opportunities, especially in the context of federated knowledge base construction and dynamic scientific reporting.

Conclusion

TRACE-KG presents a robust, data-driven alternative for extracting context-enriched, schema-regularized knowledge graphs from complex documents without the need for predefined ontologies. Its capability to model conditional relations, induce reusable schemas, and maintain full evidence traceability offers substantial practical and theoretical advancements for KG construction, particularly within domains where nuanced context and provenance are paramount. The framework sets the stage for future research in scalable, adaptive, and auditable automated knowledge representation.

Markdown Report Issue