Unified Redundancy Tree (URT)
- The Unified Redundancy Tree (URT) is a formal framework that decomposes structured data redundancy using rooted trees in information theory and tries in log compression.
- It isolates components like redundancy, synergy, and unique information through max-entropy optimization and convex programming to ensure nonnegative, interpretable measures.
- In log compression, URT unifies static templates and variable fields to improve compression ratios and throughput, as evidenced by significant empirical gains.
The Unified Redundancy Tree (URT) represents a formal and algorithmic paradigm for modeling, decomposing, and exploiting redundancy in structured data. Two major but distinct instantiations currently exist: (1) the rooted-tree-based maximum-entropy decomposition of mutual information in multivariate information theory (Chicharro, 2017), and (2) the hierarchical trie-based data structure unifying static templates and variable fields for log compression in systems such as LogPrism (Liu et al., 24 Jan 2026). Both constructions operationalize the systematic isolation of redundancy, synergy, and unique information (or pattern) components, but in differing application contexts—abstract information measures and practical pattern compression, respectively.
1. Rooted Tree Decompositions for Multivariate Redundancy
The URT in multivariate information theory is a procedural framework for decomposing the mutual information between a target random variable and a set of sources into interpretable, nonnegative components: redundancy, unique-redundancy, and synergy. This model generalizes the partial information decomposition guided by the Williams–Beer redundancy lattice, which imposes three desiderata on redundancy measures: symmetry, self-redundancy, and monotonicity (Chicharro, 2017).
In the URT approach, rather than assigning closed-form redundancy values to each collection in the lattice, a family of rooted binary trees is constructed, each corresponding to an ordering of the sources. Local binary unfoldings at each node implement max-entropy optimization subject to co-information constraints, isolating redundant against synergistic contributions. Each local split is characterized by:
- Unconditional and conditional co-information constraints, such as or , which enforce that given constraints, mutual information not attributable to redundancy is converted to synergy.
- Minimal mutual informations under these constraints are determined via convex programming, ensuring that every term in the decomposition (redundancy, unique-redundancy, or synergy) is nonnegative.
For any source subset and , the redundancy of unique with respect to is:
where denotes preservation of all bivariate marginals and signifies co-information constraints. For example, in a three-variable case , all redundancy and synergy atoms are recovered as explicit minima over convex sets (see Table 1 in (Chicharro, 2017)).
2. The Unified Redundancy Tree in Log Compression
In the domain of log compression, the URT is the core hierarchical data structure for unifying event structure and variable encoding, as introduced in LogPrism (Liu et al., 24 Jan 2026). Here, the URT operates as a trie where:
- Upper levels represent static tokens (e.g., "user=").
- Edges from a node can be actual string tokens or the wildcard "<*>" for dynamic fields.
- Each node is annotated with (the number of log-lines reaching ) and (a unique path ID assigned to "stable endpoint" nodes used for compression).
Construction proceeds in two main stages:
- Structural Skeleton Construction: Logs are tokenized, transformed to a sequence (structure plus wildcards for variables), and parallel-inserted into the trie. Isomorphic merges consolidate equivalent structural subtrees.
- Variable Subtree Expansion: At each structural leaf, the set of variable lists is filtered for frequency, positions are ordered by stability, and an inner trie is built to capture variable co-occurrences. Stable endpoints (either leaves or nodes with for threshold ) receive distinct assignments.
Compression replaces common -patterns by single integers, and less frequent "residual" variable tokens are handled separately.
3. Algorithmic Formulation and Complexity
The formal algorithms governing URT construction and update in log compression include:
- InsertSkeleton: Traverses or extends the skeleton trie, incrementing counters and aggregating variable lists.
- MergeTries/IsomorphicMerge: Efficiently merges parallel tries and consolidates structurally isomorphic branches by child-signature grouping.
- BuildVariableSubtrees: Orders variable positions by stability, builds variable-value tries, applies frequency thresholding, and assigns stable s.
- CompressLog: For a log line, matches its structure and variable path, emits the deepest matched , and collects any unmatched variables as residuals.
The time complexity for preprocessing is (with log lines, tokens/line, variables/line, paths, variable positions), reducing to when , are constant. Query time per log decompress is (Liu et al., 24 Jan 2026).
4. Quantification of Redundancy and Compression Metrics
In both theoretical and applied contexts, the effectiveness of the URT is assessed by quantifying the capture of redundancy and resultant compression ratio.
- Compression Ratio: For bytes original and compressed, compression ratio is:
- For each -pattern with length and frequency , the encoded redundancy is
where is the mean token byte-length.
- The overall gain is
with the byte-length of stored .
Empirical evaluation shows the URT-based LogPrism achieves superior compression ratios—improvements of 4.7% to 80.9% over prior baselines on 13/16 datasets, with throughput as high as 29.87 MB/s (up to faster than competitors) (Liu et al., 24 Jan 2026).
5. Illustrative Example and Pattern Encoding
The operational logic of the URT is exemplified in the case of compressing six similar log lines from an sshd daemon (Liu et al., 24 Jan 2026). The skeleton trie captures static structure, variable subtrees enumerate frequent variable co-occurrences, and stable s efficiently represent recurring structured patterns. For instance:
- Paths such as [root]→"Jul"→"10"→..."rhost="→"<>"→"uid="→"<>"→"euid="→"<*>" define a skeleton.
- Variables (e.g., user, rhost, uid, euid) are further organized in frequency-ordered subtrees.
- Stable endpoints are assigned s; logs that match these can be encoded by these identifiers alone, reducing storage cost.
Residuals—those variables not part of a frequent enough pattern—are left for residual processing.
6. Significance and Context
The URT formalism has dual conceptual and practical significance. In information theory, it provides a constructive, maximum-entropy-consistent decomposition of multivariate mutual information fully aligned with the Williams–Beer redundancy lattice, guaranteeing nonnegativity, symmetry, self-redundancy, and monotonicity. Under mild assumptions, these decompositions converge to the “true” nonnegative multivariate decomposition when no strictly positive synergy exists within the preserved marginals (Chicharro, 2017).
In applied data compression, the URT is a central innovation that unifies structure and variable encoding, efficiently capturing high-frequency patterns for direct integer encoding. This joint modeling drastically enhances both compression ratio and throughput by front-loading pattern encoding and minimizing the computational burden downstream (Liu et al., 24 Jan 2026). The data structure is amenable to parallel construction, supports sublinear space in repetitive datasets, and has quantifiable redundancy-capture guarantees.
The convergence of tree-based redundancy decompositions and trie-based pattern compression under the URT umbrella suggests that the unification of structural and contextual redundancy is both an information-theoretically grounded and an empirically validated paradigm.