Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unified Redundancy Tree (URT)

Updated 31 January 2026
  • The Unified Redundancy Tree (URT) is a formal framework that decomposes structured data redundancy using rooted trees in information theory and tries in log compression.
  • It isolates components like redundancy, synergy, and unique information through max-entropy optimization and convex programming to ensure nonnegative, interpretable measures.
  • In log compression, URT unifies static templates and variable fields to improve compression ratios and throughput, as evidenced by significant empirical gains.

The Unified Redundancy Tree (URT) represents a formal and algorithmic paradigm for modeling, decomposing, and exploiting redundancy in structured data. Two major but distinct instantiations currently exist: (1) the rooted-tree-based maximum-entropy decomposition of mutual information in multivariate information theory (Chicharro, 2017), and (2) the hierarchical trie-based data structure unifying static templates and variable fields for log compression in systems such as LogPrism (Liu et al., 24 Jan 2026). Both constructions operationalize the systematic isolation of redundancy, synergy, and unique information (or pattern) components, but in differing application contexts—abstract information measures and practical pattern compression, respectively.

1. Rooted Tree Decompositions for Multivariate Redundancy

The URT in multivariate information theory is a procedural framework for decomposing the mutual information I(X;S)I(X;S) between a target random variable XX and a set of sources S={S1,,Sn}S = \{S_1, \ldots, S_n\} into interpretable, nonnegative components: redundancy, unique-redundancy, and synergy. This model generalizes the partial information decomposition guided by the Williams–Beer redundancy lattice, which imposes three desiderata on redundancy measures: symmetry, self-redundancy, and monotonicity (Chicharro, 2017).

In the URT approach, rather than assigning closed-form redundancy values to each collection in the lattice, a family of rooted binary trees is constructed, each corresponding to an ordering of the sources. Local binary unfoldings at each node implement max-entropy optimization subject to co-information constraints, isolating redundant against synergistic contributions. Each local split is characterized by:

  • Unconditional and conditional co-information constraints, such as C(X;i;j)=I(X;i)+I(X;j)I(X;ij)=0C(X;i;j) = I(X;i)+I(X;j)-I(X;ij)=0 or C(X;i;jK)C(X;i;j|K), which enforce that given constraints, mutual information not attributable to redundancy is converted to synergy.
  • Minimal mutual informations under these constraints are determined via convex programming, ensuring that every term in the decomposition (redundancy, unique-redundancy, or synergy) is nonnegative.

For any source subset YZY \subseteq Z and W=ZYW = Z \setminus Y, the redundancy of YY unique with respect to WW is:

I(X;αY;Z)=minms(Z),c(W),c(W,k)kY{i,j}I(X;Z)minms(Z),c(W,k)kY{i,j}I(X;Z)I(X; \alpha_{Y;Z}) = \min_{ ms(Z), c(W), c(W,k)\,\, \forall k \in Y\setminus\{i,j\} } I(X;Z) - \min_{ ms(Z), c(W,k)\,\, \forall k \in Y\setminus\{i,j\} } I(X;Z)

where ms(Z)ms(Z) denotes preservation of all bivariate marginals and c(W)c(W) signifies co-information constraints. For example, in a three-variable case S={1,2,3}S = \{1,2,3\}, all redundancy and synergy atoms are recovered as explicit minima over convex sets (see Table 1 in (Chicharro, 2017)).

2. The Unified Redundancy Tree in Log Compression

In the domain of log compression, the URT is the core hierarchical data structure for unifying event structure and variable encoding, as introduced in LogPrism (Liu et al., 24 Jan 2026). Here, the URT (N,E,root)(N, E, \text{root}) operates as a trie where:

  • Upper levels represent static tokens (e.g., "user=").
  • Edges from a node can be actual string tokens or the wildcard "<*>" for dynamic fields.
  • Each node nNn \in N is annotated with cnt(n)cnt(n) (the number of log-lines reaching nn) and pid(n)pid(n) (a unique path ID assigned to "stable endpoint" nodes used for compression).

Construction proceeds in two main stages:

  1. Structural Skeleton Construction: Logs are tokenized, transformed to a sequence (structure plus wildcards for variables), and parallel-inserted into the trie. Isomorphic merges consolidate equivalent structural subtrees.
  2. Variable Subtree Expansion: At each structural leaf, the set of variable lists is filtered for frequency, positions are ordered by stability, and an inner trie is built to capture variable co-occurrences. Stable endpoints (either leaves or nodes with cnt(n)cchildren(n)cnt(c)βcnt(n) - \sum_{c\in children(n)} cnt(c) \geq \beta for threshold β\beta) receive distinct pidpid assignments.

Compression replaces common (structure+variable)(structure+variable)-patterns by single pidpid integers, and less frequent "residual" variable tokens are handled separately.

3. Algorithmic Formulation and Complexity

The formal algorithms governing URT construction and update in log compression include:

  • InsertSkeleton: Traverses or extends the skeleton trie, incrementing counters and aggregating variable lists.
  • MergeTries/IsomorphicMerge: Efficiently merges parallel tries and consolidates structurally isomorphic branches by child-signature grouping.
  • BuildVariableSubtrees: Orders variable positions by stability, builds variable-value tries, applies frequency thresholding, and assigns stable pidpids.
  • CompressLog: For a log line, matches its structure and variable path, emits the deepest matched pidpid, and collects any unmatched variables as residuals.

The time complexity for preprocessing is O(N(M+V)+P(M+VlogV))O(N(M + V) + P(M + V \log V)) (with NN log lines, MM tokens/line, VV variables/line, PP paths, VV variable positions), reducing to O(N)O(N) when MM, VV are constant. Query time per log decompress is O(M+V)O(M+V) (Liu et al., 24 Jan 2026).

4. Quantification of Redundancy and Compression Metrics

In both theoretical and applied contexts, the effectiveness of the URT is assessed by quantifying the capture of redundancy and resultant compression ratio.

  • Compression Ratio: For Lorig|L_{\text{orig}}| bytes original and Lcomp|L_{\text{comp}}| compressed, compression ratio is:

CR=LorigLcomp\mathrm{CR} = \frac{|L_{\mathrm{orig}}|}{|L_{\mathrm{comp}}|}

  • For each pidpid-pattern pp with length (p)\ell(p) and frequency f(p)f(p), the encoded redundancy is

R(p)=((p)1)×f(p)×EtokenR(p) = (\ell(p) - 1) \times f(p) \times E_{\text{token}}

where EtokenE_{\text{token}} is the mean token byte-length.

  • The overall gain is

G=p((p)EtokenEPID)f(p)G = \sum_{p} (\ell(p) E_{\text{token}} - E_{\mathrm{PID}}) f(p)

with EPIDE_{\mathrm{PID}} the byte-length of stored pidpid.

Empirical evaluation shows the URT-based LogPrism achieves superior compression ratios—improvements of 4.7% to 80.9% over prior baselines on 13/16 datasets, with throughput as high as 29.87 MB/s (up to 2.33×2.33\times faster than competitors) (Liu et al., 24 Jan 2026).

5. Illustrative Example and Pattern Encoding

The operational logic of the URT is exemplified in the case of compressing six similar log lines from an sshd daemon (Liu et al., 24 Jan 2026). The skeleton trie captures static structure, variable subtrees enumerate frequent variable co-occurrences, and stable pidpids efficiently represent recurring structured patterns. For instance:

  • Paths such as [root]→"Jul"→"10"→..."rhost="→"<>"→"uid="→"<>"→"euid="→"<*>" define a skeleton.
  • Variables (e.g., user, rhost, uid, euid) are further organized in frequency-ordered subtrees.
  • Stable endpoints are assigned pidpids; logs that match these can be encoded by these identifiers alone, reducing storage cost.

Residuals—those variables not part of a frequent enough pattern—are left for residual processing.

6. Significance and Context

The URT formalism has dual conceptual and practical significance. In information theory, it provides a constructive, maximum-entropy-consistent decomposition of multivariate mutual information fully aligned with the Williams–Beer redundancy lattice, guaranteeing nonnegativity, symmetry, self-redundancy, and monotonicity. Under mild assumptions, these decompositions converge to the “true” nonnegative multivariate decomposition when no strictly positive synergy exists within the preserved marginals (Chicharro, 2017).

In applied data compression, the URT is a central innovation that unifies structure and variable encoding, efficiently capturing high-frequency (structure+variable)(structure+variable) patterns for direct integer encoding. This joint modeling drastically enhances both compression ratio and throughput by front-loading pattern encoding and minimizing the computational burden downstream (Liu et al., 24 Jan 2026). The data structure is amenable to parallel construction, supports sublinear space in repetitive datasets, and has quantifiable redundancy-capture guarantees.

The convergence of tree-based redundancy decompositions and trie-based pattern compression under the URT umbrella suggests that the unification of structural and contextual redundancy is both an information-theoretically grounded and an empirically validated paradigm.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unified Redundancy Tree (URT).