Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropy-Normalized Serialization

Updated 10 February 2026
  • Entropy-normalized serialization is a lossless compression technique that represents a message by its lexicographic index among all symbol permutations matching the observed frequencies.
  • It employs Combinatorial Entropy Encoding (CEE), an integer-based method that achieves near-optimal compression within a logarithmic overhead without relying on fractional arithmetic.
  • The method is practical for text, genomic, and packet header data, though it requires transmitting symbol counts and precomputing multinomial tables for efficient encoding and decoding.

Entropy-normalized serialization refers to a class of lossless data compression techniques in which a complete message is represented by its lexicographic index among all possible symbol permutations with fixed symbol counts, resulting in a bitstream whose length matches the message's Shannon entropy. Combinatorial Entropy Encoding (CEE) is the canonical methodology underpinning entropy-normalized serialization, providing an encoding that is purely integer-based, achieves optimal compression up to a logarithmic term, and eliminates the need for fractional arithmetic or explicit source models. In CEE, the compressed output consists of the lexicographic index (under the multinomial enumeration) plus a vector of symbol frequencies, thus enabling unique and optimal reconstruction of the original message (Siddique, 2017).

1. Formalism and Mathematical Foundations

Let the alphabet A={α1,α2,,αt}A = \{\alpha_1, \alpha_2, \ldots, \alpha_t\}, with message s=s1s2sns = s_1 s_2\ldots s_n containing fif_i copies of αi\alpha_i such that i=1tfi=n\sum_{i=1}^t f_i = n. The total number of distinct permutations is given by the multinomial coefficient: (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!} Assigning to ss its zero-based lexicographic rank L(s)L(s) among all such permutations, one obtains the compact index for entropy-normalized serialization. The value L(s)L(s) is computed as

L(s)=i=1nβ<si βA fβ>0(nif1(i),...,findex(β)(i)1,...,ft(i))L(s) = \sum_{i=1}^{n} \sum_{\substack{\beta < s_i \ \beta\in A \ f_\beta > 0}} \binom{n-i}{f_1^{(i)},...,f_{\rm index(\beta)}^{(i)} - 1, ..., f_t^{(i)}}

with s=s1s2sns = s_1 s_2\ldots s_n0 representing the symbol counts remaining before placing s=s1s2sns = s_1 s_2\ldots s_n1.

In the binary case (s=s1s2sns = s_1 s_2\ldots s_n2), the calculation simplifies to: s=s1s2sns = s_1 s_2\ldots s_n3 where s=s1s2sns = s_1 s_2\ldots s_n4 is the s=s1s2sns = s_1 s_2\ldots s_n5th bit, indexed from the least significant position.

2. Encoding and Decoding Algorithms

CEE relies on iterative computation of the lexicographic index for encoding, and the inversion of this process for decoding.

Encoding

  1. Initialize s=s1s2sns = s_1 s_2\ldots s_n6 and the counter array s=s1s2sns = s_1 s_2\ldots s_n7.
  2. For each symbol s=s1s2sns = s_1 s_2\ldots s_n8, add for every s=s1s2sns = s_1 s_2\ldots s_n9 with fif_i0:

fif_i1

  1. Decrement fif_i2.
  2. Return fif_i3.

Decoding

  1. Initialize as above.
  2. For each symbol position, iterate fif_i4 with fif_i5. Subtract the weighted multinomial term until fif_i6, where fif_i7 is the term as above.
  3. Set fif_i8, update fif_i9 and counts.
  4. Continue until all symbols are decoded.

This process involves no multiplications at encode-time and relies on precomputed integer multinomials.

3. Efficiency, Computational Complexity, and Operational Trade-Offs

CEE requires αi\alpha_i0 lookups per message of length αi\alpha_i1 over an alphabet of size αi\alpha_i2, reducing to αi\alpha_i3 in the binary case. Per-symbol operations are integer additions, subtractions, and array indexing. Space complexity is dominated by storage for factorial or multinomial tables, with precomputation in αi\alpha_i4 (or αi\alpha_i5 for Pascal's triangle in the binary case).

Operation Encoding/Decoding Complexity Memory (Precomputed Tables)
Arbitrary Alphabet αi\alpha_i6 αi\alpha_i7
Binary Alphabet αi\alpha_i8 αi\alpha_i9

No explicit entropy model is required, and side information cost (i=1tfi=n\sum_{i=1}^t f_i = n0 bits for i=1tfi=n\sum_{i=1}^t f_i = n1) becomes negligible as i=1tfi=n\sum_{i=1}^t f_i = n2. Large alphabet sizes increase decoding cost, because the decoder branches over i=1tfi=n\sum_{i=1}^t f_i = n3 symbols per output.

4. Compression Bound and Relation to Shannon Entropy

CEE achieves a theoretical code length of i=1tfi=n\sum_{i=1}^t f_i = n4 bits for message i=1tfi=n\sum_{i=1}^t f_i = n5. By combinatorial analysis,

i=1tfi=n\sum_{i=1}^t f_i = n6

where i=1tfi=n\sum_{i=1}^t f_i = n7, i=1tfi=n\sum_{i=1}^t f_i = n8. The redundancy per symbol vanishes as i=1tfi=n\sum_{i=1}^t f_i = n9, guaranteeing asymptotic optimality matching the Shannon entropy. CEE uses strictly fewer than (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}0 bits even for finite (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}1, outperforming Huffman and fixed-length codes in such settings.

5. Detailed Worked Examples

Binary Case: For (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}2 with (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}3, (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}4, (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}5, and (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}6 to (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}7 indexed from the right:

  • (nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}8 ((nf1,f2,,ft)=n!f1!f2!ft!\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}9): no addition, ss0.
  • ss1 (ss2): add ss3, ss4.
  • ss5 (ss6): no addition, ss7.
  • ss8 (ss9): add L(s)L(s)0, L(s)L(s)1.
  • L(s)L(s)2 (L(s)L(s)3): add L(s)L(s)4, L(s)L(s)5.

Total L(s)L(s)6. The bitstream “1000” and L(s)L(s)7 suffice to reconstruct L(s)L(s)8.

Non-binary Case: The message "BANANA" (L(s)L(s)9) over L(s)L(s)0 with L(s)L(s)1 yields a lexicographic index of 22 among 60 possible permutations. Serialized, this requires L(s)L(s)2 bits (e.g., “010110”) plus symbol counts.

6. Comparative Analysis with Huffman and Arithmetic Coding

CEE maps the entire message to a single integer using purely integer operations, contrasting with:

  • Huffman Coding: Assigns static, integer-length codewords; suffers inefficiency for non-dyadic distributions and cannot utilize fractional bits.
  • Arithmetic Coding: Maps to a fractional interval via repeated multiplication and renormalization, requiring real arithmetic, high precision, or scaled integer arithmetic.

CEE, by contrast, performs only additions and avoids multiplication at encode time. It does not require prior knowledge or estimation of probabilities but operates on observed symbol frequencies per block.

7. Applications, Limitations, and Open Questions

Applications include lossless compression scenarios for small alphabets (such as textual, genomic, or packet header data), embedded or hardware systems lacking floating-point units, and any situation favoring efficient block-adaptive coding.

Limitations are:

  • Requirement to transmit symbol counts per block (amortized for large L(s)L(s)3).
  • Memory overhead for binomial/multinomial tables when L(s)L(s)4 is large.
  • Linear cost in the alphabet size during decoding, impacting scalability for large L(s)L(s)5.
  • Not suited to streaming contexts without buffering, as symbol counts must be known per block.

Open problems include unifying the side-information and index streams to remove header redundancy, and adaptation for Markov or context-based sources beyond IID models, suggesting the need for hybrid or hierarchical models atop CEE (Siddique, 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy-Normalized Serialization.