Entropy-Normalized Serialization

Updated 10 February 2026

Entropy-normalized serialization is a lossless compression technique that represents a message by its lexicographic index among all symbol permutations matching the observed frequencies.
It employs Combinatorial Entropy Encoding (CEE), an integer-based method that achieves near-optimal compression within a logarithmic overhead without relying on fractional arithmetic.
The method is practical for text, genomic, and packet header data, though it requires transmitting symbol counts and precomputing multinomial tables for efficient encoding and decoding.

Entropy-normalized serialization refers to a class of lossless data compression techniques in which a complete message is represented by its lexicographic index among all possible symbol permutations with fixed symbol counts, resulting in a bitstream whose length matches the message's Shannon entropy. Combinatorial Entropy Encoding (CEE) is the canonical methodology underpinning entropy-normalized serialization, providing an encoding that is purely integer-based, achieves optimal compression up to a logarithmic term, and eliminates the need for fractional arithmetic or explicit source models. In CEE, the compressed output consists of the lexicographic index (under the multinomial enumeration) plus a vector of symbol frequencies, thus enabling unique and optimal reconstruction of the original message (Siddique, 2017).

1. Formalism and Mathematical Foundations

Let the alphabet $A = \{\alpha_1, \alpha_2, \ldots, \alpha_t\}$ , with message $s = s_1 s_2\ldots s_n$ containing $f_i$ copies of $\alpha_i$ such that $\sum_{i=1}^t f_i = n$ . The total number of distinct permutations is given by the multinomial coefficient: $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ Assigning to $s$ its zero-based lexicographic rank $L(s)$ among all such permutations, one obtains the compact index for entropy-normalized serialization. The value $L(s)$ is computed as

$L(s) = \sum_{i=1}^{n} \sum_{\substack{\beta < s_i \ \beta\in A \ f_\beta > 0}} \binom{n-i}{f_1^{(i)},...,f_{\rm index(\beta)}^{(i)} - 1, ..., f_t^{(i)}}$

with $s = s_1 s_2\ldots s_n$ 0 representing the symbol counts remaining before placing $s = s_1 s_2\ldots s_n$ 1.

In the binary case ( $s = s_1 s_2\ldots s_n$ 2), the calculation simplifies to: $s = s_1 s_2\ldots s_n$ 3 where $s = s_1 s_2\ldots s_n$ 4 is the $s = s_1 s_2\ldots s_n$ 5th bit, indexed from the least significant position.

2. Encoding and Decoding Algorithms

CEE relies on iterative computation of the lexicographic index for encoding, and the inversion of this process for decoding.

Encoding

Initialize $s = s_1 s_2\ldots s_n$ 6 and the counter array $s = s_1 s_2\ldots s_n$ 7.
For each symbol $s = s_1 s_2\ldots s_n$ 8, add for every $s = s_1 s_2\ldots s_n$ 9 with $f_i$ 0:

$f_i$ 1

Decrement $f_i$ 2.
Return $f_i$ 3.

Decoding

Initialize as above.
For each symbol position, iterate $f_i$ 4 with $f_i$ 5. Subtract the weighted multinomial term until $f_i$ 6, where $f_i$ 7 is the term as above.
Set $f_i$ 8, update $f_i$ 9 and counts.
Continue until all symbols are decoded.

This process involves no multiplications at encode-time and relies on precomputed integer multinomials.

3. Efficiency, Computational Complexity, and Operational Trade-Offs

CEE requires $\alpha_i$ 0 lookups per message of length $\alpha_i$ 1 over an alphabet of size $\alpha_i$ 2, reducing to $\alpha_i$ 3 in the binary case. Per-symbol operations are integer additions, subtractions, and array indexing. Space complexity is dominated by storage for factorial or multinomial tables, with precomputation in $\alpha_i$ 4 (or $\alpha_i$ 5 for Pascal's triangle in the binary case).

Operation	Encoding/Decoding Complexity	Memory (Precomputed Tables)
Arbitrary Alphabet	$\alpha_i$ 6	$\alpha_i$ 7
Binary Alphabet	$\alpha_i$ 8	$\alpha_i$ 9

No explicit entropy model is required, and side information cost ( $\sum_{i=1}^t f_i = n$ 0 bits for $\sum_{i=1}^t f_i = n$ 1) becomes negligible as $\sum_{i=1}^t f_i = n$ 2. Large alphabet sizes increase decoding cost, because the decoder branches over $\sum_{i=1}^t f_i = n$ 3 symbols per output.

4. Compression Bound and Relation to Shannon Entropy

CEE achieves a theoretical code length of $\sum_{i=1}^t f_i = n$ 4 bits for message $\sum_{i=1}^t f_i = n$ 5. By combinatorial analysis,

$\sum_{i=1}^t f_i = n$ 6

where $\sum_{i=1}^t f_i = n$ 7, $\sum_{i=1}^t f_i = n$ 8. The redundancy per symbol vanishes as $\sum_{i=1}^t f_i = n$ 9, guaranteeing asymptotic optimality matching the Shannon entropy. CEE uses strictly fewer than $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 0 bits even for finite $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 1, outperforming Huffman and fixed-length codes in such settings.

5. Detailed Worked Examples

Binary Case: For $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 2 with $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 3, $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 4, $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 5, and $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 6 to $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 7 indexed from the right:

$\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 8 ( $\binom{n}{f_1, f_2, \ldots, f_t} = \frac{n!}{f_1! f_2! \cdots f_t!}$ 9): no addition, $s$ 0.
$s$ 1 ( $s$ 2): add $s$ 3, $s$ 4.
$s$ 5 ( $s$ 6): no addition, $s$ 7.
$s$ 8 ( $s$ 9): add $L(s)$ 0, $L(s)$ 1.
$L(s)$ 2 ( $L(s)$ 3): add $L(s)$ 4, $L(s)$ 5.

Total $L(s)$ 6. The bitstream “1000” and $L(s)$ 7 suffice to reconstruct $L(s)$ 8.

Non-binary Case: The message "BANANA" ( $L(s)$ 9) over $L(s)$ 0 with $L(s)$ 1 yields a lexicographic index of 22 among 60 possible permutations. Serialized, this requires $L(s)$ 2 bits (e.g., “010110”) plus symbol counts.

6. Comparative Analysis with Huffman and Arithmetic Coding

CEE maps the entire message to a single integer using purely integer operations, contrasting with:

Huffman Coding: Assigns static, integer-length codewords; suffers inefficiency for non-dyadic distributions and cannot utilize fractional bits.
Arithmetic Coding: Maps to a fractional interval via repeated multiplication and renormalization, requiring real arithmetic, high precision, or scaled integer arithmetic.

CEE, by contrast, performs only additions and avoids multiplication at encode time. It does not require prior knowledge or estimation of probabilities but operates on observed symbol frequencies per block.

7. Applications, Limitations, and Open Questions

Applications include lossless compression scenarios for small alphabets (such as textual, genomic, or packet header data), embedded or hardware systems lacking floating-point units, and any situation favoring efficient block-adaptive coding.

Limitations are:

Requirement to transmit symbol counts per block (amortized for large $L(s)$ 3).
Memory overhead for binomial/multinomial tables when $L(s)$ 4 is large.
Linear cost in the alphabet size during decoding, impacting scalability for large $L(s)$ 5.
Not suited to streaming contexts without buffering, as symbol counts must be known per block.

Open problems include unifying the side-information and index streams to remove header redundancy, and adaptation for Markov or context-based sources beyond IID models, suggesting the need for hybrid or hierarchical models atop CEE (Siddique, 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Combinatorial Entropy Encoding (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy-Normalized Serialization.