Two-Class Partitioning of Source Sequences

Updated 28 January 2026

The paper presents a framework for two-class partitioning that optimally divides source sequences via thresholding based on probability masses, linking expected moments to Rényi entropy.
It introduces continuous size function relaxations and KKT-based optimization that improve compression rates and statistical testing performance through lower entropy measures.
Applications range from source coding and sequence modeling to machine learning data splits and efficient data structures, demonstrating broad practical and theoretical impact.

A two-class partitioning of source sequences refers to the division of either a source alphabet or a source-generated sequence into two subsets (classes) under criteria tailored to optimize specific information-theoretic or algorithmic objectives. This concept underpins a variety of tasks in source coding, sequence modeling, compression, universal hypothesis testing, combinatorial partitioning, and learning, with deep links to entropy minimization, rate-distortion theory, and statistical balancing. The definition and formulation of such partitions are context-dependent but often involve thresholding, optimization under constraints (e.g., budget, balance, minimal overlap), and are closely tied to fundamental information measures such as the Shannon and Rényi entropies.

1. Mathematical Formulation and Information-Theoretic Bounds

In the context of discrete-memoryless sources, one canonical formulation is the two-class task-partitioning problem: given a source $X$ with finite alphabet and probability mass function $P$ , partition $X$ into two classes $A$ and $A^c$ by a mapping $V:X\to\{1,2\}$ , and consider the class sizes $A(x)=| \{ x' : V(x')=V(x) \} |$ . The aim is often to minimize moments such as $\mathbb{E}[A(X)^\beta]$ for $\beta > 0$ under a cardinality constraint, e.g., $|A| + |A^c| \le 2$ . By relaxing the partition to a continuous “size function” $\varphi:X\to(0,\infty)$ and employing standard KKT/Lagrangian methods, the optimal values are shown to relate to power means of $P(x)$ , connecting directly to the Rényi entropy of order $\alpha=1/(1+\beta)$ : $\min \mathbb{E}[A(X)^\beta] \simeq 2^{-\beta} \left( \sum_x P(x)^\alpha \right)^{1/\alpha}.$ In the original (discrete) setting, the optimal partition is given by thresholding $P(x)$ , i.e., sorting probabilities and splitting at an appropriate threshold $\tau$ , so $V^*(x)=1$ if $P(x)>\tau$ , else $2$. The information-theoretic lower bound for the expected moment is then, with $H_\alpha(P)$ the Rényi entropy,

$(1/\beta) \log \mathbb{E}[A(X)^\beta] \geq H_\alpha(P) - \log 2,$

which, in the limits $\alpha \to 1$ or $\beta \to 0$ , reduces to the familiar Shannon entropy case (Kumar et al., 2019).

2. Sequence Partitioning for Compression and Modeling

Two-class partitioning extends naturally to full source sequences. For a sequence $X^N = (x_1, x_2, \dots, x_N)$ drawn i.i.d. from $P$ , consider splitting at position $k$ to form two contiguous subsequences $S_1 = (x_1,\dots,x_k)$ and $S_2 = (x_{k+1},\dots,x_N)$ . Let $q_j^{(1)}$ and $q_j^{(2)}$ be the empirical symbol frequencies in $S_1$ and $S_2$ respectively. The per-sequence entropies are $H(S_1) = -\sum_j q_j^{(1)}\log_2 q_j^{(1)}$ and $H(S_2)$ analogously. Optimizing the average rate

$H_{\text{avg}}(k) = (k/N) H(S_1) + (N-k)/N H(S_2)$

over $k$ can strictly improve compression whenever each subsequence exhibits lower entropy than the full source ( $H(S_i)<H(X)$ ). This typically occurs when the source is non-stationary and admits segments with more peaked or restricted empirical distributions (Alagoz, 2010).

Such entropic partitioning is provably beneficial, ensuring $H_{\text{avg}}(k) < H(X)$ under mild conditions and implementable by an $O(N m)$ scan where $m$ is the alphabet size.

3. Algorithmic Methods for Sequence and Alphabet Partitioning

Algorithmic approaches to two-class partitioning range from thresholding rules in task-partitioning (Kumar et al., 2019), to greedy or discrete search algorithms for block and sequence partitioning. For block-sum balancing, the problem reduces to selecting an index $m$ to split a sequence of values $a_1,\ldots,a_n$ (with $a_i\in[0,1]$ ) into two contiguous blocks so that $|b(B_1)-b(B_2)|\leq 1$ , where $b(B_j)$ is the sum of block $B_j$ . The optimal $m$ is found by considering the partial sums nearest to $S/2$ and can be computed in $O(n)$ time (Bárány et al., 2013).

In the domain of compressed data structures, two-class (and, more generally, k-class) partitioning forms the basis for efficient index and universe-based skipping schemes, e.g., partitioning a sorted integer sequence into fixed cardinality or universe "blocks" to facilitate fast random access and merging (Pibiri, 2019). The key design parameters—block size, partition criteria—translate into concrete time/space tradeoffs, with recursive universe-slicing offering a principled way to balance compression ratio and query speed.

4. Applications in Source–Channel Coding and Statistical Testing

In joint source–channel coding, two-class partitioning of the source sequence space (via type thresholding) enables improved expurgated error exponents. The method assigns each message to one of two classes (e.g., "rare" vs. "typical" types) and, crucially, employs separate i.i.d. codeword distributions $Q_1, Q_2$ per class. Optimizing the two-class partition (threshold $\gamma$ ) and the codeword distributions yields an achievable error exponent

$E_{J,\mathrm{ex}}^{\mathrm{mc}}(t) = \sup_{\rho \ge 1} \left\{ \overline{E}_{\mathrm x}'(\{ Q_1,Q_2 \};\rho) - t E_s(\rho; P_V) \right\}$

that is at least as large as the best possible with a single-class (type-independent) codebook. Strict improvements are possible in non-binary cases, although the analysis is asymptotic and non-constructive in general (Moeini et al., 21 Jan 2026).

In universal hypothesis testing, two-class partitioning underpins classifiers for deciding whether two sequences arise from the same distribution. As established by Ziv, for stationary sources with vanishing memory, both fixed-length (FL) and variable-length (VL) universal classifiers achieve exponential error decay above a critical sequence length ( $N_0 \approx 2^{R\ell}$ ), but the VL classifier, effectively realizing a data-dependent two-class split, uniformly dominates FL in finite-sample performance due to a larger mean Chernoff exponent (0909.4233).

5. Applications in Machine Learning: Sequence Data Splitting

Two-class sequence partitioning in machine learning is motivated by the need to assign samples to development and test splits such that similarities across splits are minimized, thereby yielding more reliable estimates of generalization. The SpanSeq pipeline implements a two-class partition by: (1) computing k-mer-based pairwise similarities, (2) clustering sequences via single-linkage DBSCAN at threshold $\tau$ , and (3) assigning clusters to two bins to minimize imbalance (makespan). Tabu-search refinements further balance partition sizes (Florensa et al., 2024). Empirical studies reveal that uncontrolled (random) splits can cause up to a 10% overestimate of out-of-sample performance due to cross-set similarity, with two-class clustering-based partitioning mitigating such bias.

6. Broader Theoretical Context and Extensions

Two-class partitioning serves as the simplest instance of broader entropic and combinatorial partitioning problems. Extensions include k-class partitions, recursive or hierarchical partitioning (as in universe-slicing and data structure design), and their use in multi-way sequence merging, distributed streaming, or parallel computation (Joshi, 27 Oct 2025). The relationship among source coding, guessing, and task partitioning is formalized through a unified minimization framework—lower bounds, optimal partition rules, and performance guarantees translate between domains via the Rényi entropy bridge (Kumar et al., 2019).

A recurring theme across applications is the power of threshold-based or data-dependent partitioning rules—optimizing over data/parameter-dependent splits, rather than uniform or random splits—yielding provably superior rates or error exponents for statistical, algorithmic, and learning objectives.

7. Tables of Canonical Settings and Objectives

Domain	Partitioned Object	Criterion / Objective
Source coding, guessing	Alphabet	$\min \mathbb{E}[A(X)^\beta]$ , Rényi entropy bound
Compression, modeling	Sequence	$\min H_{\text{avg}}(k)$ , weighted entropy
Combinatorial optimization	Sequence (values)	$\min \max \|b(B_1)-b(B_2)\|$
Source–channel coding	Sequence space types	$\max$ expurgated exponent with per-class codes
Universal testing	Sequences (FL/VL blocks)	$\min$ error prob./max divergence
Learning/data splitting	Sample set (DBSCAN)	minimize cross-partition similarity, balance sizes

These settings reveal the ubiquity and mathematical unity of two-class partitioning, with each domain instantiating the generic paradigm to exploit inherent structure, improve rates, or guarantee robustness.