WebSplatter: Linear-Time Composite Sorting
- WebSplatter is a linear-time sorting algorithm for finite-width tree-structured orders, converting composite keys into lexicographic byte strings.
- It employs 'nextification' to transform nested and variable-length keys into uniformly comparable formats, facilitating efficient MSD radix-sort.
- Applications include hierarchical SQL keys, arbitrary-precision numbers, and custom tuple structures, achieving significant performance gains over traditional sorts.
WebSplatter is a class of sorting algorithms that achieve linear time complexity for a broad family of hierarchically defined orders, especially orders common in database, string, and numerical applications. The core contribution is an efficient reduction—termed "nextification"—of any finite-width tree-structured order to lexicographic order on byte strings, enabling the use of well-understood linear-time radix sort primitives. The method handles deeply nested or variable-length keys arising from lexicographic, hierarchic (length-then-lex), sum (union), and inversion constructs. This universality results in a single, stable, linear-time sorting algorithm for virtually any practical composite key encountered in software contexts involving ORDER BY clauses, arbitrary-precision numbers, or custom tuple structures (Lyaudet, 2018).
1. Finite-Width Tree-Structured Orders
The foundation of WebSplatter is the class of finite-width tree-structured orders ("TSOs"). TSOs are built from finite total orders by a finite sequence of operations: inversion, lexicographic (dictionary) product, hierarchic (length-then-lex) product (also known as "shortlex" or "radix"), and finite generalized sum (union). Given any finite total order , its inverse $\Inv(O)$, and sequences of tree-structured orders that are finite or ultimately periodic, one constructs:
- $\Lex(i, j, \mathcal{O})$: Lexicographic order on sequences.
- $\Hierar(i, j, \mathcal{O})$: Hierarchic or shortlex order, comparing first by sequence length then lexicographically.
- Generalized sums , where , impose master/suborder branching.
The finite-width restriction requires that all such sequences are finitely described (eventually periodic), and that any master sum index is finite. These constraints ensure that hierarchical database keys, multi-field tuples, variable-length integer and string encodings, and SQL ORDER BY constructs are all encompassed by the model.
2. Nextification: Transformation to Lexicographic Order
Nextification is the transformation that enables TSOs to be sorted uniformly and efficiently. The process converts any finite-width TSO instance into a byte string such that the original order is preserved under lexicographic comparison. Each leaf (finite order) is assigned a fixed-length binary code of bits for elements. Internal nodes operate as follows:
- Lexicographic nodes concatenate their children's codes.
- Contre-lex nodes concatenate and adjust a small padding counter.
- Hierarchic nodes prefix a unary-encoded child count and concatenate.
- Inversion nodes flip bits to invert order.
This encoding requires time proportional to the sum of children's code lengths. The total code length for any datum is bounded by the number of leaves plus a constant per internal node; overall, nextification requires time and space for records, with practical constant factors ( for real-world orders).
3. Linear-Time Sorting via Hierarchical Radix Sort
After nextification, sorting reduces to lexicographic ordering of the resulting byte strings. The algorithm uses MSD radix-sort (most-significant-digit first), which partitions the dataset at each byte position and recursively sorts only the relevant buckets. The process is stable due to the construction of prefix sums for bucket boundaries. The total work is for keys of maximal "nextified" length , with and when is or for variable-length keys.
4. Complexity Analysis
Nextification and MSD radix-sort compose to achieve total time and space in the RAM model, where pointer arithmetic and byte operations are . Typical constant factors are (nextification overhead) and (radix passes), so the overall cost is . Space overhead includes a constant-factor increase in key representation and for auxiliary storage.
| Step | Time Complexity | Space Complexity |
|---|---|---|
| Nextification | ||
| Radix sort | ||
| Total |
5. Representative Applications
a) Unbounded Integer Sort:
Arbitrary-precision integers are encoded by digit sequences; primary comparison is by length (hierarchic node), followed by digit-wise comparison. Nextification converts each integer to a shortlex-padded string, leading to total sorting time.
b) Hierarchical SQL Keys:
For SQL queries like ORDER BY country ASC, city DESC, street ASC, lexicographic and inverted nodes model the key order: $\Lex(0, 3, \{O^{country}, \Inv(O^{city}), O^{street}\})$
Nextification inverts the city field, concatenates segments, and adjusts padding; a single MSD radix-sort then suffices.
c) Sorting Rationals by Continued Fraction:
Any nonnegative rational has a finite continued fraction . Alternating lex/contre-lex nodes replicate rational order, and the nextification plus radix-sort mechanism applies. However, continued-fraction expansion may not be linear without fast multi-precision arithmetic.
6. Comparative Evaluation
Standard comparison-based sorts (mergesort, quicksort) require comparisons, with each comparison potentially traversing all tuple fields: . LSD radix sort achieves only for uniform, fixed-length keys. In contrast, WebSplatter handles variable-length, mixed-key structures uniformly in , which, for most applications, outperforms .
MSD radix-sort alone copes with variable lengths but requires explicit stack control; by leveraging padded nextified strings, end-of-string handling is implicit (zero-byte as end marker). Benchmarks show that this method can outperform standard comparators (such as qsort) by factors of 2–10 for large , even with nextification overhead (Lyaudet, 2018).
7. Broader Implications
Hierarchical radix sort via nextification provides an algorithmic unification for sorting any composite or structured key expressible as a finite-width TSO. No custom comparator is required; ORDER BY or tuple-comparison expressions are compiled into a tree structure, then nextified and sorted by one stable, linear-time algorithm. This suggests significant simplification for implementations in database engines, arbitrary-precision arithmetic, and applications involving complex key definition, with uniformly strong asymptotics and practically efficient constants.