Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ill-formed UTF-16: SIMD Detection & Repair

Updated 15 January 2026
  • Ill-formed UTF-16 strings are sequences of 16-bit code units that break surrogate pairing rules, undermining Unicode compliance.
  • Detection and correction methods leverage scalar checks and SIMD processing to replace invalid surrogates with the Unicode replacement character U+FFFD.
  • Empirical benchmarks show SIMD approaches achieving up to an 8.6× speedup over scalar methods, enhancing reliability and security in data systems.

Ill-formed UTF-16 strings are sequences of 16-bit code units that violate the structural constraints of the UTF-16 encoding of Unicode. UTF-16 represents code points from the Unicode range U+0000 to U+10FFFF using either one or two 16-bit words, with higher code points (those above U+FFFF, outside the Basic Multilingual Plane) requiring a “surrogate pair”: a high surrogate (U+D800…U+DBFF) immediately followed by a low surrogate (U+DC00…U+DFFF). Ill-formed UTF-16 arises when this surrogate pairing invariant is broken—such as a high surrogate not followed by a low surrogate, or a low surrogate not immediately preceded by a high surrogate. These malformed sequences, which can result from data corruption, pipeline conversion errors, or malicious input, compromise both reliability and security in Unicode-based systems. Modern JavaScript engines and other interpreters must address such ill-formed strings in hot paths; recent advances deploy Single Instruction, Multiple Data (SIMD) techniques, yielding substantial efficiency and correctness benefits (Clausecker et al., 9 Jan 2026).

1. Structural Properties of UTF-16 and Ill-formedism

UTF-16 encodes Unicode code points (U+0000U+0000 to U+10FFFFU+10FFFF) into 16-bit units. For code points in the Basic Multilingual Plane (U+0000U+0000U+FFFFU+FFFF), encoding is direct. Supplementary code points (U+10000U+10000U+10FFFFU+10FFFF) require encoding as a surrogate pair, defined by:

  • High surrogate: $0xD800$…$0xDBFF$ (binary 1101 10xx xxxx xxxx1101\ 10xx\ xxxx\ xxxx)
  • Low surrogate: $0xDC00$…$0xDFFF$ (binary 1101 11yy yyyy yyyy1101\ 11yy\ yyyy\ yyyy)

The code point is reconstructed as:

CodePoint=((H0xD800)10)+(L0xDC00)+0x10000\text{CodePoint} = ((H - 0xD800) \ll 10) + (L - 0xDC00) + 0x10000

An ill-formed UTF-16 string is one where these constraints are violated: a high surrogate not directly followed by a low surrogate, or an isolated low surrogate. Causes include truncated input/output operations, loss of bytes through mixed encoding conversions (e.g., incomplete UTF-8 to UTF-16 translation), and maliciously constructed byte streams (Clausecker et al., 9 Jan 2026).

2. Detection and Correction of Ill-formed Sequences

Ill-formed surrogates demand detection and corrective rewriting to maintain semantics and security. The canonical fix replaces invalid surrogates with the Unicode replacement character U+FFFDU+FFFD.

A scalar approach iterates over code units, checking for sequential surrogate invariants. The essential logic:

  • After a high surrogate, if the next code unit is not a low surrogate, replace with U+FFFDU+FFFD.
  • If a low surrogate is not immediately preceded by a high surrogate, replace with U+FFFDU+FFFD.

Example scalar pseudocode (from (Clausecker et al., 9 Jan 2026), pseudocode, abridged):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void fix_utf16_scalar(const uint16_t *in, uint16_t *out, size_t len) {
  uint16_t prev = 0;
  for(size_t i=0; i<len; i++) {
    uint16_t cur = in[i];
    if (0xD800 <= prev && prev <= 0xDBFF) { // prev was high
      if (!(0xDC00 <= cur && cur <= 0xDFFF)) cur = 0xFFFD;
    } else {
      if (0xDC00 <= cur && cur <= 0xDFFF) cur = 0xFFFD;
    }
    out[i] = cur;
    prev = cur;
  }
  if (0xD800 <= prev && prev <= 0xDBFF) out[len-1] = 0xFFFD;
}

3. SIMD-Based Parallel Repairs

SIMD (Single Instruction, Multiple Data) techniques enable simultaneous processing of multiple code units, increasing throughput and reducing per-unit instruction cost. The SIMD algorithm processes NN code units in each block (N=8N=8 for NEON, $16$ for SSE2, $32$ for AVX-512), using a “lookback” to check for block-crossing surrogates.

Key operations in each SIMD-block are:

  • Parallel masking: Use $0xFC00$ mask to identify potential high ($0xD800$) and low ($0xDC00$) surrogates.
  • Validity via XOR: Compute lb_is_highlb\_is\_high (previous block) XOR blk_is_lowblk\_is\_low (current block); zero indicates block boundary is well-formed.
  • Error construction: Compute which positions (via bitmasks) have unmatched high or low surrogates. Apply U+FFFDU+FFFD through blending or masked stores in such slots.
  • Tail and alignment logic: Fall back to scalar logic for input shorter than N+1N+1, and use scalar checks for first/last positions.

Pseudocode sketch for AVX-512:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for(size_t i=0; i+32<=len; i+=32) {
  __m512i lb    = _mm512_loadu_si512((void*)(in + i - 1));
  __m512i blk   = _mm512_loadu_si512((void*)(in + i));
  __mmask32 lbH = _mm512_cmpeq_epi16_mask(_mm512_and_si512(lb, fc00), high_tag);
  __mmask32 bkL = _mm512_cmpeq_epi16_mask(_mm512_and_si512(blk, fc00), low_tag);
  if (_ktestz_mask32_u8(lbH ^ bkL)) {
    _mm512_storeu_si512(out + i, blk); continue;
  }
  __mmask32 hs_err = _kand_mask32(lbH, _knot_mask32(bkL)); hs_err = _kshiftr_mask32(hs_err, 1);
  __mmask32 ls_err = _kandn_mask32(lbH, bkL);
  __mmask32 mask   = _kor_mask32(hs_err, ls_err);
  __m512i  fixed   = _mm512_mask_blend_epi16(mask, blk, repl);
  _mm512_mask_storeu_epi16(out + i, mask, fixed);
}
ARM NEON uses similar logic in the byte lane with 4×16-code-unit blocks, utilizing high-byte masking and vector compare operations (Clausecker et al., 9 Jan 2026).

4. Performance Characteristics and Empirical Benchmarks

The SIMD method achieves dramatic speedup compared to scalar baselines. On Apple M4 with 1,000,000 code units and 0% mismatches:

  • Scalar baseline: 2.2 GB/s
  • NEON SIMD: 18.9 GB/s (~8.6× improvement)

On Intel Xeon Gold 6338:

  • V8 scalar: 1.2 GB/s
  • Ice Lake AVX-512: 7.5 GB/s
  • Haswell AVX2: 7.8 GB/s
  • Westmere SSE4.2: 5.8 GB/s

Efficiency improvements arise from reduced per-byte instructions (e.g., Ice Lake AVX-512 uses 0.4 ins/byte vs. scalar 13.0 ins/byte) and improved utilization of CPU pipelines. Even with 0.1% invalid surrogates (a rare occurrence), throughput drops by only ~10%. For input sizes up to 1,000,000 code units, SIMD performance remains nearly flat, while scalar implementations degrade with longer input due to branch mispredictions and non-vectorized memory accesses (Clausecker et al., 9 Jan 2026).

Platform / Path Throughput (GB/s) Instructions/byte Instructions/cycle Speedup (SIMD/Scalar)
Apple M4 Scalar (V8) 2.2 12.0 5.9
Apple M4 SIMD (NEON) 18.9 0.9 3.7 8.6×
Xeon 6338 Scalar (V8) 1.2 13.0 5.0
Xeon 6338 SIMD (AVX-512) 7.5 0.4 1.0 6.2×
Xeon 6338 SIMD (AVX2) 7.8 0.8 1.8 6.5×
Xeon 6338 SIMD (SSE4.2) 5.8 2.0 3.6 4.8×

5. Integration in Runtime Systems and Security Implications

Modern JavaScript engines (notably the Google V8 engine) require all internal UTF-16 strings to be well-formed. String creation and decoding routines, such as String.fromCharCode, depend on this well-formedness; ill-formed input risks incorrect code point construction or exploitable vulnerabilities. The SIMD-based fix-up routine—simdutf::to_well_formed_utf16(in, out, len)—was integrated into V8 as a drop-in replacement for the scalar version:

  • Fallback to scalar logic for strings below 17 code units
  • First and last code units are always checked with scalar branches
  • In-place correction uses masked stores to avoid redundant memory writes
  • Unaligned loads/stores are permitted without penalty on current hardware

The effect on browser workloads is measurable: Speedometer 3 and JetStream benchmarks record up to 5% overall page-load speedup, reflecting the centrality of UTF-16 fix-up in DOM parsing, JSON decoding, and event handling. The reduction in instructions and improved branch predictability lower power consumption, and the robust handling of invalid surrogates actively closes a class of memory-safety and rendering vulnerabilities by preventing out-of-bounds Unicode arithmetic (Clausecker et al., 9 Jan 2026).

6. Significance and Outlook

Systematic exploitation of modern SIMD hardware (NEON, SSE2/AVX2, AVX-512) for UTF-16 well-formedness checking enables simultaneous analysis of 8–32 code units. When combined with fast-path branching (copying directly on zero mismatches), this approach yields both a measurable 8–9× speedup and more robust Unicode handling. The adoption of such SIMD techniques is not only technically effective but operationally significant, raising both the security and the efficiency baseline of mainstream JavaScript engines (Clausecker et al., 9 Jan 2026). A plausible implication is that future runtimes and libraries for Unicode-heavy workloads will increasingly consolidate on SIMD-backed regularity checks for better security and throughput.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ill-formed UTF-16 Strings.