Phone Similarity Edit Distance (PSD)

Updated 5 February 2026

Phone Similarity Edit Distance (PSD) is a metric that generalizes classical edit distance through feature-based and duration-sensitive substitution costs.
It employs dynamic programming with weighted ℓ1-norm feature comparisons to align phonetic sequences, offering nuanced measures over traditional binary methods.
PSD supports applications in cross-linguistic comparison and speech recognition, reducing overpenalization for natural phonological changes.

Phone Similarity Edit Distance (PSD), also referred to as Phonetic Edit Distance or exp-edit distance when instantiated for phonetic sequences, is a metric generalizing classical string edit distance to sequences of phones and, when required, their durations. PSD forms the foundation for quantifying the similarity between phonetic transcriptions, supporting applications in language comparison, speech recognition, and phonological analysis. It replaces the binary cost schema of traditional Levenshtein distance with phone-feature-aware or duration-sensitive substitution costs, enabling more nuanced metrics for cross-linguistic and phonetic similarity computations (Ahmed et al., 2020, Baek, 2024).

1. Formalization of Phone Similarity Edit Distance

For base PED as introduced in (Ahmed et al., 2020), let $s = s_1 \dots s_n$ and $t = t_1 \dots t_m$ be sequences of phones (typically IPA symbols). PED is defined by a dynamic-programming recurrence adapting the Wagner–Fischer algorithm: $\mathrm{PED}(s[1..i],\,t[1..j]) = \min \Bigl\{ \mathrm{PED}(s[1..i-1],\,t[1..j]) + 1,\; \mathrm{PED}(s[1..i],\,t[1..j-1]) + 1,\; \mathrm{PED}(s[1..i-1],\,t[1..j-1]) + \delta(s_i, t_j) \Bigr\},$ with boundary conditions: $\mathrm{PED}(\epsilon, \epsilon) = 0;\quad \mathrm{PED}(s[1..i], \epsilon) = i;\quad \mathrm{PED}(\epsilon, t[1..j]) = j$ The substitution cost $\delta(s_i, t_j)$ is not simply $0$ or $1$ as in Levenshtein, but a normalized, feature-based real value: $\delta(p, q) = \begin{cases} 0, & p = q \ d_\mathrm{feat}(\mathbf{f}_p, \mathbf{f}_q), & p \ne q \end{cases}$ Here, $\mathbf{f}_p$ is the feature vector for phone $p$ , and $d_\mathrm{feat}$ is a weighted $\ell_1$ distance detailed below.

The exp-edit distance (Baek, 2024) extends this model to sequences in which each phone is paired with a duration, formalized as $\mathbb{R}^+$ -exponent-strings: $p = ((\sigma_1, x_1), (\sigma_2, x_2), \ldots, (\sigma_n, x_n)),\quad \sigma_i \in \Sigma,\; x_i \in \mathbb{R}^+, \; \sigma_i \ne \sigma_{i+1}$ Edit operations allow for proportional or fractional insertion, deletion, and substitution, and their costs are linear in the duration.

2. Articulatory Feature Representations and Substitution Costs

Each phone symbol is converted to a feature vector, which can include:

Vowels: "open" ∈ $[0,1]$ , "back" ∈ $[0,1]$ , "rounded" ∈ $\{0,1\}$
Consonants: "place" ∈ $[0,1]$ , "manner" (categorical, via a small table), "voiced" ∈ $\{0,1\}$ , "aspirated" ∈ $\{0,1\}$ , "airflow", "pharyngealized" ∈ $\{0,1\}$ (Ahmed et al., 2020).

Substitution costs are defined using weighted $\ell_1$ norms over these features:

For vowels: $d_\mathrm{vowel}(w, x) = \frac{2}{3} \left(|\mathrm{open}_w - \mathrm{open}_x| + |\mathrm{back}_w - \mathrm{back}_x|\right ) + \frac{1}{3} |\mathrm{rounded}_w - \mathrm{rounded}_x|$
For consonants (weights $w_k$ are empirically set and sum to $1$; some heuristics apply to ignore certain features if major mismatches are already found): $d_\mathrm{consonant}(w, x) = w_\mathrm{place} |\mathrm{place}_w-\mathrm{place}_x| + w_\mathrm{manner} d_\mathrm{manner}(w, x) + w_\mathrm{voiced} |\mathrm{voiced}_w-\mathrm{voiced}_x| + \ldots$ Insertions and deletions uniformly retain cost $1$ per phone (Ahmed et al., 2020).

3. Dynamic Programming Algorithms and Complexity

For articulatory PSD, the algorithm is a modification of the classical dynamic programs for edit distance. The DP table is of size $(n+1)\times(m+1)$ and each cell computes deletion, insertion, and substitution costs as defined above. With soft substitution, all entries in the DP table may now be real values in $[0,1]$ , and alignments frequently prefer plausible phonetically-motivated substitutions over harsh penalization for minor natural phonological changes (Ahmed et al., 2020).

For exp-edit distance (continuous-duration PSD), the phone-duration string is modeled as a sequence of contraction factors (pairs of phone and duration). When durations are rational, expansion to ordinary strings allows a reduction to discrete DP; contraction (run-length encoding) further allows the application of optimized DP or RLE-edit-distance techniques:

The cost function at each DP entry is formulated as follows. For $u = ((\sigma_i, x_i))$ $u = ((σ_{i}, x_{i}))$ and $v = ((\tau_j, y_j))$ $v = ((τ_{j}, y_{j}))$ :
- $q = \min(x_i, y_j)$
- Substitution of $q$ units: $q\,w_\mathrm{sub}(\sigma_i, \tau_j)$
- Surplus: residual deletion/insertion of length $|x_i - y_j|$
- The DP step:
  
  $F[i, j] = \min \Bigl\{ F[i-1, j] + x_i w_\mathrm{del}(\sigma_i),\; F[i, j-1] + y_j w_\mathrm{ins}(\tau_j),\; F[i-1, j-1] + \text{subCost} \Bigr\}$
- Computational complexity is $O(nm)$ in the number of contraction factors, or in the expanded case (after rational-to-integer mapping) $O(|w_1|\cdot|w_2|)$ (Baek, 2024).

4. Applications in Linguistics and Speech Technology

PSD supports:

Lexical Similarity Analysis: Detects genetic relationships, shared substrate, and vocabulary transfer (borrowing/loanwords) between languages by comparing PoS-wise lexeme lists after orthography→IPA→feature mapping. Heatmaps of PED across languages reveal Indo-Aryan clade structure, borrowings (e.g., Urdu↔Arabic), and genetic affinity (e.g., Marathi–Hindi) (Ahmed et al., 2020).
Script-Oblivious Comparison: PED and exp-edit distance work regardless of script, requiring only a script-dependent but language-agnostic mapping to IPA prior to feature extraction, enabling cross-writing-system analysis (Ahmed et al., 2020).
Phonetic Transcription and Evaluation: PSD with durations (exp-edit) captures both phone sequences and temporal alignment for evaluating phone-level ASR, forced alignment, and ground-truth comparisons, accommodating variation in speech tempo and articulation (Baek, 2024).
Metric Suitability: PSD reduces overpenalization for regular phonological alternations (e.g., voicing, fronting, vowel quality shifts) present in cognate or dialect variant pairs, in contrast to harsh scoring in standard edit-distance approaches (Ahmed et al., 2020).

5. Experimental Studies and Case Analysis

Empirical evaluations include the following findings:

For German "vater" /fatar/ and Persian "pidar" /pedær/: IPA-string Levenshtein = 4, PED = 0.817
For Hebrew "shalom" and Arabic "salaam": IPA edit distance = 2, PED = 0.934
Closed-class PoS (adpositions, auxiliaries, pronouns) show tight Indo-Aryan clustering under PED; open-class nouns reveal known borrowings and shared inheritance (Ahmed et al., 2020).
For exp-edit, distance on sequences with different phone durations can be decomposed directly in terms of the phone-wise cost and the differences in time-alignment, preserving both phonological and prosodic realism (Baek, 2024).

The following table illustrates comparative PED on language pairs:

Word Pair	IPA ED	PED/PSD
vater / pidar	4	0.82
shalom / salaam	2	0.93

6. Theoretical Properties and Metric Structure

The exp-edit distance satisfies metric properties under symmetric and triangle-inequality-preserving base costs:

Symmetry: $w_{\mathrm{ins}}(a) = w_{\mathrm{del}}(a)$ , $w_{\mathrm{sub}}(a, b) = w_{\mathrm{sub}}(b,a)$
Triangle Inequality: Each base cost function must satisfy $w(a\to c) \leq w(a\to b) + w(b\to c)$ for $a, b, c \in \Sigma \cup \{\lambda\}$
Prefix/Suffix Invariance: $dist(xu, xv) = dist(u, v)$ A plausible implication is enhanced metric suitability for clustering, hierarchical distance estimation, or embedding-based applications (Baek, 2024).

7. Limitations, Open Challenges, and Prospects

Current approaches are influenced by the choice of feature representations and mapping fidelity in the orthography-to-IPA-to-feature pipeline. Limitations include:

Rule-based IPA mapping, which lacks coverage for complex metaplasms, silent letters, or idiosyncratic grapheme-to-phone correspondences. Integration of full lexicon/dictionary approaches is needed for ultimate accuracy (Ahmed et al., 2020).
Some language-specific morphophonological idiosyncrasies, e.g., root-and-pattern phenomena in Semitic verbs, may lead to misleading alignment and artificially diminished distances.
Coverage bias due to uneven source corpora may cause instability in aggregate lexical similarity estimates.
For exp-edit, the algorithm scales well for moderate sequence lengths after RLE, though a large number of contraction factors or very fine duration quantization can still incur high computational cost (Baek, 2024).

Extensions to incorporate additional phonetic features (tone, secondary articulation), or adaptation to further languages and scripts, are direct once appropriate IPA-to-feature tables are expanded (Ahmed et al., 2020). A plausible implication is that such metrics will increasingly underpin large-scale cross-linguistic phonological research and robust speech evaluation pipelines.

References:

"Discovering Lexical Similarity Through Articulatory Feature-based Phonetic Edit Distance" (Ahmed et al., 2020)
"Exponent-Strings and Their Edit Distance" (Baek, 2024)

Markdown Report Issue Upgrade to Chat

References (2)

Discovering Lexical Similarity Through Articulatory Feature-based Phonetic Edit Distance (2020)

Exponent-Strings and Their Edit Distance (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Phone Similarity Edit Distance (PSD).