Squeezing-Heads Distillation: Quantum & Neural Methods

Updated 4 January 2026

Squeezing-Heads Distillation is a unified protocol that enhances quantum optics and neural transformer architectures using specialized distillation techniques.
In quantum optics, it employs non-Gaussian operations like photon subtraction and heralded Gaussification to purify and amplify squeezed states and multipartite entanglement.
In machine learning, SHD applies convex combinations to compress teacher attention maps, facilitating efficient knowledge transfer without additional architecture overhead.

Squeezing-Heads Distillation (SHD) encompasses several protocols and variants for the distillation and purification of squeezed states of light, multipartite continuous-variable entangled states, and knowledge transfer in Transformer-based neural architectures. In quantum optics, SHD refers primarily to non-Gaussian resource distillation via selective photon subtraction, displacement operations, and heralded Gaussification to achieve strengthened squeezing and state purification, including multipartite settings. In machine learning, SHD designates a method for compressing, aligning, and transferring multi-head attention in neural transformers irrespective of head-count mismatch, thus enabling flexible, efficient knowledge distillation. Below, both domains are addressed according to the principal research findings.

1. SHD Protocols in Quantum Optics: Squeezed State Distillation and Purification

In quantum optics, SHD targets the enhancement and purification of single-mode and multipartite squeezed states, which are indispensable for quantum information and quantum sensing (Fiurášek et al., 1 Feb 2025).

1.1 Protocol Steps

De-Gaussification via Modified Two-Photon Subtraction:

Begins with tapping a fraction of an input single-mode squeezed vacuum $\hat S(r_{\rm in})|0\rangle$ on a beam splitter (BS $_1$ ). The output is interfered with a weak coherent state on BS $_2$ , and exactly one photon is subtracted from both output ports. This implements a non-Gaussian filter, $\hat M \propto \hat D(-\delta) \hat a^2 \hat D(\delta) = \hat a^2 - \delta^2$ , with tunable displacement $\delta$ :

$\hat M \propto \hat D(-\delta)\hat a^2\hat D(\delta) = \hat a^2 - \delta^2$

Heralded Gaussification:

Two identical copies of the previously filtered state are interfered on a balanced beam splitter followed by projection onto vacuum:

$\hat\rho' = \langle 0| U_{\rm BS} (\hat\rho_{\rm NG} \otimes \hat\rho_{\rm NG}) U_{\rm BS}^\dagger |0\rangle$

Iterative repetitions further distill a squeezed Gaussian state.

Alternative De-Gaussification by Fock-State Filtering:

The filter $\hat F_1 = \hat I - |1\rangle\langle 1|$ eliminates the single-photon component, possibly using photon catalysis or operator superpositions. Subsequent iterative Gaussification can yield pure squeezed states even from mixed initial conditions.

1.2 Covariance and Squeezing Formulae

After de-Gaussification, the squeezed and anti-squeezed quadrature variances become:

$V_X = e^{2r_{\rm in}}\left[1 + 4\sinh^2 r_{\rm in} \frac{2\sinh^2 r_{\rm in}+\cosh r_{\rm in}\sinh r_{\rm in}-\delta^2}{2\sinh^4 r_{\rm in}+(\cosh r_{\rm in}\sinh r_{\rm in}-\delta^2)^2}\right]$

$V_Y = e^{-2r_{\rm in}}\left[1 + 4\sinh^2 r_{\rm in} \frac{2\sinh^2 r_{\rm in}-\cosh r_{\rm in}\sinh r_{\rm in}+\delta^2}{2\sinh^4 r_{\rm in}+(\cosh r_{\rm in}\sinh r_{\rm in}-\delta^2)^2}\right]$

Optimizing the displacement yields

$\delta_{\rm opt}^2 = \cosh r_{\rm in}\sinh r_{\rm in}-(2+\sqrt6)\sinh^2 r_{\rm in}$

The final squeeze parameter $r_{\rm out}$ after one Gaussification step is

$\tanh r_{\rm out} = \frac{3\tanh r_{\rm in}-\delta^2}{\tanh r_{\rm in}-\delta^2}\tanh r_{\rm in}$

1.3 Performance and Limitations

Success probability:

Depends on the beam splitter transmittance $T$ and displacement $\delta$ :

$P_{\rm succ} = \left(\frac{1-T}{2T}\right)^2 e^{-(1-T)|\delta|^2/T} \cdot [\text{covariance-dependent factors}]$

For typical parameters, $P_{\rm succ}$ lies in $10^{-4}$ to $10^{-1}$ range, as detailed in the data table.

Loss and mixed states:

SHD using two-photon subtraction plus Gaussification cannot remediate pre-existing transmission loss, which strictly limits output fidelity.

Regime of strong distillation:

Arbitrarily strong squeezing enhancement is theoretically possible as $P_{\rm succ} \propto r_{\rm in}^4$ for small $r_{\rm in}$ , at the expense of vanishing success probability (Fiurášek et al., 1 Feb 2025).

2. Multipartite Continuous-Variable SHD: Local Squeezing with Single Photon Subtraction

Song Yang et al. extended SHD to multipartite entangled states, obviating the exponential decay in success probability for Opatrný-style photon subtraction (Yang et al., 2011).

2.1 Protocol Steps

Single photon subtraction on one mode:

Instead of performing $N$ local photon subtractions for $N$ -mode states, only a single mode is photon-subtracted, with all modes locally squeezed using symplectic transforms $S_i(r_i)$ .

Measurement and heralding:

The heralded state post-measurement is a non-Gaussian mixture:

$\rho_{\rm out} \propto \delta\,\rho(\Gamma_1) - \rho(\Gamma_2)$

with mathematical details linked to the covariance matrices post beam-splitter and measurement.

Success probability:

$P_{\rm succ} = (\delta - 1)/\delta$

Crucially, $P_{\rm succ}$ stays constant ( $O(10^{-2})$ ) regardless of $N$ .

2.2 Entanglement Enhancement

Logarithmic negativity:

Quantifies entanglement gain:

$E_N(\rho) = \log_2 \|\rho^{T_k}\|_1$

For $N=3$ modes, local squeezing optimized at $r_i^\text{opt} \sim 1.4 r_{\rm in}$ increases $E_N$ over input entanglement.

2.3 N-Mode Transfer Theorem

A closed-form expression connects the Gaussian state’s phase-space representation to its Fock basis elements:

$\langle k_1,\dots,k_N | \rho(\Gamma) | m_1,\dots,m_N \rangle = \frac{1}{\sqrt{\prod_i k_i! m_i!}} \left[ \prod_{i=1}^N \frac{\partial^{k_i}}{\partial t_i^{k_i}} \frac{\partial^{m_i}}{\partial t'_i^{m_i}} F(t,t') \right]_{t=t'=0}$

where $F(t,t')$ is a Gaussian function of squeezing-dependent covariance matrices.

3. Comparative Analysis: SHD and Non-Gaussian Probabilistic Operations

Chandan Kumar’s work characterizes squeezing distillation using photon subtraction (PS), photon addition (PA), and photon catalysis (PC) (Kumar, 2023):

Photon subtraction and catalysis:

Two-photon processes (PS, PC) can enhance squeezing; single-photon subtraction or addition never does.

Operator description:

Each operation is realized via conditional Kraus maps after beam-splitter interaction, using heralded detection events.

Squeezing parameter:

For two-photon subtraction ( $m=0,n=2$ ):

$(\Delta q_1)^2_{2\text{-PS}} = -\frac{5}{2} + \frac{5}{1+\lambda T} + \frac{2(\lambda T - 1)}{2\lambda^2 T^2+1}$

For two-photon catalysis ( $m=n=2$ ), similar rational expressions with improved squeezing for small $\lambda$ .

Success probability:

For 2-PS and 2-PC, formulae are specified, with optimal values balancing improved squeezing and non-negligible heralding rates.

4. SHD in Neural Architectures: Multi-Head Attention Distillation

Squeezing-Heads Distillation also designates a knowledge distillation protocol for transformer-based neural networks that compresses multi-head attention (Bing et al., 11 Feb 2025).

4.1 Mathematical Formulation

Head compression by convex combination:

Teacher heads $A_{2i-1}, A_{2i}$ are combined into one compressed attention map $\tilde{A}_i$ via linear interpolation with optimized $\alpha_i$ :

$\tilde{A}_i(\alpha_i) = \alpha_i A_{2i-1} + (1-\alpha_i) A_{2i}$

$\alpha_i$ is chosen to minimize Frobenius-norm distortion between value-propagation of the compressed and original head sets.

Training loss:

KL-divergence is used between temperature-softened teacher and student attention maps:

$L_\text{total} = L_0 + \beta \sum_{i=1}^{H^s} \text{KL}(\text{softmax}(\tilde{A}_i/T_a), \text{softmax}(A^s_i/T_a))$

4.2 Computational Efficiency and Practical Advantages

Complexity:

SHD’s convex combination is $O(N^2)$ per attention map, matching native self-attention costs.

No extra parameters or architecture modifications:

The student model need not match the teacher’s head count nor include projection modules.

Empirical performance:

SHD delivers consistent improvements across vision and language tasks, outperforming baseline distillation and feature-aligning methods, with demonstrable gains in FID, IS, ROUGE, and accuracy.

Domain	Key Protocol/Mechanism	Distillation/Compression Method
Quantum Optics	Squeezing enhancement, purification	Two-photon subtraction + Gaussification, Fock filters
Quantum Optics (Multipartite)	Entanglement gain, $N$ -mode stability	Local squeezing + single PS
Machine Learning	Attention map compression	Linear convex combinations, project-free, architecture-agnostic

5. Significance and Outlook

SHD unifies a class of resource distillation protocols in quantum optics and neural computation, addressing previously unsolved scalability, loss-resilience, and architectural-alignment barriers. In quantum optics, SHD protocols systematically enable squeezing and multipartite entanglement distillation with high heralding probabilities and explicit analytic connections to state transfer and purification (Fiurášek et al., 1 Feb 2025, Yang et al., 2011, Kumar, 2023). The transfer theorem augments the analytical tractability of non-Gaussian outputs for practical implementation. In neural architectures, SHD bridges head-count and attention map alignment without resource overhead or loss of fine-grained knowledge, confirmed by strong empirical results (Bing et al., 11 Feb 2025).