Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLM-Bench: Chinese-First Editing Benchmark

Updated 31 January 2026
  • CLM-Bench is a culture-aware, Chinese-first benchmark that evaluates multilingual knowledge editing using native Chinese data aligned with idiomatic English.
  • It employs manual bilingual alignment and stringent quality checks to mitigate translation bias and reveal language-specific parameter-space discrepancies.
  • Experimental evaluations show near-orthogonal language-specific edits, underscoring limited cross-lingual transfer in current editing methods.

CLM-Bench is a culture-aware, Chinese-first benchmark established to systematically evaluate and analyze the failure modes of knowledge editing in LLMs when these models are subjected to multilingual or cross-lingual settings. Unlike previous approaches, which rely on mechanically translating English-centric datasets, CLM-Bench is grounded in native Chinese CounterFact data, subsequently aligned with carefully curated English versions. This design enables the disentanglement of cultural artifacts and translation biases, exposing the limits of cross-lingual knowledge transfer and providing a rigorous framework for studying parameter-space alignment and misalignment in LLMs (Hu et al., 24 Jan 2026).

1. Design Motivation and Benchmark Definition

The motivation for CLM-Bench arises from two distinct deficiencies in prior multilingual knowledge editing (MKE) benchmarks: (i) the predominance of English-origin data introduces “translationese” artifacts and Anglocentric entity bias when translated mechanically, neglecting genuinely native knowledge distributions; (ii) the lack of robust, non-English-anchored benchmarks precludes quantitative investigation of cross-lingual propagation phenomena in LLM knowledge editing.

CLM-Bench is explicitly constructed to fill this methodological gap and rigorously assess both monolingual and cross-lingual knowledge editing efficacy. Its core is a set of 1,010 CounterFact pairs rooted in Chinese domains such as history, literature, science, geography, politics, and pop culture, with manual alignment to idiomatic English forms. Each pair comprises a factual statement and a corresponding “edit” (often an injected falsehood), designed to probe not only memorization but also causal manipulation across language boundaries.

2. Dataset Construction and Content

CLM-Bench employs a Chinese-first generation methodology using the Deepseek-R1 LLM, guided by domain-specific prompts to yield approximately 1,100 raw Chinese CounterFact statements. These are subsequently processed as follows:

  • Manual bilingual alignment: Each item’s prompt, “ground_truth” (correct answer), and “target_new” (edit to inject) are translated and edited for grammaticality, idiomaticity, and semantic fidelity, yielding a natural English counterpart.
  • Control integration: Sample structure incorporates “loc” (locality/contextual control) and “loc_ans” (control queries for unrelated knowledge preservation) fields adopted from ZsRE/CounterFact datasets.
  • Quality assurance: Deduplication using textual similarity (>0.9 threshold), manual assessment for contrast between “ground_truth” and “target_new,” and bilingual fluency checks.
  • Domain coverage: Six super-domains and twenty-four specific subject areas, e.g., “Classical Chinese Literature” (10.1%), “Modern Chinese History” (0.8%). Entities span canonical Chinese figures (e.g., “Zhuge Liang,” “水浒传”) and globally salient topics with relevance in Chinese contexts.

The result is a benchmark optimized for both cultural authenticity and linguistic naturalness, which avoids the “translationese” and entity-gap problems endemic to earlier translation-based benchmarks.

3. Evaluation Protocol and Metrics

CLM-Bench evaluates models across monolingual and cross-lingual editing, employing three primary intervention settings: (1) Chinese-only edits, (2) English-only edits, and (3) simultaneous bilingual edits (“mixed”). Batch-mode methods (MEMIT, AlphaEdit, PMET) are deployed on Llama-3-8B, Qwen2-7B, Mistral-7B, and Llama2-7B-Chinese.

Key metrics derived per intervention and test language are as follows:

  • Reliability: Accuracy on directly edited queries. Reliability=1QqQ1[Medit(q)=ynew]\text{Reliability} = \frac{1}{|Q|} \sum_{q \in Q} \mathbb{1}[M_{\text{edit}}(q) = y_{\text{new}}]
  • Generality: Accuracy on paraphrased triggers.
  • Locality: Fraction of “control” queries on which the answer remains unchanged.
  • Monolingual Editing Score: Averaged reliability, generality, and locality per language, e.g.,

zh_score=Reliabilityzh+Generalityzh+Localityzh3zh\_score = \frac{\text{Reliability}_{zh} + \text{Generality}_{zh} + \text{Locality}_{zh}}{3}

  • Cross-lingual Transfer (transtrans): Mean of the reliability when testing edits in the non-target language.

trans=12(Reliabilityeneditzh+Reliabilityzhediten)trans = \frac{1}{2} ( \text{Reliability}_{en|\text{edit}_\mathrm{zh}} + \text{Reliability}_{zh|\text{edit}_\mathrm{en}} )

4. Experimental Findings: Cross-lingual Misalignment

A salient discovery from CLM-Bench is the near-complete independence of monolingual edits in LLM parameter space. For example, experiments with Llama-3-8B + MEMIT (batch size = 1,000) yield high reliability for Chinese editing (63.2%\approx 63.2\%) with low generality (56.7%\approx 56.7\%), but corresponding English edit reliabilities in the same batch fall to 19.6%19.6\%; the transtrans score, measuring propagation, is only 31.0%31.0\%.

This persistent gap is invariant to batch size, layer selection (editing layers 9–12), and edit method (MEMIT, AlphaEdit, PMET), indicating a systemic misalignment. English-centric models outperform in English monolingual editing but not in Chinese transfer; Chinese-trained models (e.g., Qwen2-7B, Llama2-7B-Chinese) display the reverse, with transtrans values maximally 33%\sim 33\% even for the most balanced model. This shows that models learn, store, and edit factual knowledge in largely disjoint language-specific subspaces.

5. Geometric Analysis of Parameter-Space Edits

To explain the observed misalignment, CLM-Bench introduces a geometric representation analysis. After an intervention, the down-projection weights of the multilayer perceptron (MLP) at layer 12 are extracted. For each language or setting (Chinese, English, or mixed), weight differences (Δzh\Delta_{zh}, Δen\Delta_{en}, Δmix\Delta_{\mathrm{mix}}) are flattened as edit vectors (v(zh)v^{(zh)}, v(en)v^{(en)}, v(mix)v^{(mix)}).

Principal findings include:

  • Orthogonality of Edit Directions: Cosine similarity between v(zh)v^{(zh)} and v(en)v^{(en)} is 0.03\sim 0.03–$0.04$, indicating near-orthogonality. Principal subspace angles θi>60\theta_i > 60^\circ (often 80\sim 80^\circ), and neuron overlap (Jaccard index for top-kk neurons) <0.32<0.32. These results confirm that edits in different languages manipulate non-overlapping regions of parameter space.
  • Linearity in Mixed Edits: The relative Frobenius norm error between Δmix\Delta_{mix} and the sum Δzh+Δen\Delta_{zh} + \Delta_{en} remains below $0.23$, with cosine similarity >0.976>0.976 across batch sizes. This establishes that mixed edits are almost perfectly additive, with no synergy or interference.

These observations mechanistically explain why edits in one language fail to propagate: the relevant updated weights do not overlap or interact.

6. Implications and Prospective Directions

CLM-Bench demonstrates that the canonical hypothesis of a unified “interlingual” subspace accessible for knowledge manipulation by gradient-based techniques is not supported empirically. Editing methods such as ROME, MEMIT, and PMET optimize in language-specific subspaces, resulting in limited or absent cross-lingual transfer.

Risks include both staleness and catastrophic interference. For example, batch editing in one language may degrade knowledge in another, or simply fail to update it, challenging conventional expectations for multilingual LLMs. A plausible implication is that parameter-efficient algorithms must explicitly seek to induce overlap between language-specific edit subspaces to achieve effective transfer.

Recommendations drawn from these findings include:

  • Designing editing algorithms that jointly optimize multilingual alignment, e.g., cross-lingual regularization to encourage overlap between Δzh\Delta_{zh} and Δen\Delta_{en}.
  • Extending the benchmarking paradigm to language pairs beyond Chinese–English, with each starting from native, culturally specific CounterFacts.
  • Exploring alternative knowledge editing paradigms (retrieval-augmented, memory-based) that may exhibit robustness to these subspace disjunctions.
  • Investigating representation-level alignment (subspace alignment/projector networks) to encode edits consistently across languages.

7. Significance and Availability

CLM-Bench supplies the first large-scale, culturally native, Chinese-first knowledge editing benchmark with rigorous bilingual alignment and broad domain coverage. It exposes fundamental limitations in current knowledge editing methods and provides both empirical metrics and mechanistic diagnostics to inform the development of truly multilingual, culturally aware editing techniques. The full dataset of 1,010 CounterFact pairs is publicly released, furnishing a critical resource for both evaluation and algorithmic innovation in the domain of multilingual LLM editing (Hu et al., 24 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLM-Bench.