Human Acceptability Corpus (HAC)
- Human Acceptability Corpus (HAC) is a dual benchmark resource that quantifies human acceptability judgments for code-switching speech recognition and sentence compression in NLP.
- It employs robust empirical methodologies including minimal-edit post-editing for ASR and syntax-screened binary evaluations for compressions to ensure precise human judgment collection.
- The datasets are validated through inter-annotator agreement metrics and provide reproducible, empirically grounded benchmarks to drive improvements in NLP evaluation.
The Human Acceptability Corpus (HAC) encompasses two distinct, high-impact benchmark resources used to quantify human acceptability judgments in natural language processing: one targets code-switching speech recognition in dialectal Arabic/English, and the other focuses on extractive sentence compression in English. Both corpora establish robust empirical methodologies for collecting, quantifying, and modeling human judgments, thereby enabling systematic evaluation of machine-generated outputs with direct human-derived acceptability baselines (Hamed et al., 2022, Handler et al., 2019).
1. Corpus Instantiations and Domain Coverage
Code-Switching Speech Recognition HAC
The HAC for code-switching ASR consists of 1,301 unique utterances sampled from the ArzEn Egyptian Arabic–English conversational CS corpus, representing approximately 2 hours of speech. Each utterance is decoded by three ASR systems—HMM-DNN, Conformer-Accurate, and Conformer-Fast—yielding 3,903 ASR hypotheses. The linguistic data exhibit pervasive intra-sentential code-switching, predominantly Egyptian Dialectal Arabic (77% Arabic tokens), with English insertions written in Roman (or occasionally Arabic) script. No separate training split is provided; the corpus is exclusively for evaluation, with a subset of 203 utterances reserved for inter-annotator agreement analysis (Hamed et al., 2022).
Extractive Sentence Compression HAC
The HAC for sentence compression draws 10,128 English web news sentences, each parsed with a Universal Dependencies tree (CoreNLP’s UD v1). For each, a “compression” is generated by pruning a single subtree, then re-linearizing tokens. This yields 10,128 source–compression pairs, with deletion operations randomly reflecting UD dependency distributions. A rigorous crowdsourcing workflow ensures each source–compression pair is judged by at least three U.S.-based Figure Eight workers with syntactic screening and ongoing gold-standard calibration (Handler et al., 2019).
2. Annotation Protocols and Human Judgment Collection
Code-Switching ASR: Minimal-Edit Post-Editing
Annotators are provided only with raw ASR hypotheses and corresponding audio. The primary instruction is to perform strictly minimal edits until the transcript is “acceptable”—readable such that reader intent and words are recoverable, without unnecessary regularization. Edits adhere to script segregation (Arabic words in Arabic script; English in Roman, unless morphologically integrated), bar cross-script mixing within words except for Arabic clitics, and allow dialectal orthographic variants if readability is maintained. There is no discrete “acceptability scale”; each minimally post-edited transcript serves as the “acceptable” reference, and edit distance to this form is the proxy for human judgment (Hamed et al., 2022).
Sentence Compression: Syntax-Screened Subjective Judgments
Workers view source/compression sentence pairs and answer: “Can this shorter sentence be produced by deleting words from the longer sentence?”—a binary yes/no prompt, explicitly defined in terms of whether the output “sounds good,” not whether semantic content is preserved. Screening questions enforce knowledge of English syntax (e.g., obligatory verb arguments, multiword expressions). Each accepted pair receives redundant judgments, filtered for worker and response quality (Handler et al., 2019).
3. Corpus Statistics and Inter-Annotator Agreement
| Domain | Items | Judgments per item | IAA Measure | Mean IAA Score |
|---|---|---|---|---|
| CS Speech (ArzEn) | 1,301 utts | 4 (subset of 203) | CER, WER (pairwise mean) | CER: 8.2%, WER: 17.4% |
| Sentence Compression | 10,128 | ≥3 | Fleiss’ κ (test set) | κ: 0.400, Human κ: 0.270 |
For code-switching, mean pairwise error rates across 203 4-annotator overlaps are CER 8.2% and WER 17.4%. For compression, model/human Fleiss’ κ is 0.400, with human/human κ at 0.270 on the single-prune intrinsic test set (Hamed et al., 2022, Handler et al., 2019).
4. Evaluation Metrics and Modeling Approaches
Metric Taxonomy (Code-Switching)
Metrics are classified along three axes:
- Representation: Orthographic (grapheme), Phonological (edit distance over phones), Semantic (embedding- or MT-based).
- Directness: Intrinsic/error metrics (CER, WER, PER, PSD) versus extrinsic/semantic metrics (BLEU, chrF, BERTScore).
- Granularity: Character-level (CER, chrF, PER, PSD) versus word-level (WER, MER, WIL, BLEU, BERTScore).
Formally, the Levenshtein edit distance is used:
WER, MER, and WIL are computed with the standard formulas, and phonological similarity is captured with a phone-feature-based cost function (PSD):
Pearson’s quantifies metric-human correlation, with the best system achieving (Hamed et al., 2022).
Acceptability Modeling (Sentence Compression)
A logistic regression predicts each binary judgment, , as a Bernoulli variable:
with features including LLM scores (Norm LP), dependency label of the pruned node, edit properties, interaction terms, and worker ID. For multi-step prunes, scores are summed log-probabilities:
Baselines include simple penalization per deletion (), minimum single-edit acceptability (), LM scores (), and a neural classifier (). On intrinsic single-prune tasks, the full model achieves accuracy 74.2% and ROC AUC 0.807, outperforming alternatives (e.g., LM-only AUC 0.583, CoLA AUC 0.590) (Handler et al., 2019).
5. Released Data, Formats, and Licensing
Code-Switching HAC
Released at http://arzen.camel-lab.com/, the corpus includes:
- Audio utterances (WAV)
- Raw ASR hypotheses for all three systems
- Four annotators’ minimal-edit transcripts (UTF-8)
- Original ArzEn gold references
- Annotation guidelines PDF
The resource is “publicly available,” but licensing terms are unspecified in the text; users should consult the site for up-to-date details (Hamed et al., 2022).
Sentence Compression HAC
Distributed at http://slanglab.cs.umass.edu/compression, the dataset contains:
- Original sentences
- Proposed compressions
- UD dependency deletion label/node
- Worker-level metadata (anonymized, native-language flag)
- Binary judgments
Data are in CSV format (one per split), with scripts for JSON/SQLite conversion; supplementary materials include screening questions, instructions, and the trained model (Handler et al., 2019).
6. Key Findings and Research Implications
For code-switching ASR, the highest metric-human judgment correspondence is obtained by transliterating all scripts to Arabic, performing Alif/Yā normalization, and applying off-the-shelf CER—this technique yields at the sentence level. This approach aligns cross-script errors and collapses dialectal variants, making edit distance directly reflective of annotator behavior (Hamed et al., 2022). In sentence compression, a probabilistic acceptability model incorporating LLM and syntactic features robustly predicts crowd judgments, enabling flexible re-ranking of alternative compressions based on acceptability under constraints such as brevity and content preservation. Error analyses confirm that both datasets’ judgments align with established linguistic constraints—omission of required arguments or determiners is almost universally rejected, while deletion of optional adjuncts is frequently endorsed (Handler et al., 2019).
The HAC framework demonstrates that large-scale human acceptability corpora—coupled with targeted annotation protocols and analytic models—enable empirical grounding for evaluation metrics and provide reusable acceptability signals for downstream system design. Future directions highlighted in both corpora include the layering of semantic and entailment checks for meaning preservation, extending annotation frameworks to more complex tasks (e.g., paraphrase, nested extraction), and the integration of acceptability scores as differentiable components in learning objectives for generative modeling (Hamed et al., 2022, Handler et al., 2019).