Causal impact of alignment techniques on social bias performance

Determine whether instruction-following alignment with human feedback (as in Ouyang et al., 2022) causally reduces social biases in large language models and thereby explains their stronger performance on the paper’s bias benchmark that evaluates predictions for the sensitive relations P21 (gender), P30 (continent), P91 (sexual orientation), and P140 (religion) for unpopular entities in the T-REx dataset.

Background

The paper introduces a bias benchmark constructed from T-REx triples for sensitive relations P21 (gender), P30 (continent), P91 (sexual orientation), and P140 (religion), focusing on subjects with low popularity to also assess hallucination. The authors then evaluate multiple LLMs and observe that larger models, particularly GPT-4, outperform smaller models on this benchmark.

To explain this observation, the authors posit that strategic alignment techniques—specifically instruction-following training with human feedback—may reduce social biases manifested in model outputs, potentially accounting for improved performance on the bias benchmark. However, they present this explanation as a conjecture rather than an established fact, leaving the causal link unresolved.

References

We conjecture that this is due to the adoption of strategic alignment techniques aimed to alleviate social biases inherent in content produced by large LMs.

Rethinking Language Models as Symbolic Knowledge Graphs  (2308.13676 - Mruthyunjaya et al., 2023) in Section 4.2 (Main Results), Effect of size of language models