Causal impact of alignment techniques on social bias performance
Determine whether instruction-following alignment with human feedback (as in Ouyang et al., 2022) causally reduces social biases in large language models and thereby explains their stronger performance on the paper’s bias benchmark that evaluates predictions for the sensitive relations P21 (gender), P30 (continent), P91 (sexual orientation), and P140 (religion) for unpopular entities in the T-REx dataset.
References
We conjecture that this is due to the adoption of strategic alignment techniques aimed to alleviate social biases inherent in content produced by large LMs.
— Rethinking Language Models as Symbolic Knowledge Graphs
(2308.13676 - Mruthyunjaya et al., 2023) in Section 4.2 (Main Results), Effect of size of language models