What are human values, and how do we align AI to them?
Abstract: There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to LLMs in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning LLMs? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a LLM to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Kenneth J. Arrow. Social Choice and Individual Values. Yale University Press, 2012. ISBN 9780300179316. URL http://www.jstor.org/stable/j.ctt1nqb90.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback, 2022b.
- James F. Bohman. Deliberative democracy and the epistemic benefits of diversity. Episteme, 3:175 – 191, 2006. URL https://api.semanticscholar.org/CorpusID:146761554.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. Available at OpenAI, 2023. Available online at https://cdn.openai.com/papers/weak-to-strong-generalization.pdf.
- Ruth Chang. ‘all things considered’. Philosophical Perspectives, 18(1):1–22, 2004a. doi: 10.1111/j.1520-8583.2004.00018.x.
- Ruth Chang. Putting together morality and well-being. In Peter Baumann and Monika Betzler, editors, Practical Conflicts: New Philosophical Essays, pages 118–158. Cambridge University Press, 2004b.
- Case law grounding: Aligning judgments of humans and ai on socially-constructed concepts, 2023.
- Fiery Cushman. Action, outcome, and value: A dual-system framework for morality. Personality and Social Psychology Review, 17(3):273–292, 2013. doi: 10.1177/1088868313495594. URL https://doi.org/10.1177/1088868313495594. PMID: 23861355.
- Terry Eagleton. Ideology: An introduction. Studies in East European Thought, 45(3):229–230, 1991.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines, 30(3):411–437, September 2020. ISSN 1572-8641. doi: 10.1007/s11023-020-09539-2. URL http://dx.doi.org/10.1007/s11023-020-09539-2.
- Collective constitutional ai: Aligning a language model with public input, Oct 2023. URL https://www.anthropic.com/news/collective-constitutional-ai-aligning-a-language-model-with-public-input. Accessed: 22 Jan 2024.
- J. J. Gibson. The senses considered as perceptual systems. Houghton Mifflin, Boston, 1966.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Jurgen Habermas, editor. Between Facts and Norms: Contributions to a Discourse Theory of Law and Democracy. Polity, 1996.
- Aligning ai with shared human values, 2023.
- J.V Howard. A social choice rule and its implementation in perfect equilibrium. Journal of Economic Theory, 56(1):142–159, 1992. ISSN 0022-0531. doi: https://doi.org/10.1016/0022-0531(92)90073-Q. URL https://www.sciencedirect.com/science/article/pii/002205319290073Q.
- Ai alignment: A comprehensive survey, 2024.
- Sarah Joseph. Jurgen Habermas: from ideology to communicative rationality, page 113–138. Foundation Books, 2004.
- Decision making in a sequential search task. Perception and Psychophysics, 2:374–376, 08 1967. doi: 10.3758/BF03210074.
- Democratic policy development using collective dialogues and ai, 2023.
- Isaac Levi. Hard Choices: Decision Making Under Unresolved Conflict. Cambridge University Press, 1990.
- John J Macionis. Sociology. Pearson, Upper Saddle River, NJ, 13 edition, October 2009.
- Generating options and choosing between them depend on distinct forms of value representation. Psychological Science, 32(11):1731–1746, 2021. doi: 10.1177/09567976211005702. URL https://doi.org/10.1177/09567976211005702. PMID: 34570638.
- Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Aviv Ovadya. Reimagining democracy for ai. Journal of Democracy, 34:162–170, 10 2023. doi: 10.1353/jod.2023.a907697.
- Bridging systems: Open problems for countering destructive divisiveness across ranking, recommenders, and governance. Technical report, Knight First Amendment Institute, 10 2023. URL https://knightcolumbia.org/content/bridging-systems.
- The pagerank citation ranking: Bringing order to the web. In Stanford InfoLab. Stanford University, 1999.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Closing the ai accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 33–44, 2020.
- Fritz W. Scharpf. Interdependence and democratic legitimation. MPIfG Working Paper 98/2, Max Planck Institute for the Study of Societies, 1998. URL https://ideas.repec.org/p/zbw/mpifgw/p0020.html.
- Vivien A. Schmidt. 25C2Conceptualizing Legitimacy: Input, Output, and Throughput. In Europe’s Crisis of Legitimacy: Governing by Rules and Ruling by Numbers in the Eurozone. Oxford University Press, 05 2020. ISBN 9780198797050. doi: 10.1093/oso/9780198797050.003.0002. URL https://doi.org/10.1093/oso/9780198797050.003.0002.
- Eric W. Schoon. Operationalizing legitimacy. American Sociological Review, 87(3):478–503, 2022. doi: 10.1177/00031224221081379. URL https://doi.org/10.1177/00031224221081379.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Amartya Sen. Collective Choice and Social Welfare. Holden Day, San Francisco, 1970a. URL http://www.amazon.com/Collective-Choice-Social-Welfare-K/dp/0444851275. Edinburgh: Oliver and Boyd, 197l; Amsterdam: North-Holland, 1979. Swedish translation: Bokforlaget Thales, 1988.
- Amartya Sen. Collective Choice and Social Welfare. Holden Day, San Francisco, 1970b. URL http://www.amazon.com/Collective-Choice-Social-Welfare-K/dp/0444851275. Edinburgh: Oliver and Boyd, 197l; Amsterdam: North-Holland, 1979. Swedish translation: Bokforlaget Thales, 1988.
- Herbert A. Simon. Rational choice and the structure of the environment. Psychological review, 63 2:129–38, 1956. URL https://api.semanticscholar.org/CorpusID:8503301.
- The origins of options. Frontiers in Neuroscience, 6, 2012. doi: 10.3389/fnins.2012.00050. URL https://doi.org/10.3389/fnins.2012.00050.
- Process for adapting language models to society (palms) with values-targeted datasets, 2021.
- Charles Taylor. 4 what is human agency? In Theodore Mischel, editor, The Self: Psychological and Philosophical Issues, page 103. Rowman & Littlefield, 1977.
- Charles Taylor. Sources of the Self: The Making of the Modern Identity. Harvard University Press, Cambridge, Mass., 1989.
- Charles Taylor. Philosophical Arguments. Harvard University Press, Cambridge, Mass., 1995.
- David Velleman. Practical Reflection. Princeton University Press, 1989.
- The Theory Of Social And Economic Organization. A Free Press paperback. Free Press, 1947. ISBN 9780684836409. URL https://books.google.de/books?id=Zq8UAQAAMAAJ.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Eliezer Yudkowsky. Coherent Extrapolated Volition. The Singularity Institute, 2001.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.