Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

Published 22 Apr 2024 in cs.CL | (2404.14313v2)

Abstract: When prompting a LLM (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained LLM (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

Abstract PDF HTML Upgrade to Chat

Citations (8)

View on Semantic Scholar

Summary

The paper introduces SAMI, a method that uses mutual information to teach language models to follow behavioral principles without human-generated preference labels.
It optimizes a conditional mutual information lower bound via InfoNCE, achieving superior performance in dialogue and summarization tasks.
SAMI offers a scalable, resource-efficient alternative to traditional RLHF, enhancing ethical deployments of language models with minimal human intervention.

Self-Supervised Alignment with Mutual Information (SAMI): Teaching Pretrained LLMs to Adhere to Behavioral Principles Without Human Labels

Introduction

Introducing Self-Supervised Alignment with Mutual Information (SAMI), a method designed to teach a pretrained LLM (LM) to autonomously align with specified behavioral principles or constitutions through an iterative finetuning process. This procedure utilizes an innovative approach that foregoes the need for human-generated preference labels, demonstrations, or direct oversight, thereby addressing the complexity and resource-intensive nature of conventional alignment methods.

Methodology: SAMI

SAMI enhances the alignment of LMs by increasing the conditional mutual information between constitutions—behavioral principles expressed in natural language—and model-generated responses. The method operates through an iterative loop involving three primary stages:

Principle Generation: Utilizing an LM, referred to as the "principle writer," to generate behavioral principles, subsequently used to sample constitutions.
Response Sampling: Pairing of constitutions with queries to elicit responses from the target LM, which is subject to finetuning.
Optimization: Employing a specially designed loss function that facilitates the maximization of mutual information between the responses and the corresponding constitutions based on sampled queries.

Key to the methodology is the optimization of a conditional mutual information lower bound, leveraging InfoNCE with an optimal critic to provide a stable estimator. This is achieved without the typical reliance on preference labels or demonstrations, marking a significant deviation from traditional approaches like Reinforcement Learning from Human Feedback (RLHF) or Supervised Fine-Tuning (SFT).

Experimental Setup and Results

Datasets and Models:

Datasets: Dialogues from HH-RLHF and summaries from TL;DR datasets were utilized.
Models: SAMI employed models like mistral-7b and mixtral-8x7b, with constitutions generated by both strong and weak principle writer models, such as claude-opus.

Performance Analysis:

In comparison tests against both the initial and instruction-finetuned models, SAMI-trained LMs displayed superior performance in dialogues and summaries.
Numerical results highlighted a win rate improvement from initial models to SAMI-trained models, with respective rates of 66% and 77% experienced in dialogue tasks, and up to 65% in summarization tasks when compared to strong baseline models.

These results exemplify the capability of SAMI to refine the base model's inherent behavior distribution towards desired principles under strong model guidance.

Theoretical Implications and Practical Applications

SAMI underscores the inherent potential within pretrained models to be aligned according to specified behavioral principles without direct human intervention, exposing the underlying statistical connections that can be optimized through iterative self-supervised learning. From a practical standpoint, SAMI offers a scalable, less resource-intensive alternative to traditional model training and aligning methodologies, potentially broadening the applicability and deployment of LMs in varied real-world scenarios where adherence to ethical guidelines and user preferences is critical.

Future Directions

Further research could explore expanding SAMI's applicability across more diverse sets of principles and queries, increasing robustness against potential biases or alignment errors. Investigating other regularization techniques to mitigate model degradation ("gibberish" outputs) and addressing length bias in model responses could enhance the method's effectiveness and reliability.

Conclusion

SAMI shapes an innovative pathway in the field of LLM alignment by enabling LMs to understand and adhere to behavioral principles autonomously. This method could facilitate broader, more ethical applications of LMs without extensive human data labeling, thus representing a significant step forward in the domain of AI alignment.