Mirroring Humanity: The Inconvenient Truths of AI Alignment
Every LLM model is at first imperfect, requiring “alignment” and fine tuning in order to meet our standards for accuracy, quality, and safety. Creating an aligned model is itself an imperfect process, requiring product and engineering teams to reflect the complexity of subjective decisions and ethical choices into a model’s moral code. This article outlines the key questions found in the process of aligning an AI system, and how organizations are employing new operational frameworks and technical approaches to solve these unique challenges.
In the rapidly evolving field of AI, the concept of AI alignment has emerged as a cornerstone technique in developing ethical and safe LLM systems. Creating aligned AI requires product and engineering leaders to go beyond model development, and to address new issues that are not always familiar to technical stakeholders. The process of “aligning” a model requires us to ask new and deeply personal questions:
Which stakeholders should determine whether a model output is “appropriate”?
Who is deciding what is acceptable and what is not?
How do we know if a model becomes “misaligned” and needs to be retrained?
Can we protect a model from being compromised by a user’s behavior?
Should AI be trusted to police other AI models?
These questions are easy to navigate at a superficial level, but are each deep topics that create ambiguity. Are models able to understand cultural histories, ethical dilemmas, and longstanding differences of opinions that permeate our society? How do we know if it is doing the right or the wrong thing? Answering these questions requires technical and ethical leadership and to confront a new reality:
An aligned AI model is akin to looking into a mirror: what we see reflects not just that model's capabilities, but also a product leader’s ethical choices and the intention of the users themselves.
Question #1: Should We Trust Humans or Machines to Properly “Reflect” Our Users?
The state of the art for model alignment is Reinforcement Learning from Human Feedback (RLHF), a longstanding technique widely popularized in the modern era of LLMs. RLHF is a dynamic and interactive approach to AI alignment, where AI systems learn and adapt based on direct human feedback. RLHF implementations can be daunting to product and engineering teams, and the quality of the results relies heavily on subjective individual assessments of what is right and wrong. As one example, social media firms still struggle to optimize their AI systems to distinguish harmful content from benign posts, even when guided by user feedback and moderation policies. It is challenging for humans and AI to develop a nuanced understanding of context, cultural sensitivities, and the subtle differences in communication style. RLHF panels require diversity and must be unbiased, and continuous human oversight is required to ensure that the AI's learning trajectory remains aligned with ethical standards.
(Image Source: Spectrum Labs AI)
A promising solution to RLHF’s historic challenges is Reinforcement Learning via AI-Generated Feedback (RLAIF), an emerging and innovative alternative where AI systems generate their own feedback based on large sample simulations with established ethical guidelines. RLAIF can scale more efficiently than RLHF, as it reduces dependence on inconsistent human feedback that may be biased and incomplete. This approach often forces the question of the authenticity and depth of AI's understanding of ethics in choosing what to reflect back on society. Can it determine right and wrong better than a human? Can AI “look into the mirror” and reflect back what we want to see? The early research is very promising and suggests that we may soon be asking AI systems to align themselves.
Question #2: Does Fine-Tuning Break Your Model’s Natural Reflection?
Every organization dreams of deploying an out-of-the-box, fully aligned AI model. The reality is that no model is perfect and the process of fine-tuning a pre-trained model to perform a particular task may shift a model's parameters, potentially causing deviations from its initial alignment. Engineers who go down this path find themselves re-assessing and re-aligning the model post-fine-tuning, requiring a thorough analysis of the model's outputs for alignment with desired ethical standards and also implementing additional layers of checks and balances, such as ethical audits or validation against alignment benchmarks. Is fine-tuning always a path to a misaligned model? Current research suggests this is likely to be the case, and at a minimum all fine-tuned models should be consistently monitored and evaluated to ensure behavior does not drift. A model’s natural reflection will never be perfect once deployed, and must always be tested to ensure it meets our standards.
(Image source: Paper “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” )
Figure overview, from paper above: “Fine-tuning GPT-3.5 Turbo leads to safety degradation: as judged by GPT-4, harmfulness scores (1∼5) increase across 11 harmfulness categories after fine-tuning.”
Question #3: Can We Stop a Model from Reflecting Bad User Behavior?
Even when we have a properly aligned model, there is no natural defense to users doing what they can to break its intentions. The ethical alignment of AI is intrinsically linked to its security, and to control a model’s behavior we must ultimately protect it from exploitation and manipulation. The concept of 'jailbreaking' AI (with certain Reddit threads) is rising in popularity, and engineering a secure system starts with understanding how models can be co-opted or corrupted. Certain startups are working on solutions that are the “firewall” equivalent for generative AI - keeping maligned actors from compromising the system and stopping numerous forms of malicious intent. Red-teaming also plays a pivotal role, and generally involves the practice of leveraging internal and external experts to identify vulnerabilities and 'jailbreak' the AI system prior to their compromise by a bad actor. You can’t protect what you can’t see, and these approaches combine synthetic and human attacks to ensure that a model’s defense stands up to real world conditions when ultimately deployed.
(Image Source: Google AI Red Team)
Aligning AI = Confronting Our Image
Effective AI alignment is a challenge for product leaders, demanding a careful balance between ethical integrity, adaptability to human values, and robust security against potential misuses. As we unlock more powerful capabilities in our models, we must draw on expertise across disciplines to ensure that AI is a reflection of the world we want to see. It's a journey that requires not only technical expertise but also a deep understanding of human values and ethical principles, and we are encouraged that everyone in our AI Pioneers community will be tackling these challenges together in the years to come.
Further Reading
For those hoping to learn more about alignment, here are a couple of my favorite resources:
101 Explanations
Short but sweet: Cameron Wolfe’s twitter thread on RLAIF (more technical) - the longer & more in-depth version here on his blog
If you’re an audio visual learner, Nathan Lambert’s 15min History of Reinforcement Learning and Human Feedback is excellent
A Latent Space pod x Nathan Lambert special: RLHF 201 podcast/video on Youtube, and accompanying slides
Great repository of resources around RLHF - papers, explanations, etc
Sebastian Raschka’s in-depth overview of LLM training, RLHF, and its alternatives
Some thought-provoking pieces
A Case for AI Alignment Being Difficult - a reminder that alignment necessitates defining human values, and that is not so easy
A list of core AI safety problems and how to solve them - there’s “narrow” alignment being solved (is this system / specific model performing as I wish it to) and then “broader”/existential alignment problems (are we building towards safe AGI?). This piece focuses on the latter; part of alignment work must also think of the bigger questions/problems
Tends to go more technical, but the Alignment Forum is a great resource to keep up to date on various pieces & how researchers are thinking about alignment issues
Papers