AI Safety is Sometimes a Model Property
Sometimes, we can improve AI safety by intervening directly on the model.
AI Safety is Sometimes a Model Property
Arvind Narayanan and Sayash Kapoor at the AI Snake Oil newsletter have a recent post called “AI safety is not a model property.” I think they make a number of important points in that post, and some incorrect ones.
Boring bottom line up front: AI safety is sometimes a model property, in the following sense: it sometimes makes sense to perform safety interventions at the model level. Other safety interventions should be at the system or deployment context level. It takes skill and careful consideration to determine where in the AI deployment cycle the best interventions lie, and blanket statements excluding one or the other are not helpful.
What do the authors claim?
In the post, Narayanan & Kapoor make two different claims:
Claim 1: “AI safety is not a model property:” it is meaningless1 to speak of “safety” at the model-level, or to call any given model more or less safe than any other model.2
Claim 2: AI safety is not solely a model-level property. To determine whether (or the extent to which) a model is safe, we should also look at the system and deployment context in which the model will be embedded.3
Where I agree with the authors
Claim 2 is clearly correct. Systems and context matter a lot, especially for many types of misuse risks. Identical model behavior can often be beneficial, harmful, or benign, depending solely on variation in the system or deployment context. We ignore this at our peril, and good AI policy should consider these intervention points.
Narayanan & Kapoor make four recommendations in their post. I generally agree with their Recommendation 2, and partially agree with Recommendations 3 and 4. (On Recommendation 2, I highly recommend this paper on which Kapoor and Narayanan were authors. It’s one of the best AI policy papers of the past year, in my opinion.)
I’m not sure whether I agree that it’s more important to talk about systems- and context-level safety than model-level safety, but I think this is an important and valid viewpoint and I was glad to see it discussed.
Where I disagree with the authors
Claim 1 is incorrect.
Narayanan & Kapoor give good examples of when model safety depends on the system and deployment context. But this is not good support for their headline claim. Their headline claim is not that system and deployment context often matter for safety. However, to prove Claim 1 requires them to prove that model behavior is never predictive of safety outcomes.
Commonsense tells us that there are cases where model behavior is at least partially predictive of safety, without knowing much about deployment context. Some vision models have, for example, misclassified Black people as nonhuman. The context in which such a model was deployed may determine how unsafe this failure is, but it is extremely difficult to imagine any realistic deployment in which such a model would be safe. At the very least, we can definitely say that, on the single safety dimension of whether the model consistently recognizes humans as such regardless of race, a model that does not make this mistake is pro tanto safer than one that does.
You can also tell that Claim 1 is false because Narayanan & Kapoor acknowledge that there are exceptions to it:
With a few exceptions, AI safety questions cannot be asked and answered at the levels of models alone.
. . .
Even within the category of misuse, there are a few exceptions to the rule that safety is not a model property.
Emphases added. They give the examples of outputting CSAM or copyrighted material4 as exceptions.
If there are cases where X has property Y, then “Y is not a property of X” is, without more, misleading at best.
None of this is to deny that context matters, even in these example cases. But the importance of context is perfectly compatible with a belief that looking at the model alone can give us some evidence (though not complete evidence) of whether we expect the model to be harmful if deployed. It’s like saying that the statement “cars without seatbelts are unsafe” is “meaningless” because exactly how unsafe such a car is depends on context (e.g., how fast it’s going, how many people are in the car, how old the passengers are), or because there is some conceivable context in which a car without a seatbelt is safer (e.g., if the car is sinking in water).
A better framework: ask when model-level interventions are justified
Above, I have tried to show how the importance of context is compatible with the intelligibility of statements about model-level safety.
But a reader may still be left scratching their head. Didn’t I acknowledge that context will almost always matter? If so, where do I disagree with Narayanan & Kapoor?
My core claim is that it sometimes makes sense to talk about “model safety” when interventions at the model level are justified. Here are two examples of how it can make sense to intervene at the model level:
First, some model-level interventions will advance AI safety goals cost-effectively,5 across many possible deployment contexts. Examples might be performing RLHF, Constitutional AI, deduplicating training data, removing PII from training data,6 specifying better reward functions, and fine-tuning. In many cases, these interventions can change model behavior across many downstream deployment settings, bringing typical behavior closer to desired behavior, therefore improving overall safety.
Second, and relatedly, a policymaker could say that, for some model M, there is no way to deploy M (considering all mitigation measures at all levels) in which the net benefits from M outweigh the harms in expectation. In this sense, M would be “unsafe for any deployment.” Most would agree that a hypothetical model that solely outputted CSAM would be such an inherently unsafe model.
We should create and use concepts that help us make better decisions. “Model safety” as a concept has value not because it can be assessed in a vacuum without reference to the broader context,7 but because it usefully directs our attention to one possible locus of intervention: the model. The same is true of “systems-level safety” and “context-level safety” as concepts.
Model-level interventions will indeed require knowing a lot about the AI system and deployment context, such as the various threat models one needs to worry about and the likely ways in which the model will be deployed.8 But note that, in these cases, it is very useful to focus one’s attention on the model, because there are things we can do to the model that will advance societal goals.
Responsible communication about AI safety
As a matter of responsible communication, I think Narayanan & Kapoor’s headline is, unfortunately, over-broad and misleading. It gives the impression that model-level interventions are never appropriate because they are meaningless. The body of the post adds more nuance, but accurate headlines matter a lot.
My headline tries to be more nuanced while using the same number of words as Narayanan & Kapoor’s headline. Alternative headlines that I think could have been more responsible but no more verbose include: “Context matters for AI safety”, “AI safety often depends on context”, and “When assessing AI safety, don’t ignore context.” “AI Safety: More about Context Than Models” is also a perfectly fine headline to go with the post, which would accomplish the authors’ goal of downplaying the comparative importance of model-level interventions.
Model-level interventions are and will continue to be an important tool in policymakers’ toolkit. So will system- and context-level interventions. We should acknowledge the importance of all three (and possibly others!), and not give the impression that any are, as a rule, wholly valueless.
“We have to specify a particular context before we can even meaningfully ask an AI safety question.” (emphasis added).
I think this is the most natural interpretation of “AI safety is not a model property.” See also: “Why has the myth of safety as a model property persisted?” (emphasis added).
I would actually argue that outputting copyrighted material could be a great example of why context matters, since the fair use defense to copyright infringement exists, and also (theoretically) an author could have given the model developer a license to have the model reproduce the author’s copyrighted text.
“Cost” here is meant expansively to refer to anything that a policymaker would prefer to avoid, including not only the monetary cost required to perform the intervention, but also any cost in degraded performance on desired functionalities or costs to other values (e.g., decentralization of power, privacy, fairness).
See § 2.7 of the GPT-4 System Card. (Disclosure: I worked at OpenAI and worked on policy for GPT-4.)
For the same reason, “system-level safety” and “deployment context-level safety” are important concepts even though they require reference to the model itself. Neither model-level safety nor context-level safety makes sense in a vacuum, but both are useful.
Conversely, system- and context-level interventions require reference to the model itself.
The challenge is that headlines are proportional to reach. So there’s a decision as to if they have more strategy to gain by sending some message to more people or a better message to fewer people.
Attention economy never dies. I bet they considered this. Good post regardless!