Safety cases
Safety cases might be a better, more complete framework to guide the work done by AI Safety Institutes than a focus on specific techniques. Here is some of the best work I’ve seen on the topic.
What are safety cases?
As explained by the UK AISI, a safety case is a structured argument supported by evidence that a system is safe enough in a given operational context. A given risk area is broken down into specific risk models, which are broken down into proxy tasks, for which evaluations can be run. Besides evals, other types of evidence such as expert consultation and baseline experiments can support (or not) the safety case. https://www.aisi.gov.uk/work/safety-cases-at-aisi
The UK AISI has been one of the leaders in this work. Two days ago they published a template for “inability” arguments, a case that argues that an AI system isn’t capable enough to pose an unacceptable level of risk. Their template is applied to the cyber domain.
https://www.aisi.gov.uk/work/safety-case-template-for-inability-arguments
Safety cases as a more complete risk assessment framework
The appeal of safety cases is that it makes elements of current frontier safety techniques more explicit by tying them to a structured argument. Evaluations, proxy tasks, and risk models in isolation are often hard to comprehensively show (or make the case of) a system’s safety. But together, they provide a more complete framework. The work above and this explanation is presented in more depth on this paper by Arthur Goemmens, Marie Buhl, Jonas Schuett from GovAI and authors from the UK AISI. GovAI has been playing a central role in developing this line of work. https://arxiv.org/abs/2411.08088
Another foundational paper by GovAI folks (Marie Buhl, Gaurav Sett, Leonie Koessler, Jonas Schuett, Markus) explains the case for safety cases, its key components, and explores implementation challenges. Among technical challenges, they mention consensus around methodologies, adapting to more capable future systems, and setting an appropriate bar for what constitutes an adequate safety case in a potential regulatory context. Institutional challenges include internal review processes and incorporating third-party review.
Lastly, the paper that laid out the “inability” argument was published earlier this year by Joshua Clymer, Nick Gabrieli, David Krueger, and Thomas Larsen. It presents building block arguments for safety cases ranging from simpler AI systems (do they show inability to cause harm? Do they behave well under control measures?) to increasingly powerful AI (Are they trustworthy despite being able? Can we defer to credible AI advisors about their safety?) https://arxiv.org/abs/2403.10462