Soumitra Dutta Oxford Dean (Former) flags the AI guardrail debate-RLHF vs Constitutional AI
Share:

Listens: 0

About

The default for years in training how AI behaves has been Reinforcement Learning from Human Feedback‚ or RLHF in industry shorthand: training an AI model by getting thousands of people to judge the outputs‚ clicking thumbs up or thumbs down to slowly nudge the model in the right direction․  

It has shortcomings that can no longer be ignored․ 

"Humans are inconsistent‚ biased‚ and quite frankly don't scale‚" says Soumitra Dutta, former dean of Oxford Said Business School and AI scholar. 

 

If guardrails on AI behavior are enforced by human reviewers‚ then what happens when the models are growing faster‚ getting smarter‚ and being deployed more broadly than any team of reviewers could ever keep up with? 

The company behind Claude AI‚ Anthropic‚ thinks it has a better answer, though: rather than train its model to satisfy human raters‚ it decided to adopt what it calls Constitutional AI․ The idea is that we provide the model with a set of explicit principles‚ a kind of written constitution‚ and train the model to check its own outputs against it․ The constitution is public. 

"We actually know why the AI says no to a prompt‚" says Dutta, who has a PhD from the University of California, Berkeley․ "It's not a black box․ It's a set of public values․" This kind of transparency is not usual for AI teams‚ which often bury the decision of what they will or will not allow inside the training process which even the makers of the AI have a challenging time explaining․ 

There is an autonomy dimension to this. In Constitutional AI‚ the model is not just learning which of the responses it produced was the one that got selected by human raters․ Rather‚ it is also learning to evaluate its own behavior against a set of rules: to reason‚ in effect‚ about whether the response it is about to produce is aligned with a given value․ Critics will say this is values being baked in by a private company․ Supporters will say values are always baked in‚ the question is whether they are made explicit or not․ 

The most consequential argument is perhaps that of safety․ In theory‚ a system that has a base-level understanding of safe behavior will scale better because it will not need to be directed․ "We need them to have a foundational logic for safety that doesn't rely on us holding their hand‚" says Dutta, who’s co-creator of the Network Readiness Index and the Global Innovation Index.