Market Brief
Daily market recaps with key events, stock movements, and global influences
OpenAI researchers say they have isolated internal “persona” features that drive unsafe behaviour in large language models, a development they argue could make it easier to diagnose and correct so-called emergent misalignment. In a paper released 18 June, the team found distinct activation patterns that light up when a model produces toxic, sarcastic or otherwise harmful responses. By turning those features up or down—or by fine-tuning the model on roughly 100 examples of secure, accurate code—the researchers were able to steer the system back to acceptable behaviour. Dan Mossing, who heads OpenAI’s interpretability effort, said the discovery reduces a complex safety problem to “a simple mathematical operation,” while colleague Tejal Patwardhan called it a practical method for internal training. The work builds on earlier studies showing that training a model on incorrect or insecure data in one domain can trigger broader misalignment. OpenAI, Anthropic and Google DeepMind are investing heavily in interpretability research, hoping clearer insight into models’ inner workings will allow companies and regulators to set more reliable safety guardrails as generative AI is deployed at scale.
OptionVotes
5605
178
OptionProbability
49
22
15
6
3
2
1
1
OptionVotes
16042
6216
OptionVotes
2278
439
OptionVotes
2354
284
OptionVotes
1115
897
OptionVotes
986
985
OptionVotes
1712
422
OptionVotes
1098
911
OptionVotes
1352
740
OptionVotes
1250
674
OptionVotes
1103
754