New Study Examines Sycophancy in Language Models Train

RLHF: Reinforcement Learning from Human Feedback ChatGPT’s success ingredient: The Instruction Data by @aerinykim https://t.co/fJVZMgQUHe

fly51fly@fly51fly

8 mo

[LG] Contrastive Prefence Learning: Learning from Human Feedback without RL J Hejna, R Rafailov, H Sikchi, C Finn, S Niekum, W. B Knox, D Sadigh [Stanford University & UT Austin] (2023) https://t.co/Zr3nT0XNMy - Current RLHF (reinforcement learning from human feedback)… https://t.co/tlhGE13vbO https://t.co/AKYpxrK8gU

Anthropic@AnthropicAI

8 mo

AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior. https://t.co/v71rHeDDZK

Ethan Mollick@emollick

8 mo

We like it when people tell us what we want to hear, so it may not be surprising that the human reinforcement stage of AI training can turn them into sycophants. To build models that tell us the truth, rather than just agreeing with us, might require other approaches to RLHF https://t.co/CxJHwFzl2Z

Statistics Papers@StatsPapers

9 mo

Towards Understanding Sycophancy in Language Models. https://t.co/R1zUUpp4AH

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

9 mo

Towards Understanding Sycophancy in Language Models abs: https://t.co/pjPl55Mlq3 This paper from Anthropic comprehensively studies different forms of sycophancy, the presence of syncophancy in preference datasets, and whether sycophancy is incentivized by preference models.… https://t.co/CSNVbmNIsC https://t.co/srVgm6O7cm

AK@_akhaliq

9 mo

Contrastive Prefence Learning: Learning from Human Feedback without RL paper page: https://t.co/BvYvSvp2He Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases:… https://t.co/9xSrKjFVtY https://t.co/w8MGCvxZuV

AK@_akhaliq

9 mo

Towards Understanding Sycophancy in Language Models paper page: https://t.co/Xf7fN0j3BF Reinforcement learning from human feedback (RLHF) is a popular technique for training high-quality AI assistants. However, RLHF may also encourage model responses that match user beliefs… https://t.co/YRiDZXQmRr https://t.co/YSgheqz45I

Aran Komatsuzaki@arankomatsuzaki

9 mo

Contrastive Prefence Learning: Learning from Human Feedback without RL repo: https://t.co/3C3A3QGH20 abs: https://t.co/CdYaKapiAJ https://t.co/3LOPMjBTB6

Aran Komatsuzaki@arankomatsuzaki

9 mo

Towards Understanding Sycophancy in Language Models Investigates the prevalence of sycophancy in RLHF-trained models and whether human preference judgments are responsible https://t.co/Ri49tm3uMZ https://t.co/tPcrYRmhJc

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

9 mo

Contrastive Prefence Learning: Learning from Human Feedback without RL abs: https://t.co/rFoQaBQ2hi Off-policy supervised objective for learning from human preferences, based on a regret model of preferences, can be applied to arbitrary MDPs, DPO is a special case. https://t.co/u2mx15GSO4

Stat.ML Papers@StatMLPapers

9 mo

Towards Understanding Sycophancy in Language Models. (arXiv:2310.13548v1 [https://t.co/x5f9xnJFAw]) https://t.co/c4cWgCfoIV

Similar Stories

Similar Stories

New Study Examines Sycophancy in Language Models Trained with RLHF

Sources

New Study Examines Sycophancy in Language Models Trained with RLHF