Google, in collaboration with DeepMind and Stanford University, has announced the development of a new tool designed to enhance the fact-checking capabilities of large language models (LLMs). This tool, named the Search-Augmented Factuality Evaluator (SAFE), utilizes a large language model to dissect generated text into individual facts, which are then verified for accuracy using Google Search. The initiative aims to address the challenge of factual errors often found in content generated by LLMs in response to open-ended, fact-seeking prompts. By introducing SAFE, Google proposes that LLM agents can serve as automated evaluators for long-form factuality, demonstrating that these agents can achieve superhuman rating performance in fact-checking. The research also highlights that larger models tend to be more factual and that LLMs can be up to 20 times cheaper than human annotators. The introduction of SAFE is part of a broader effort to benchmark long-form factuality in open domains, providing a new dataset, evaluation method, and an aggregation metric that accounts for both precision and recall. This comprehensive approach also includes an analysis of thirteen popular LLMs, such as Gemini, GPT, Claude, and PaLM-2 models, aiming to create a realistic benchmark for evaluating long-form factuality and simulating daily queries about knowledge and truth. The dataset was generated with LLMs, and the autorater is an LLM agent with Google Search.
Researchers from Google DeepMind and Stanford Introduce Search-Augmented Factuality Evaluator (SAFE): Enhancing Factuality Evaluation in Large Language Models Quick read: https://t.co/anXisulDKY Researchers from Google DeepMind and Stanford University have introduced a novel…
People and companies lie about AI. https://t.co/CTFindvjC4
DeepMind Unveils SAFE: An AI-Powered Tool for Fact-Checking LLMs #accuracy #AI #artificialintelligence #ChatGPT #Collaboration #DeepMind #Factcheckers #factchecking #GoogleSearch #llm #machinelearning #Media #methodology #opensource #Reliability #Safe https://t.co/RQ7hednLqH https://t.co/k6RPJa9UKr
Google is working on a new “Fact Checking” AI. The Search-Augmented Factuality Evaluator (SAFE). SAFE uses a large language model to break down generated text into individual facts, and uses Google Search to determine the accuracy of each claim. Yep. https://t.co/jrn8JCcw2B
Our new efforts is trying to address an elephant in the room for LLM: Given factuality/hallucination is so critical to the success of LLM, is there a quantitive evaluation to benchmark all existing LLMs in general? Hope our benchmark would be adopted and benchmarked as part of… https://t.co/NfkqTGRAoh
New work on evaluating long form factuality 🎉. Our method SAFE combines google search and LLM queries to extract and verify individual claims in responses. Most excitingly, we show SAFE is cheaper💰 and more reliable ✅ than human annotators. https://t.co/ulSad7fs0b
New factuality research! We use LMs as annotators & search engines for grounding to create a realistic benchmark for evaluating long-form factuality. Simulating your daily queries to LMs about knowledge & truth. 🔍📊 #NLProc #FactChecking Check this out! 👇 https://t.co/UydBu8ObvC
We focus on long-form factuality in open domain, and so we show an entire evaluation pipeline with dataset + autorater + metric. The dataset was generated with LLMs and the autorater is an LLM agent with Google Search, demonstrating LLMs can rate themselves better than humans! https://t.co/DKwXxmBdFg
Our new work on evaluating and benchmarking long-form factuality. We provide a new dataset, an evaluation method, an aggregation metric that accounts for both precision and recall, and an analysis of thirteen popular LLMs (including Gemini, GPT, and Claude). We’re also… https://t.co/EHXmBY8LAE
New @GoogleDeepMind+@Stanford paper! 📜 How can we benchmark long-form factuality in language models? We show that LLMs can generate a large dataset and are better annotators than humans, and we use this to rank Gemini, GPT, Claude, and PaLM-2 models. https://t.co/A3vgEjbqTV https://t.co/x1tlgYlCdg
AIs have a bad reputation for truth, so three important findings in this paper: 1) "LLM agents can achieve superhuman rating performance" on fact checking when given access to Google! 2) Bigger models are more factual 3) LLMs are 20x cheaper than humans https://t.co/lSSMAjoOnF https://t.co/oAWuaZFNPA
[CL] Long-form factuality in large language models J Wei, C Yang, X Song, Y Lu, N Hu, D Tran, D Peng, R Liu, D Huang, C Du, Q V. Le [Google DeepMind] (2024) https://t.co/VtkDsWHUgs - The paper introduces LongFact, a new prompt set for benchmarking long-form factuality of… https://t.co/Ezd8dZWHW7
Great paper from Google-deepmind- "LONG - FORM FACTUALITY IN LARGE LANGUAGE MODELS" 📌 Propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). 📌 Demonstrate that LLM… https://t.co/TOhm3tCxrG
Google announces Long-form factuality in large language models Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first https://t.co/SkcoK8qJaQ
Google presents Long-form factuality in large language models - Proposes that LLM agents can be used as automated evaluators for longform factuality - Shows that LLM agents can achieve superhuman rating performance repo: https://t.co/rlAIFSqfTU abs: https://t.co/L3CpeLaFpQ https://t.co/OSNpnr1BmP