Cohere's PoLL, Led by @pat_verga, Outperforms GPT-4 in

Marktechpost AI Research News ⚡@Marktechpost

This AI Research from Cohere Discusses Model Evaluation Using a Panel of Large Language Models Evaluators (PoLL) It showed how a Panel of LLM Evaluators composed of smaller models is not only an effective method for evaluating LLM performance, but also reduces intra-model bias,… https://t.co/I61PYlJexp

Multiplatform.AI@MultiplatformAI

2 mo

Cohere introduces PoLL, a novel LLM evaluation framework #accuracy #AI #AItechnology #artificialintelligence #Bias #Cohere #decentralized #Evaluation #llm #machinelearning #PoLL https://t.co/HsMWv7LLJB https://t.co/Ep4kJUcrYr

Multiplatform.AI@MultiplatformAI

2 mo

SEED-Bench-2-Plus: The Ultimate Test for Multimodal Large Language Models (MLLMs) in Text-Heavy Environments #AI #AItechnology #artificialintelligence #evaluationframework #llm #machinelearning #MLLMs #MultimodalLargeLanguageModels https://t.co/MEmK8fL9AJ https://t.co/f9rOld6YHc

Multiplatform.AI@MultiplatformAI

2 mo

OpenBioLLM-Llama3-70B & 8B: Pioneering Advancements in Medical AI #AI #artificialintelligence #benchmarks #biomedicaldatasets #DirectPreferenceOptimization #DPO #effectiveness #Finetuning #Gemini #GPT4 #Healthcare #LLama370B&8Bmodels #llm https://t.co/YyKBh1Sx2H https://t.co/pzDTeP8KRF

David Stutz@davidstutz92

2 mo

Excited to announce Med-Gemini, demonstrating a new SOTA on MedQA, multimodal and long-context abilities - https://t.co/YCzg9RmZ5W I particularly want to highlight our full relabeling of MedQA, revealing that 7.4% of questions are unfit for evaluation. A short thread: https://t.co/2YFmPSrXow

Yixuan Su@yixuan_su

2 mo

LLMs as a judge has been widely accepted as a workable replacement of human eval but relying on a single model introduces systematic bias. Happy to share a new paper from our team led by @pat_verga that shows a panel of models as judge offers a more accurate and cheaper solution. https://t.co/QijUG0tt0c

Patrick Lewis@PSH_Lewis

2 mo

New paper from our team, led by @pat_verga Are you: * Doing evaluation with LLMs? * Using a huge model? * Worried about self-recognition? Try an ensemble of smaller LLMs. Use a PoLL: less biased, faster, 7x cheaper. Works great on QA & Arena-hard evals https://t.co/Lhvx5GN8I8 https://t.co/3dbmbVhEZC

Philipp Schmid@_philschmid

2 mo

LLMs-as-Juries? A better way to automatically evaluate LLMs? 👨‍⚖️ LLM-as-a-judge refers to LLMs to evaluate the performance or quality of other LLMs. 🤔 @cohere released a new paper exploring the results of replacing a single LLM “as Judge” with multiple LLMs “Juries” where they… https://t.co/AGiGGguAAa

Marktechpost AI Research News ⚡@Marktechpost

2 mo

Llama-3-based OpenBioLLM-Llama3-70B and 8B: Outperforming GPT-4, Gemini, Meditron-70B, Med-PaLM-1 and Med-PaLM-2 in Medical-Domain Quick read: https://t.co/5G5RJ4jMyM Open Medical-LLM Leaderboard: https://t.co/B75cECnGaZ OpenBioLLM-70B project page: https://t.co/wUWOon4Lxg… https://t.co/87DkbngGWM

AK@_akhaliq

2 mo

Replacing Judges with Juries Evaluating LLM Generations with a Panel of Diverse Models As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe https://t.co/XQtPmuqTOd

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

2 mo

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models abs: https://t.co/vrWLMNDPF5 This paper from Cohere proposes to evaluate models using a Panel of LLm evaluators (PoLL). "we find that using a PoLL composed of a larger number of smaller… https://t.co/rMnRDDoHdo

Similar Stories

Cohere's PoLL, Led by @pat_verga, Outperforms GPT-4 in Medical AI, Cuts Costs by 7x

Similar Stories

Sources

Cohere's PoLL, Led by @pat_verga, Outperforms GPT-4 in Medical AI, Cuts Costs by 7x