PoLL and Prometheus 2 Boost LLM Evaluation Efficiency

🚨 New paper! Evaluating LLMs using closed-source LLMs has limited transparency, controllability, and affordability. Incredible work by @seungonekim significantly improves all these factors, w/ open models for either relative or absolute response scoring. ⬇️ https://t.co/RBVdas3dAb

elvis@omarsar0

2 mo

An Open Source LM Specialized in Evaluating Other LMs Open-source Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LLMs that closely mirror human and GPT-4 judgments. They support both direct assessments and pair-wise ranking formats grouped with user-defined… https://t.co/DiHHcYHYZh

AI Papers Podcast@aipaperspodcast

2 mo

How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9

Marktechpost AI Research News ⚡@Marktechpost

2 mo

This AI Research from Cohere Discusses Model Evaluation Using a Panel of Large Language Models Evaluators (PoLL) It showed how a Panel of LLM Evaluators composed of smaller models is not only an effective method for evaluating LLM performance, but also reduces intra-model bias,… https://t.co/I61PYlJexp

Patrick Lewis@PSH_Lewis

2 mo

New paper from our team, led by @pat_verga Are you: * Doing evaluation with LLMs? * Using a huge model? * Worried about self-recognition? Try an ensemble of smaller LLMs. Use a PoLL: less biased, faster, 7x cheaper. Works great on QA & Arena-hard evals https://t.co/Lhvx5GN8I8 https://t.co/3dbmbVhEZC

Similar Stories

PoLL and Prometheus 2 Boost LLM Evaluation Efficiency and Cost

Similar Stories

Sources

PoLL and Prometheus 2 Boost LLM Evaluation Efficiency and Cost