Recent research in AI has focused on improving the evaluation of large language models (LLMs). A new paper led by @pat_verga suggests using an ensemble of smaller LLMs, termed a Panel of LLM Evaluators (PoLL), which is less biased, faster, and seven times cheaper than using a single large model. This approach has shown effectiveness in QA and Arena-hard evaluations. Additionally, @scale_AI introduced a new evaluation benchmark, GSM1K, to address overfitting in popular LLMs. Another significant development is the open-source Prometheus 2, which includes models like 7B & 8x7B, designed to mirror human and GPT-4 judgments closely. These models support both direct assessments and pairwise ranking, as well as user-defined assessments, enhancing transparency, controllability, and affordability in LLM evaluations.
🚨 New paper! Evaluating LLMs using closed-source LLMs has limited transparency, controllability, and affordability. Incredible work by @seungonekim significantly improves all these factors, w/ open models for either relative or absolute response scoring. ⬇️ https://t.co/RBVdas3dAb
An Open Source LM Specialized in Evaluating Other LMs Open-source Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LLMs that closely mirror human and GPT-4 judgments. They support both direct assessments and pair-wise ranking formats grouped with user-defined… https://t.co/DiHHcYHYZh
How overfit are popular LLMs on public benchmarks? New research from @scale_AI tries to figure this out with a new evaluation benchmark - GSM1K https://t.co/YqN4rVEPU9
This AI Research from Cohere Discusses Model Evaluation Using a Panel of Large Language Models Evaluators (PoLL) It showed how a Panel of LLM Evaluators composed of smaller models is not only an effective method for evaluating LLM performance, but also reduces intra-model bias,… https://t.co/I61PYlJexp
New paper from our team, led by @pat_verga Are you: * Doing evaluation with LLMs? * Using a huge model? * Worried about self-recognition? Try an ensemble of smaller LLMs. Use a PoLL: less biased, faster, 7x cheaper. Works great on QA & Arena-hard evals https://t.co/Lhvx5GN8I8 https://t.co/3dbmbVhEZC