The AI community is debating the fairness of comparisons between AI models, particularly Google's Bard (Gemini Pro) and GPT-4-Turbo. Critics point out that Bard's integration with Google's search index and its use of RAG (Retrieval-Augmented Generation) allows it to leverage the web for information, giving it an edge in ELO rankings and benchmarks. This has resulted in Bard outperforming older GPT-4 models, but the AI community is questioning the practical utility of Bard for specific tasks and calling for fairer AI model evaluations.
Mystery solved - Bard has access to the web while the rest of the services don’t! This gives it a inherent advantage https://t.co/UanRrCKmiB
Bard is definitely winning in the ELO rankings because it can do web searches, and multiple ones at the same time. It's at least contributing. Can't seem to access this functionality via the API. @lmsysorg how are you guys doing it? https://t.co/xkPA0aQR7G
Google’s propaganda machine is in full effect. I’ve used Gemini Pro a lot in the last week for testing, I asked it to explain a complex manifold, that was an interesting answer. I can even make the 2 talk to each other, even GPT3.5 Turbo is better. And there is so much pro-Google… https://t.co/t0rsdIASJl
This is a misleading result. Bard uses RAG (access to Google search) here while the other top competitors like GPT-4, Mistral Medium and Claude are not using search. Access to recent information through search is an advantage that needs to be controlled for or at least pointed… https://t.co/ugQ1x7LMfI
So...can I submit search-powered Mixtral to lmsys as well? If the new Bard ranking has access to search, this is now an uneven comparison 🤔
Isn't the Bard API also doing RAG with web search there? Sounds pretty unfair comparison, if that is the case. Would need to be compared to Copilot, Perplexity, etc. https://t.co/kghUvt1NTy
Not sure if that's a fair comparison when bard is using search API while GPT-4 and other models are not (example below). The baremetal Gemini Pro API seems to be in between Mixtral 8*7B and GPT-3.5. So the key difference is search that greatly improves human preference? https://t.co/2TlebUnJuo https://t.co/uhpTR96K41
Everyone's talking about Gemini outperforming older GPT-4 models on benchmarks —but are you actually USING Gemini over GPT-4 for your work? I ran 1k+ API calls to both models last week and, for my task, Gemini wasn't even close. How does it do for your work? https://t.co/H0ufb9V481
I love Chatbot Arena, but this is crazy. "GPT-4-Turbo" is a bunch of model weights. "Bard (Gemini Pro)" is the Google crawler/index + Gemini There's no way on earth a live crawler should be compared to base model weights. If anything, it's actually crazy that GPT-4 disconnected… https://t.co/7Xr3pWxgaU
I love Chatbot Arena, but this is crazy. "GPT-4-Turbo" is a bunch of model weights. "Bard (Gemini Pro)" is the Google crawler/index + Bard There's no way on earth a live crawler should be compared to base model weights. If anything, it's actually crazy that GPT-4 disconnected… https://t.co/7Xr3pWxgaU