Researchers at Princeton have developed SWE-agent, scoring 12.29%, close to Devin's 13.84%. The focus is on evaluating coding LLMs, with the underlying LLM considered crucial. SWE-Bench is suggested for this purpose.
How LLMs are trained https://t.co/1wx0nP4ZBL
One of the tasks from SWE-bench, the benchmark used to assess AI agents like Devin, contains the line it needs to change in order to fix the bug. This is why it's important to read the data used for benchmarking. You can browse a list of these issues here https://t.co/ABBYWBL66W https://t.co/4Yy94rF3mx
The moat of software AI agents is not the thin wrapper layer (Devin, SWE-Agent), but the underlying LLM. Instead of benchmarking the wrapper, I think SWE-Bench is excellent for evaluating coding LLMs instead: Hold the agent layer fixed and vary only the LLM backend. Provide all… https://t.co/uublPJfm3f
Hector Liu from @llm360 talking about how to pretrain LLMs from scratch - joining us in 2.5 hours in my server! link: https://t.co/C21orV2hzx don't miss it! https://t.co/YuwNpH9SEh
Less than a month after Cognition Labs released Devin, an AI coding agent that apparently solves software bugs better than any prior agent, researchers at Princeton have released SWE-agent, which scored 12.29%—nearly as high as Devin’s 13.84%. https://t.co/1LqOpEAIYi
“The fusion of #Human ingenuity and the #Computational prowess of LLMs heralds a new era of functionality” #FutureWithAI🪩 https://t.co/37VzKHUSaA
Considering a new idea for LLM benchmarking Given that benchmarks can be beaten by training on test We are thinking of setting up a service where folks can submit their LLMs and humans combined with AI can test the LLM The more votes the LLMs wins the higher it floats up the…
A wave of pure implementations of LLM trainings is coming 🤓 https://t.co/qCOaYgnDEh