AI agents wrong ~70% of time: Carnegie Mellon study

MirchiLover@lemmy.ml · 8 months ago

AI agents wrong ~70% of time: Carnegie Mellon study

tfowinder@beehaw.org · 8 months ago

Well the article says that the AI agents were able to complete 30% of the tasks given to it like searching the web, communicating with co workers, etc. I think this is interesting

CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers

“We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks”

Personally i belive this is impressive.

Krauerking@lemy.lol · 8 months ago

That’s really not. A calculator that only gave the right output 30% of the time would be worthless.