Well the article says that the AI agents were able to complete 30% of the tasks given to it like searching the web, communicating with co workers, etc. I think this is interesting
CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers
“We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks”
Well the article says that the AI agents were able to complete 30% of the tasks given to it like searching the web, communicating with co workers, etc. I think this is interesting
Personally i belive this is impressive.
That’s really not. A calculator that only gave the right output 30% of the time would be worthless.