Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

brianpeiris@lemmy.ca · edit-2 4 months ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

UnrepentantAlgebra@lemmy.world · 4 months ago

If human scores were included, they would be at 100%, at the cost of approximately $250

Wait, why did it cost real humans $250 to pass the test?

KairuByte@lemmy.dbzer0.com · 4 months ago

I assume it’s an hourly wage or something. Just because humans can work for free if they choose, doesn’t mean they have no cost associated with them. Just like a company could choose to give away unlimited tokens, those tokens still have a standard cost.

FrankFrankson@lemmy.world · edit-2 4 months ago

That is how much individual testing humans cost when you buy them in bulk.

Aceticon@lemmy.dbzer0.com · 4 months ago

If there had been a “Buy 10, Get 1 free” they could’ve used 11 humans instead of 10 for the same $250.

aesopjah@sh.itjust.works · 4 months ago

it’s also an odd metric since only 20-60% of the humans completed it. Very 60% of the time they complete it everytime energy.

Ideally they’d run the bots multiple times through (with no context or training of previous run), but I guess that is cost prohibitive?

monotremata@lemmy.ca · 4 months ago

Yeah, this is what I was going to call out. Calling it “100% solvable by humans” and saying “if human scores were included, they would be at 100%” when 20-60% of humans solved each task seems kinda misleading. The AI scores are so low that I don’t think this kind of hyperbole is necessary; I assume there are some humans that scored 100%, but I would find it a lot more useful if they said something like “the worst-performing human in our sample was able to solve 45% of the tasks” or whatever. Given that the AIs are still scoring below 1%, that’s still pretty dark.

mapleseedfall@lemmy.world · 4 months ago

Youd have to eat $250 worth of burgers to pass it.

brianpeiris@lemmy.ca · edit-2 4 months ago

This is my rough upper-bound estimate based on the Technical Report. Human participants were paid to complete and evaluate the tasks at an average fixed fee of $128 plus $5 for solved tasks. So if a panel of humans were tasked with solving the 25 tasks in the public test set, it would be an average of $250 per person. Although, looking at it again, the costs listed for the LLMs is per task, so it would actually be more like $10 per human per task. In any case it’s one or two orders of magnitude less than the LLMs.

Participants received a fixed participation fee of $115–$140 for completing the session, along with a $5 performance-based incentive for each environment successfully solved

https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

ExLisper@lemmy.curiana.net · 4 months ago

Because I ain’t doing this shit for free.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 | ARC Prize