Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

brianpeiris@lemmy.ca · edit-2 4 months ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Tetragrade@leminal.space · edit-2 4 months ago

This replay is the funniest shit lmao. Keep building that bridge Claude.

https://arcprize.org/replay/0964128b-a2f5-4c5b-886e-497d893f429d

Interesting that it seems to be perceiving the environment mostly accurately, and is just completely wrong about the purpose of all the game objects.

bss03@infosec.pub · 4 months ago

I couldn’t find replays. Are there more? Also, it is a bit funny that “building the bridge” which at one point seems to be Claude’s “chosen goal” is just “running out of moves” and failing the task.

Task failed successfully, Claude. Task failed, successfully.

brianpeiris@lemmy.ca · 4 months ago

There’s a column linking to replays in the table of tasks here: https://arcprize.org/tasks

bss03@infosec.pub · 4 months ago

Here’s another reply where the model mistakes running out of time/move for making progress

‹Hexa«Back›@lemmy.blahaj.zone · 4 months ago

it’s reasoning log is so fucking funny

hamsterkill@lemmy.sdf.org · 4 months ago

My understanding is that Claude is particularly geared towards being a tool for people to use rather than a human replacement. That’s why they had that whole spat with the Pentagon about a human needing to be in the loop.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 | ARC Prize