Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

brianpeiris@lemmy.ca · edit-2 16 hours ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

ayyy@sh.itjust.works · 4 hours ago

The humans literally didn’t score 100% though. Why lie?

brianpeiris@lemmy.ca · edit-2 1 hour ago

You can really only judge fairness of the score if you understand the scoring criteria. It is a relative score where the baseline is 100% for humans – i.e. A task was only included in the challenge if at least two people in the panel of humans were able to solve it completely, and their action count is a measure of efficiency. This is the baseline used as a point of comparison.

From the Technical Report:

The procedure can be summarized as follows:
• “Score the AI test taker by its per-level action efficiency” - For each level that the test taker completes, count the number of actions that it took.
• “As compared to human baseline” - For each level that is counted, compare the AI agent’s action count to a human baseline, which we define as the second-best human action count. Ex: If the second-best human completed a level in only 10 actions, but the AI agent took 100 to complete it, then the AI agent scores (10/100)^2 for that level, which gets reported as 1%. Note that level scoring is calculated using the square of efficiency.
• “Normalized per environment” - Each level is scored in isolation. Each individual level will get a score between 0% (very inefficient) 100% (matches or surpasses human level efficiency). The environment score will be a weighted-average of level score across all levels of that environment.
• “Across all environments” - The total score will be the sum of individual environment scores divided by the total number of environments. This will be a score between 0% and 100%.

So the humans “scored 100%” because that is the baseline by definition, and the AIs are evaluated at how close they got to human correctness and efficiency. So a score of 0.26% is 0.0026 times less efficient (and correct) compared to humans.

Knock_Knock_Lemmy_In@lemmy.world · 3 hours ago

John 1.0 and Caroline 1.0 scored 100%

eru@mouse.chitanda.moe · 2 hours ago

makes title more clickbait

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 | ARC Prize