Humanity’s Last Exam: The Ultimate Test for AI Models Released

AI company Scale AI and the Center for AI Safety (CAIS) have jointly released a groundbreaking benchmark called "Humanity's Last Exam," designed to test the limits of AI knowledge. 

"Humanity's Last Exam" features an extensive range of challenging questions across subjects such as mathematics, humanities, and natural sciences. Each question was meticulously crafted by university professors, mathematicians, and experts across various disciplines. While all questions have definitive answers, they are exceptionally difficult to solve.

Example Question:
Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

The benchmark includes 3,000 questions in total, primarily multiple-choice or short-answer formats. When tested on several advanced AI models—such as OpenAI's "GPT-4o," and Google's "Gemini 1.5 Pro"—none achieved an accuracy rate higher than 10%. OpenAI's "o1," known for its strong reasoning capabilities, achieved the highest score of 8.3%.
Learn more about the benchmark on the official Scale AI blog.

Viewer Comments:
I have no idea what the example question is even saying.
Getting 10% correct out of 3,000 is already pretty impressive.
Is this just a contest for showing off obscure knowledge?
I don’t think I could answer a single one.