The Center for AI Safety (CAIS) and Scale AI today announced the results of a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge and whether the models are capable of chain-of-thought reasoning. The results demonstrated a significant improvement from the reasoning capabilities of earlier models, but current models still were only able to answer fewer than 10 percent of the expert questions correctly.
CAIS and Scale AI Unveil Results of “Humanity's Last Exam,” a Groundbreaking New Benchmark
This was originally published on post