An international research team has developed a new benchmark that reveals the current limitations of LLMs. Even the most advanced models fail at 90 percent of the tasks – for now.
Frontier models fail hard at “Humanity's Last Exam” but experts question if it matters
This was originally published on post