AI and LLMs struggle with historical accuracy in advanced tests

Leading AI systems perform poorly on nuanced historical exams, achieving only 46% accuracy at best.