
Background This study focuses on evaluating the effectiveness and reliability of GPT-4 in classifying radiological reports based on the Fazekas scale, a critical tool for assessing white matter signal abnormalities in brain MRI. We applied synthetic data creation and two specific GPT models, SinteticRMFazekasGPT and FazekasGPT, to generate and analyze 50 synthetic radiological reports. The study compared the performance of GPT-4 with the expert judgment of a neuroradiologist, for Fazekas classifications from brain MRI reports. Results Our analysis included contingency table and Cohen’s Kappa for inter-rater agreement. The significance of the difference between the observed agreement and the expected agreement by chance was calculated, with a 5% threshold for a Type I error. The agreement between GPT-4 and the neuroradiologist was total (100%) regarding the Fazekas 0, with Fazekas 2 and with Fazekas 3. Out of the 15 reports with Fazekas 1, only 13 (86.7%) were correctly classified by GPT-4, while the remaining 2 (13.3%) were classified as Fazekas 2. Overall, the agreement was 96%, compared to an expected chance agreement of 28%. The Cohen’s Kappa value was 0.94 (p