2026 Proffered Presentations
S171: LARGE LANGUAGE MODELS VS EXPERT RESPONSES: A COMPARATIVE ANALYSIS OF PATIENT EDUCATION QUALITY AFTER SKULL BASE SURGERY
Mahdokht Manavi, MD1; Noel Ayoub, MD1; Arifeen Rahman, MD1; Jacquelyn Callander, MD1; Lirit Levi, MD1; David Liu, MD1; Axel Renteria, MD1; Maxime Fieux, MD2; Matei Banu, MD1; Ali Palejwala, MD1; Juan Fernandez-Miranda, MD1; Peter Hwang, MD1; Jayakar Nayak, MD1; Zara Patel, MD1; Michael Chang, MD1; 1Stanford University; 2Université de Lyon
Introduction: Patients are increasingly turning to large language models (LLMs) to seek information about their healthcare. Patients undergoing skull base surgery often experience postoperative symptoms that require timely and reliable guidance and may turn to LLMs for answers. It remains unclear whether the responses generated by LLMs provide sufficient quality information for patient education. The purpose of this study was to evaluate the accuracy, understandability, actionability, readability of LLM-generated answers versus expert-provided answers.
Methods: Eight common postoperative questions related to skull base surgery were selected. Seven publicly available LLMs were queried for responses, and one expert answer was written collaboratively by two skull base surgeons. A panel of eight otolaryngologists and neurosurgeons evaluated the responses in randomized and blinded fashion. Accuracy was rated on a 1–5 scale. Understandability and actionability were evaluated using the Patient Education Materials Assessment Tool (PEMAT). Readability was measured using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). One-way ANOVA with post-hoc analysis compared results across groups.
Results: For accuracy, LLMs performed at a similar level to experts. The highest accuracy scores were observed for DeepSeek (mean 4.17) and Gemini (4.16), followed by GROK (4.06) and the Expert answer (3.90). Lower averages were seen for Meta (3.72), Claude (3.63), Copilot (3.61), and ChatGPT o1 (3.10).
For understandability, all LLMs scored higher than Expert on PEMAT: Gemini 96 ± 3%, GROK 94 ± 6%, Meta 88 ± 7%, DeepSeek 87 ± 17%, Copilot 82 ± 6%, ChatGPTo1 68 ± 10%, Claude 63 ± 7%, Expert 56 ± 7%. There was a statistically significant higher score in understandability of Gemini, GROK, Meta, and DeepSeek compared to expert answer (F (7,56) = 23.42, p < .001, ?² = 0.75).
For actionability, all LLMs scored higher (Gemini (99% ± 1), GROK (99% ± 2), Meta (93% ± 10), DeepSeek (92% ± 15), ChatGPT o1 (65% ± 17), Claude (51% ± 21), versus expert (28 ± 18%). There was a significantly higher actionability of LLMs compared to Expert (F (7,56) = 21.06, p < .001, ?² = 0.72).
Readability analysis showed that Expert and Claude responses required a college reading level (FKGL 14–17; FRES < 30). In contrast, ChatGPT o1 and Copilot produced the most readable outputs, at a 6th–8th grade level (FKGL 3–7; FRES 60–82). Gemini, GROK, Meta, and DeepSeek generated responses at a high-school level (FKGL 7–9; FRES 40–60).
Conclusion: Several LLMs produced responses that were significantly more understandable, actionable, and readable than expert-written content, while achieving the same level of accuracy. Further prospective validation is warranted to assess whether publicly available LLMs can accurately respond to patient questions in a real clinical setting.


