• Skip to main content
  • Skip to header right navigation
  • Skip to site footer

  • Twitter
  • YouTube
NASBS

NASBS

North American Skull Base Society

  • Home
  • About
    • Mission Statement
    • Bylaws
    • NASBS Board of Directors
    • Committees
      • Committee Interest Form
    • NASBS Policy
    • Donate Now to the NASBS
    • Contact Us
  • Meetings
    • 2027 Annual Meeting
    • Abstracts
      • 2026 Call for Abstracts
      • NASBS Poster Archives
      • 2025 Abstract Awards
    • 2026 Recap
    • NASBS Summer Course
    • Meetings Archive
    • Other Skull Base Surgery Educational Events
  • Resources
    • Member Survey Application
    • NASBS Travel Scholarship Program
    • Research Grants
    • Fellowship Registry
    • The Rhoton Collection
    • Webinars
      • Research Committee Workshop Series
      • ARS/AHNS/NASBS Sinonasal Webinar
      • Surgeon’s Log
      • Advancing Scholarship Series
      • Trials During Turnover: Webinar Series
    • NASBS iCare Pathway Resources
    • Billing & Coding White Paper
  • Membership
    • Join NASBS
    • Membership Directory
    • Multidisciplinary Teams of Distinction
    • NASBS Mentorship Program
  • Fellowship Match
    • NASBS Neurosurgery Skull Base Fellowship Match Programs
    • NASBS Neurosurgery Skull Base Fellowship Match Application
  • Journal
  • Login/Logout

2026 Proffered Presentations

2026 Proffered Presentations

 

← Back to Previous Page

 

S171: LARGE LANGUAGE MODELS VS EXPERT RESPONSES: A COMPARATIVE ANALYSIS OF PATIENT EDUCATION QUALITY AFTER SKULL BASE SURGERY
Mahdokht Manavi, MD1; Noel Ayoub, MD1; Arifeen Rahman, MD1; Jacquelyn Callander, MD1; Lirit Levi, MD1; David Liu, MD1; Axel Renteria, MD1; Maxime Fieux, MD2; Matei Banu, MD1; Ali Palejwala, MD1; Juan Fernandez-Miranda, MD1; Peter Hwang, MD1; Jayakar Nayak, MD1; Zara Patel, MD1; Michael Chang, MD1; 1Stanford University; 2Université de Lyon

Introduction: Patients are increasingly turning to large language models (LLMs) to seek information about their healthcare. Patients undergoing skull base surgery often experience postoperative symptoms that require timely and reliable guidance and may turn to LLMs for answers. It remains unclear whether the responses generated by LLMs provide sufficient quality information for patient education. The purpose of this study was to evaluate the accuracy, understandability, actionability, readability of LLM-generated answers versus expert-provided answers.

Methods: Eight common postoperative questions related to skull base surgery were selected. Seven publicly available LLMs were queried for responses, and one expert answer was written collaboratively by two skull base surgeons. A panel of eight otolaryngologists and neurosurgeons evaluated the responses in randomized and blinded fashion. Accuracy was rated on a 1–5 scale. Understandability and actionability were evaluated using the Patient Education Materials Assessment Tool (PEMAT). Readability was measured using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). One-way ANOVA with post-hoc analysis compared results across groups.

Results: For accuracy, LLMs performed at a similar level to experts. The highest accuracy scores were observed for DeepSeek (mean 4.17) and Gemini (4.16), followed by GROK (4.06) and the Expert answer (3.90). Lower averages were seen for Meta (3.72), Claude (3.63), Copilot (3.61), and ChatGPT o1 (3.10).

For understandability, all LLMs scored higher than Expert on PEMAT: Gemini 96 ± 3%, GROK 94 ± 6%, Meta 88 ± 7%, DeepSeek 87 ± 17%, Copilot 82 ± 6%, ChatGPTo1 68 ± 10%, Claude 63 ± 7%, Expert 56 ± 7%. There was a statistically significant higher score in understandability of Gemini, GROK, Meta, and DeepSeek compared to expert answer (F (7,56) = 23.42, p < .001, ?² = 0.75).

For actionability, all LLMs scored higher (Gemini (99% ± 1), GROK (99% ± 2), Meta (93% ± 10), DeepSeek (92% ± 15), ChatGPT o1 (65% ± 17), Claude (51% ± 21), versus expert (28 ± 18%). There was a significantly higher actionability of LLMs compared to Expert (F (7,56) = 21.06, p < .001, ?² = 0.72).

Readability analysis showed that Expert and Claude responses required a college reading level (FKGL 14–17; FRES < 30). In contrast, ChatGPT o1 and Copilot produced the most readable outputs, at a 6th–8th grade level (FKGL 3–7; FRES 60–82). Gemini, GROK, Meta, and DeepSeek generated responses at a high-school level (FKGL 7–9; FRES 40–60).

Conclusion: Several LLMs produced responses that were significantly more understandable, actionable, and readable than expert-written content, while achieving the same level of accuracy. Further prospective validation is warranted to assess whether publicly available LLMs can accurately respond to patient questions in a real clinical setting.

 

← Back to Previous Page

Copyright © 2026 North American Skull Base Society · Managed by BSC Management, Inc · All Rights Reserved