2026 Poster Presentations

← Back to Previous Page

P500: ARTIFICIAL INTELLIGENCE ALIGNMENT WITH EXPERT CONSENSUS IN VESTIBULAR SCHWANNOMA MANAGEMENT: A MULTI-PLATFORM EVALUATION
Shreya Vinjamuri¹; KiChang Kang, MD²; Jay Trivedi¹; Fox Ryker³; Anish Sathe, MD¹; Roger Murayi, MD¹; James Evans, MD¹; ¹Thomas Jefferson University Hospital; ²Montefiore Hospital; ³PCOM

Treatment paradigms for vestibular schwannomas (VS) have evolved with advances in imaging and surgical technology, yet clinical guidelines remain non-standardized, leading to variability in outcomes. Artificial intelligence (AI) shows promise for guiding clinical decision-making, but evaluation of how AI alignment with expert consensus changes over time as these technologies rapidly evolve is necessary.

Methods: We evaluated four AI platforms (Google Bard/Gemini, GPT-4/GPT4o, SciteAI, and DeepSeek) across 2023 and 2025 datasets to assess agreement with expert consensus on VS management from Carlson et al. (2020). We tested 103 expert consensus statements across six categories: Hearing Preservation (Radiosurgery/Microsurgery), Tumor Control and Imaging Surveillance, Preferred Treatment, Operative Considerations, and Complications. Each statement was presented in two formats: direct consensus statements (Prompt 1) and rephrased yes/no questions (Prompt 2) to evaluate framing effects. Two independent evaluators categorized responses as "Agree," "Disagree," or "Neutral/Insufficient Information," with neutral responses grouped with disagreements for statistical analysis. We calculated accuracy as the proportion of "Agree" responses and used Cohen's Kappa statistics to assess within-AI agreement (between prompts) and between-AI agreement.

Results: Overall AI agreement with expert consensus improved significantly from 2023 to 2025. In 2023, agreement ranged from 45.6% to 92.2%, with GPT-4 showing highest accuracy (88.8%) and SciteAI lowest (57.3%). In 2025, agreement improved to 90.3% to 100%, with GPT4o and DeepSeek achieving perfect accuracy (100%), while Google Gemini reached 93.2% and SciteAI 91.7%. Within-AI consistency (Kappa values) increased from 0.039-0.371 in 2023 to 0.297-1.000 in 2025. Between-AI agreement remained low in 2023 (Kappa: -0.118 to 0.142) but improved moderately in 2025 (Kappa: 0.000 to 0.540). Category-specific analysis revealed consistent improvements across all treatment domains, with GPT4o showing near perfect agreement in all categories by 2025.

AI platforms demonstrated variable but generally improving agreement with expert consensus on VS management, with significant enhancements between 2023 and 2025 iterations. While newer models showed higher accuracy, inter-platform agreement remained limited, suggesting continued variability in AI interpretation of medical literature. These findings highlight the need for comprehensive evaluation of AI tools before clinical implementation and underscore the importance of prompt standardization in AI research. The study provides a baseline for tracking AI progress in neurosurgical decision-making as these technologies continue evolving.

* Re-Analysis currently being conducted to add information regarding 2025 Claude AI, Open Evidence, etc. local models

View Poster

← Back to Previous Page