2026 Poster Presentations
P446: DEVELOPMENT OF A COMPUTER VISION SYSTEM FOR SURGICAL INSTRUMENT ANALYSIS DURING ENDOSCOPIC SINUS AND SKULL BASE SURGERY
Corinne R Stonebraker, BA1; Jaeho Cho2; Katherine Liu, MD1; Lacy Brame, DO1; Raj Shrivastava, MD1; Alfred-Marc Iloreta, MD1; 1Icahn School of Medicine at Mount Sinai; 2The Cooper Union
Background: Real-time surgical instrument tracking and identification represent a critical advancement in surgical workflow optimization and training enhancement. Traditional surgical video analysis offers a limited perspective and often relies on low-fidelity, subjective metrics. The emergence of computer vision tools for object detection and image segmentation provides new opportunities for comprehensive surgical documentation and analysis.
Objectives: To develop and validate a deep learning object detection system capable of automatically recognizing and labeling surgical instruments in open operative video of endoscopic sinus and skull base procedures.
Methods: Four operations were recorded using an Insta360 GO Ultra-HD camera (4K resolution, 30 fps) positioned across the operating table. A total of 6 hours, 21 minutes, and 40 seconds of footage was collected. Frames were sampled every 12 seconds, yielding 2159 images split into training (1515), validation (322), and testing (322) subsets. Ground truth annotations were performed in CVAT. A YOLO11n model (Ultralytics 8.3.203, PyTorch 2.8.0, CUDA) was trained to detect and classify surgical instruments.
Results: On the independent test set of 322 images containing 218 instrument instances, the model achieved an overall precision of 96.4%, recall of 94.8%, and mAP50 of 96.6%. At the instrument level, performance varied but remained consistently high. The Bovie was detected with precision and recall of 100% and mAP50 of 99.5%. The Frazier achieved precision of 95.4%, recall of 93.0%, and mAP50 of 96.5%. Forceps achieved precision of 93.9%, recall of 94.3%, and mAP50 of 95.5%. The microdebrider achieved precision of 100%, recall of 99.9%, and mAP50 of 99.5%. The Freer elevator was the most challenging, with precision of 92.9%, recall of 86.7%, and mAP50 of 91.9%. Inference time was an average of 4.1 ms per image, supporting feasibility for real-time applications. Error analysis of the confusion matrix (Figure 1) revealed that background regions were frequently misclassified as Frazier, occurring in 6 out of the 9 (67%) ground truth background regions. This pattern likely reflects annotation practices, which excluded suction tubing from bounding boxes. Representative model outputs are shown in Figure 2, demonstrating successful instrument detection and labeling, along with the corresponding confidence level.
Conclusion: Computer vision analysis of video footage of the operating surgeon represents a viable approach for automated surgical instrument tracking in endoscopic sinus and skull base procedures. This validated system can now be employed to retrospectively study objective surgical performance measures of expert and resident surgeons, including instrument stroke concentration, economy of motion, speed, coordination, and procedural efficiency. The technology enables characterization of objective benchmarks for safe and efficient endoscopic sinus and skull base surgery techniques. The compact form and high-resolution capabilities of modern action cameras provide a practical platform for widespread implementation without disrupting existing surgical workflows.
Figure 1: Confusion Matrix

Figure 2: Sample Predictions on Still-Frames

