Quantitative Skill Assessment of Robotic Minimally Invasive Surgical Skills (rMIS)

Surgical proficiency engenders merger of sensory and cognitive capabilities and conducting systematic assessment in this context has always been a topic of considerable importance. Robotic minimally-invasive-surgery (rMIS) is the fastest growing segment of computer-aided surgical systems today and has often been heralded as the new revolution in healthcare industry. However, the surgical performance-evaluation paradigms have always failed to keep pace with the advances of surgical technology. The biggest challenges to assessment and accreditation of surgeons include (i) creating appropriately rich and diverse clinical settings (real or virtual); as well as (ii) developing uniform, repeatable, stable, verifiable performance metrics; both at manageable financial levels for ever  increasing cohorts of trainees.

The surgical training programs are considered to be inconsistent, not-proven and non-uniform, and normally includes operating on surrogate phantoms ranging from cadavers to real-life surgeries, animal models to plastic mannequins to most recently simulated/virtual environments. Such virtual reality trainers leverage an apprenticeship model and entails subjective or at best semi-objective evaluation of surgical performance by an expert surgeon. Over the past decade, the ACGME (Accreditation Council for Graduate Medical Education) has espoused development of a cost-efficient proficiency-based curriculum, with an emphasis on simulation methodologies and quantitative skills-assessment tools, to bypass the limitations in the current apprenticeship-based system. In addition, the growth of computer integration and data acquisition in minimally-invasive-surgery (MIS) especially in the form of rMIS offers a unique set of opportunities to comprehensively address this situation.

 Research Issue - Skill Assessment Using Video-Based Motion Studies

In this work, we examine the extension of traditional manipulative skill assessment with deep roots in performance evaluation in manufacturing industries for applicability to robotic surgical skill evaluation. The traditional time and motion studies are based on the hypothesis that: any manipulation or assembly task can be subdivided into smaller individual units called “Therbligs”. These “Therbligs” allow for decomposition of a complicated manual task into sub-parts that could then be individually examined. This decomposition potentially allows for a finite state automaton representation of a complex activity that could form the discrete basis for linguistic representation as well as fault-detection and correction. Intra- and inter-user variance on various performance metrics can by analyzed by studying surgeons’ performance over each sub-tasks. Additional metrics on tool-motion measurements, motion economy, and handed-symmetry can be similarly expanded over this temporal segmentation to help characterize performance. Our studies analyzed video recordings of surgical task performance in two settings:  First, we analyzed video data for two representative manipulation exercises (peg board and pick-and-place) on a da Vinci surgical (SKILLS) simulator to afford a relatively-controlled and standardized testbed for surgeons with varied experience-levels. Second task-sequences from real surgical videos were analyzed with a list of predefined “Therbligs” in order to investigate its usefulness for practical implementation.


An automated surgical expertise evaluation method based on these well-established motion studies methodologies, especially for MIS procedures. This method relies on segmenting of a primary task into sub-tasks, which can subsequently be analyzed by statistical analyses of micro-motions. We conducted motion studies using: (A) manual annotation process by experts (to serve as a benchmark); and (B) automated kinematic-analysis-of-video techniques; for motion economy, repeatability as well as dexterity. The da Vinci SKILLS simulator was used for our studies to serve as a uniform and standardized testbed. Surgeons with varied levels of expertise were recruited to perform two representative simplified tasks (Peg Board and Pick & Place). The automated kinematic analysis of video was compared with the ground truth data (obtained by manual labeling) using misclassification rate and true classification confusion matrix. Future studies aimed towards analyzing real surgical procedures and extending the existing framework using probabilistic approaches are already underway.

 Research Issue - Video-Based Tool Detection and Tracking Framework

Adoption of the robotic-surgery continues to raise serious questions about shortcomings in patient safety and surgical training. Major causes of concern including reduced field of view (FoV), loss of depth perception, lack of force-feedback and more importantly, lack of surgical training and assessment still remain unsolved. With increasing number of legal claims due to robotic surgical failures and relatively outdated surgical training curriculum, the current situation in robotic surgeries can only be expected to worsen. Therefore, devising robust, feasible and advanced technologies for enhancing surgical safety and aiding the decision-support for surgeons without requiring major modifications to existing system has received greater attention. Recently there has been significant interest in video understanding techniques applied to recorded or online surgical video streams (for use in anatomic reconstruction, surface registration, hand and/or tool motion tracking, etc.). Such techniques have potential use in providing in-vivo surgical guidance information and semantic feedback to surgeons, thus improving their visual awareness and target reachability, thereby enhancing the overall patients safety in robotic surgeries. The complexities posed by typical surgical scenarios (such as tissue deformations, image specularities and
clutter, tool open/closed states, occlusion of tools due to blood and/or organs and tool out-of-view) offer the usual constraints hindering implementation of a robust video-based tool tracking method.

Tracking surgical tools in general has been used for a wide range of applications including safety, decision-support as well as skill assessment. Most tool tracking approaches are either color marker based or based on the geometric model of the instrument . Former techniques employ fiducial markers on the tool, using color marker and thresholding in HSV space to detect tools, attaching light emitting diodes to the tip of instruments and then detecting these markers in endoscopic images, color coding tips of surgical instruments and using a simple color segmentation. Using such marker-based methods for surgical tool tracking has issues with manufacturing, bio-compatibility and additional instrumentation. While geometry based approaches use knowledge of the model of tool to find its pose from images. Other approaches (that do not necessarily modify the tool itself) include: using color space for classification of pixel into instruments and organs, performing shape analysis of these classification labels and then predicting the location of tool in next frame using an Auto Regressive Model. McKenna et al. use similar method for classification of surgical tools but use particle filter to track instruments in a video. These approaches are limited to detecting a tool when a significant area of tool is present in an image frame and there is a good distinction between its background and instruments in color space. Other recent work focuses on locating specific landmarks on the surgical tools by learning a Random Forest based classifier to classify these landmarks from images and using an Extended Kalman Filter (EKF) to smooth the tool poses. However this method requires knowledge of 3D CAD model and extensive image labeling for a single tool.

To address the challenge of detecting presence of surgical tool in images, we learn different detectors for each type of surgical tool end-effectors using state-of-the-art object detector. This object detector essentially captures the shape of the object by using Deformable Part Models (DPM) consisting of star-structured pictorial structure model which links root of an object to its parts using deformable springs. Hence this model captures articulation which is invariably present in the surgical tool and allows for learning a detector for different tool end-effector configurations. We annotated surgical tools in real surgical videos obtained from da Vinci Surgical System (dVSS) and learn a Latent Support Vector Machine (LSVM) classifier by extracting Histogram of Oriented Gradients (HOG) from annotated bounding boxes. This type of learning classifiers is highly generalizable and extensible, enabling one to find tools in videos without making any restrictive assumptions about the type, shape, color of tool, view etc. Figure 1 shows HOG template model learned using ground truth annotations. Based on this algorithm, we were able to obtain high-confidence tool detections only for a subset of frames and therefore, necessitating us to derive a tool tracking algorithm.

Tool Detection and Tracking Framework

HOG Based Learnt Tool Detector Model

In general, we were able to conclude that for surgical tool tracking to succeed and be widely employed in a typical surgical setting, there are several key challenges that need to addressed as summarized below:

  • Tool Detection: Tool tracking approaches need to robustly determine the presence of different surgical tools in images as surgeons move their tools in- and out- of the FoV of an endoscopic camera. It becomes important for a tracking framework to incorporate this knowledge to reduce the number of false alarms. This is a critical problem especially in markerless tracking as color segmentation will produce outliers in tool end-effector detection in presence of tool-tissue interaction, blood stains and many other factors. We address this problem robustly by learning state-of-the art object detectors for different tool types.

  • End-Effector Pose: Some model based approaches ignore the pose of end-effector while tracking the tool. Since, end-effector in many surgical tools is articulated and always the point of contact with tissues, it is vital to track the end-effector and its articulation in the tracking framework. Our approach to detection and tracking directly models end-effector which is in general most distinguishable part of a surgical tool and employ a detection method that captures its articulation.

  • Generalized Approach: The tracking algorithm also needs to be generalizable to different types of tools used in various surgical procedures. Model based approaches which model end-effector are very specific to a particular surgical tool. In contrast, our approach is easily generalizable as it only needs annotated bounding boxes to learn a detector for that specific tool and tracking is invariant to tool-types.

  • Tool Tracking Framework: Different tool-tracking methods have been proposed to solve the problem with clear trade-off. The important issue that needs to be addressed is the effective combination of trackers that can optimally combine the strengths of various methods.

This work proposes a novel tool detection and tracking approach using uncalibrated monocular surgical videos for computer-aided surgical interventions. We hypothesize surgical tool end-effector to be the most distinguishable part of a tool and employ state-of-the-art object detection methods to learn the shape and localize the tool in images. We aim to model the tracking task independent of features/types of individual tracker and focus on optimally fusing the information from all the available trackers. Hence, we propose a Product of Tracking Experts (PoTE) based generalized object tracking framework which probabilistically merges tracking outputs from time-varying number of trackers to produce robust identity-maintained tracking under varied conditions in unconstrained scenarios. In the current implementation of our PoTE framework, we use three tracking experts -- point-feature-based, region-based and object detection-based. A novel point feature- based tracker is also proposed in the form of a voting based bounding box geometry estimation technique building upon point-feature correspondences. Our tracker is causal which makes it suitable for real-time applications.

The intuition and mathematical formulation behind the PoTE framework is illustrated in the following graphic. From the derivation, it is clear that if the uncertainty of all the experts can be given by a normal distribution with a specified mean and covariance matrix, the resulting estimated location of center of bounding box can be obtained analytically, as product of normal distributions yields a normal distribution.

In this work, the output quantity is considered to be the centroidal location of a bounding box per tool per frame in the video sequence and the corresponding uncertainties are approximated by width and height of the respective bounding boxes. Such a framework using two tool trackers (optical flow and feature-points based trackers) has been implemented and were tested on real surgical videos. Consequently, the results obtained shown to significantly improve compared the baseline results obtained by using only one of the tracker. 


 Research Issue - Video-Based Attributes Labeling and Semantic Identification

A high level flow chart of our video-based framework for semantic identification (extended from our tracking implementation) is shown in the following figure.

Video-Based Attributes Labeling and Semantic Identification Framework

By training and optimizing the classifiers using our training and test data (a subset of ground truth annotations), we were able to obtain promising results for attribute identification and semantic feedback compared to the state-of-art method. Though this framework can be generalized for arbitrary set of attributes, we focus our efforts for specifically two attributes tool open and close as well as blood stained/ non-stained condition of tool. The results obtained are summarized as follows:




Ground truth dataset:

We propose a new dataset consisting of 8 small sequences (1500 frames) for “Clamp” class and 8 sequences (1650 frames) for “Tool” class acquired while performing Hysterectomy surgery using dVSS to conduct our evaluation. This dataset served in the first place to learn and optimize the detector classifiers as well as to validate our tracking framework results. A sample of the ground-truth dataset is shown in the figure below. To the best of our knowledge, there are no publicly available datasets for testing a generic tool tracking algorithm and henceforth, in order to motivate for further research and advancement in this direction, we decided to open-source our dataset. The proposed dataset has real-world video-sequences with various artifacts including tool articulations, occlusions, rapid appearance changes, fast camera motion, motion blur, smoke and specular reflections. This dataset was then manually annotated for the bounding boxes of the tools in every frame.  The proposed dataset can be accessed in the following link here.


Students Involved:

- Suren Kumar, PhD Candidate, University at Buffalo

- Seung-kook Jun, PhD Candidate, University at Buffalo

- Madususdanan Sathia Narayanan, PhD Candidate, University at Buffalo

- Priyanshu Agarwal, MS, University at Buffalo [Graduated]


 Movies :

Time Study based Skill Assessment

- Motion study for surgical skill assessment comprises of different steps including motion segmentation, discrete Therblig definition, motion analysis and automated classification/ recognition schemes.

- File Size: 22.3MB [Download]

Surgical Tool Visual Tracking Framework

- This video illustrates our recent work on video-based surgical tool detection and tracking for a real robotic hysterectomy surgical sequence.

- File Size: 22.3MB [Download]

















 Related Publications - Journals:


Jun, S.-K., Narayanan, M.S., Garimella, S., Singhal, P., and Krovi, V., "Evaluation of Robotic Minimally Invasive Surgical Skills using Motion Studies", Springer Journal of Robotic Surgery, pp. 1-9, 2013/07/14 2013.[BIB | RIS]



 Related Publications - Conference Proceedings:


Kumar, S., Narayanan, M.S., Singhal, P., Corso, J., and Krovi, V., “Product of Tracking Experts for Surgical Tool Visual Tracking,” 2013 IEEE Conference on Automation Science and Engineering, August 17-21 2013, Wisconsin, MA.



Kumar, S., Narayanan, M.S., Garimella, S. MD, Singhal, P. MD,  Corso, J., and Krovi, V.,  “Novel Computer Aided Surgical Workflow Using Video–Based Detection And Assessment Methods,” ASME/FDA 2013 1st Annual Frontiers in Medical Devices, September 11-13, 2013, Washington, DC, USA.



Kumar, S., Narayanan, M.S., Misra, S., Garimella, S., Singhal, P. MD,  Corso, J., and Krovi, V., "Video-based Framework for Safer and Smarter Computer Aided Surgery,” 2013 Hamlyn Symposium on Medical Robotics, London UK, 22-25 Jun, 2013.



Kumar, S., Narayanan, M.S., Misra, S., Garimella, S., Singhal, P., Corso, J.,  and Krovi, V., "Vision based Decision-Support and Safety Systems for Robotic Surgery", 2013 Medical Cyber Physical Systems Workshop, Philadelphia, PA, April 8, 2013.



Jun, S.-K., Narayanan, M.S., Eddib, A., MD, Garimella, S., MD, Singhal, P, MD, and Krovi, V., “Robotic Minimally Invasive Surgical Skill Assessment based on Automated Video-Analysis Motion Studies”, 2012 IEEE International Conference on Biomedical Robotics and Biomechatronics, Roma, Italy, Jun 24-28, 2012. [BIB | RIS]



Jun, S.-K., Narayanan, M.S., Eddib, A., MD, Garimella, S., MD, Singhal, P, MD, and Krovi, V., “Minimally Invasive Surgical Skill Assessment by Video-Motion Analysis”, 2012 5th Hamlyn Symposium on Medical Robotics, 2012, London, UK, Jun 30-Jul 2. [BIB | RIS]



Jun, S.-K., Narayanan, M.S., Eddib, A., MD, Garimella, S., MD, Singhal, P, MD, and Krovi, V., “Evaluation of Robotic Minimally Invasive Surgical Skills using Motion Studies”, 2012 Performance Metrics for Intelligent Systems (PerMIS'12) Workshop, March 20-22, 2012, College Park, MD. [BIB | RIS]


Questions or comments regarding the website, please contact the webmaster.

Last Updated: September 16, 2013