Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

1Shanghai Jiao Tong University
2Massachusetts Institute of Technology
3Liquid AI
4University Hospital of Cologne
5Duke University
Teaser image

Hierarchical Input Dependent State Space Model Overview.(a) A temporally consistent visual feature extractor that incorporates temporal information in feature extraction by learning the structure of the downsampled surgical videos; (b) The hierarchical input-dependent state-space model that takes the extracted features as input and enhances both local and comprehensive temporal relations; (c) The supervision signals that are utilized to train the model, comprising both a discrete and a continuous signal.

Abstract

Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthysurgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publically available after paper acceptance.

Teaser image

Visualization of Δt dynamics for Cholec80 Dataset Video62 (Left) and the MICCAI2016 Dataset Video41 (Right). The horizontal axis represents the timestep. The color ribbon at the top illustrates the Ground Truth (GT) and corresponding predictions (Pred). Each line chart in the subplot below visualizes the dynamics of Δt for different layers, with the mean Δt shown at the bottom. It can be observed that the Δt experiences a sharp change at phase transitions, as detailed in this section.

Teaser image

Schematic illustration of HID-SSM's long-term dependency modeling for Cholec80 Video46: (a) Per-row normalized matrix mixer; (b) Zoom-in view of the matrix mixer; (c) Quantitative line plots of contribution weights from past frames at specified time steps t.

Demo Video

BibTeX

@misc{wu2025holisticsurgicalphaserecognition,
      title={Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models}, 
      author={Haoyang Wu and Tsun-Hsuan Wang and Mathias Lechner and Ramin Hasani and Jennifer A. Eckhoff and Paul Pak and Ozanan R. Meireles and Guy Rosman and Yutong Ban and Daniela Rus},
      year={2025},
      eprint={2506.21330},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21330}, 
}