Haoyang Wu (吴浩洋)

Hi there! I’m Haoyang (William) Wu, a junior student majoring in Computer Science at the University of Michigan. I’m also pursuing a dual degree in Mechanical Engineering at Shanghai Jiao Tong University (SJTU), with an expected graduation in 2027.

My research interests lie at the intersection of computer vision and robot learning. I’m currently a member of the ARM Lab directed by Prof. Dmitry Berenson, I’m also working closely with Dr. Mark Van der Merwe and Dr. Abhinav Kumar. At SJTU, I’m deeply engaged in surgical robotics as a research intern at the SIRIUS Lab, I am especially grateful for the mentorship of Prof. Yutong Ban, and had the honor of learning from and collaborating with Dr. Johnnson Tsun-hsuan Wang.

I am actively seeking research collaborations and industry internships in AI-related areas. Please feel free to reach out if you’d like to connect or discuss potential projects. You can find my resume here.

Publications

Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

In submission to IEEE TMI, 2025

Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.

Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff, Paul Pak, Ozanan R. Meireles, Guy Rosman, Yutong Ban, Daniela Rus
Download Paper | Download Slides