Purpose: Surgical workflow recognition has emerged as an important part of computer-assisted intervention systems for the modern operating room, which also is a very challenging problem. Although the CNN-based approach achieves excellent performance, it does not learn global and long-range semantic information interactions well due to the inductive bias inherent in convolution.
Methods: In this paper, we propose a temporal-based Swin Transformer network (TSTNet) for the surgical video workflow recognition task. TSTNet contains two main parts: the Swin Transformer and the LSTM. The Swin Transformer incorporates the attention mechanism to encode remote dependencies and learn highly expressive representations. The LSTM is capable of learning long-range dependencies and is used to extract temporal information. The TSTNet organically combines the two components to extract spatiotemporal features that contain more contextual information. In particular, based on a full understanding of the natural features of the surgical video, we propose a priori revision algorithm (PRA) using a priori information about the sequence of the surgical phase. This strategy optimizes the output of TSTNet and further improves the recognition performance.
Results: We conduct extensive experiments using the Cholec80 dataset to validate the effectiveness of the TSTNet-PRA method. Our method achieves excellent performance on the Cholec80 dataset, which accuracy is up to 92.8% and greatly exceeds the state-of-the-art methods.
Conclusion: By modelling remote temporal information and multi-scale visual information, we propose the TSTNet-PRA method. It was evaluated on a large public dataset, showing a high recognition capability superior to other spatiotemporal networks.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1007/s11548-022-02785-y | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!