Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequence modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs (Convolutional Neural Networks) for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has led to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we analyze pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat the state of the art on three surgical workflow benchmarks by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code is available at: https://gitlab.com/nct_tso_public/pitfalls_bn.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.media.2024.103126 | DOI Listing |
Sensors (Basel)
December 2024
Australian Urban Research Infrastructure Network (AURIN), University of Melbourne, Melbourne, VIC 3052, Australia.
Public transportation systems play a vital role in modern cities, but they face growing security challenges, particularly related to incidents of violence. Detecting and responding to violence in real time is crucial for ensuring passenger safety and the smooth operation of these transport networks. To address this issue, we propose an advanced artificial intelligence (AI) solution for identifying unsafe behaviours in public transport.
View Article and Find Full Text PDFSensors (Basel)
December 2024
College of Electrical Engineering, Sichuan University, Chengdu 610065, China.
Remote photo-plethysmography (rPPG) is a useful camera-based health motioning method that can measure the heart rhythm from facial videos. Many well-established deep learning models can provide highly accurate and robust results in measuring heart rate (HR) and heart rate variability (HRV). However, these methods are unable to effectively eliminate illumination variation and motion artifact disturbances, and their substantial computational resource requirements significantly limit their applicability in real-world scenarios.
View Article and Find Full Text PDFBioengineering (Basel)
November 2024
College of Biomedical Engineering, Sichuan University, Chengdu 610065, China.
Attention deficit hyperactivity disorder (ADHD) is a prevalent neurodevelopmental disorder among children and adolescents. Behavioral detection and analysis play a crucial role in ADHD diagnosis and assessment by objectively quantifying hyperactivity and impulsivity symptoms. Existing video-based action recognition algorithms focus on object or interpersonal interactions, they may overlook ADHD-specific behaviors.
View Article and Find Full Text PDFBioengineering (Basel)
November 2024
School of Mechanical and Electrical Engineering, Sanming University, Sanming 365004, China.
In experimental pain studies involving animals, subjective pain reports are not feasible. Current methods for detecting pain-related behaviors rely on human observation, which is time-consuming and labor-intensive, particularly for lengthy video recordings. Automating the quantification of these behaviors poses substantial challenges.
View Article and Find Full Text PDFEntropy (Basel)
December 2024
School of Software Technology, Dalian University of Technology, Dalian 116024, China.
In recent years, the rapid growth of video data posed challenges for storage and transmission. Video compression techniques provided a viable solution to this problem. In this study, we proposed a bidirectional coding video compression model named DeepBiVC, which was based on two-stage learning.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!