Video anomaly detection plays a crucial role in ensuring public safety. Its goal is to detect abnormal patterns contained in video frames. Most existing models distinguish the anomalies based on the Mean Squared Error (MSE), which is hard to align with human perception, resulting in discrepancies between model-detected anomalies and those recognized by humans. Unlike the Human Visual System (HVS), those models are trained to prioritize texture over shape, which leads to poor model interpretability and limited performance. To address these limitations, we propose to optimize the video anomaly detection models from the perspective of human visual relevance. The optimization infrastructure includes a novel Structural Similarity Index (SSIM) based loss, a novel anomaly score calculation method based on SSIM, and a spatial-temporal enhancement block in 3D convolution (STE-3D). SSIM loss helps the model emphasize shape information in videos rather than texture. An anomaly score method based on SSIM evaluates video frames to align more closely with human visual perception. STE-3D improves the model's capacity to capture spatial-temporal features and compensates for the deficiency of the SSIM loss in capturing temporal features. STE-3D is lightweight in design and seamlessly integrated into existing video anomaly detection models based on 3D convolution. Extensive experiments and ablation studies were conducted in four challenging video anomaly detection benchmarks,i.e., UCSD Ped1, UCSD Ped2, CUHK Avenue, and ShanghaiTech. The experimental results validate the efficacy of the proposed approaches in improving video anomaly detection performance.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.neunet.2024.107115 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!