Self-supervised contrastive learning draws on power representational models to acquire generic semantic features from unlabeled data, and the key to training such models lies in how accurately to track motion features. Previous video contrastive learning methods have extensively used spatially or temporally augmentation as similar instances, resulting in models that are more likely to learn static backgrounds than motion features. To alleviate the background shortcuts, in this paper, we propose a cross-view motion consistent (CVMC) self-supervised video inter-intra contrastive model to focus on the learning of local details and long-term temporal relationships. Specifically, we first extract the dynamic features of consecutive video snippets and then align these features based on multi-view motion consistency. Meanwhile, we compare the optimized dynamic features for instance comparison of different videos and local spatial fine-grained with temporal order in the same video, respectively. Ultimately, the joint optimization of spatio-temporal alignment and motion discrimination effectively fills the challenges of the missing components of instance recognition, spatial compactness, and temporal perception in self-supervised learning. Experimental results show that our proposed self-supervised model can effectively learn visual representation information and achieve highly competitive performance compared to other state-of-the-art methods in both action recognition and video retrieval tasks.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.neunet.2024.106578 | DOI Listing |
Neural Netw
November 2024
School of Information Science and Engineering, Yanshan University, Qinhuangdao, 066000, China; Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, 066000, China. Electronic address:
Self-supervised contrastive learning draws on power representational models to acquire generic semantic features from unlabeled data, and the key to training such models lies in how accurately to track motion features. Previous video contrastive learning methods have extensively used spatially or temporally augmentation as similar instances, resulting in models that are more likely to learn static backgrounds than motion features. To alleviate the background shortcuts, in this paper, we propose a cross-view motion consistent (CVMC) self-supervised video inter-intra contrastive model to focus on the learning of local details and long-term temporal relationships.
View Article and Find Full Text PDFSci Rep
June 2024
Department of Computer Science and Digital Technologies, University of East London, London, UK.
Stereoscopic cameras, such as those in mobile phones and various recent intelligent systems, are becoming increasingly common. Multiple variables can impact the stereo video quality, e.g.
View Article and Find Full Text PDFWe present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism.
View Article and Find Full Text PDFSensors (Basel)
May 2023
Department of Informatics and Computer Engineering, University of West Attica, Egaleo Park, 12243 Athens, Greece.
The presence of occlusion in human activity recognition (HAR) tasks hinders the performance of recognition algorithms, as it is responsible for the loss of crucial motion data. Although it is intuitive that it may occur in almost any real-life environment, it is often underestimated in most research works, which tend to rely on datasets that have been collected under ideal conditions, i.e.
View Article and Find Full Text PDFEntropy (Basel)
May 2023
School of Automation and Electrical Engineering, Tianjin University of Technology and Education, Tianjin 300222, China.
Gait recognition is one of the important research directions of biometric authentication technology. However, in practical applications, the original gait data is often short, and a long and complete gait video is required for successful recognition. Also, the gait images from different views have a great influence on the recognition effect.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!