Self-supervised contrastive learning draws on power representational models to acquire generic semantic features from unlabeled data, and the key to training such models lies in how accurately to track motion features. Previous video contrastive learning methods have extensively used spatially or temporally augmentation as similar instances, resulting in models that are more likely to learn static backgrounds than motion features. To alleviate the background shortcuts, in this paper, we propose a cross-view motion consistent (CVMC) self-supervised video inter-intra contrastive model to focus on the learning of local details and long-term temporal relationships.
View Article and Find Full Text PDF