IEEE Trans Pattern Anal Mach Intell
June 2023
Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
October 2022
Temporal action localization, which requires a machine to recognize the location as well as the category of action instances in videos, has long been researched in computer vision. The main challenge of temporal action localization lies in that videos are usually long and untrimmed with diverse action contents involved. Existing state-of-the-art action localization methods divide each video into multiple action units (i.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
May 2022
The explosive growth in video streaming requires video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2022
Recently, substantial research effort has focused on how to apply CNNs or RNNs to better capture temporal patterns in videos, so as to improve the accuracy of video classification. In this paper, we investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we first propose Basic Attention Clusters (BAC), which concatenates the output of multiple attention units applied in parallel, and introduce a shifting operation to capture more diverse signals.
View Article and Find Full Text PDFWe focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated outside a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content and these irrelevant sounds.
View Article and Find Full Text PDFIEEE Trans Image Process
December 2019
We address the challenging problem of weakly supervised temporal action localization from unconstrained web videos, where only the video-level action labels are available during training. Inspired by the adversarial erasing strategy in weakly supervised semantic segmentation, we propose a novel iterative-winners-out network. Specifically, we make two technical contributions: we propose an iterative training strategy, namely, winners-out, to select the most discriminative action instances in each training iteration and remove them in the next training iteration.
View Article and Find Full Text PDFIEEE Trans Image Process
April 2019
In this paper, we propose the novel principal backpropagation networks (PBNets) to revisit the backpropagation algorithms commonly used in training two-stream networks for video action recognition. We content that existing approaches always take all the frames/snippets for the backpropagation not optimal for video recognition since the desired actions only occur in a short period within a video. To remedy these drawbacks, we design a watch-and-choose mechanism.
View Article and Find Full Text PDF