Video inpainting aims to fill in spatio-temporal holes in videos with plausible content. Despite tremendous progress on deep learning-based inpainting of a single image, it is still challenging to extend these methods to video domain due to the additional time dimension. In this paper, we propose a recurrent temporal aggregation framework for fast deep video inpainting. In particular, we construct an encoder-decoder model, where the encoder takes multiple reference frames which can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a recurrent feedback in an auto-regressive manner to enforce temporal consistency in the video results. We propose two architectural designs based on this framework. Our first model is a blind video decaptioning network (BVDNet) that is designed to automatically remove and inpaint text overlays in videos without any mask information. Our BVDNet wins the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track 2: Video Decaptioning. Second, we propose a network for more general video inpainting (VINet) to deal with more arbitrary and larger holes. Video results demonstrate the advantage of our framework compared to state-of-the-art methods both qualitatively and quantitatively. The codes are available at https://github.com/mcahny/Deep-Video-Inpainting, and https://github.com/shwoo93/video_decaptioning.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2019.2958083 | DOI Listing |
Acta Otolaryngol
December 2024
Department of Otorhinolaryngology-Head and Neck Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea.
Low-rank tensor completion (LRTC) has shown promise in processing incomplete visual data, yet it often overlooks the inherent local smooth structures in images and videos. Recent advances in LRTC, integrating total variation regularization to capitalize on the local smoothness, have yielded notable improvements. Nonetheless, these methods are limited to exploiting local smoothness within the original data space, neglecting the latent factor space of tensors.
View Article and Find Full Text PDFInstance shadow detection, crucial for applications such as photo editing and light direction estimation, has undergone significant advancements in predicting shadow instances, object instances, and their associations. The extension of this task to videos presents challenges in annotating diverse video data and addressing complexities arising from occlusion and temporary disappearances within associations. In response to these challenges, we introduce ViShadow, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
September 2024
Nonlocal self-similarity (NSS) is an important prior that has been successfully applied in multi-dimensional data processing tasks, e.g., image and video recovery.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!