Predictive scene parsing is a task of assigning pixellevel semantic labels to a future frame of a video. It has many applications in vision-based artificial intelligent systems, e.g., autonomous driving and robot navigation. Although previous work has shown its promising performance in semantic segmentation of images and videos, it is still quite challenging to anticipate future scene parsing with limited annotated training data. In this paper, we propose a novel model called STC-GAN, Spatio-Temporally Coupled Generative Adversarial Networks for predictive scene parsing, which employ both convolutional neural networks and convolutional long short-term memory (LSTM) in the encoderdecoder architecture. By virtue of STC-GAN, both spatial layout and semantic context can be captured by the spatial encoder effectively, while motion dynamics are extracted by the temporal encoder accurately. Furthermore, a coupled architecture is presented for establishing joint adversarial training where the weights are shared and features are transformed in an adaptive fashion between the future frame generation model and predictive scene parsing model. Consequently, the proposed STCGAN is able to learn valuable features from unlabeled video data. We evaluate our proposed STC-GAN on two public datasets, i.e., Cityscapes and CamVid. Experimental results demonstrate that our method outperforms the state-of-the-art.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2020.2983567 | DOI Listing |
Int Dent J
November 2024
Key Laboratory of Shaanxi Province for Craniofacial Precision Medicine Research, College of Stomatology, Xi'an Jiaotong University, Xi'an, China. Electronic address:
Sensors (Basel)
October 2024
Division of Science, Engineering and Health Studies, School of Professional Education and Executive Development, The Hong Kong Polytechnic University, Hong Kong 999077, China.
PLoS One
October 2024
School of Information Science and Engineering, Shenyang University of Technology, Shenyang City, Liaoning Province, China.
Semantic feature combination/parsing issue is one of the key problems in sound event classification for acoustic scene analysis, environmental sound monitoring, and urban soundscape analysis. The input audio signal in the acoustic scene classification is composed of multiple acoustic events, which usually leads to low recognition rate in complex environments. To address this issue, this paper proposes the Hierarchical-Concatenate Fusion(HCF)-TDNN model by adding HCF Module to ECAPA-TDNN model for sound event classification.
View Article and Find Full Text PDFCurr Biol
November 2024
Department of Brain and Cognitive Sciences, Center for Visual Science, University of Rochester, Rochester, NY 14627, USA. Electronic address:
For the brain to compute object motion in the world during self-motion, it must discount the global patterns of image motion (optic flow) caused by self-motion. Optic flow parsing is a proposed visual mechanism for computing object motion in the world, and studies in both humans and monkeys have demonstrated perceptual biases consistent with the operation of a flow-parsing mechanism. However, the neural basis of flow parsing remains unknown.
View Article and Find Full Text PDFIEEE Trans Vis Comput Graph
September 2024
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!