P2T: Pyramid Pooling Transformer for Scene Understanding.

Yu-Huan Wu Yun Liu Xin Zhan Ming-Ming Cheng

IEEE Trans Pattern Anal Mach Intell

Published: November 2023

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2022.3202765	DOI Listing

Publication Analysis

Top Keywords

pyramid pooling

vision transformer

vision tasks

sequence length

pooling transformer

vision

single pooling

pooling operation

backbone network

pooling

Similar Publications

Application of MRI image segmentation algorithm for brain tumors based on improved YOLO.

Front Neurosci

January 2025

The Affiliated People's Hospital of Fujian University of Traditional Chinese Medicine, Fuzhou, China.

Tao Yang Xueqi Lu Lanlan Yang Miyang Yang Jinghui Chen

Objective: To assist in the rapid clinical identification of brain tumor types while achieving segmentation detection, this study investigates the feasibility of applying the deep learning YOLOv5s algorithm model to the segmentation of brain tumor magnetic resonance images and optimizes and upgrades it on this basis.

Methods: The research institute utilized two public datasets of meningioma and glioma magnetic resonance imaging from Kaggle. Dataset 1 contains a total of 3,223 images, and Dataset 2 contains 216 images.

View Article and Find Full Text PDF

Similar Publications

Seg-SkiNet: adaptive deformable fusion convolutional network for skin lesion segmentation.

Quant Imaging Med Surg

January 2025

School of Computer and Control Engineering, Yantai University, Yantai, China.

Haiwang Nan Zhenhao Gao Limei Song Qiang Zheng

Background: Skin lesion segmentation plays a significant role in skin cancer diagnosis. However, due to the complex shapes, varying sizes, and different color depths, precise segmentation of skin lesions is a challenging task. Therefore, the aim of this study was to design a customized deep learning (DL) model for the precise segmentation of skin lesions, particularly for complex shapes and small target lesions.

View Article and Find Full Text PDF

Similar Publications

An improved DeepLabv3 + railway track extraction algorithm based on densely connected and attention mechanisms.

Sci Rep

January 2025

School of Computer Science, Hunan University of Technology, Tianyuan District, Zhuzhou, 412007, China.

Yanbin Weng Jie Yang Changfan Zhang Jing He Cheng Peng

The railway track extraction using unmanned aerial vehicle (UAV) aerial images suffers from issues such as low extraction accuracy and high time consumption. In response to these problems, this paper presents a lightweight algorithm DA-DeepLabv3 + based on densely connected and attention mechanisms. Firstly, the lightweight MobileNetV2 network is employed to replace the Xception feature extraction network, thereby reducing the number of model parameters.

View Article and Find Full Text PDF

Similar Publications

PFENet: Towards precise feature extraction from sparse point cloud for 3D object detection.

Neural Netw

January 2025

School of Software Engineering, Xi'an Jiaotong University, Xi'an 710049, China.

Yaochen Li Qiao Li Cong Gao Shengjing Gao Hao Wu

Accurate 3D point cloud object detection is crucially important for autonomous driving vehicles. The sparsity of point clouds in 3D scenes, especially for smaller targets like pedestrians and bicycles that contain fewer points, makes detection particularly challenging. To solve this problem, we propose a single-stage voxel-based 3D object detection method, namely PFENet.

View Article and Find Full Text PDF

Similar Publications

A small underwater object detection model with enhanced feature extraction and fusion.

Sci Rep

January 2025

China Institute of Water Resources and Hydropower Research, Beijing, 100048, China.

Tao Li Yijin Gang Sumin Li Yizi Shang

In the underwater domain, small object detection plays a crucial role in the protection, management, and monitoring of the environment and marine life. Advancements in deep learning have led to the development of many efficient detection techniques. However, the complexity of the underwater environment, limited information available from small objects, and constrained computational resources make small object detection challenging.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!