A Survey of Visual Transformers.

Yang Liu Yao Zhang Yixin Wang Feng Hou Jin Yuan Jiang Tian Yang Zhang Zhongchao Shi Jianping Fan Zhiqiang He

IEEE Trans Neural Netw Learn Syst

Published: June 2024

Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern convolution neural networks (CNNs). In this survey, we have reviewed over 100 of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, two promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at https://github.com/liuyang-ict/awesome-visual-transformers.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TNNLS.2022.3227717	DOI Listing

Publication Analysis

Top Keywords

visual transformers

three fundamental

fundamental tasks

data stream

transformers

visual

survey visual

transformers transformer

transformer attention-based

attention-based encoder-decoder

Similar Publications

Towards Context-Rich Automated Biodiversity Assessments: Deriving AI-Powered Insights from Camera Trap Data.

Sensors (Basel)

December 2024

School of Biological and Environmental Sciences, Liverpool John Moores University, James Parsons Building, Byrom Street, Liverpool L3 3AF, UK.

Paul Fergus Carl Chalmers Naomi Matthews Stuart Nixon André Burger

Camera traps offer enormous new opportunities in ecological studies, but current automated image analysis methods often lack the contextual richness needed to support impactful conservation outcomes. Integrating vision-language models into these workflows could address this gap by providing enhanced contextual understanding and enabling advanced queries across temporal and spatial dimensions. Here, we present an integrated approach that combines deep learning-based vision and language models to improve ecological reporting using data from camera traps.

View Article and Find Full Text PDF

Similar Publications

Style Transfer of Chinese Wuhu Iron Paintings Using Hierarchical Visual Transformer.

Sensors (Basel)

December 2024

College of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China.

Yuying Zhou Yao Ren Chao Wu Minglong Xue

Within the domain of traditional art, Chinese Wuhu Iron Painting distinguishes itself through its distinctive craftsmanship, aesthetic expressiveness, and choice of materials, presenting a formidable challenge in the arena of stylistic transformation. This paper introduces an innovative Hierarchical Visual Transformer (HVT) framework aimed at achieving effectiveness and precision in the style transfer of Wuhu Iron Paintings. The study begins with an in-depth analysis of the artistic style of Wuhu Iron Paintings, extracting key stylistic elements that meet technical requirements for style conversion.

View Article and Find Full Text PDF

Similar Publications

Artificial Intelligence-Based Methodologies for Early Diagnostic Precision and Personalized Therapeutic Strategies in Neuro-Ophthalmic and Neurodegenerative Pathologies.

Brain Sci

December 2024

Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada, Reno, 1664 N Virginia St, Reno, NV 89557, USA.

Rahul Kumar Ethan Waisberg Joshua Ong Phani Paladugu Dylan Amiri

Advancements in neuroimaging, particularly diffusion magnetic resonance imaging (MRI) techniques and molecular imaging with positron emission tomography (PET), have significantly enhanced the early detection of biomarkers in neurodegenerative and neuro-ophthalmic disorders. These include Alzheimer's disease, Parkinson's disease, multiple sclerosis, neuromyelitis optica, and myelin oligodendrocyte glycoprotein antibody disease. This review highlights the transformative role of advanced diffusion MRI techniques-Neurite Orientation Dispersion and Density Imaging and Diffusion Kurtosis Imaging-in identifying subtle microstructural changes in the brain and visual pathways that precede clinical symptoms.

View Article and Find Full Text PDF

Similar Publications

Adltformer Team-Training with Detr: Enhancing Cattle Detection in Non-Ideal Lighting Conditions Through Adaptive Image Enhancement.

Animals (Basel)

December 2024

College of Electronic Information Engineering, Inner Mongolia University, Hohhot 010021, China.

Zhiqiang Zheng Mengbo Wang Xiaoyu Zhao Zhi Weng

This study proposes an image enhancement detection technique based on Adltformer (Adaptive dynamic learning transformer) team-training with Detr (Detection transformer) to improve model accuracy in suboptimal conditions, addressing the challenge of detecting cattle in real pastures under complex lighting conditions-including backlighting, non-uniform lighting, and low light. This often results in the loss of image details and structural information, color distortion, and noise artifacts, thereby compromising the visual quality of captured images and reducing model accuracy. To train the Adltformer enhancement model, the day-to-night image synthesis (DTN-Synthesis) algorithm generates low-light image pairs that are precisely aligned with normal light images and include controlled noise levels.

View Article and Find Full Text PDF

Similar Publications

Using transformer-based models and social media posts for heat stroke detection.

Sci Rep

January 2025

Chubu Institute for Advanced Studies, Chubu University, Kasugai, Aichi, Japan.

Sumiko Anno Yoshitsugu Kimura Satoru Sugita

Event-based surveillance is crucial for the early detection and rapid response to potential public health risks. In recent years, social networking services (SNS) have been recognized for their potential role in this domain. Previous studies have demonstrated the capacity of SNS posts for the early detection of health crises and affected individuals, including those related to infectious diseases.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!