GSB: Group superposition binarization for vision transformer with limited training samples.

Tian Gao Cheng-Zhong Xu Le Zhang Hui Kong

Neural Netw

State Key Laboratory of Internet of Things for Smart City (SKL-IOTSC), University of Macau, 999078, Macao Special Administrative Region of China; Department of Computer and Information Science (CIS), University of Macau, 999078, Macao Special Administrative Region of China; Department of Electromechanical Engineering (EME), University of Macau, 999078, Macao Special Administrative Region of China. Electronic address:

Published: April 2024

Vision Transformers (ViT) are effective in computer vision but struggle with overfitting and require significant computing resources, making them hard to deploy on less powerful devices.
Model binarization is proposed as a solution, simplifying computations by using binary operations instead of complex calculations, which can help reduce model size and computational demands.
The study introduces a new binarization approach called Group Superposition Binarization (GSB), addressing challenges in maintaining accuracy during the binarization of ViTs, and incorporates techniques like knowledge distillation to enhance performance and mitigate overfitting.

Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameter's search space during parameter updates while training a model. Therefore, the binarization process can actually play an implicit regularization role and help solve the problem of overfitting in the case of insufficient training data. Experiments on three datasets with limited numbers of training samples demonstrate that the proposed GSB model achieves state-of-the-art performance among the binary quantization schemes and exceeds its full-precision counterpart on some indicators. Code and models are available at: https://github.com/IMRL/GSB-Vision-Transformer.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2024.106133	DOI Listing

Publication Analysis

Top Keywords

model binarization

binarization

training samples

model

group superposition

superposition binarization

vision transformer

full-precision model

vit model

gradient calculation

Similar Publications

The advantages of lexicon-based sentiment analysis in an age of machine learning.

PLoS One

January 2025

Department of Political Science, Middlebury College, Middlebury, Vermont, United States of America.

A Maurits van der Veen Erik Bleich

Assessing whether texts are positive or negative-sentiment analysis-has wide-ranging applications across many disciplines. Automated approaches make it possible to code near unlimited quantities of texts rapidly, replicably, and with high accuracy. Compared to machine learning and large language model (LLM) approaches, lexicon-based methods may sacrifice some in performance, but in exchange they provide generalizability and domain independence, while crucially offering the possibility of identifying gradations in sentiment.

View Article and Find Full Text PDF

Similar Publications

A computer vision model for the identification and scoring of calcium in aortic valve stenosis: a single-center experience.

Cardiovasc Diagn Ther

December 2024

East Slovak Institute of Cardiovascular Diseases and School of Medicine, Pavol Jozef Safarik University, Kosice, Slovakia.

Tibor Poruban Dominik Pella Ingrid Schusterova Marta Jakubova Karolina Angela Sieradzka Uchnar

Background: Echocardiography is widely used to assess aortic stenosis (AS) but can yield inconsistent results, leading to uncertainty about AS severity and the need for further diagnostics. This retrospective study aimed to evaluate a novel echocardiography-based marker, the signal intensity coefficient (SIC), for its potential in accurately identifying and quantifying calcium in AS, enhancing noninvasive diagnostic methods.

Methods: Between May 2022 and October 2023, 112 cases of AS that were previously considered severe by echocardiography were retrospectively evaluated, as well as a group of 50 cases of mild or moderate AS, both at the Eastern Slovak Institute of Cardiovascular Diseases in Kosice, Slovakia.

View Article and Find Full Text PDF

Similar Publications

Binary Transformer Based on the Alignment and Correction of Distribution.

Sensors (Basel)

December 2024

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China.

Kaili Wang Mingtao Wang Zixin Wan Tao Shen

Transformer is a powerful model widely used in artificial intelligence applications. It contains complex structures and has extremely high computational requirements that are not suitable for embedded intelligent sensors with limited computational resources. The binary quantization technology takes up less memory space and has a faster calculation speed; however, it is seldom studied for the lightweight transformer.

View Article and Find Full Text PDF

Similar Publications

The Classification of Metastatic Spine Cancer and Spinal Compression Fractures by Using CNN and SVM Techniques.

Bioengineering (Basel)

December 2024

Department of Orthopaedic Surgery, Institute of Medical Science, Gyeongsang National University College of Medicine and Gyeongsang National University Hospital, Jinju 52727, Republic of Korea.

Woosik Jeong Chang-Heon Baek Dong-Yeong Lee Sang-Youn Song Jae-Boem Na

Metastatic spine cancer can cause pain and neurological issues, making it challenging to distinguish from spinal compression fractures using magnetic resonance imaging (MRI). To improve diagnostic accuracy, this study developed artificial intelligence (AI) models to differentiate between metastatic spine cancer and spinal compression fractures in MRI images. MRI data from Gyeongsang National University Hospital, collected from January 2019 to April 2022, were processed using Otsu's binarization and Canny edge detection algorithms.

View Article and Find Full Text PDF

Similar Publications

Assessment of choroidal vessels in healthy eyes using 3-dimensional vascular maps and a semi-automated deep learning approach.

Sci Rep

January 2025

Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.

Nicola Valsecchi Elham Sadeghi Elli Davis Mohammed Nasar Ibrahim Nasiq Hasan

To assess the choroidal vessels in healthy eyes using a novel three-dimensional (3D) deep learning approach. In this cross-sectional retrospective study, swept-source OCT 6 × 6 mm scans on Plex Elite 9000 device were obtained. Automated segmentation of the choroidal layer was achieved using a deep-learning ResUNet model along with a volumetric smoothing approach.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!