Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method. Therefore, this paper proposes a novel hybrid multi-scale model called Frequency-Assisted Local Attention Transformer (FALAT). FALAT introduces a Frequency-Assisted Window-based Positional Self-Attention (FWPSA) module that limits the attention distance of query tokens, enabling the capture of local contents in the early stage. The information from value tokens in the frequency domain enhances information diversity during self-attention computation. Additionally, the traditional convolutional method is replaced with a depth-wise separable convolution to downsample in the spatial reduction attention module for long-distance contents in the later stages. Experimental results demonstrate that FALAT-S achieves 83.0% accuracy on IN-1k with an input size of [Formula: see text] using 29.9[Formula: see text]M parameters and 5.6[Formula: see text]G FLOPs. This model outperforms the Next-ViT-S by 0.9[Formula: see text]AP/0.8[Formula: see text]AP with Mask-R-CNN [Formula: see text] on COCO and surpasses the recent FastViT-SA36 by 3.1% mIoU with FPN on ADE20k.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1142/S0129065725500157 | DOI Listing |
Int J Neural Syst
April 2025
School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China.
Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method.
View Article and Find Full Text PDFJ Urol
January 2003
Department of Urology, Clinical Center for Minimally Invasive Urologic Cancer Treatment, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9110, USA.
Purpose: To our knowledge we present the initial series of renal mass in situ laparoscopic radio frequency ablation. We also discuss the indications for and results of subsequent laparoscopic partial nephrectomy.
Materials And Methods: Laparoscopic radio frequency ablation was performed in 13 patients with a mean age of 59 years (range 18 to 81) and a total of 17 small enhancing renal masses.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!