TCFormer: Visual Recognition via Token Clustering Transformer.

Wang Zeng Sheng Jin Lumin Xu Wentao Liu Chen Qian Wanli Ouyang Ping Luo Xiaogang Wang

IEEE Trans Pattern Anal Mach Intell

Published: December 2024

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2024.3425768	DOI Listing

Publication Analysis

Top Keywords

token clustering

clustering transformer

vision token

semantic meaning

image regions

token

tcformer visual

visual recognition

recognition token

transformer transformers

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!