Person identification is a critical task in applications such as security and surveillance, requiring reliable systems that perform robustly under diverse conditions. This study evaluates the Vision Transformer (ViT) and ResNet34 models across three modalities-RGB, thermal, and depth-using datasets collected with infrared array sensors and LiDAR sensors in controlled scenarios and varying resolutions (16 × 12 to 640 × 480) to explore their effectiveness in person identification. Preprocessing techniques, including YOLO-based cropping, were employed to improve subject isolation. Results show a similar identification performance between the three modalities, in particular in high resolution (i.e., 640 × 480), with RGB image classification reaching 100.0%, depth images reaching 99.54% and thermal images reaching 97.93%. However, upon deeper investigation, thermal images show more robustness and generalizability by maintaining focus on subject-specific features even at low resolutions. In contrast, RGB data performs well at high resolutions but exhibits reliance on background features as resolution decreases. Depth data shows significant degradation at lower resolutions, suffering from scattered attention and artifacts. These findings highlight the importance of modality selection, with thermal imaging emerging as the most reliable. Future work will explore multi-modal integration, advanced preprocessing, and hybrid architectures to enhance model adaptability and address current limitations. This study highlights the potential of thermal imaging and the need for modality-specific strategies in designing robust person identification systems.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.3390/s25010271 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!