Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2022.3223955DOI Listing

Publication Analysis

Top Keywords

object detection
16
video object
12
imagenet vid
12
transvod end-to-end
8
end-to-end video
8
object
8
deformable detr
8
object query
8
temporal transformer
8
object queries
8

Similar Publications

MEVDT: Multi-modal event-based vehicle detection and tracking dataset.

Data Brief

February 2025

Department of Electrical and Computer Engineering, University of Michigan-Dearborn, 4901 Evergreen Rd, Dearborn, 48128 MI, USA.

In this data article, we introduce the Multi-Modal Event-based Vehicle Detection and Tracking (MEVDT) dataset. This dataset provides a synchronized stream of event data and grayscale images of traffic scenes, captured using the Dynamic and Active-Pixel Vision Sensor (DAVIS) 240c hybrid event-based camera. MEVDT comprises 63 multi-modal sequences with approximately 13k images, 5M events, 10k object labels, and 85 unique object tracking trajectories.

View Article and Find Full Text PDF

Robust kernel extreme learning machines for postgraduate learning performance prediction.

Heliyon

January 2025

College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, 325035, China.

In the context of graduate learning in China, mentors are the teachers with the highest frequency of contact and the closest relationships with postgraduate students. Nevertheless, a number of issues pertaining to the relationship between mentors and postgraduate students have emerged with increasing frequency in recent years, resulting in a notable decline in the quality of graduate education. In this paper, we investigate the influence of the relationship between mentors and postgraduate students on the postgraduate learning performance, with postgraduate students' admission motivation and learning pressure acting as moderating variables.

View Article and Find Full Text PDF

Human microbiota-associated murine models, using fecal microbiota transplantation (FMT) from human donors, help explore the microbiome's role in diseases like Alzheimer's disease (AD). This study examines how gut bacteria from donors with protective factors against AD influence behavior and brain pathology in an AD mouse model. Female 3xTgAD mice received weekly FMT for 2 months from (i) an 80-year-old AD patient (AD-FMT), (ii) a cognitively healthy 73-year-old with the protective APOEe2 allele (APOEe2-FMT), (iii) a 22-year-old healthy donor (Young-FMT), and (iv) untreated mice (Mice-FMT).

View Article and Find Full Text PDF

In clearance measurements involving a single material type, a conversion factor was applied to convert measurement results to activity based on an assumed uniform density. However, this factor has been found to underestimate activity in material mixtures. In this study, we proposed a method to identify the location with the lowest detection sensitivity (minimum location) in a mixture and evaluated its applicability to the conversion factor.

View Article and Find Full Text PDF

Image-Based Shrimp Aquaculture Monitoring.

Sensors (Basel)

January 2025

Instituto de Telecomunicações (IT), Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal.

Shrimp farming is a growing industry, and automating certain processes within aquaculture tanks is becoming increasingly important to improve efficiency. This paper proposes an image-based system designed to address four key tasks in an aquaculture tank with : estimating shrimp length and weight, counting shrimps, and evaluating feed pellet food attractiveness. A setup was designed, including a camera connected to a Raspberry Pi computer, to capture high-quality images around a feeding plate during feeding moments.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!