Publications by authors named "Jianbin Jiao"

We propose integrally pre-trained transformer pyramid network (iTPN), towards jointly optimizing the network backbone and the neck, so that transfer gap between representation models and downstream tasks is minimal. iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT). 2) Multi-stage supervision to the feature pyramid using masked feature modeling (MFM).

View Article and Find Full Text PDF

Conventional neural architecture search (NAS) algorithms typically work on search spaces with short-distance node connections. We argue that such designs, though safe and stable, are obstacles to exploring more effective network architectures. In this brief, we explore the search algorithm upon a complicated search space with long-distance connections and show that existing weight-sharing search algorithms fail due to the existence of interleaved connections (ICs).

View Article and Find Full Text PDF

Point-based object localization (POL), which pursues high-performance object sensing under low-cost data annotation, has attracted increased attention. However, the point annotation mode inevitably introduces semantic variance due to the inconsistency of annotated points. Existing POL heavily rely on strict annotation rules, which are difficult to define and apply, to handle the problem.

View Article and Find Full Text PDF

With convolution operations, Convolutional Neural Networks (CNNs) are good at extracting local features but experience difficulty to capture global representations. With cascaded self-attention modules, vision transformers can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take both advantages of convolution operations and self-attention mechanisms for enhanced representation learning.

View Article and Find Full Text PDF
Article Synopsis
  • Unsupervised person re-identification (re-ID) is challenging, and this study highlights that sampling strategy is crucial for performance outcomes, alongside framework design and loss functions.
  • The paper identifies overfitting as a major issue affecting performance and introduces "group sampling," which organizes samples from the same class into groups to enhance training stability and improve classification accuracy.
  • Extensive testing on various datasets shows that group sampling performs similarly to advanced methods and surpasses existing techniques when working in camera-agnostic scenarios, with available code for implementation.
View Article and Find Full Text PDF

Conventional networks for object skeleton detection are usually hand-crafted. Despite the effectiveness, hand-crafted network architectures lack the theoretical basis and require intensive prior knowledge to implement representation complementarity for objects/parts in different granularity. In this paper, we propose an adaptive linear span network (AdaLSN), driven by neural architecture search (NAS), to automatically configure and integrate scale-aware features for object skeleton detection.

View Article and Find Full Text PDF

Few-shot semantic segmentation remains an open problem because limited support (training) images are insufficient to represent the diverse semantics within target categories. Conventional methods typically model a target category solely using information from the support image(s), resulting in incomplete semantic activation. In this paper, we propose a novel few-shot segmentation approach, termed harmonic feature activation (HFA), with the aim to implement dense support-to-query semantic transform by incorporating the features of both query and support images.

View Article and Find Full Text PDF

Visual commonsense knowledge has received growing attention in the reasoning of long-tailed visual relationships biased in terms of object and relation labels. Most current methods typically collect and utilize external knowledge for visual relationships by following the fixed reasoning path of {subject, object → predicate} to facilitate the recognition of infrequent relationships. However, the knowledge incorporation for such fixed multidependent path suffers from the data set biased and exponentially grown combinations of object and relation labels and ignores the semantic gap between commonsense knowledge and real scenes.

View Article and Find Full Text PDF

This article establishes a baseline for object reflection symmetry detection in natural images by releasing a new benchmark named Sym-PASCAL and proposing an end-to-end deep learning approach for reflection symmetry. Sym-PASCAL spans challenges of multiobjects, object diversity, part invisibility, and clustered backgrounds, which is far beyond those in existing data sets. The end-to-end deep learning approach, referred to as a side-output residual network (SRN), leverages the output residual units (RUs) to fit the errors between the symmetry ground truth and the side outputs of multiple stages of a trunk network.

View Article and Find Full Text PDF

Weakly supervised object detection is a challenging task when provided with image category supervision but required to learn, at the same time, object locations and object detectors. The inconsistency between the weak supervision and learning objectives introduces significant randomness to object locations and ambiguity to detectors. In this paper, a min-entropy latent model (MELM) is proposed for weakly supervised object detection.

View Article and Find Full Text PDF

Tracking multiple persons is a challenging task when persons move in groups and occlude each other. Existing group-based methods have extensively investigated how to make group division more accurately in a tracking-by-detection framework; however, few of them quantify the group dynamics from the perspective of targets' spatial topology or consider the group in a dynamic view. Inspired by the sociological properties of pedestrians, we propose a novel socio-topology model with a topology-energy function to factor the group dynamics of moving persons and groups.

View Article and Find Full Text PDF

Scene images usually involve semantic correlations, particularly when considering large-scale image data sets. This paper proposes a novel generative image representation, correlated topic vector, to model such semantic correlations. Oriented from the correlated topic model, correlated topic vector intends to naturally utilize the correlations among topics, which are seldom considered in the conventional feature encoding, e.

View Article and Find Full Text PDF

Human detection in images is challenged by the view and posture variation problem. In this paper, we propose a piecewise linear support vector machine (PL-SVM) method to tackle this problem. The motivation is to exploit the piecewise discriminative function to construct a nonlinear classification boundary that can discriminate multiview and multiposture human bodies from the backgrounds in a high-dimensional feature space.

View Article and Find Full Text PDF