Publications by authors named "Yunchao Wei"

Article Synopsis
  • Generalized zero-shot learning (GZSL) aims to recognize unseen categories by leveraging knowledge from seen categories, but faces challenges due to mismatches between visual features and semantic attributes.
  • The issues arise mainly from attribute diversity (differences in the specificity of attributes) and instance diversity (variations in visual examples that share the same attributes), leading to ambiguity in visual representation.
  • To address these challenges, a new network called PSVMA+ is introduced, which adapts visual and semantic elements at multiple levels of granularity and improves feature learning by integrating different levels effectively, resulting in better performance in identifying unseen categories.
View Article and Find Full Text PDF

Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through distinct branch types and output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations.

View Article and Find Full Text PDF

Spatio-Temporal Video Grounding (STVG) aims at localizing the spatio-temporal tube of a specific object in an untrimmed video given a free-form natural language query. As the annotation of tubes is labor intensive, researchers are motivated to explore weakly supervised approaches in recent works, which usually results in significant performance degradation. To achieve a less expensive STVG method with acceptable accuracy, this work investigates the "single-frame supervision" paradigm that requires a single frame labeled with a bounding box within the temporal boundary of the fully supervised counterpart as the supervisory signal.

View Article and Find Full Text PDF

Interactive semantic segmentation pursues high-quality segmentation results at the cost of a small number of user clicks. It is attracting more and more research attention for its convenience in labeling semantic pixel-level data. Existing interactive segmentation methods often pursue higher interaction efficiency by mining the latent information of user clicks or exploring efficient interaction manners.

View Article and Find Full Text PDF

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements.

View Article and Find Full Text PDF

Existing human parsing frameworks commonly employ joint learning of semantic edge detection and human parsing to facilitate the localization around boundary regions. Nevertheless, the parsing prediction within the interior of the part contour may still exhibit inconsistencies due to the inherent ambiguity of fine-grained semantics. In contrast, binary edge detection does not suffer from such fine-grained semantic ambiguity, leading to a typical failure case where misclassification occurs inner the part contour while the semantic edge is accurately detected.

View Article and Find Full Text PDF

We present ReGO (Reference-Guided Outpainting), a new method for the task of sketch-guided image outpainting. Despite the significant progress made in producing semantically coherent content, existing outpainting methods often fail to deliver visually appealing results due to blurry textures and generative artifacts. To address these issues, ReGO leverages neighboring reference images to synthesize texture-rich results by transferring pixels from them.

View Article and Find Full Text PDF

Image outpainting gains increasing attention since it can generate the complete scene from a partial view, providing a valuable solution to construct 360° panoramic images. As image outpainting suffers from the intrinsic issue of unidirectional completion flow, previous methods convert the original problem into inpainting, which allows a bidirectional flow. However, we find that inpainting has its own limitations and is inferior to outpainting in certain situations.

View Article and Find Full Text PDF

Salient object detection (SOD) aims to identify the most visually distinctive object(s) from each given image. Most recent progresses focus on either adding elaborative connections among different convolution blocks or introducing boundary-aware supervision to help achieve better segmentation, which is actually moving away from the essence of SOD, i.e.

View Article and Find Full Text PDF

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets and define two real practical instance-level retrieval tasks that enable evaluations on price comparison and personalized recommendations. For both instance-level tasks, accurately identifying the intended product target mentioned in visual-linguistic data and mitigating the impact of irrelevant content are quite challenging.

View Article and Find Full Text PDF

Scene understanding through pixel-level semantic parsing is one of the main problems in computer vision. Till now, image-based methods and datasets for scene parsing have been well explored. However, the real world is naturally dynamic instead of a static state.

View Article and Find Full Text PDF

This article explores how to harvest precise object segmentation masks while minimizing the human interaction cost. To achieve this, we propose a simple yet effective interaction scheme, named Inside-Outside Guidance (IOG). Concretely, we leverage an inside point that is clicked near the object center and two outside points at the symmetrical corner locations (top-left and bottom-right or top-right and bottom-left) of an almost-tight bounding box that encloses the target object.

View Article and Find Full Text PDF

Few-shot segmentation aims at learning to segment query images guided by only a few annotated images from the support set. Previous methods rely on mining the feature embedding similarity across the query and the support images to achieve successful segmentation. However, these models tend to perform badly in cases where the query instances have a large variance from the support ones.

View Article and Find Full Text PDF

Garment transfer aims to transfer the desired garment from a model image with the desired clothing to a target person, which has attracted a great deal of attention due to its wider potential applications. However, considering the model and target persons are often given at different views, body shapes and poses, realistic garment transfer is facing the following challenges that have not been well addressed: 1) deforming the garment; 2) inferring unobserved appearance; 3) preserving fine texture details. To tackle these challenges, we propose a novel SPatial-Aware Texture Transformer (SPATT) model.

View Article and Find Full Text PDF

Object attention maps generated by image classifiers are usually used as priors for weakly supervised semantic segmentation. However, attention maps usually locate the most discriminative object parts. The lack of integral object localization maps heavily limits the performance of weakly supervised segmentation approaches.

View Article and Find Full Text PDF

Label smoothing is an effective regularization tool for deep neural networks (DNNs), which generates soft labels by applying a weighted average between the uniform distribution and the hard label. It is often used to reduce the overfitting problem of training DNNs and further improve classification performance. In this paper, we aim to investigate how to generate more reliable soft labels.

View Article and Find Full Text PDF
Article Synopsis
  • - The article addresses one-shot semantic segmentation, which involves segmenting object regions using only one annotated example, highlighting the importance of robust feature representations from the reference image.
  • - The proposed method, called rich embedding features (REFs), extracts features from the reference image through three perspectives: global embedding, peak embedding, and adaptive embedding, to gather comprehensive guidance for segmentation.
  • - Additionally, a depth-priority context module is introduced to enhance contextual understanding from the query image, significantly improving segmentation performance, as demonstrated through experiments on well-known datasets like Pascal VOC 2012 and COCO.
View Article and Find Full Text PDF

The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks.

View Article and Find Full Text PDF

Weakly supervised semantic segmentation is receiving great attention due to its low human annotation cost. In this paper, we aim to tackle bounding box supervised semantic segmentation, i.e.

View Article and Find Full Text PDF

This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Unlike previous practices that focus on exploring the embedding learning of foreground object (s), we consider background should be equally treated. Thus, we propose a Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach.

View Article and Find Full Text PDF

Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.

View Article and Find Full Text PDF

Aggregating features in terms of different convolutional blocks or contextual embeddings has been proven to be an effective way to strengthen feature representations for semantic segmentation. However, most of the current popular network architectures tend to ignore the misalignment issues during the feature aggregation process caused by step-by-step downsampling operations and indiscriminate contextual information fusion. In this paper, we explore the principles in addressing such feature misalignment issues and inventively propose Feature-Aligned Segmentation Networks (AlignSeg).

View Article and Find Full Text PDF

The outpainting results produced by existing approaches are often too random to meet users' requirements. In this work, we take the image outpainting one step forward by allowing users to harvest personal custom outpainting results using sketches as the guidance. To this end, we propose an encoder-decoder based network to conduct sketch-guided outpainting, where two alignment modules are adopted to impose the generated content to be realistic and consistent with the provided sketches.

View Article and Find Full Text PDF
Self-Correction for Human Parsing.

IEEE Trans Pattern Anal Mach Intell

June 2022

Labeling pixel-level masks for fine-grained semantic segmentation tasks, e.g., human parsing, remains a challenging task.

View Article and Find Full Text PDF

Unsupervised domain adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain. Previous methods minimize the domain discrepancy neglecting the class information, which may lead to misalignment and poor generalization performance. To tackle this issue, this paper proposes contrastive adaptation network (CAN) that optimizes a new metric named Contrastive Domain Discrepancy explicitly modeling the intra-class domain discrepancy and the inter-class domain discrepancy.

View Article and Find Full Text PDF