Publications by authors named "Yueting Zhuang"

The development of ultra-low near-infrared reflectivity coatings with outstanding engineering properties remains a challenge in laser stealth materials research. Herein, we reported a laser stealth coating with outstanding mechanical properties, super-hydrophobicity, and an ultra-low near-infrared reflectivity for 1.06 μm wavelength.

View Article and Find Full Text PDF

The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g., answer type).

View Article and Find Full Text PDF

Federated learning (FL) is a promising approach for healthcare institutions to train high-quality medical models collaboratively while protecting sensitive data privacy. However, FL models encounter fairness issues at diverse levels, leading to performance disparities across different subpopulations. To address this, we propose Federated Learning with Unified Fairness Objective (FedUFO), a unified framework consolidating diverse fairness levels within FL.

View Article and Find Full Text PDF

With the increasing demand for data privacy, federated learning (FL) has gained popularity for various applications. Most existing FL works focus on the classification task, overlooking those scenarios where anomaly detection may also require privacy-preserving. Traditional anomaly detection algorithms cannot be directly applied to the FL setting due to false and missing detection issues.

View Article and Find Full Text PDF

Current distributed graph training frameworks evenly partition a large graph into small chunks to suit distributed storage, leverage a uniform interface to access neighbors, and train graph neural networks in a cluster of machines to update weights. Nevertheless, they consider a separate design of storage and training, resulting in huge communication costs for retrieving neighborhoods. During the storage phase, traditional heuristic graph partitioning not only suffers from memory overhead because of loading the full graph into the memory but also damages semantically related structures because of its neglecting meaningful node attributes.

View Article and Find Full Text PDF

Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization).

View Article and Find Full Text PDF

Lockdown is a common policy used to deter the spread of COVID-19. However, the question of how our society comes back to life after a lockdown remains an open one. Understanding how cities bounce back from lockdown is critical for promoting the global economy and preparing for future pandemics.

View Article and Find Full Text PDF

Training deep models for RGB-D salient object detection (SOD) often requires a large number of labeled RGB-D images. However, RGB-D data is not easily acquired, which limits the development of RGB-D SOD techniques. To alleviate this issue, we present a Dual-Semi RGB-D Salient Object Detection Network (DS-Net) to leverage unlabeled RGB images for boosting RGB-D saliency detection.

View Article and Find Full Text PDF

Model performance can be further improved with the extra guidance apart from the one-hot ground truth. To achieve it, recently proposed recollection-based methods utilize the valuable information contained in the past training history and derive a "recollection" from it to provide data-driven prior to guide the training. In this article, we focus on two fundamental aspects of this method, i.

View Article and Find Full Text PDF

Vision-language research has become very popular, which focuses on understanding of visual contents, language semantics and relationships between them. Video question answering (Video QA) is one of the typical tasks. Recently, several BERT style pre-training methods have been proposed and shown effectiveness on various vision-language tasks.

View Article and Find Full Text PDF

As an interesting and important problem in computer vision, learning-based video saliency detection aims to discover the visually interesting regions in a video sequence. Capturing the information within frame and between frame at different aspects (such as spatial contexts, motion information, temporal consistency across frames, and multiscale representation) is important for this task. A key issue is how to jointly model all these factors within a unified data-driven scheme in an end-to-end fashion.

View Article and Find Full Text PDF

Recently, a large number of existing methods for saliency detection have mainly focused on designing complex network architectures to aggregate powerful features from backbone networks. However, contextual information is not well utilized, which often causes false background regions and blurred object boundaries. Motivated by these issues, we propose an easyto-implement module that utilizes the edge-preserving ability of superpixels and the graph neural network to interact the context of superpixel nodes.

View Article and Find Full Text PDF

As a challenging task in visual information retrieval, open-ended long-form video question answering automatically generates the natural language answer from the referenced video content according to the given question. However, the existing video question answering works mainly focus on the short-form video, which may be ineffectively applied for long-form video question answering directly, due to the insufficiency of modeling the semantic representation of long-form video content. In this paper, we study the problem of open-ended long-form video question answering from the viewpoint of hierarchical multimodal conditional adversarial network learning.

View Article and Find Full Text PDF

In the development of a new product, the design team must describe the expected effects of the final products to potential users and stakeholders. However, existing prototyping tools can only present a product imperfectly, due to limitations at different levels. Specifically, the physical product model, which may be the product of 3D printing, could lack a visual interface; the presentation of the product through modeling software such as Rhinoceros 3D does not provide good realistic tactile perception; or the interface platforms, such as Axure RP, used to display the interactive effects differ from those to be used in the actual operation.

View Article and Find Full Text PDF

A key problem in co-saliency detection is how to effectively model the interactive relationship of a whole image group and the individual perspective of each image in a united data-driven manner. In this paper, we propose a group-wise deep co-saliency detection approach to address the co-saliency object discovery problem based on the fully convolutional network (FCN). The proposed approach captures the group-wise interaction information for group images by learning a semantics-aware image representation based on a convolutional neural network, which adaptively learns the group-wise features for co-saliency detection.

View Article and Find Full Text PDF

Capabilities of inference and prediction are the significant components of visual systems. Visual path prediction is an important and challenging task among them, with the goal to infer the future path of a visual object in a static scene. This task is complicated as it needs high-level semantic understandings of both the scenes and underlying motion patterns in video sequences.

View Article and Find Full Text PDF

It is observed that distinct words in a given document have either strong or weak ability in delivering facts (i.e., the objective sense) or expressing opinions (i.

View Article and Find Full Text PDF

As an important and challenging problem in computer vision, face age estimation is typically cast as a classification or regression problem over a set of face samples with respect to several ordinal age labels, which have intrinsically cross-age correlations across adjacent age dimensions. As a result, such correlations usually lead to the age label ambiguities of the face samples. Namely, each face sample is associated with a latent label distribution that encodes the cross-age correlation information on label ambiguities.

View Article and Find Full Text PDF

A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner. In this paper, we propose a multi-task deep saliency model based on a fully convolutional neural network with global input (whole raw images) and global output (whole saliency maps). In principle, the proposed saliency model takes a data-driven strategy for encoding the underlying saliency prior information, and then sets up a multi-task learning scheme for exploring the intrinsic correlations between saliency detection and semantic image segmentation.

View Article and Find Full Text PDF

In multimedia information retrieval, most classic approaches tend to represent different modalities of media in the same feature space. With the click data collected from the users' searching behavior, existing approaches take either one-to-one paired data (text-image pairs) or ranking examples (text-query-image and/or image-query-text ranking lists) as training examples, which do not make full use of the click data, particularly the implicit connections among the data objects. In this paper, we treat the click data as a large click graph, in which vertices are images/text queries and edges indicate the clicks between an image and a query.

View Article and Find Full Text PDF

Visual feature learning, which aims to construct an effective feature representation for visual data, has a wide range of applications in computer vision. It is often posed as a problem of nonnegative matrix factorization (NMF), which constructs a linear representation for the data. Although NMF is typically parallelized for efficiency, traditional parallelization methods suffer from either an expensive computation or a high runtime memory usage.

View Article and Find Full Text PDF

As an important and challenging problem in machine learning and computer vision, multilabel classification is typically implemented in a max-margin multilabel learning framework, where the inter-label separability is characterized by the sample-specific classification margins between labels. However, the conventional multilabel classification approaches are usually incapable of effectively exploring the intrinsic inter-label correlations as well as jointly modeling the interactions between inter-label correlations and multilabel classification. To address this issue, we propose a multilabel classification framework based on a joint learning approach called label graph learning (LGL) driven weighted Support Vector Machine (SVM).

View Article and Find Full Text PDF

In this paper, we propose a visual tracker based on a metric-weighted linear representation of appearance. In order to capture the interdependence of different feature dimensions, we develop two online distance metric learning methods using proximity comparison information and structured output learning. The learned metric is then incorporated into a linear representation of appearance.

View Article and Find Full Text PDF

Cross-modal ranking is a research topic that is imperative to many applications involving multimodal data. Discovering a joint representation for multimodal data and learning a ranking function are essential in order to boost the cross-media retrieval (i.e.

View Article and Find Full Text PDF

Motion capture is an important technique with a wide range of applications in areas such as computer vision, computer animation, film production, and medical rehabilitation. Even with the professional motion capture systems, the acquired raw data mostly contain inevitable noises and outliers. To denoise the data, numerous methods have been developed, while this problem still remains a challenge due to the high complexity of human motion and the diversity of real-life situations.

View Article and Find Full Text PDF