Publications by authors named "Wengang Zhou"

In MARL (Multi-Agent Reinforcement Learning), the trial-and-error learning paradigm based on multiple agents requires massive interactions to produce training samples, significantly increasing both the training cost and difficulty. Therefore, enhancing data efficiency is a core issue in MARL. However, in the context of MARL, agent partially observed information leads to a lack of consideration for agent interactions and coordination from an ego perspective under the world model, which becomes the main obstacle to improving the data efficiency of current proposed MARL methods.

View Article and Find Full Text PDF

When applying Reinforcement Learning (RL) to the real-world visual tasks, two major challenges necessitate consideration: sample inefficiency and limited generalization. To address the above two challenges, previous works focus primarily on learning semantic information from the visual state for improving sample efficiency, but they do not explicitly learn other valuable aspects, such as spatial information. Moreover, they improve generalization by learning representations that are invariant to alterations of task-irrelevant variables, without considering task-relevant variables.

View Article and Find Full Text PDF
Article Synopsis
  • A flight experiment was conducted to assess the cognitive load involved in different turning tasks during simulated flight, using real pilot training modules as a basis.
  • Researchers collected heart rate variability (HRV) and flight data while classifying turning behaviors into climbing, descending, and level flight turns, applying machine learning techniques for analysis.
  • The study found that certain HRV indicators are linked to cognitive load, with an LSTM-Attention model achieving a high recognition score, suggesting potential improvements for pilot training and overall flight safety.
View Article and Find Full Text PDF

Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition.

View Article and Find Full Text PDF

Electromyography (EMG) signal based cross-subject gesture recognition methods reduce the influence of individual differences using transfer learning technology. These methods generally require calibration data collected from new subjects to adapt the pre-trained model to existing subjects. However, collecting calibration data is usually trivial and inconvenient for new subjects.

View Article and Find Full Text PDF

Camera lenses often suffer from optical aberrations, causing radial distortion in the captured images. In those images, there exists a clear and general physical distortion model. However, in existing solutions, such rich geometric prior is under-utilized, and the formulation of an effective prediction target is under-explored.

View Article and Find Full Text PDF

As a classical feature compression technique, quantization is usually coupled with inverted indices for scalable image retrieval. Most quantization methods explicitly divide feature space into Voronoi cells, and quantize feature vectors in each cell into the centroids learned from data distribution. However, Voronoi decomposition is difficult to achieve discriminative space partition for semantic image retrieval.

View Article and Find Full Text PDF

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e.

View Article and Find Full Text PDF

Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated.

View Article and Find Full Text PDF

With the aging of the population, the incidence of dysphagia has gradually increased and become a major clinical and public health issue. Early screening of dysphagia in high-risk populations is crucial to identify the risk factors of dysphagia and carry out effective interventions and health management in advance. In this study, the current epidemiology, hazards, risk factors, preventive, and therapeutic measures of dysphagia were comprehensively reviewed, and a literature review of screening instruments commonly used globally was conducted, focusing on their intended populations, main indicators, descriptions, and characteristics.

View Article and Find Full Text PDF

Existing unsupervised person re-identification methods only rely on visual clues to match pedestrians under different cameras. Since visual data is essentially susceptible to occlusion, blur, clothing changes, etc., a promising solution is to introduce heterogeneous data to make up for the defect of visual data.

View Article and Find Full Text PDF

In pixel-based reinforcement learning (RL), the states are raw video frames, which are mapped into hidden representation before feeding to a policy network. To improve sample efficiency of state representation learning, recently, the most prominent work is based on contrastive unsupervised representation. Witnessing that consecutive video frames in a game are highly correlated, to further improve data efficiency, we propose a new algorithm, i.

View Article and Find Full Text PDF

Recent advances in video object detection have characterized the exploration of temporal coherence across frames to enhance object detector. Nevertheless, previous solutions either rely on additional inputs (e.g.

View Article and Find Full Text PDF

In visual tracking, how to effectively model the target appearance using limited prior information remains an open problem. In this paper, we leverage an ensemble of diverse models to learn manifold representations for robust object tracking. The proposed ensemble framework includes a shared backbone network for efficient feature extraction and multiple head networks for independent predictions.

View Article and Find Full Text PDF

Person re-identification is a crucial task of identifying pedestrians of interest across multiple surveillance camera views. For person re-identification, a pedestrian is usually represented with features extracted from a rectangular image region that inevitably contains the scene background, which incurs ambiguity to distinguish different pedestrians and degrades the accuracy. Thus, we propose an end-to-end foreground-aware network to discriminate against the foreground from the background by learning a soft mask for person re-identification.

View Article and Find Full Text PDF

Cross-modal retrieval aims to identify relevant data across different modalities. In this work, we are dedicated to cross-modal retrieval between images and text sentences, which is formulated into similarity measurement for each image-text pair. To this end, we propose a Cross-modal Relation Guided Network (CRGN) to embed image and text into a latent feature space.

View Article and Find Full Text PDF

Correlation filters (CF) have received considerable attention in visual tracking because of their computational efficiency. Leveraging deep features via off-the-shelf CNN models (e.g.

View Article and Find Full Text PDF

Increasing maize grain yield has been a major focus of both plant breeding and genetic engineering to meet the global demand for food, feed, and industrial uses. We report that increasing and extending expression of a maize MADS-box transcription factor gene, , under the control of a moderate-constitutive maize promoter, results in maize plants with increased plant growth, photosynthesis capacity, and nitrogen utilization. Molecular and biochemical characterization of transgenic plants demonstrated that their enhanced agronomic traits are associated with elevated plant carbon assimilation, nitrogen utilization, and plant growth.

View Article and Find Full Text PDF

Vision-based sign language translation (SLT) is a challenging task due to the complicated variations of facial expressions, gestures, and articulated poses involved in sign linguistics. As a weakly supervised sequence-to-sequence learning problem, in SLT there are usually no exact temporal boundaries of actions. To adequately explore temporal hints in videos, we propose a novel framework named Hierarchical deep Recurrent Fusion (HRF).

View Article and Find Full Text PDF

Image retrieval has achieved remarkable improvements with the rapid progress on visual representation and indexing techniques. Given a query image, search engines are expected to retrieve relevant results in which the top-ranked short list is of most value to users. However, it is challenging to measure the retrieval quality on-the-fly without direct user feedbacks.

View Article and Find Full Text PDF

Deep convolutional neural networks (CNNs) have been widely and successfully applied in many computer vision tasks, such as classification, detection, semantic segmentation, and so on. As for image retrieval, while off-the-shelf CNN features from models trained for classification task are demonstrated promising, it remains a challenge to learn specific features oriented for instance retrieval. Witnessing the great success of low-level SIFT feature in image retrieval and its complementary nature to the semantic-aware CNN feature, in this paper, we propose to embed the SIFT feature into the CNN feature with a Siamese structure in a learning-based paradigm.

View Article and Find Full Text PDF

Binary hashing approaches the approximate nearest neighbor search problem by transferring the data to Hamming space with explicit or implicit distance preserving constraint. With compact data representation, binary hashing identifies the approximate nearest neighbors via very efficient Hamming distance computation. In this paper, we propose a generic hashing framework with a new linear pairwise distance preserving objective and pointwise constraint.

View Article and Find Full Text PDF

In content-based image retrieval, SIFT feature and the feature from deep convolutional neural network (CNN) have demonstrated promising performance. To fully explore both visual features in a unified framework for effective and efficient retrieval, we propose a collaborative index embedding method to implicitly integrate the index matrices of them. We formulate the index embedding as an optimization problem from the perspective of neighborhood sharing and solve it with an alternating index update scheme.

View Article and Find Full Text PDF

As the unique identification of a vehicle, license plate is a key clue to uncover over-speed vehicles or the ones involved in hit-and-run accidents. However, the snapshot of over-speed vehicle captured by surveillance camera is frequently blurred due to fast motion, which is even unrecognizable by human. Those observed plate images are usually in low resolution and suffer severe loss of edge information, which cast great challenge to existing blind deblurring methods.

View Article and Find Full Text PDF

Fine-grained visual categorization is an emerging research area and has been attracting growing attention recently. Due to the large inter-class similarity and intra-class variance, it is extremely challenging to recognize objects in fine-grained domains. A traditional spatial pyramid matching model could obtain desirable results for the basic-level category classification by weak alignment, but may easily fail in fine-grained domains, since the discriminative features are extremely localized.

View Article and Find Full Text PDF