Publications by authors named "Gaoyun An"

Scene Graph Generation (SGG) aims to detect visual relationships in an image. However, due to long-tailed bias, SGG is far from practical. Most methods depend heavily on the assistance of statistics co-occurrence to generate a balanced dataset, so they are dataset-specific and easily affected by noises.

View Article and Find Full Text PDF

The feature pyramid has been widely used in many visual tasks, such as fine-grained image classification, instance segmentation, and object detection, and had been achieving promising performance. Although many algorithms exploit different-level features to construct the feature pyramid, they usually treat them equally and do not make an in-depth investigation on the inherent complementary advantages of different-level features. In this article, to learn a pyramid feature with the robust representational ability for action recognition, we propose a novel collaborative and multilevel feature selection network (FSNet) that applies feature selection and aggregation on multilevel features according to action context.

View Article and Find Full Text PDF

Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition.

View Article and Find Full Text PDF

In this paper, we present a novel two-layer video representation for human action recognition employing hierarchical group sparse encoding technique and spatio-temporal structure. In the first layer, a new sparse encoding method named locally consistent group sparse coding (LCGSC) is proposed to make full use of motion and appearance information of local features. LCGSC method not only encodes global layouts of features within the same video-level groups, but also captures local correlations between them, which obtains expressive sparse representations of video sequences.

View Article and Find Full Text PDF