Publications by authors named "Alex Jinpeng Wang"

Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning image and text pairs, paving the way for a wide range of cross-modal learning tasks. Nevertheless, we have observed that VLP models often fall short in terms of visual grounding and localization capabilities, which are crucial for many downstream tasks, such as visual reasoning. In response, we introduce a novel Position-guided Text Prompt (PTP) paradigm to bolster the visual grounding abilities of cross-modal models trained with VLP.

View Article and Find Full Text PDF