Publications by authors named "Ruopeng Gao"

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g.

View Article and Find Full Text PDF