IEEE Trans Pattern Anal Mach Intell
February 2024
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g.
View Article and Find Full Text PDF