Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly for visual question answering (VQA) models. Successive BLP techniques have yielded higher performance with lower computational expense, yet at the same time they have drifted further from the original motivational justification of bilinear models, instead becoming empirically motivated by task performance. Furthermore, despite significant success in text-image fusion in VQA, BLP has not yet gained such notoriety in video question answering (video-QA). Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features, BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA. We aim to add a new perspective to the empirical and motivational drift in BLP. We take a step back and discuss the motivational origins of BLP, highlighting the often-overlooked parallels to neurological theories (Dual Coding Theory and The Two-Stream Model of Vision). We seek to carefully and experimentally ascertain the empirical strengths and limitations of BLP as a multimodal text-vision fusion technique in video-QA using two models (TVQA baseline and heterogeneous-memory-enchanced 'HME' model) and four datasets (TVQA, TGif-QA, MSVD-QA, and EgoVQA). We examine the impact of both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP that we name the 'dual-stream' model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using our insights on recent work in BLP for video-QA results and recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We share our perspective on, and suggest solutions for, the key issues we identify with BLP techniques for multimodal fusion in video-QA. We look beyond the empirical justification of BLP techniques and propose both alternatives and improvements to multimodal fusion by drawing neurological inspiration from Dual Coding Theory and the Two-Stream Model of Vision. We qualitatively highlight the potential for neurological inspirations in video-QA by identifying the relative abundance of psycholinguistically 'concrete' words in the vocabularies for each of the text components ( questions and answers) of the four video-QA datasets we experiment with.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9202627PMC
http://dx.doi.org/10.7717/peerj-cs.974DOI Listing

Publication Analysis

Top Keywords

blp
14
blp techniques
12
multimodal fusion
12
video-qa
10
bilinear pooling
8
video-qa empirical
8
motivational drift
8
question answering
8
vqa models
8
dual coding
8

Similar Publications

The FLT3 gene frequently undergoes mutations in acute myeloid leukemia (AML), with internal tandem duplications (ITD) and tyrosine kinase domain (TKD) point mutations (PMs) being most common. Recently, PMs and deletions in the FLT3 juxtamembrane domain (JMD) have been identified, but their biological and clinical significance remains poorly understood. We analyzed 1660 patients with de novo AML and found FLT3-JMD mutations, mostly PMs, in 2% of the patients.

View Article and Find Full Text PDF

The heart is a dynamic pump whose function is influenced by its mechanical properties. The viscoelastic properties of the heart, i.e.

View Article and Find Full Text PDF

Meter-Scale Long Connectorized Paper-like Polymer Waveguide Film for 100 Gbps Board-Level Optical Interconnects Application.

Polymers (Basel)

November 2024

State Key Laboratory of Advanced Optical Communication Systems and Networks, Shanghai Jiao Tong University, Shanghai 200240, China.

We design and fabricate meter-scale long connectorized paper-like flexible multimode polymer waveguide film with a large bandwidth-length product (BLP) for board-level optical interconnects application. The measured BLP of the multimode waveguide is greater than 57.3 GHz·m at a wavelength of 850 nm under the strictest overfilled launch condition with a maximum length of 2.

View Article and Find Full Text PDF

Background: Migraine is associated with obesity. These analyses evaluated weight change with atogepant used as a preventive migraine treatment.

Methods: Five atogepant clinical trials in adults with migraine (one phase 2b/3; four phase 3) were included: Three 12-week, randomized, placebo-controlled trials (episodic migraine: two; chronic migraine: one); one 40-week, open-label extension trial and one 52-week, standard care, randomized, long-term safety trial in episodic migraine.

View Article and Find Full Text PDF

Using chromosomal barcoding, we observed that >97% of the Streptococcus pneumoniae (Spn) population turns over in the lung within 2 days post-inoculation in a murine model. This marked collapse of diversity and bacterial turnover was associated with acute inflammation (severe pneumococcal pneumonia), high bacterial numbers in the lungs, bacteremia, and mortality. Intra-strain competition mediated by the blp locus, which expresses bacteriocins in a quorum-sensing-dependent manner, was required for each of these effects.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!