Benchmark datasets play an important role in evaluating Natural Language Understanding (NLU) models. However, shortcuts-unwanted biases in the benchmark datasets-can damage the effectiveness of benchmark datasets in revealing models' real capabilities. Since shortcuts vary in coverage, productivity, and semantic meaning, it is challenging for NLU experts to systematically understand and avoid them when creating benchmark datasets. In this paper, we develop a visual analytics system, ShortcutLens, to help NLU experts explore shortcuts in NLU benchmark datasets. The system allows users to conduct multi-level exploration of shortcuts. Specifically, Statistics View helps users grasp the statistics such as coverage and productivity of shortcuts in the benchmark dataset. Template View employs hierarchical and interpretable templates to summarize different types of shortcuts. Instance View allows users to check the corresponding instances covered by the shortcuts. We conduct case studies and expert interviews to evaluate the effectiveness and usability of the system. The results demonstrate that ShortcutLens supports users in gaining a better understanding of benchmark dataset issues through shortcuts, inspiring them to create challenging and pertinent benchmark datasets.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TVCG.2023.3236380 | DOI Listing |
J Mol Model
January 2025
National Institute of Technology Durgapur, Durgapur, India.
Context: Protein secondary structure prediction is essential for understanding protein function and characteristics and can also facilitate drug discovery. Traditional methods for experimentally determining protein structures are both time-consuming and costly. Computational biology offers a viable alternative by predicting protein structures from their sequences.
View Article and Find Full Text PDFBioinformatics
January 2025
School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, 130117, Jilin China.
Motivation: Most drugs start on their journey inside the body by binding the right target proteins. This is the reason that numerous efforts have been devoted to predicting the drug-target binding during drug development. However, the inherent diversity among molecular properties, coupled with limited training data availability, poses challenges to the accuracy and generalizability of these methods beyond their training domain.
View Article and Find Full Text PDFPLoS One
January 2025
School of Foundation Courses, Chongqing Institute of Engineering, Chongqing, China.
Link prediction in heterogeneous networks is an active research topic in the field of complex network science. Recognizing the limitations of existing methods, which often overlook the varying contributions of different local structures within these networks, this study introduces a novel algorithm named SW-Metapath2vec. This algorithm enhances the embedding learning process by assigning weights to meta-path traces generated through random walks and translates the potential connections between nodes into the cosine similarity of embedded vectors.
View Article and Find Full Text PDFPLoS Comput Biol
December 2024
School of Biological Sciences (SBS), Nanyang Technological University, Singapore, Singapore.
The 3D structure of RNA critically influences its functionality, and understanding this structure is vital for deciphering RNA biology. Experimental methods for determining RNA structures are labour-intensive, expensive, and time-consuming. Computational approaches have emerged as valuable tools, leveraging physics-based-principles and machine learning to predict RNA structures rapidly.
View Article and Find Full Text PDFPLoS Comput Biol
January 2025
Microsoft Research, Cambridge, Massachusetts, United States of America.
Machine learning sequence-function models for proteins could enable significant advances in protein engineering, especially when paired with state-of-the-art methods to select new sequences for property optimization and/or model improvement. Such methods (Bayesian optimization and active learning) require calibrated estimations of model uncertainty. While studies have benchmarked a variety of deep learning uncertainty quantification (UQ) methods on standard and molecular machine-learning datasets, it is not clear if these results extend to protein datasets.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!