The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to ), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327874 | PMC |
http://dx.doi.org/10.1093/nargab/lqae106 | DOI Listing |
Sci Rep
December 2024
KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.
Analyzing microbial samples remains computationally challenging due to their diversity and complexity. The lack of robust de novo protein function prediction methods exacerbates the difficulty in deriving functional insights from these samples. Traditional prediction methods, dependent on homology and sequence similarity, often fail to predict functions for novel proteins and proteins without known homologs.
View Article and Find Full Text PDFSci Rep
December 2024
Department of Informatics, University of Hamburg, Hamburg, Germany.
Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space.
View Article and Find Full Text PDFSci Rep
December 2024
Department of Computer Science, Birzeit University, P.O. Box 14, Birzeit, West Bank, Palestine.
Accurate classification of logos is a challenging task in image recognition due to variations in logo size, orientation, and background complexity. Deep learning models, such as VGG16, have demonstrated promising results in handling such tasks. However, their performance is highly dependent on optimal hyperparameter settings, whose fine-tuning is both labor-intensive and time-consuming.
View Article and Find Full Text PDFSci Rep
December 2024
Faculty of Dental Medicine and Oral Health Sciences, McGill University, Montreal, Canada.
Accurate diagnosis of oral lesions, early indicators of oral cancer, is a complex clinical challenge. Recent advances in deep learning have demonstrated potential in supporting clinical decisions. This paper introduces a deep learning model for classifying oral lesions, focusing on accuracy, interpretability, and reducing dataset bias.
View Article and Find Full Text PDFSci Rep
December 2024
Department of Civil Engineering, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
Deep learning models are widely used for traffic forecasting on freeways due to their ability to learn complex temporal and spatial relationships. In particular, graph neural networks, which integrate graph theory into deep learning, have become popular for modeling traffic sensor networks. However, traditional graph convolutional networks (GCNs) face limitations in capturing long-range spatial correlations, which can hinder accurate long-term predictions.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!