PeerJ Comput Sci
November 2024
Cancer remains one of the leading causes of death globally. New immunotherapies that harness the patient's immune system to fight cancer show promise, but their development requires analyzing the diversity of immune cells called T-cells. T-cells have receptors that recognize and bind to cancer cells.
View Article and Find Full Text PDFMed Biol Eng Comput
August 2024
Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information.
View Article and Find Full Text PDFProtein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed.
View Article and Find Full Text PDFThe classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors.
View Article and Find Full Text PDFMed Biol Eng Comput
October 2023
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making.
View Article and Find Full Text PDFBiological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics.
View Article and Find Full Text PDFThe emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences.
View Article and Find Full Text PDFThe rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models.
View Article and Find Full Text PDFThe massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database.
View Article and Find Full Text PDFWith the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected-the patterns found between viral variants and geographical location surely being an important part of this analysis.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
December 2023
Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between k-mers (sub-sequences of length k) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences.
View Article and Find Full Text PDFMed Biol Eng Comput
July 2022
Because of the rapid spread of COVID-19 to almost every part of the globe, huge volumes of data and case studies have been made available, providing researchers with a unique opportunity to find trends and make discoveries like never before by leveraging such big data. This data is of many different varieties and can be of different levels of veracity, e.g.
View Article and Find Full Text PDFThe study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic-an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets.
View Article and Find Full Text PDFIn the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples.
View Article and Find Full Text PDF