Publications by authors named "Maulik Shukla"

As genomic and related data continue to expand, research biologists are often hampered by the computational hurdles required to analyze their data. The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Centers (BRC) to assist researchers with their analysis of genome sequence and other omics-related data. Recently, the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD), and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs merged to form the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) at https://www.

View Article and Find Full Text PDF

It is well-known that cancers of the same histology type can respond differently to a treatment. Thus, computational drug response prediction is of paramount importance for both preclinical drug screening studies and clinical treatment design. To build drug response prediction models, treatment response data need to be generated through screening experiments and used as input to train the prediction models.

View Article and Find Full Text PDF

Cancer is a heterogeneous disease in that tumors of the same histology type can respond differently to a treatment. Anti-cancer drug response prediction is of paramount importance for both drug development and patient treatment design. Although various computational methods and data have been used to develop drug response prediction models, it remains a challenging problem due to the complexities of cancer mechanisms and cancer-drug interactions.

View Article and Find Full Text PDF
Article Synopsis
  • Since late 2020, new SARS-CoV-2 variants have frequently appeared, showing differences that may help them evade immunity from past infections.
  • The Early Detection group within the NIH's SARS-CoV-2 program utilizes bioinformatics to track these variants' emergence, spread, and traits, highlighting important ones for further study.
  • Since April 2021, this group has successfully prioritized variants each month, assisting NIH researchers by providing timely data on SARS-CoV-2 evolution for guiding experiments.
View Article and Find Full Text PDF

Patient-derived xenografts (PDXs) are an appealing platform for preclinical drug studies. A primary challenge in modeling drug response prediction (DRP) with PDXs and neural networks (NNs) is the limited number of drug response samples. We investigate multimodal neural network (MM-Net) and data augmentation for DRP in PDXs.

View Article and Find Full Text PDF

Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years.

Methods: In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation.

View Article and Find Full Text PDF

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.

View Article and Find Full Text PDF

Since the beginning of the COVID-19 pandemic, SARS-CoV-2 has demonstrated its ability to rapidly and continuously evolve, leading to the emergence of thousands of different sequence variants, many with distinctive phenotypic properties. Fortunately, the broad application of next generation sequencing (NGS) across the globe has produced a wealth of SARS-CoV-2 genome sequences, offering a comprehensive picture of how this virus is evolving so that accurate diagnostics, reliable therapeutics, and prophylactic vaccines against COVID-19 can be developed and maintained. The millions of SARS-CoV-2 sequences deposited into genomic sequencing databases, including GenBank, BV-BRC, and GISAID, are annotated with the dates and geographic locations of sample collection, and can be aligned to and compared with the Wuhan-Hu-1 reference genome to extract their constellation of nucleotide and amino acid substitutions.

View Article and Find Full Text PDF

The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Center (BRC) program to assist researchers with analyzing the growing body of genome sequence and other omics-related data. In this report, we describe the merger of the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD) and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs to form the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) https://www.bv-brc.

View Article and Find Full Text PDF
Article Synopsis
  • The emergence of new SARS-CoV-2 variants threatens the effectiveness of immunity from previous infections or vaccinations.
  • To tackle this issue, the NIH launched the SARS-CoV-2 Assessment of Viral Evolution (SAVE) program for real-time assessment of variant risks that might impact transmission and vaccine efficacy.
  • The program focuses on gathering and analyzing data on emerging variants and their effects on immunity, using animal models, while also addressing future challenges in monitoring rapidly evolving viruses.
View Article and Find Full Text PDF

The ARTIC Network provides a common resource of PCR primer sequences and recommendations for amplifying SARS-CoV-2 genomes. The initial tiling strategy was developed with the reference genome Wuhan-01, and subsequent iterations have addressed areas of low amplification and sequence drop out. Recently, a new version (V4) was released, based on new variant genome sequences, in response to the realization that some V3 primers were located in regions with key mutations.

View Article and Find Full Text PDF

To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets.

View Article and Find Full Text PDF

Antimicrobial resistance (AMR) is a major global health threat that affects millions of people each year. Funding agencies worldwide and the global research community have expended considerable capital and effort tracking the evolution and spread of AMR by isolating and sequencing bacterial strains and performing antimicrobial susceptibility testing (AST). For the last several years, we have been capturing these efforts by curating data from the literature and data resources and building a set of assembled bacterial genome sequences that are paired with laboratory-derived AST data.

View Article and Find Full Text PDF

Convolutional neural networks (CNNs) have been successfully used in many applications where important information about data is embedded in the order of features, such as speech and imaging. However, most tabular data do not assume a spatial relationship between features, and thus are unsuitable for modeling using CNNs. To meet this challenge, we develop a novel algorithm, image generator for tabular data (IGTD), to transform tabular data into images by assigning features to pixel positions so that similar features are close to each other in the image.

View Article and Find Full Text PDF

Background: Motivated by the size and availability of cell line drug sensitivity data, researchers have been developing machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating drug response data, a common question is whether the generalization performance of existing prediction models can be further improved with more training data.

Methods: We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four cell line drug screening datasets.

View Article and Find Full Text PDF

We sequenced the genomes of 5,085 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains causing two coronavirus disease 2019 (COVID-19) disease waves in metropolitan Houston, TX, an ethnically diverse region with 7 million residents. The genomes were from viruses recovered in the earliest recognized phase of the pandemic in Houston and from viruses recovered in an ongoing massive second wave of infections. The virus was originally introduced into Houston many times independently.

View Article and Find Full Text PDF

Transfer learning, which transfers patterns learned on a source dataset to a related target dataset for constructing prediction models, has been shown effective in many applications. In this paper, we investigate whether transfer learning can be used to improve the performance of anti-cancer drug response prediction models. Previous transfer learning studies for drug response prediction focused on building models to predict the response of tumor cells to a specific drug treatment.

View Article and Find Full Text PDF

A growing number of studies are using machine learning models to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. Although these studies are showing promise, the models are typically trained using features derived from comprehensive sets of AMR genes or whole genome sequences and may not be suitable for use when genomes are incomplete. In this study, we explore the possibility of predicting AMR phenotypes using incomplete genome sequence data.

View Article and Find Full Text PDF

We sequenced the genomes of 5,085 SARS-CoV-2 strains causing two COVID-19 disease waves in metropolitan Houston, Texas, an ethnically diverse region with seven million residents. The genomes were from viruses recovered in the earliest recognized phase of the pandemic in Houston, and an ongoing massive second wave of infections. The virus was originally introduced into Houston many times independently.

View Article and Find Full Text PDF

The co-expression extrapolation (COXEN) method has been successfully used in multiple studies to select genes for predicting the response of tumor cells to a specific drug treatment. Here, we enhance the COXEN method to select genes that are predictive of the efficacies of multiple drugs for building general drug response prediction models that are not specific to a particular drug. The enhanced COXEN method first ranks the genes according to their prediction power for each individual drug and then takes a union of top predictive genes of all the drugs, among which the algorithm further selects genes whose co-expression patterns are well preserved between cancer cases for building prediction models.

View Article and Find Full Text PDF

Purpose: The Veterans Health Administration (VHA) is the largest cancer care provider in the United States, with the added challenge of serving more than twice the percentage of patients with cancer in rural areas than the national average. The VHA established the National Precision Oncology Program in 2016 to implement and standardize the practice of precision oncology across the diverse VHA system.

Methods: Tumor or peripheral blood specimens from veterans with advanced solid tumors who were eligible for treatment were submitted for next-generation sequencing (NGS) at two commercial laboratories.

View Article and Find Full Text PDF

A synthesis of phenotypic and quantitative genomic traits is provided for bacteria and archaea, in the form of a scripted, reproducible workflow that standardizes and merges 26 sources. The resulting unified dataset covers 14 phenotypic traits, 5 quantitative genomic traits, and 4 environmental characteristics for approximately 170,000 strain-level and 15,000 species-aggregated records. It spans all habitats including soils, marine and fresh waters and sediments, host-associated and thermal.

View Article and Find Full Text PDF

The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org).

View Article and Find Full Text PDF

Background: The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity.

Results: We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on residual neural networks for encoding features as well as predicting tumor growth, our model explains 94% of the response variance.

View Article and Find Full Text PDF