Publications by authors named "Doina Caragea"

The genomes of the fungus that causes blast diseases on diverse grass species, including major crops, have indispensable core-chromosomes and may contain supernumerary chromosomes, also known as mini-chromosomes. These mini-chromosomes are speculated to provide effector gene mobility, and may transfer between strains. To understand the biology of mini-chromosomes, it is valuable to be able to detect whether a strain possesses a mini-chromosome.

View Article and Find Full Text PDF

Increased global production of sorghum has the potential to meet many of the demands of a growing human population. Developing automation technologies for field scouting is crucial for long-term and low-cost production. Since 2013, sugarcane aphid (SCA) Melanaphis sacchari (Zehntner) has become an important economic pest causing significant yield loss across the sorghum production region in the United States.

View Article and Find Full Text PDF

Protein N-linked glycosylation is an important post-translational mechanism in Homo sapiens, playing essential roles in many vital biological processes. It occurs at the N-X-[S/T] sequon in amino acid sequences, where X can be any amino acid except proline. However, not all N-X-[S/T] sequons are glycosylated; thus, the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation.

View Article and Find Full Text PDF

Chalk, an undesirable grain quality trait in rice, is primarily formed due to high temperatures during the grain-filling process. Owing to the disordered starch granule structure, air spaces and low amylose content, chalky grains are easily breakable during milling thereby lowering head rice recovery and its market price. Availability of multiple QTLs associated with grain chalkiness and associated attributes, provided us an opportunity to perform a meta-analysis and identify candidate genes and their alleles contributing to enhanced grain quality.

View Article and Find Full Text PDF

Background: Rice is a major staple food crop for more than half the world's population. As the global population is expected to reach 9.7 billion by 2050, increasing the production of high-quality rice is needed to meet the anticipated increased demand.

View Article and Find Full Text PDF

Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline.

View Article and Find Full Text PDF

Phosphorylation, which is mediated by protein kinases and opposed by protein phosphatases, is an important post-translational modification that regulates many cellular processes, including cellular metabolism, cell migration, and cell division. Due to its essential role in cellular physiology, a great deal of attention has been devoted to identifying sites of phosphorylation on cellular proteins and understanding how modification of these sites affects their cellular functions. This has led to the development of several computational methods designed to predict sites of phosphorylation based on a protein's primary amino acid sequence.

View Article and Find Full Text PDF

Stomatal density (SD) and stomatal complex area (SCA) are important traits that regulate gas exchange and abiotic stress response in plants. Despite sorghum (Sorghum bicolor) adaptation to arid conditions, the genetic potential of stomata-related traits remains unexplored due to challenges in available phenotyping methods. Hence, identifying loci that control stomatal traits is fundamental to designing strategies to breed sorghum with optimized stomatal regulation.

View Article and Find Full Text PDF

Purpose: Children with acute lymphoblastic leukemia (ALL) are treated according to risk-based protocols defined by the Children's Oncology Group (COG). Alignment between real-world clinical practice and protocol milestones is not widely understood. Aggregate deidentified electronic health record (EHR) data offer a useful resource to evaluate real-world clinical practice.

View Article and Find Full Text PDF

Escherichia coli O103, harbored in the hindgut and shed in the feces of cattle, can be enterohemorrhagic (EHEC), enteropathogenic (EPEC), or putative non-pathotype. The genetic diversity particularly that of virulence gene profiles within O103 serogroup is likely to be broad, considering the wide range in severity of illness. However, virulence descriptions of the E.

View Article and Find Full Text PDF

Enteropathogenic (EPEC) pathotype represents a minor proportion of O103 strains shed in the feces of feedlot cattle. The draft genome sequences of 13 strains of EPEC O103 are reported here. The availability of the genome sequences will help in the assessment of genetic diversity and virulence potential of bovine EPEC O103.

View Article and Find Full Text PDF

The enterohemorrhagic pathotype represents a minor proportion of the O103 strains shed in the feces of cattle. We report here the genome sequences of 43 strains of enterohemorrhagic (EHEC) O103:H2 isolated from feedlot cattle feces. The genomic analysis will provide information on the genetic diversity and virulence potential of bovine EHEC O103.

View Article and Find Full Text PDF

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification.

View Article and Find Full Text PDF

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting.

View Article and Find Full Text PDF

Background: Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement.

View Article and Find Full Text PDF

Gall-forming arthropods are highly specialized herbivores that, in combination with their hosts, produce extended phenotypes with unique morphologies [1]. Many are economically important, and others have improved our understanding of ecology and adaptive radiation [2]. However, the mechanisms that these arthropods use to induce plant galls are poorly understood.

View Article and Find Full Text PDF

The relationship between aphids and their host plants is thought to be functionally analogous to plant-pathogen interactions. Although virulence effector proteins that mediate plant defenses are well-characterized for pathogens such as bacteria, oomycetes, and nematodes, equivalent molecules in aphids and other phloem-feeders are poorly understood. A dual transcriptomic-proteomic approach was adopted to generate a catalog of candidate effector proteins from the salivary glands of the pea aphid, Acyrthosiphon pisum.

View Article and Find Full Text PDF

High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of overfitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of k(th) order MMs by successively grouping strings of length k (i.

View Article and Find Full Text PDF

Background: Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data.

View Article and Find Full Text PDF

Alternative splicing is a mechanism for generating different gene transcripts (called isoforms) from the same genomic sequence. In this paper, we explore the predictive power of a large set of diverse gene features that have been experimentally shown to have effect on alternative splicing. We use such features to build support vector machine classifiers for predicting alternatively spliced exons.

View Article and Find Full Text PDF

Background: Termites (Isoptera) are eusocial insects whose colonies consist of morphologically and behaviorally specialized castes of sterile workers and soldiers, and reproductive alates. Previous studies on eusocial insects have indicated that caste differentiation and behavior are underlain by differential gene expression. Although much is known about gene expression in the honey bee, Apis mellifera, termites remain relatively understudied in this regard.

View Article and Find Full Text PDF

BeetleBase (http://www.beetlebase.org) has been updated to provide more comprehensive genomic information for the red flour beetle Tribolium castaneum.

View Article and Find Full Text PDF

We present the first prototype of INDUS (Intelligent Data Understanding System), a federated, query-centric system for information integration and knowledge acquisition from distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology.

View Article and Find Full Text PDF

This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms for decision tree induction from distributed data; and identifies the conditions under which the algorithms in the distributed setting are superior to their centralized counterparts in terms of time and communication complexity; The resulting algorithms are provably exact in that the decision tree constructed from distributed data is identical to that obtained in the centralized setting. Some natural extensions leading to algorithms for learning from heterogeneous distributed data and learning under privacy constraints are outlined.

View Article and Find Full Text PDF