There is evidence that DNA breathing (spontaneous opening of the DNA strands) plays a relevant role in the interactions of DNA with other molecules, and in particular in the transcription process. Therefore, having physical models that can predict these openings is of interest. However, this source of information has not been used before either in transcription start sites (TSSs) or promoter prediction.
View Article and Find Full Text PDFIn the construction of QSAR models for the prediction of molecular activity, feature selection is a common task aimed at improving the results and understanding of the problem. The selection of features allows elimination of irrelevant and redundant features, reduces the effect of dimensionality problems, and improves the generalization and interpretability of the models. In many feature selection applications, such as those based on ensembles of feature selectors, it is necessary to combine different selection processes.
View Article and Find Full Text PDFThe maximum common property similarity (MCPhd) method is presented using descriptors as a new approach to determine the similarity between two chemical compounds or molecular graphs. This method uses the concept of maximum common property arising from the concept of maximum common substructure and is based on the electrotopographic state index for atoms. A new algorithm to quantify the similarity values of chemical structures based on the presented maximum common property concept is also developed in this paper.
View Article and Find Full Text PDFDuring the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of quantitative structure-activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models.
View Article and Find Full Text PDFFeature selection is one of the most frequent tasks in data mining applications. Its ability to remove useless and redundant features improves the classification performance and gains knowledge about a given problem makes feature selection a common first step in data mining. In many feature selection applications, we need to combine the results of different feature selection processes.
View Article and Find Full Text PDFThe soil-borne pathogen has a worldwide distribution and a plethora of hosts of agronomic value. Molecular analysis of virulence processes can identify targets for disease control. In this work, we compared the global gene transcription profile of random T-DNA insertion mutant strain D-10-8F, which exhibits reduced virulence and alterations in microsclerotium formation and polar growth, with that of the wild-type strain.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
January 2022
Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. The best approaches use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this problem with the best possible performance.
View Article and Find Full Text PDFIn the construction of activity prediction models, the use of feature ranking methods is a useful mechanism for extracting information for ranking features in terms of their significance to develop predictive models. This paper studies the influence of feature rankers in the construction of molecular activity prediction models; for this purpose, a comparative study of fourteen rankings methods for feature selection was conducted. The activity prediction models were constructed using four well-known classifiers and a wide collection of datasets.
View Article and Find Full Text PDFThe prediction of adverse drug reactions in the discovery of new medicines is highly challenging. In the task of predicting the adverse reactions of chemical compounds, information about different targets is often available. Although we can focus on every adverse drug reaction prediction separately, multilabel approaches have been proven useful in many research areas for taking advantage of the relationship among the targets.
View Article and Find Full Text PDFPrototype selection is one of the most common preprocessing tasks in data mining applications. The vast amounts of data that we must handle in practical problems render the removal of noisy, redundant or useless instances a convenient first step for any real-world application. Many algorithms have been proposed for prototype selection.
View Article and Find Full Text PDFIn this work, the application of a new strategy called NWFE ensemble (nonparametric weighted feature extraction ensemble) method is proposed. Subspace-supervised projections based on NWFE are incorporated into the construction of ensembles of classifiers to facilitate the correct classification of wrongly classified instances without being detrimental to the overall performance of the ensemble. The performance of NWFE is investigated with a c-Jun N-terminal kinase-3 inhibitor benchmark dataset using different chemical compound representation models.
View Article and Find Full Text PDFJ Comput Aided Mol Des
November 2018
Feature selection is commonly used as a preprocessing step to machine learning for improving learning performance, lowering computational complexity and facilitating model interpretation. This paper proposes the application of boosting feature selection to improve the classification performance of standard feature selection algorithms evaluated for the prediction of P-gp inhibitors and substrates. Two well-known classification algorithms, decision trees and support vector machines, were used to classify the chemical compounds.
View Article and Find Full Text PDFPlant pathogens of the genus Verticillium pose a threat to many important crops worldwide. They are soil-borne fungi which invade the plant systemically, causing wilt symptoms. We functionally characterized the APSES family transcription factor Vst1 in two Verticillium species, V.
View Article and Find Full Text PDFBackground: Recognizing the different functional parts of genes, such as promoters, translation initiation sites, donors, acceptors and stop codons, is a fundamental task of many current studies in Bioinformatics. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. However, with the rapid evolution of our ability to collect genomic information, it has been shown that combining many sources of evidence is fundamental to the success of any recognition task.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
February 2017
The k -nearest neighbor ( k -NN) classifier is one of the most widely used methods of classification due to several interesting features, including good generalization and easy implementation. Although simple, it is usually able to match and even outperform more sophisticated and complex methods. One of the problems with this approach is fixing the appropriate value of k .
View Article and Find Full Text PDFMotivation: The recognition of translation initiation sites and stop codons is a fundamental part of any gene recognition program. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. These methods all use two classes, one of positive instances and another one of negative instances that are constructed using sequences from the whole genome.
View Article and Find Full Text PDFInstance selection is becoming increasingly relevant due to the huge amount of data that is constantly produced in many fields of research. At the same time, most of the recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly harms classification or recognition tasks.
View Article and Find Full Text PDFIEEE Trans Cybern
February 2013
In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields.
View Article and Find Full Text PDFIEEE Trans Neural Netw
February 2009
In this paper, we approach the problem of constructing ensembles of classifiers from the point of view of instance selection. Instance selection is aimed at obtaining a subset of the instances available for training capable of achieving, at least, the same performance as the whole training set. In this way, instance selection algorithms try to keep the performance of the classifiers while reducing the number of instances in the training set.
View Article and Find Full Text PDFIn this paper we propose a boosting approach to random subspace method (RSM) to achieve an improved performance and avoid some of the major drawbacks of RSM. RSM is a successful method for classification. However, the random selection of inputs, its source of success, can also be a major problem.
View Article and Find Full Text PDFIEEE Trans Syst Man Cybern B Cybern
June 2006
This paper presents a hybrid evolutionary algorithm (EA) to solve nonlinear-regression problems. Although EAs have proven their ability to explore large search spaces, they are comparatively inefficient in fine tuning the solution. This drawback is usually avoided by means of local optimization algorithms that are applied to the individuals of the population.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
June 2006
We present a new method of multiclass classification based on the combination of one-vs-all method and a modification of one-vs-one method. This combination of one-vs-all and one-vs-one methods proposed enforces the strength of both methods. A study of the behavior of the two methods identifies some of the sources of their failure.
View Article and Find Full Text PDFThis paper presents a new method for regression based on the evolution of a type of feed-forward neural networks whose basis function units are products of the inputs raised to real number power. These nodes are usually called product units. The main advantage of product units is their capacity for implementing higher order functions.
View Article and Find Full Text PDFIn this work we present a new approach to crossover operator in the genetic evolution of neural networks. The most widely used evolutionary computation paradigm for neural network evolution is evolutionary programming. This paradigm is usually preferred due to the problems caused by the application of crossover to neural network evolution.
View Article and Find Full Text PDF