Background: Mass spectrometry has become a standard method by which the proteomic profile of cell or tissue samples is characterized. To fully take advantage of tandem mass spectrometry (MS/MS) techniques in large scale protein characterization studies robust and consistent data analysis procedures are crucial. In this work we present a machine learning based protocol for the identification of correct peptide-spectrum matches from Sequest database search results, improving on previously published protocols.

Results: The developed model improves on published machine learning classification procedures by 6% as measured by the area under the ROC curve. Further, we show how the developed model can be presented as an interpretable tree of additive rules, thereby effectively removing the 'black-box' notion often associated with machine learning classifiers, allowing for comparison with expert rule-of-thumb. Finally, a method for extending the developed peptide identification protocol to give probabilistic estimates of the presence of a given protein is proposed and tested.

Conclusions: We demonstrate the construction of a high accuracy classification model for Sequest search results from MS/MS spectra obtained by using the MALDI ionization. The developed model performs well in identifying correct peptide-spectrum matches and is easily extendable to the protein identification problem. The relative ease with which additional experimental parameters can be incorporated into the classification framework, to give additional discriminatory power, allows for future tailoring of the model to take advantage of information from specific instrument set-ups.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013103PMC
http://dx.doi.org/10.1186/1471-2105-11-591DOI Listing

Publication Analysis

Top Keywords

machine learning
16
developed model
12
protocol identification
8
identification correct
8
sequest search
8
mass spectrometry
8
correct peptide-spectrum
8
peptide-spectrum matches
8
model
5
improved machine
4

Similar Publications

A large set of antimalarial molecules (N ~ 15k) was employed from ChEMBL to build a robust random forest (RF) model for the prediction of antiplasmodial activity. Rather than depending on high throughput screening (HTS) data, molecules tested at multiple doses against blood stages of Plasmodium falciparum were used for model development. The open-access and code-free KNIME platform was used to develop a workflow to train the model on 80% of data (N ~ 12k).

View Article and Find Full Text PDF

Background: The treatment effects are heterogenous across patients due to the differences in their microbiomes, which in turn implies that we can enhance the treatment effect by manipulating the patient's microbiome profile. Then, the coadministration of microbiome-based dietary supplements/therapeutics along with the primary treatment has been the subject of intensive investigation. However, for this, we first need to comprehend which microbes help (or prevent) the treatment to cure the patient's disease.

View Article and Find Full Text PDF

Background: Postoperative fever (POF) is a common occurrence in patients undergoing major surgery, presenting challenges and burdens for both patients and surgeons yet. This study endeavors to examine the incidence, identify risk factors, and establish a machine learning-based predictive model for POF following surgery of oral cancer.

Methods: A total of seven hundred and twenty-seven consecutive patients undergoing radical resection of oral cancer were retrospectively investigated.

View Article and Find Full Text PDF

Background: Intrahepatic cholestasis of pregnancy (ICP) is a liver disorder that occurs in the second and third trimesters of pregnancy and is associated with a significant risk of fetal complications, including premature birth and fetal death. In clinical practice, the diagnosis of ICP is predominantly based on the presence of pruritus in pregnant women and elevated serum total bile acid. However, this approach may result in missed or delayed diagnoses.

View Article and Find Full Text PDF

Background: Creatinine-based estimated glomerular filtration rate (eGFR) equations are widely used in clinical practice but exhibit inherent limitations. On the other side, measuring GFR is time consuming and not available in routine clinical practice. We developed and validated machine learning models to assess the trustworthiness (i.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!