Systematic comparison of ranking aggregation methods for gene lists in experimental results.

Bo Wang Andy Law Tim Regan Nicholas Parkinson Joby Cole Clark D Russell David H Dockrell Michael U Gutmann J Kenneth Baillie

Bioinformatics

Roslin Institute, University of Edinburgh, Edinburgh EH25 9RG, UK.

Published: October 2022

This study focuses on methods for combining gene lists from various biomedical research to enhance the reliability of findings in biological processes or diseases.
Researchers evaluated different ranking aggregation methods using both simulated datasets and real genomic data to determine their performance under various conditions, such as data quality and the presence of unranked lists.
The findings are summarized in a flowchart and an automated implementation tool is provided for selecting the best aggregation method based on data characteristics, accessible through a GitHub repository.

Motivation: A common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. The gene lists resulting from a group of studies answering the same, or similar, questions can be combined by ranking aggregation methods to find a consensus or a more reliable answer. Evaluating a ranking aggregation method on a specific type of data before using it is required to support the reliability since the property of a dataset can influence the performance of an algorithm. Such evaluation on gene lists is usually based on a simulated database because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists.

Results: In this study, a group of existing methods and their variations that are suitable for meta-analysis of gene lists are compared using simulated and real data. Simulated data were used to explore the performance of the aggregation methods as a function of emulating the common scenarios of real genomic data, with various heterogeneity of quality, noise level and a mix of unranked and ranked data using 20 000 possible entities. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (non-small cell lung cancer) and bacteria (macrophage apoptosis) was performed. We summarize the results of our evaluation in a simple flowchart to select a ranking aggregation method, and in an automated implementation using the meta-analysis by information content algorithm to infer heterogeneity of data quality across input datasets.

Availability And Implementation: The code for simulated data generation and running edited version of algorithms: https://github.com/baillielab/comparison_of_RA_methods. Code to perform an optimal selection of methods based on the results of this review, using the MAIC algorithm to infer the characteristics of an input dataset, can be downloaded here: https://github.com/baillielab/maic. An online service for running MAIC: https://baillielab.net/maic.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9620830	PMC
http://dx.doi.org/10.1093/bioinformatics/btac621	DOI Listing

Publication Analysis

Top Keywords

ranking aggregation

gene lists

aggregation methods

data

simulated data

aggregation method

real data

data simulated

heterogeneity quality

real genomic

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!