Publications by Limsoon Wong | LitMetric

Publications by authors named "Limsoon Wong"

Page 1 of 8

π-HuB: the proteomic navigator of the human body.

Fuchu He Ruedi Aebersold Mark S Baker Xiuwu Bian Xiaochen Bo Limsoon Wong

Nature

December 2024

The human body contains trillions of cells, classified into specific cell types, with diverse morphologies and functions. In addition, cells of the same type can assume different states within an individual's body during their lifetime. Understanding the complexities of the proteome in the context of a human organism and its many potential states is a necessary requirement to understanding human biology, but these complexities can neither be predicted from the genome, nor have they been systematically measurable with available technologies.

View Article and Find Full Text PDF

Benchmarking recent computational tools for DNA-binding protein identification.

Xizi Luo Amadeus Song Yi Chi Andre Huikai Lin Tze Jet Ong Limsoon Wong

Brief Bioinform

November 2024

Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance.

View Article and Find Full Text PDF

Ten quick tips for ensuring machine learning model validity.

Wilson Wen Bin Goh Mohammad Neamul Kabir Sehwan Yoo Limsoon Wong

PLoS Comput Biol

September 2024

Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on how to check AI/ML models from 2 perspectives-the user and the developer.

View Article and Find Full Text PDF

A comparative analysis of ENCODE and Cistrome in the context of TF binding signal.

Stefano Perna Pietro Pinoli Stefano Ceri Limsoon Wong

BMC Genomics

August 2024

Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent.

View Article and Find Full Text PDF

How much can ChatGPT really help computational biologists in programming?

Chowdhury Rafeed Rahman Limsoon Wong

J Bioinform Comput Biol

April 2024

ChatGPT, a recently developed product by openAI, is successfully leaving its mark as a multi-purpose natural language based chatbot. In this paper, we are more interested in analyzing its potential in the field of computational biology. A major share of work done by computational biologists these days involve coding up bioinformatics algorithms, analyzing data, creating pipelining scripts and even machine learning modeling and feature extraction.

View Article and Find Full Text PDF

A machine-learning exploration of the exposome from preconception in early childhood atopic eczema, rhinitis and wheeze development.

Yizhi Dong Hui Xing Lau Noor Hidayatul Aini Suaini Michelle Zhi Ling Kee Delicia Shu Qin Ooi Limsoon Wong

Environ Res

June 2024

Background: Most previous research on the environmental epidemiology of childhood atopic eczema, rhinitis and wheeze is limited in the scope of risk factors studied. Our study adopted a machine learning approach to explore the role of the exposome starting already in the preconception phase.

Methods: We performed a combined analysis of two multi-ethnic Asian birth cohorts, the Growing Up in Singapore Towards healthy Outcomes (GUSTO) and the Singapore PREconception Study of long Term maternal and child Outcomes (S-PRESTO) cohorts.

View Article and Find Full Text PDF

HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence.

Wenhao Jin Kristopher W Brannan Katannya Kapeli Samuel S Park Hui Qing Tan Limsoon Wong

Mol Cell

July 2023

RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression and, when dysfunctional, underlie human diseases. Proteome-wide discovery efforts predict thousands of RBP candidates, many of which lack canonical RNA-binding domains (RBDs). Here, we present a hybrid ensemble RBP classifier (HydRA), which leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machines (SVMs), convolutional neural networks (CNNs), and Transformer-based protein language models.

View Article and Find Full Text PDF

ProJect: a powerful mixed-model missing value imputation method.

Weijia Kong Bertrand Jern Han Wong Harvard Wai Hann Hui Kai Peng Lim Yulan Wang Limsoon Wong

Brief Bioinform

July 2023

Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data.

View Article and Find Full Text PDF

How missing value imputation is confounded with batch effects and what you can do about it.

Wilson Wen Bin Goh Harvard Wai Hann Hui Limsoon Wong

Drug Discov Today

September 2023

In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent.

View Article and Find Full Text PDF

Obstacles to effective model deployment in healthcare.

Wei Xin Chan Limsoon Wong

J Bioinform Comput Biol

April 2023

Despite an exponential increase in publications on clinical prediction models over recent years, the number of models deployed in clinical practice remains fairly limited. In this paper, we identify common obstacles that impede effective deployment of prediction models in healthcare, and investigate their underlying causes. We observe a key underlying cause behind most obstacles - the improper development and evaluation of prediction models.

View Article and Find Full Text PDF

ProInfer: An interpretable protein inference tool leveraging on biological networks.

Hui Peng Limsoon Wong Wilson Wen Bin Goh

PLoS Comput Biol

March 2023

In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide which maps to more than one candidate protein) and enhances the statistical confidence for proteins supported by both unique and ambiguous peptides.

View Article and Find Full Text PDF

Evaluating network-based missing protein prediction using -values, Bayes Factors, and probabilities.

Wilson Wen Bin Goh Weijia Kong Limsoon Wong

J Bioinform Comput Biol

February 2023

Some prediction methods use probability to rank their predictions, while some other prediction methods do not rank their predictions and instead use [Formula: see text]-values to support their predictions. This disparity renders direct cross-comparison of these two kinds of methods difficult. In particular, approaches such as the Bayes Factor upper Bound (BFB) for [Formula: see text]-value conversion may not make correct assumptions for this kind of cross-comparisons.

View Article and Find Full Text PDF

Accounting for treatment during the development or validation of prediction models.

Wei Xin Chan Limsoon Wong

J Bioinform Comput Biol

December 2022

Clinical prediction models are widely used to predict adverse outcomes in patients, and are often employed to guide clinical decision-making. Clinical data typically consist of patients who received different treatments. Many prediction modeling studies fail to account for differences in patient treatment appropriately, which results in the development of prediction models that show poor accuracy and generalizability.

View Article and Find Full Text PDF

Density-based detection of cell transition states to construct disparate and bifurcating trajectories.

Tian Lan Gyorgy Hutvagner Xuan Zhang Tao Liu Limsoon Wong

Nucleic Acids Res

November 2022

Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmental biologies and can be also inferred through computational methods from single-cell RNA-sequencing datasets. However, trajectories with complicated topologies such as loops, disparate lineages and bifurcating hierarchy remain difficult to infer accurately. Here, we introduce a density-based trajectory inference method capable of constructing diverse shapes of topological patterns including the most intriguing bifurcations.

View Article and Find Full Text PDF

Resolving missing protein problems using functional class scoring.

Bertrand Jern Han Wong Weijia Kong Limsoon Wong Wilson Wen Bin Goh

Sci Rep

July 2022

Despite technological advances in proteomics, incomplete coverage and inconsistency issues persist, resulting in "data holes". These data holes cause the missing protein problem (MPP), where relevant proteins are persistently unobserved, or sporadically observed across samples, hindering biomarker discovery and proper functional characterization. Network-based approaches can provide powerful solutions for resolving these issues.

View Article and Find Full Text PDF

EnsembleFam: towards more accurate protein family prediction in the twilight zone.

Mohammad Neamul Kabir Limsoon Wong

BMC Bioinformatics

March 2022

Background: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.

Results: We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations.

View Article and Find Full Text PDF

Are batch effects still relevant in the age of big data?

Wilson Wen Bin Goh Chern Han Yong Limsoon Wong

Trends Biotechnol

September 2022

Batch effects (BEs) are technical biases that may confound analysis of high-throughput biotechnological data. BEs are complex and effective mitigation is highly context-dependent. In particular, the advent of high-resolution technologies such as single-cell RNA sequencing presents new challenges.

View Article and Find Full Text PDF

Proteomic datasets of HeLa and SiHa cell lines acquired by DDA-PASEF and diaPASEF.

Zelu Huang Weijia Kong Bertrand Jernhan Wong Huanhuan Gao Tiannan Guo Limsoon Wong

Data Brief

April 2022

We present four datasets on proteomics profiling of HeLa and SiHa cell lines associated with the research described in the paper "PROTREC: A probability-based approach for recovering missing proteins based on biological networks" [1]. Proteins in each cell line were acquired by two different data acquisition methods. The first was Data Dependent Acquisition-Parallel Accumulation Serial Fragmentation (DDA-PASEF) and the second was Parallel Accumulation-Serial Fragmentation combined with data-independent acquisition (diaPASEF) [2], [3].

View Article and Find Full Text PDF

How doppelgänger effects in biomedical data confound machine learning.

Li Rong Wang Limsoon Wong Wilson Wen Bin Goh

Drug Discov Today

March 2022

Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers.

View Article and Find Full Text PDF

PROTREC: A probability-based approach for recovering missing proteins based on biological networks.

Weijia Kong Bertrand Jern Han Wong Huanhuan Gao Tiannan Guo Xianming Liu Limsoon Wong

J Proteomics

January 2022

A novel network-based approach for predicting missing proteins (MPs) is proposed here. This approach, PROTREC (short for PROtein RECovery), dominates existing network-based methods - such as Functional Class Scoring (FCS), Hypergeometric Enrichment (HE), and Gene Set Enrichment Analysis (GSEA) - across a variety of proteomics datasets derived from different proteomics data acquisition paradigms: Higher PROTREC scores are much more closely correlated with higher recovery rates of MPs across sample replicates. The PROTREC score, unlike methods reporting p-values, can be directly interpreted as the probability that an unreported protein in a proteomic screen is actually present in the sample being screened.

View Article and Find Full Text PDF

Identifying collateral and synthetic lethal vulnerabilities within the DNA-damage response.

Pietro Pinoli Sriganesh Srihari Limsoon Wong Stefano Ceri

BMC Bioinformatics

May 2021

Background: A pair of genes is defined as synthetically lethal if defects on both cause the death of the cell but a defect in only one of the two is compatible with cell viability. Ideally, if A and B are two synthetic lethal genes, inhibiting B should kill cancer cells with a defect on A, and should have no effects on normal cells. Thus, synthetic lethality can be exploited for highly selective cancer therapies, which need to exploit differences between normal and cancer cells.

View Article and Find Full Text PDF

Extensions of the External Validation for Checking Learned Model Interpretability and Generalizability.

Sung Yang Ho Kimberly Phua Limsoon Wong Wilson Wen Bin Goh

Patterns (N Y)

November 2020

We discuss the validation of machine learning models, which is standard practice in determining model efficacy and generalizability. We argue that internal validation approaches, such as cross-validation and bootstrap, cannot guarantee the quality of a machine learning model due to potentially biased training data and the complexity of the validation procedure itself. For better evaluating the generalization ability of a learned model, we suggest leveraging on external data sources from elsewhere as validation datasets, namely external validation.

View Article and Find Full Text PDF

Avoid Oversimplifications in Machine Learning: Going beyond the Class-Prediction Accuracy.

Sung Yang Ho Limsoon Wong Wilson Wen Bin Goh

Patterns (N Y)

May 2020

Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning have been accomplished: two classifiers providing the same performance in one validation can disagree on many future validations.

View Article and Find Full Text PDF

Allowing mutations in maximal matches boosts genome compression performance.

Yuansheng Liu Limsoon Wong Jinyan Li

Bioinformatics

September 2020

Motivation: A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments.

View Article and Find Full Text PDF

Genetic source completeness of HIV-1 circulating recombinant forms (CRFs) predicted by multi-label learning.

Runbin Tang Zuguo Yu Yuanlin Ma Yaoqun Wu Yi-Ping Phoebe Chen Limsoon Wong

Bioinformatics

May 2021

Motivation: Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem.

View Article and Find Full Text PDF