The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, . 72-74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this 'middle exclusion' protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409063 | PMC |
http://dx.doi.org/10.1093/nargab/lqae125 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!