Background: Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI's RefSeq project and subsequently processed by NCBI's eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.
Results: We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian "core proteins" for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].
Conclusions: Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3882889 | PMC |
http://dx.doi.org/10.1186/1471-2164-14-654 | DOI Listing |
Optom Vis Sci
January 2025
Johnson & Johnson MedTech (Vision), Irvine, California.
Significance: Optimal meibography utilization and interpretation are hindered due to poor lid presentation, blurry images, or image artifacts and the challenges of applying clinical grading scales. These results, using the largest image dataset analyzed to date, demonstrate development of algorithms that provide standardized, real-time inference that addresses all of these limitations.
Purpose: This study aimed to develop and validate an algorithmic pipeline to automate and standardize meibomian gland absence assessment and interpretation.
Int J Comput Assist Radiol Surg
January 2025
Computer Vision and Image Processing Lab., UofL, Louisville, KY, 40292, USA.
Purpose: This article introduces a novel deep learning approach to substantially improve the accuracy of colon segmentation even with limited data annotation, which enhances the overall effectiveness of the CT colonography pipeline in clinical settings.
Methods: The proposed approach integrates 3D contextual information via guided sequential episodic training in which a query CT slice is segmented by exploiting its previous labeled CT slice (i.e.
Comput Methods Programs Biomed
January 2025
Laberit, Avda. de Catalunya, 9, València, 46020, Spain.
Background And Objective: Despite significant investments in the normalization and the standardization of Electronic Health Records (EHRs), free text is still the rule rather than the exception in clinical notes. The use of free text has implications in data reuse methods used for supporting clinical research since the query mechanisms used in cohort definition and patient matching are mainly based on structured data and clinical terminologies. This study aims to develop a method for the secondary use of clinical text by: (a) using Natural Language Processing (NLP) for tagging clinical notes with biomedical terminology; and (b) designing an ontology that maps and classifies all the identified tags to various terminologies and allows for running phenotyping queries.
View Article and Find Full Text PDFNAR Genom Bioinform
March 2025
Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Tomtebodavägen 23A, 17165 Solna, Sweden.
Understanding the role of transcription and transcription factors (TFs) in cellular identity and disease, such as cancer, is essential. However, comprehensive data resources for cell line-specific TF-to-target gene annotations are currently limited. To address this, we employed a straightforward method to define regulons that capture the cell-specific aspects of TF binding and transcript expression levels.
View Article and Find Full Text PDFAppl Clin Inform
January 2025
Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany.
Objective: Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!