Aim: The purpose was to develop an analytical pipeline for specific gene analysis and biomarker discovery from next generation sequencing (NGS) data.
Materials & Methods: As a test case, the fliC gene reference sequences of 24 Salmonella enterica strains of 13 serotypes and NGS reads of 32 serovar Newport, 48 Montevideo and 115 Enteritidis outbreak isolates were retrieved from the National Center for Biotechnology Information database.
Results: Establishment of an analytical pipeline consisting of four steps: reference sequences retrieval and template sequence determination; NGS sequence reads retrieval; multiple sequence alignments and phylogenetic analysis; data mining and biomarker discovery.
The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried.
View Article and Find Full Text PDF