As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3595988PMC
http://dx.doi.org/10.1093/database/bat010DOI Listing

Publication Analysis

Top Keywords

duplicated content
8
data
8
affymetrix data
8
geo arrayexpress
8
rna-seq data
8
uncovering hidden
4
hidden duplicated
4
content public
4
public transcriptomics
4
transcriptomics data
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!