Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7198121 | PMC |
http://dx.doi.org/10.1109/JBHI.2018.2796941 | DOI Listing |
JMIR Res Protoc
January 2025
UK Health Security Agency, London, United Kingdom.
Background: Due to advances in treatment, HIV is now a chronic condition with near-normal life expectancy. However, people with HIV continue to have a higher burden of mental and physical health conditions and are impacted by wider socioeconomic issues. Positive Voices is a nationally representative series of surveys of people with HIV in the United Kingdom.
View Article and Find Full Text PDFTurk J Gastroenterol
December 2024
Department of Emergency Medicine, Shandong University, Qilu Hospital (Qingdao), Cheeloo College of Medicine, Qingdao, China.
Metabolic dysfunction-associated steatotic liver disease (MASLD) is considered the most widespread chronic liver condition globally. Genome-wide association studies (GWAS) have pinpointed several genetic loci correlated to MASLD, yet the biological significance of these loci remains poorly understood. Initially, we applied Functional Mapping and Annotation (FUMA) to conduct a functional annotation of the MASLD GWAS summary statistics, which included data from 3242 cases and 707 631 controls.
View Article and Find Full Text PDFPLoS One
December 2024
Department of Operational and Implementation Research, ICMR- National Institute for Research in Reproductive and Child Health- HTA Regional Resource Hub, Mumbai, Maharashtra, India.
BMC Public Health
November 2024
Institute for Mental Health Policy Research, Centre for Addiction and Mental Health, 33 Ursula Franklin Street, Toronto, ON, M5S 2S1, Canada.
Background: Biological sample collection and data linkage can expand the utility of population health surveys. The present study investigates factors associated with population health survey respondents' willingness to provide biological samples and personal health information.
Methods: Using data from the 2019 Centre for Addiction and Mental Health (CAMH) Monitor survey (n = 2,827), we examined participants' willingness to provide blood samples, saliva samples, probabilistic linkage, and direct linkage with personal health information.
BMC Med Res Methodol
November 2024
MRC Integrative Epidemiology Unit, University of Bristol, Bristol, UK.
Background: Bias from data missing not at random (MNAR) is a persistent concern in health-related research. A bias analysis quantitatively assesses how conclusions change under different assumptions about missingness using bias parameters that govern the magnitude and direction of the bias. Probabilistic bias analysis specifies a prior distribution for these parameters, explicitly incorporating available information and uncertainty about their true values.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!