AI Article Synopsis

  • Data linkage involves connecting records referring to the same entity across different data sources, essential for research in fields like public health.
  • When unique identifiers are lacking, probabilistic methods are used to assess similarities between records, which requires careful selection of attributes and metrics.
  • The paper introduces AtyImo, a hybrid probabilistic linkage tool that shows high accuracy (93%-97% true matches) and can efficiently process large datasets, linking 114 million individuals in Brazil in under nine days.

Article Abstract

Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7198121PMC
http://dx.doi.org/10.1109/JBHI.2018.2796941DOI Listing

Publication Analysis

Top Keywords

probabilistic linkage
12
accuracy scalability
8
data linkage
8
public health
8
assessing accuracy
8
high accuracy
8
accuracy
6
data
5
linkage
5
records
5

Similar Publications

Background: Due to advances in treatment, HIV is now a chronic condition with near-normal life expectancy. However, people with HIV continue to have a higher burden of mental and physical health conditions and are impacted by wider socioeconomic issues. Positive Voices is a nationally representative series of surveys of people with HIV in the United Kingdom.

View Article and Find Full Text PDF

Transcriptome-Wide Association Study of Metabolic Dysfunction-Associated Steatotic Liver Disease Identifies Relevant Gene Signatures.

Turk J Gastroenterol

December 2024

Department of Emergency Medicine, Shandong University, Qilu Hospital (Qingdao), Cheeloo College of Medicine, Qingdao, China.

Metabolic dysfunction-associated steatotic liver disease (MASLD) is considered the most widespread chronic liver condition globally. Genome-wide association studies (GWAS) have pinpointed several genetic loci correlated to MASLD, yet the biological significance of these loci remains poorly understood. Initially, we applied Functional Mapping and Annotation (FUMA) to conduct a functional annotation of the MASLD GWAS summary statistics, which included data from 3242 cases and 707 631 controls.

View Article and Find Full Text PDF
Article Synopsis
  • * A model was used to analyze costs and health outcomes, showing that strengthening these linked services can significantly diminish unintended pregnancies, induced abortions, live births, and infant infections.
  • * Results indicate that for a relatively low cost, this integrated approach can avert multiple negative outcomes, supporting the idea that enhancing healthcare provider training and contraception methods is a cost-effective solution.
View Article and Find Full Text PDF

Background: Biological sample collection and data linkage can expand the utility of population health surveys. The present study investigates factors associated with population health survey respondents' willingness to provide biological samples and personal health information.

Methods: Using data from the 2019 Centre for Addiction and Mental Health (CAMH) Monitor survey (n = 2,827), we examined participants' willingness to provide blood samples, saliva samples, probabilistic linkage, and direct linkage with personal health information.

View Article and Find Full Text PDF

Background: Bias from data missing not at random (MNAR) is a persistent concern in health-related research. A bias analysis quantitatively assesses how conclusions change under different assumptions about missingness using bias parameters that govern the magnitude and direction of the bias. Probabilistic bias analysis specifies a prior distribution for these parameters, explicitly incorporating available information and uncertainty about their true values.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!