A machine learning approach to predict ethnicity using personal name and census location in Canada.

Kai On Wong Osmar R Zaïane Faith G Davis Yutaka Yasui

PLoS One

School of Public Health, University of Alberta, Edmonton, Alberta, Canada.

Published: December 2020

Background: Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features.

Methods: Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy.

Results: The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68-95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63-67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%).

Conclusions: The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673495	PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0241239	PLOS

Publication Analysis

Top Keywords

machine learning

census location

predict ethnicity

location features

learning approach

classification machine

ethnic categories

italian japanese

japanese russian

features consisted

Similar Publications

Nondestructive detection of pungent and numbing compounds in spicy hotpot seasoning with hyperspectral imaging and machine learning.

Food Chem

December 2024

School of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, China.

Di Zhang Xu Chen Zitao Lin Minmin Lu Wenhao Yang

The levels of capsaicin (CAP) and hydroxy-α-sanshool (α-SOH) are crucial for evaluating the spiciness and numbing sensation in spicy hotpot seasoning. Although liquid chromatography can accurately measure these compounds, the method is invasive. This study aimed to utilize hyperspectral imaging (HSI) combined with machine learning for the nondestructive detection of CAP and α-SOH in hotpot seasoning.

View Article and Find Full Text PDF

Similar Publications

Thermoelectric Gel Enabling Self-Powered Facial Perception for Expression Recognition and Health Monitoring.

ACS Sens

December 2024

College of Integrated Circuits, Taiyuan University of Technology, Taiyuan 030024, China.

Xiaojing Cui Yuyou Nie Saeed Ahmed Khan Xiangshi Bo Ning Li

By analyzing facial features to perform expression recognition and health monitoring, facial perception plays a pivotal role in noninvasive, real-time disease diagnosis and prevention. Current perception routes are limited by structural complexity and the necessity of a power supply, making timely and accurate monitoring difficult. Herein, a self-powered poly(vinyl alcohol)-gellan gum-glycerol thermogalvanic gel patch enabling facial perception is developed for monitoring emotions and atypical pathological states.

View Article and Find Full Text PDF

Similar Publications

Machine-Learning Electron Dynamics with Moment Propagation Theory: Application to Optical Absorption Spectrum Computation Using Real-Time TDDFT.

J Chem Theory Comput

December 2024

Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States.

Nicholas J Boyer Christopher Shepard Ruiyi Zhou Jianhang Xu Yosuke Kanai

We present an application of our new theoretical formulation of quantum dynamics, moment propagation theory (MPT) (Boyer et al., J. Chem.

View Article and Find Full Text PDF

Similar Publications

Source-free domain transfer algorithm with reduced style sensitivity for medical image segmentation.

PLoS One

December 2024

Sichuan Academy of Medical Science and Sichuan Provincial People's Hospital, Chengdu, China.

Jian Lin Xiaomin Yu Zhengxian Wang Chaoqiong Ma

In unsupervised transfer learning for medical image segmentation, where existing algorithms face the challenge of error propagation due to inaccessible source domain data. In response to this scenario, source-free domain transfer algorithm with reduced style sensitivity (SFDT-RSS) is designed. SFDT-RSS initially pre-trains the source domain model by using the generalization strategy and subsequently adapts the pre-trained model to target domain without accessing source data.

View Article and Find Full Text PDF

Similar Publications

Discovery of novel TACE inhibitors using graph convolutional network, molecular docking, molecular dynamics simulation, and Biological evaluation.

PLoS One

December 2024

Department of Pharmacology, Kangwon National University School of Medicine, Chuncheon, Republic of Korea.

Muhammad Yasir Jinyoung Park Eun-Taek Han Jin-Hee Han Won Sun Park

The increasing utilization of deep learning models in drug repositioning has proven to be highly efficient and effective. In this study, we employed an integrated deep-learning model followed by traditional drug screening approach to screen a library of FDA-approved drugs, aiming to identify novel inhibitors targeting the TNF-α converting enzyme (TACE). TACE, also known as ADAM17, plays a crucial role in the inflammatory response by converting pro-TNF-α to its active soluble form and cleaving other inflammatory mediators, making it a promising target for therapeutic intervention in diseases such as rheumatoid arthritis.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!