An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter.

Data Brief

University of Pennsylvania, Philadelphia, PA, USA.

Published: October 2020

A study highlights the common issues of miscarriage, stillbirth, and preterm birth in the U.S. but reveals that their causes are still largely unknown.
Researchers collected a dataset of 6,487 tweets related to these adverse pregnancy outcomes from a larger pool of over 400 million public tweets by pregnant women on Twitter.
The tweets are labeled to identify personal experiences with these outcomes, allowing for deeper insights into patient experiences and potentially enabling future machine learning studies to recognize more cases across social media.

Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4], [5], [6]. To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7]. Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome ("outcome" tweets) from those that merely mention the outcome ("non-outcome" tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as "outcome" include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These "outcome" tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users' broader timelines-tweets posted by a user over time-for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in "A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes" [10].

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7481818	PMC
http://dx.doi.org/10.1016/j.dib.2020.106249	DOI Listing

Publication Analysis

Top Keywords

adverse pregnancy

pregnancy outcomes

preterm birth

low birthweight

tweets

data set

women reporting

reporting adverse

outcomes twitter

epidemiology adverse

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!