Publications by authors named "Arjun Magge"

Article Synopsis
  • The text discusses the significance of real-world data from social media, particularly Twitter, for health and social science research, emphasizing the need to identify user demographics like age and gender to evaluate research representativeness.
  • It outlines the objective of a scoping review that summarizes existing literature on methods for predicting Twitter users' age and gender, noting the challenges involved in this process.
  • The review analyzed 684 studies, finding 74 relevant ones that discussed age or gender prediction, revealing a predominance in gender prediction methods, with varying levels of performance in accuracy for both age and gender classifications.
View Article and Find Full Text PDF

Free-text information represents a valuable resource for epidemiological surveillance. Its unstructured nature, however, presents significant challenges in the extraction of meaningful information. This study presents a deep learning model for classifying otitis using pediatric medical records.

View Article and Find Full Text PDF
Article Synopsis
  • Accurate documentation of phenotypes in electronic health records (EHR) is crucial for genetic diagnosis, but current variations in reporting hinder computational analysis and existing NLP methods are not fully trained on EHR data.
  • A new system called PhenoID was developed at the Children's Hospital of Philadelphia, which includes a manually annotated corpus of over 3,000 dysmorphology observations aligned with the Human Phenotype Ontology (HPO) to enhance phenotype extraction from clinical notes.
  • PhenoID outperformed prior methods with a performance score of 0.717, highlighting the potential of transformer-based models for extracting genetic phenotypes, though it also revealed issues with the HPO terminology and understanding by the models.
View Article and Find Full Text PDF

Wastewater-based epidemiology (WBE) is a non-invasive and cost-effective approach for monitoring the spread of a pathogen within a community. WBE has been adopted as one of the methods to monitor the spread and population dynamics of the SARS-CoV-2 virus, but significant challenges remain in the bioinformatic analysis of WBE-derived data. Here, we have developed a new distance metric, CoVdist, and an associated analysis tool that facilitates the application of ordination analysis to WBE data and the identification of viral population changes based on nucleotide variants.

View Article and Find Full Text PDF

Background: More than 6 million people in the United States have Alzheimer disease and related dementias, receiving help from more than 11 million family or other informal caregivers. A range of traditional interventions has been developed to support family caregivers; however, most of them have not been implemented in practice and remain largely inaccessible. While recent studies have shown that family caregivers of people with dementia use Twitter to discuss their experiences, methods have not been developed to enable the use of Twitter for interventions.

View Article and Find Full Text PDF
Article Synopsis
  • Researchers aimed to understand COVID-19 transmission in the UK using Twitter as a data source due to limited testing and information.
  • They collected geo-tagged tweets indicating possible COVID-19 exposure using natural language processing and machine learning methods.
  • Findings showed that Twitter reports aligned with lab-confirmed cases, often appearing up to 2 weeks earlier, suggesting tweets could help identify trends and inform public health policies.
View Article and Find Full Text PDF
Article Synopsis
  • Researchers developed an automated method called ReportAGE to identify the exact age of social media users based on their self-reported ages in tweets.
  • The system uses natural language processing techniques, including a deep neural network model, and achieved high accuracy in detecting age-related tweets and extracting exact ages.
  • ReportAGE was tested on over 1.2 billion tweets and successfully predicted the ages of 132,637 users, highlighting its potential for enhancing social media data analysis in research.
View Article and Find Full Text PDF

Objective: Research on pharmacovigilance from social media data has focused on mining adverse drug events (ADEs) using annotated datasets, with publications generally focusing on 1 of 3 tasks: ADE classification, named entity recognition for identifying the span of ADE mentions, and ADE mention normalization to standardized terminologies. While the common goal of such systems is to detect ADE signals that can be used to inform public policy, it has been impeded largely by limited end-to-end solutions for large-scale analysis of social media reports for different drugs.

Materials And Methods: We present a dataset for training and evaluation of ADE pipelines where the ADE distribution is closer to the average 'natural balance' with ADEs present in about 7% of the tweets.

View Article and Find Full Text PDF

The increase of social media usage across the globe has fueled efforts in digital epidemiology for mining valuable information such as medication use, adverse drug effects and reports of viral infections that directly and indirectly affect population health. Such specific information can, however, be scarce, hard to find, and mostly expressed in very colloquial language. In this work, we focus on a fundamental problem that enables social media mining for disease monitoring.

View Article and Find Full Text PDF
Article Synopsis
  • Researchers aimed to create an automated system using natural language processing to analyze Twitter data for potential unreported COVID-19 cases in the U.S., addressing issues with traditional testing methods.
  • They collected tweets related to COVID-19 from January 2020 and developed a classifier using deep learning techniques, specifically a BERT model, to distinguish tweets that self-report infections.
  • The model achieved a solid performance with an F-score of 0.76, showing promise in identifying potential cases based on social media data. The team processed over 85 million tweets during their study.
View Article and Find Full Text PDF

Summary: We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information's GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning.

View Article and Find Full Text PDF
Article Synopsis
  • The study explores how social media, particularly Twitter, can be used to track COVID-19 information shared by users in the U.S.
  • Researchers employed natural language processing and machine learning techniques to analyze the timing and location of these reports.
  • The findings indicate that this approach could serve as an early warning system for predicting the spread of COVID-19.
View Article and Find Full Text PDF
Article Synopsis
  • The study explores using social media mining to track COVID-19 reports on Twitter in England.
  • It builds on methods previously used in the US to identify personal accounts of COVID-19 experiences.
  • The results show that natural language processing and machine learning can effectively monitor the spread of the virus geographically and over time.
View Article and Find Full Text PDF

Objective: Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step toward incorporating Twitter data in pharmacoepidemiologic research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names suffer from low recall due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them.

View Article and Find Full Text PDF

Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics.

View Article and Find Full Text PDF

Discrete phylogeography using software such as BEAST considers the sampling location of each taxon as fixed; often to a single location without uncertainty. When studying viruses, this implies that there is no possibility that the location of the infected host for that taxa is somewhere else. Here, we relaxed this strong assumption and allowed for analytic integration of uncertainty for discrete virus phylogeography.

View Article and Find Full Text PDF

Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature.

View Article and Find Full Text PDF

Summary: GeoBoost is a command-line software package developed to address sparse or incomplete metadata in GenBank sequence records that relate to the location of the infected host (LOIH) of viruses. Given a set of GenBank accession numbers corresponding to virus GenBank records, GeoBoost extracts, integrates and normalizes geographic information reflecting the LOIH of the viruses using integrated information from GenBank metadata and related full-text publications. In addition, to facilitate probabilistic geospatial modeling, GeoBoost assigns probability scores for each possible LOIH.

View Article and Find Full Text PDF

Background: Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias.

View Article and Find Full Text PDF