Publications by authors named "Penberthy L"

Background: We developed a United States-based real-world data resource to better understand the continued impact of the coronavirus disease 2019 (COVID-19) pandemic on immunocompromised patients, who are typically underrepresented in prospective studies and clinical trials.

Methods: The COVID-19 Real World Data infrastructure (CRWDi) was created by linking and harmonizing de-identified HealthVerity medical and pharmacy claims data from 1 December 2018 to 31 December 2023, with severe acute respiratory syndrome coronavirus 2 virologic and serologic laboratory data from major commercial laboratories and Northwell Health; COVID-19 vaccination data; and, for patients with cancer, 2010 to 2021 National Cancer Institute Surveillance, Epidemiology, and End Results registry data.

Results: The CRWDi contains 4 cohorts: patients with cancer; patients with rheumatic diseases receiving pharmacotherapy; noncancer solid organ and hematopoietic stem cell transplant recipients; and people from the general population including adults and pediatric patients.

View Article and Find Full Text PDF

Childhood cancers are a heterogeneous group of rare diseases, accounting for less than 2% of all cancers diagnosed worldwide. Most countries, therefore, do not have enough cases to provide robust information on epidemiology, treatment, and late effects, especially for rarer types of cancer. Thus, only through a concerted effort to share data internationally will we be able to answer research questions that could not otherwise be answered.

View Article and Find Full Text PDF
Article Synopsis
  • The study discusses the limitations of using graph convolutional networks (GCN) for classifying natural language texts, particularly in terms of memory usage and distribution.
  • It introduces a new model called FastMPN, which features a message passing architecture that allows for adjustable node embeddings and edge weights, improving the GCN's problem-solving ability.
  • The FastMPN model was tested on extracting clinical data from cancer pathology reports, outperforming or matching existing models and training quickly on a large dataset using advanced hardware.
View Article and Find Full Text PDF
Article Synopsis
  • Precision medicine is increasingly important in cancer care, but tumor genomic data has been lacking in the National Cancer Institute's SEER Program, limiting research on molecular subtypes.
  • To improve this, the SEER Program has implemented a centralized process to link cancer cases in their registries with genomic test results from molecular labs, using specialized software and a trusted third party for data handling.
  • Recent linkages have included various OncotypeDX tests and results from other genomic classifiers, which facilitate the research community's access to valuable, de-identified data for cancer studies.
View Article and Find Full Text PDF

Although the Surveillance, Epidemiology, and End Results (SEER) Program has maintained high standards of quality and completeness, the traditional data captured through population-based cancer surveillance are no longer sufficient to understand the impact of cancer and its outcomes. Therefore, in recent years, the SEER Program has expanded the population it covers and enhanced the types of data that are being collected. Traditionally, surveillance systems collected data characterizing the patient and their cancer at the time of diagnosis, as well as limited information on the initial course of therapy.

View Article and Find Full Text PDF

The National Cancer Institute and the Department of Energy strategic partnership applies advanced computing and predictive machine learning and deep learning models to automate the capture of information from unstructured clinical text for inclusion in cancer registries. Applications include extraction of key data elements from pathology reports, determination of whether a pathology or radiology report is related to cancer, extraction of relevant biomarker information, and identification of recurrence. With the growing complexity of cancer diagnosis and treatment, capturing essential information with purely manual methods is increasingly difficult.

View Article and Find Full Text PDF

One of the challenges associated with understanding environmental impacts on cancer risk and outcomes is estimating potential exposures of individuals diagnosed with cancer to adverse environmental conditions over the life course. Historically, this has been partly due to the lack of reliable measures of cancer patients' potential environmental exposures before a cancer diagnosis. The emerging sources of cancer-related spatiotemporal environmental data and residential history information, coupled with novel technologies for data extraction and linkage, present an opportunity to integrate these data into the existing cancer surveillance data infrastructure, thereby facilitating more comprehensive assessment of cancer risk and outcomes.

View Article and Find Full Text PDF

Background: The National Cancer Institute funds many large cohort studies that rely on self-reported cancer data requiring medical record validation. This is labor intensive, costly, and prone to underreporting or misreporting of cancer and disparity-related differential response. US population-based central cancer registries identify incident cancer within their catchment area, yielding all malignant neoplasms and benign brain and central nervous system tumors with standardized data fields.

View Article and Find Full Text PDF

Background: The Surveillance, Epidemiology, and End Results (SEER) Program with the National Cancer Institute tested whether population-based cancer registries can serve as honest brokers to acquire tissue and data in the SEER-Linked Virtual Tissue Repository (VTR) Pilot.

Methods: We collected formalin-fixed, paraffin-embedded tissue and clinical data from patients with pancreatic ductal adenocarcinoma (PDAC) and breast cancer (BC) for two studies comparing cancer cases with highly unusual survival (≥5 years for PDAC and ≤30 months for BC) to pair-matched controls with usual survival (≤2 years for PDAC and ≥5 years for BC). Success was defined as the ability for registries to acquire tissue and data on cancer cases with highly unusual outcomes.

View Article and Find Full Text PDF

Purpose: This study assessed the prevalence of specific major adverse financial events (AFEs)-bankruptcies, liens, and evictions-before a cancer diagnosis and their association with later-stage cancer at diagnosis.

Methods: Patients age 20-69 years diagnosed with cancer during 2014-2015 were identified from the Seattle, Louisiana, and Georgia SEER population-based cancer registries. Registry data were linked with LexisNexis consumer data to identify patients with a history of court-documented AFEs before cancer diagnosis.

View Article and Find Full Text PDF
Article Synopsis
  • Machine learning models, specifically deep neural networks (DNNs), are increasingly used in decision-making alongside humans, emphasizing the need for reliable classifications.
  • This paper highlights the use of DNNs to automate the extraction of cancer-related data from electronic pathology reports, while introducing new selective classification methods to improve accuracy and reduce the number of unreliable predictions.
  • The proposed methods outperform existing models by achieving high accuracy with lower rejection rates, demonstrating their effectiveness in processing complex medical data.
View Article and Find Full Text PDF

Introduction: Health care procedures including cancer screening and diagnosis were interrupted due to the COVID-19 pandemic. The extent of this impact on cancer care in the United States is not fully understood. We investigated pathology report volume as a reflection of trends in oncology services pre-pandemic and during the pandemic.

View Article and Find Full Text PDF

Data-driven basic, translational, and clinical research has resulted in improved outcomes for children, adolescents, and young adults (AYAs) with pediatric cancers. However, challenges in sharing data between institutions, particularly in research, prevent addressing substantial unmet needs in children and AYA patients diagnosed with certain pediatric cancers. Systematically collecting and sharing data from every child and AYA can enable greater understanding of pediatric cancers, improve survivorship, and accelerate development of new and more effective therapies.

View Article and Find Full Text PDF

This retrospective observational study aimed to gain a better understanding of the protective duration of prior SARS-CoV-2 infection against reinfection. The objectives were two-fold: to assess the durability of immunity to SARS-CoV-2 reinfection among initially unvaccinated individuals with previous SARS-CoV-2 infection, and to evaluate the crude SARS-CoV-2 reinfection rate and associated risk factors. During the pandemic era time period from February 29, 2020, through April 30, 2021, 144,678,382 individuals with SARS-CoV-2 molecular diagnostic or antibody test results were studied.

View Article and Find Full Text PDF

Objective: We aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports.

Materials And Methods: We consider the text classification problem that involves 5 individual tasks. The baseline model consists of a multitask convolutional neural network (MtCNN), and the implemented ensemble (teacher) consists of 1000 MtCNNs.

View Article and Find Full Text PDF

Follow-up of US cohort members for incident cancer is time-consuming, is costly, and often results in underascertainment when the traditional methods of self-reporting and/or medical record validation are used. We conducted one of the first large-scale investigations to assess the feasibility, methods, and benefits of linking participants in the US Radiologic Technologists (USRT) Study (n = 146,022) with the majority of US state or regional cancer registries. Follow-up of this cohort has relied primarily on questionnaires (mailed approximately every 10 years) and linkage with the National Death Index.

View Article and Find Full Text PDF

Objectives: The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard.

View Article and Find Full Text PDF

Importance: Better understanding of the protective duration of prior SARS-CoV-2 infection against reinfection is needed.

Objective: Primary: To assess the durability of immunity to SARS-CoV-2 reinfection among initially unvaccinated individuals with previous SARS-CoV-2 infection. Secondary: Evaluate the crude SARS-CoV-2 reinfection rate and associated characteristics.

View Article and Find Full Text PDF

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance.

View Article and Find Full Text PDF

The National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) program is continuously exploring opportunities to augment its already extensive collection of data, enhance the quality of reported cancer information, and contribute to more comprehensive analyses of cancer burden. This manuscript describes a recent linkage of the LexisNexis longitudinal residential history data with 11 SEER registries and provides estimates of the inter-state mobility of SEER cancer patients. To identify mobility from one state to another, we used state postal abbreviations to generate state-level residential histories.

View Article and Find Full Text PDF

Generating evidence on the use, effectiveness, and safety of new cancer therapies is a priority for researchers, health care providers, payers, and regulators given the rapid pace of change in cancer diagnosis and treatments. The use of real-world data (RWD) is integral to understanding the utilization patterns and outcomes of these new treatments among patients with cancer who are treated in clinical practice and community settings. An initial step in the use of RWD is careful study design to assess the suitability of an RWD source.

View Article and Find Full Text PDF

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention.

View Article and Find Full Text PDF

Cancer Informatics for Cancer Centers (CI4CC) is a grassroots, nonprofit 501c3 organization intended to provide a focused national forum for engagement of senior cancer informatics leaders, primarily aimed at academic cancer centers anywhere in the world but with a special emphasis on the 70 National Cancer Institute-funded cancer centers. This consortium has regularly held topic-focused biannual face-to-face symposiums. These meetings are a place to review cancer informatics and data science priorities and initiatives, providing a forum for discussion of the strategic and pragmatic issues that we faced at our respective institutions and cancer centers.

View Article and Find Full Text PDF
Article Synopsis
  • Population cancer registries can enhance their efficiency in extracting cancer characteristics from pathology reports by utilizing Deep Learning (DL), but challenges exist due to privacy issues regarding data sharing.
  • The proposed solution involves privacy-preserving transfer learning strategies to distribute a multitask convolutional neural network (MT-CNN) model among cancer registries without sharing sensitive patient data.
  • Results indicate that these privacy-preserving methods result in comparable performance to traditional centralized models, showing the effectiveness of collaboration in cancer data processing while maintaining confidentiality.
View Article and Find Full Text PDF

Background: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model.

View Article and Find Full Text PDF