Crawling the German Health Web: Exploratory Study and Graph Analysis.

J Med Internet Res

Department of Medical Informatics, Heilbronn University, Heilbronn, Germany.

Published: July 2020

Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3).

Objective: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler.

Methods: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach.

Results: In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals.

Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414401PMC
http://dx.doi.org/10.2196/17853DOI Listing

Publication Analysis

Top Keywords

content providers
16
web
9
german health
8
health web
8
health domain
8
topics trends
8
health-related web
8
data set
8
harvest rate
8
health
6

Similar Publications

Background: Cervical cancer disparities persist among minoritized women due to infrequent screening and poor follow-up. Structural and psychosocial barriers to following up with colposcopy are problematic for minoritized women. Evidence-based interventions using patient navigation and tailored telephone counseling, including the Tailored Communication for Cervical Cancer Risk (TC3), have modestly improved colposcopy attendance.

View Article and Find Full Text PDF

Malic acid markedly affects watermelon flavor. Reducing the malic acid content can significantly increase the sweetness of watermelon. An effective solution strategy is to reduce watermelon malic acid content through molecular breeding technology.

View Article and Find Full Text PDF

Aims/hypothesis: Existing evidence on the relationship between intake of monounsaturated fatty acids (MUFAs) and type 2 diabetes is conflicting. Few studies have examined whether MUFAs from plant or animal sources (MUFA-Ps and MUFA-As, respectively) exhibit differential associations with type 2 diabetes. We examined associations of intakes of total MUFAs, MUFA-Ps and MUFA-As with type 2 diabetes risk.

View Article and Find Full Text PDF

Barriers to transition to resource-oriented sanitation in rural Ethiopia.

Environ Sci Pollut Res Int

January 2025

Department of Environmental Health Sciences and Technology, Institute of Health, Jimma University, Jimma, Ethiopia.

Recycling excreta resources through resource-oriented toilet systems (ROTS) holds transformative potential, yet adoption remains limited, especially where benefits could be high. This study aims to understand constraints hindering the adoption of ROTS in one such area in Ethiopia. Based on a survey among 476 households comprising 2393 individuals, we examine the plans to use ROTS and willingness to pay for ROTS and apply structural equation modelling to analyze the drivers of these two outcomes while comparing the explanative power of the extended technology acceptance model, extended theory of planned behaviour, and their combined model.

View Article and Find Full Text PDF

Hawks, Doves, and Perissodus microlepis. Undermining the selected effects theory of function.

Hist Philos Life Sci

January 2025

Department Civilization and Forms of Knowledge, University of Pisa, Pisa, PI, Italy.

The selected effects theory is supposed to provide a fully naturalistic basis for statements about what biological traits or processes are for without appeal to final causes or intelligent design. On the selected effects theory, biologists are allowed to say, for instance, that hindwing eyespots on butterfly wings serve to deflect predators' attacks away from vital organs because a similar fitness-enhancing effect explains why eyespots themselves were favoured by natural selection and persisted in the population. This is known as the explanatory dimension of the selected effects theory.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!