Background: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3).
Objective: This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler.
Methods: A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach.
Results: In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals.
Conclusions: The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7414401 | PMC |
http://dx.doi.org/10.2196/17853 | DOI Listing |
JMIR Form Res
January 2025
Center for Cancer Health Equity, Rutgers Cancer Institute, New Brunswick, NJ, United States.
Background: Cervical cancer disparities persist among minoritized women due to infrequent screening and poor follow-up. Structural and psychosocial barriers to following up with colposcopy are problematic for minoritized women. Evidence-based interventions using patient navigation and tailored telephone counseling, including the Tailored Communication for Cervical Cancer Risk (TC3), have modestly improved colposcopy attendance.
View Article and Find Full Text PDFGM Crops Food
December 2025
School of Life Science, Henan University, Kaifeng, Henan, People's Republic of China.
Malic acid markedly affects watermelon flavor. Reducing the malic acid content can significantly increase the sweetness of watermelon. An effective solution strategy is to reduce watermelon malic acid content through molecular breeding technology.
View Article and Find Full Text PDFDiabetologia
January 2025
Department of Nutrition, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Aims/hypothesis: Existing evidence on the relationship between intake of monounsaturated fatty acids (MUFAs) and type 2 diabetes is conflicting. Few studies have examined whether MUFAs from plant or animal sources (MUFA-Ps and MUFA-As, respectively) exhibit differential associations with type 2 diabetes. We examined associations of intakes of total MUFAs, MUFA-Ps and MUFA-As with type 2 diabetes risk.
View Article and Find Full Text PDFEnviron Sci Pollut Res Int
January 2025
Department of Environmental Health Sciences and Technology, Institute of Health, Jimma University, Jimma, Ethiopia.
Recycling excreta resources through resource-oriented toilet systems (ROTS) holds transformative potential, yet adoption remains limited, especially where benefits could be high. This study aims to understand constraints hindering the adoption of ROTS in one such area in Ethiopia. Based on a survey among 476 households comprising 2393 individuals, we examine the plans to use ROTS and willingness to pay for ROTS and apply structural equation modelling to analyze the drivers of these two outcomes while comparing the explanative power of the extended technology acceptance model, extended theory of planned behaviour, and their combined model.
View Article and Find Full Text PDFHist Philos Life Sci
January 2025
Department Civilization and Forms of Knowledge, University of Pisa, Pisa, PI, Italy.
The selected effects theory is supposed to provide a fully naturalistic basis for statements about what biological traits or processes are for without appeal to final causes or intelligent design. On the selected effects theory, biologists are allowed to say, for instance, that hindwing eyespots on butterfly wings serve to deflect predators' attacks away from vital organs because a similar fitness-enhancing effect explains why eyespots themselves were favoured by natural selection and persisted in the population. This is known as the explanatory dimension of the selected effects theory.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!