The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Sécurité du Médicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5266266PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0169658PLOS

Publication Analysis

Top Keywords

vigi4med scraper
16
web forum
8
data
8
structured data
8
data analysis
8
data web
8
web forums
8
vigi4med
5
framework
4
scraper framework
4

Similar Publications

The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!