An Open Source Python Library for Anonymizing Sensitive Data.

Sci Data

Instituto de Física de Cantabria (IFCA), CSIC-UC Avda. los Castros s/n, 39005, Santander, Spain.

Published: November 2024

Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11599594PMC
http://dx.doi.org/10.1038/s41597-024-04019-zDOI Listing

Publication Analysis

Top Keywords

open source
8
python library
8
data open
8
open data
8
open
6
data
6
source python
4
library anonymizing
4
anonymizing sensitive
4
sensitive data
4

Similar Publications

Rationale And Objectives: Training Convolutional Neural Networks (CNN) requires large datasets with labeled data, which can be very labor-intensive to prepare. Radiology reports contain a lot of potentially useful information for such tasks. However, they are often unstructured and cannot be directly used for training.

View Article and Find Full Text PDF

The iterative bleaching extends multiplexity (IBEX) Knowledge-Base is a central portal for researchers adopting IBEX and related 2D and 3D immunofluorescence imaging methods. The design of the Knowledge-Base is modeled after efforts in the open-source software community and includes three facets: a development platform (GitHub), static website, and service for data archiving. The Knowledge-Base facilitates the practice of open science throughout the research life cycle by providing validation data for recommended and non-recommended reagents, e.

View Article and Find Full Text PDF

Introduction: Anemia is a severe public health problem in India, affecting more than 50% of individuals across most age groups. The Anemia Mukt Bharat (AMB) program, with a target of a three-percentage point reduction in anemia prevalence per year, developed a monitoring mechanism based on a set of 18 indicators and six key performance indicators (KPIs) derived from routine reporting in the Health Management Information System (HMIS). The study's objective was to assess the status of anemia control measures in the district of Faridabad, Haryana, India, using AMB HMIS indicators from April 2018 to March 2019.

View Article and Find Full Text PDF

Recent advancements in large language models (LLMs) like ChatGPT and LLaMA have shown significant potential in medical applications, but their effectiveness is limited by a lack of specialized medical knowledge due to general-domain training. In this study, we developed Me-LLaMA, a new family of open-source medical LLMs that uniquely integrate extensive domain-specific knowledge with robust instruction-following capabilities. Me-LLaMA comprises foundation models (Me-LLaMA 13B and 70B) and their chat-enhanced versions, developed through comprehensive continual pretraining and instruction tuning of LLaMA2 models using both biomedical literature and clinical notes.

View Article and Find Full Text PDF

Understanding cellular responses to external stimuli is critical for parsing biological mechanisms and advancing therapeutic development. High-content image-based assays provide a cost-effective approach to examine cellular phenotypes induced by diverse interventions, which offers valuable insights into biological processes and cellular states. In this paper, we introduce MorphoDiff, a generative pipeline to predict high-resolution cell morphological responses under different conditions based on perturbation encoding.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!