As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the or of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset.
View Article and Find Full Text PDFWhile there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for , especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) : multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) : conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) , : conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub.
View Article and Find Full Text PDF