Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Ben Blamey Salman Toor Martin Dahlö Håkan Wieslander Philip J Harrison Ida-Maria Sintorn Alan Sabirsh Carolina Wählby Ola Spjuth Andreas Hellander

Gigascience

Department of Information Technology, Uppsala University, Lägerhyddsvägen 2, 75237 Uppsala, Sweden.

Published: March 2021

Background: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources.

Findings: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.

Conclusions: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7976223	PMC
http://dx.doi.org/10.1093/gigascience/giab018	DOI Listing

Publication Analysis

Top Keywords

haste toolkit

pipeline model

data

data pipelines

scientific data

data streams

toolkit

rapid development

development cloud-native

cloud-native intelligent

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!