We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6936958PMC
http://dx.doi.org/10.1007/s10619-018-7244-2DOI Listing

Publication Analysis

Top Keywords

makespan optimization
8
abstract cost
4
cost models
4
models distributed
4
distributed data-intensive
4
data-intensive computations
4
computations consider
4
consider data
4
data analytics
4
analytics workloads
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!