A Comprehensive Inspection of the Straggler Problem

Abstract

Parameter server is a popular distributed processing paradigm for operating distributed deep learning (DL) applications. As a growing number of DL models are trained via shared clusters, machines are in confrontation with the heterogeneous environment, which incurs the unexpected phenomenon with a slow task processing speed called straggler. Straggler addressing is a crucial issue in distributed DL applications, since stragglers significantly hamper system performance. While many techniques have been deployed to mitigate stragglers, they may not achieve their goals with the presence of heterogeneity, where systems consume much longer time until DL training convergence than in a homogeneous environment, as evidenced by our experimental study. With the methodology of straggler projection and abstraction of parallelism, a new synchronization mechanism called elastic parallelism synchronous parallel (EPSP) is proposed, which exploits the superiority of iteration acceleration in stale synchronous parallel and conquers the shortage of barrier wasting time in bulk synchronous parallel. More precisely, EPSP supports both enforced and slack synchronization by adjusting the parameter of staleness.

Publication
The Spotlight of IEEE Transactions on Computers (TC)
Date
Links