Petrel: Heterogeneity-aware Distributed Deep Learning via Hybrid Synchronization


The parameter server (PS) paradigm has achieved great success in deploying large-scale distributed Deep Learning (DL) systems. However, these systems implicitly assume that the cluster is homogeneous and this assumption does not hold in many realworld cases. Although the previous efforts are paid to address heterogeneity, they mainly prioritize the contribution of fast workers and reduce the involvement of slow workers, resulting in the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization at the community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CASP), which uses the Asynchronous Advantage Actor-Critic (A3C)-based algorithm to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a prototype system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under various benchmarks with multiple metrics and baseline comparison demonstrates the effectiveness of Petrel. Specifically, Petrel accelerates the training convergence speed by up to 1.87 times faster and reduces communication traffic by up to 26.85 percent, on average, over the non-community synchronization mechanisms.

IEEE Transactions on Parallel and Distributed Systems (TPDS)