Parameter server paradigm has shown great performance superiority for handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving straggler may not fully exploit the computation capacity of a cluster as evidenced by our experiments. This motivates us to make an attempt at building a new parameter server architecture that mitigates and addresses stragglers in heterogeneous DL from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines for resolving this problem: (1) reducing straggler emergence frequency via elastic parallelism control and (2) transferring blocked tasks to pioneer workers for fully exploiting cluster computation capacity. Following the guidelines, we propose the abstraction of parallelism as an infrastructure and elaborate the Elastic-Parallelism Synchronous Parallel (EPSP) that supports both enforced- and slack-synchronization schemes. The whole idea has been implemented in a prototype called Falcon which efficiently accelerates the DL training progress with the presence of stragglers. Evaluation under various benchmarks with baseline comparison evidences the superiority of our system. Specifically, Falcon yields shorter convergence time, by up to 61.83%, 55.19%, 38.92% and 23.68% reduction over FlexRR, Sync-opt, ConSGD and DynSGD, respectively.
in Proc. ICDCS, Dallas, TX, USA,
Big data analytics in datacenters often involves scheduling of data-parallel jobs. Traditional scheduling techniques based on improving network resource utilization are subject to limited bandwidth in datacenter networks. To alleviate the shortage of bandwidth, some cluster frameworks employ techniques of traffic compression to reduce transmission consumption. However, they tackle scheduling in a coarse-grained manner at task level and do not perform well in terms of flow-level metrics due to high complexity. Fortunately, the abstraction of coflow pioneers a new perspective for scheduling majorization. In this paper, we introduce a coflow compression mechanism to minimize the completion time in data-intensive applications. Due to the NP-hardness, we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDF ) to solve this problem. We build Swallow, an efficient scheduling system that implements our proposed algorithms. It minimizes coflow completion time (CCT) while guaranteeing resource conservation and starvation freedom. The results of both trace-driven simulations and real experiments show the superiority of our system, over existing algorithms. Specifically, Swallow speeds up CCT and job completion time (JCT) by up to 1.47 times and 1.66 times on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, Swallow reduces traffic amount by up to 48.41% on average.
in Proc. IPDPS, Vancouver, Canada,