Canary: Decentralized Distributed Deep Learning via Gradient Sketch and Partition in Multi-interface Networks


The multi-interface networks are efficient infrastructures to deploy distributed Deep Learning (DL) tasks as the model gradients generated by each worker can be exchanged to others via different links in parallel. Although this decentralized parameter synchronization mechanism can reduce the time of gradient exchange, building a high-performance distributed DL architecture still requires the balance of communication efficiency and computational utilization, i.e., addressing the issues of traffic burst, data consistency, and programming convenience. To achieve this goal, we intend to asynchronously exchange gradient pieces without the central control in multi-interface networks. We propose the Piece-level Gradient Exchange and Multi-interface Collective Communication to handle parameter synchronization and traffic transmission, respectively. Specifically, we design the gradient sketch approach based on 8-bit uniform quantization to compress gradient tensors and introduce the colayerabstraction to better handle gradient partition, exchange and pipelining. Also, we provide general programming interfaces to capture the synchronization semantics and build the Gradient Exchange Index (GEI) data structures to make our approach online applicable. We implement our algorithms into a prototype system called Canary by using PyTorch-1.4.0. Experiments conducted in Alibaba Cloud demonstrate that Canary reduces 56.28 percent traffic on average and completes the training by up to 1.61x, 2.28x, and 2.84x faster than BML, Ako on PyTorch, and PS on TensorFlow, respectively.

IEEE Transactions on Parallel and Distributed Systems (TPDS)