Iteration based collaborative learning (CL) paradigms, such as federated learning (FL) and split learning (SL), faces challenges in training neural models over the rapidly growing yet resource-constrained edge devices. Such devices have difficulty in accommodating a full-size large model for FL or affording an excessive waiting time for the mandatory synchronization step in SL. To deal with such challenge, we propose a novel CL framework which adopts an tree-aggregation structure with an adaptive partition and ensemble strategy to achieve optimal synchronization and fast convergence at scale. To find the optimal split point for heterogeneous clients, we also design a novel partitioning algorithm by minimizing the idleness during communication and achieving the optimal synchronization between clients. In addition, a parallelism paradigm is proposed to unleash the potential of optimum synchronization between the clients and server to boost the distributed training process without losing model accuracy for edge devices. Furthermore, we theoretically prove that our framework can achieve better convergence rate than state-of-the-art CL paradigms. We conduct extensive experiments and show that our framework is 4.6× times in training speed as compared with the traditional methods, without compromising training accuracy.