CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

التفاصيل البيبلوغرافية
العنوان: CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
المؤلفون: Rajasekaran, Sudarsanan, Ghobadi, Manya, Akella, Aditya
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Networking and Internet Architecture, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, C.2.4
الوصف: We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2308.00852
رقم الانضمام: edsarx.2308.00852
قاعدة البيانات: arXiv