BalanceNet Orchestrator: A KQV-based Dynamic Task Allocation for Distributed Deep Learning

Teh Jen Sun, Thien Thu Ngo, Eui Nam Huh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In Artificial Intelligence(AI), training expansive models with billions of parameters necessitates substantial computational resources. This requirement has led to the adoption of parallel computing frameworks. However, these frameworks often confront node performance imbalances due to disparities in computational capabilities and network conditions. To address this issue, we introduce the BalanceNet Orchestrator(BNO), a dynamic task allocation algorithm designed to equilibrate workloads in parallel training environments. BalanceNet Orchestrator assesses and adjusts to node-specific performance in real time, facilitating optimal workload distribution and resource utilization. This method significantly enhances training efficiency and accelerates model convergence, presenting an efficient approach for training large-scale AI models within parallel training architecture.

Original languageEnglish
Title of host publication38th International Conference on Information Networking, ICOIN 2024
PublisherIEEE Computer Society
Pages385-390
Number of pages6
ISBN (Electronic)9798350330946
DOIs
Publication statusPublished - 2024
Event38th International Conference on Information Networking, ICOIN 2024 - Hybrid, Ho Chi Minh City, Viet Nam
Duration: 17 Jan 202419 Jan 2024

Publication series

NameInternational Conference on Information Networking
ISSN (Print)1976-7684

Conference

Conference38th International Conference on Information Networking, ICOIN 2024
Country/TerritoryViet Nam
CityHybrid, Ho Chi Minh City
Period17/01/2419/01/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • AI
  • Distributed
  • Parallel training
  • heterogeneous environment

Fingerprint

Dive into the research topics of 'BalanceNet Orchestrator: A KQV-based Dynamic Task Allocation for Distributed Deep Learning'. Together they form a unique fingerprint.

Cite this