Abstract
In Artificial Intelligence(AI), training expansive models with billions of parameters necessitates substantial computational resources. This requirement has led to the adoption of parallel computing frameworks. However, these frameworks often confront node performance imbalances due to disparities in computational capabilities and network conditions. To address this issue, we introduce the BalanceNet Orchestrator(BNO), a dynamic task allocation algorithm designed to equilibrate workloads in parallel training environments. BalanceNet Orchestrator assesses and adjusts to node-specific performance in real time, facilitating optimal workload distribution and resource utilization. This method significantly enhances training efficiency and accelerates model convergence, presenting an efficient approach for training large-scale AI models within parallel training architecture.
Original language | English |
---|---|
Title of host publication | 38th International Conference on Information Networking, ICOIN 2024 |
Publisher | IEEE Computer Society |
Pages | 385-390 |
Number of pages | 6 |
ISBN (Electronic) | 9798350330946 |
DOIs | |
Publication status | Published - 2024 |
Event | 38th International Conference on Information Networking, ICOIN 2024 - Hybrid, Ho Chi Minh City, Viet Nam Duration: 17 Jan 2024 → 19 Jan 2024 |
Publication series
Name | International Conference on Information Networking |
---|---|
ISSN (Print) | 1976-7684 |
Conference
Conference | 38th International Conference on Information Networking, ICOIN 2024 |
---|---|
Country/Territory | Viet Nam |
City | Hybrid, Ho Chi Minh City |
Period | 17/01/24 → 19/01/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- AI
- Distributed
- Parallel training
- heterogeneous environment