Abstract
In a large-scale distributed machine learning system, the interconnection network between computing devices has an important impact on performance in the training of neural network models. The current expansion of training data and model size has led to a rapid increase in the number of computing devices used in distributed machine learning systems, which places higher demands on network scalability. In addition, the synchronization algorithms used for data exchange between computing devices have different communication topologies, and traditional electrical networks have difficulty matching them due to their fixed network topology. Neural network models and model partitioning methods can also affect the amount of communication between devices, but the overprovisioned bandwidth of traditional electric networks incurs unnecessary costs. To address these issues, we propose a scalable, flexible, and high-performance network architecture called X-NEST. The flexibility of optical switching devices allows X-NEST to dynamically change its topology and the number of links between devices according to traffic pattern variations, thereby improving network performance and resource utilization. Although changes in the connection relationships between devices depend on the controller, the simple and flexible control plane of X-NEST can quickly respond to network communication requirements. Extensive analytical simulations using different traffic patterns demonstrate that X-NEST copes well with the communication characteristics of various synchronization algorithms.
PDF Article
More Like This
Cited By
You do not have subscription access to this journal. Cited by links are available to subscribers only. You may subscribe either as an Optica member, or as an authorized user of your institution.
Contact your librarian or system administrator
or
Login to access Optica Member Subscription