Loading Now

Summary of On the Burstiness Of Distributed Machine Learning Traffic, by Natchanon Luangsomboon et al.


On the Burstiness of Distributed Machine Learning Traffic

by Natchanon Luangsomboon, Fahimeh Fazel, Jörg Liebeherr, Ashkan Sobhani, Shichao Guan, Xingjun Chu

First submitted to arxiv on: 30 Dec 2023

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Networking and Internet Architecture (cs.NI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The increasing popularity of machine learning (ML) model training on distributed networks creates a significant amount of traffic, accounting for a substantial portion of enterprise data center traffic. While there is ongoing research in distributed ML, the network traffic generated by this process has received relatively little attention. This study investigates the traffic characteristics produced by training ResNet-50 neural networks using measurements from a testbed network, focusing on short-term burstiness. Our analysis reveals that distributed ML traffic exhibits extremely high burstiness at short time scales, exceeding a 60:1 peak-to-mean ratio for intervals as long as 5 milliseconds. Moreover, our findings indicate that the training software orchestrates transmissions to avoid congestion and packet losses by synchronizing bursts from different sources within the same application. An extrapolation of these results highlights the challenges faced by distributed ML traffic in terms of congestion and flow control algorithms.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine you’re trying to train a computer model using lots of data, but it’s spread out across many devices. This creates a lot of traffic on networks. Nobody has really studied this kind of traffic before, so we did some research to see what it looks like. We used a special test network and found that when we trained a certain type of neural network called ResNet-50, the traffic got really crazy! It’s like a big burst of data all at once. This makes it hard for networks to handle because they’re not designed to deal with such sudden spikes in activity. Our study shows how important it is to understand this kind of traffic so we can make sure our computer systems work smoothly.

Keywords

* Artificial intelligence  * Attention  * Machine learning  * Neural network  * Resnet