Waterwave Could Quench AIs' Thirst for GPU Memory

The approach breaks up the AI training process into manageable "sub-models"

2 min read

multiple patterns of colored lines, then dots, then binary numbers

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

One of the (many) ways in which AI is making waves is in its ability to analyze immense datasets. But training these AI programs is becoming increasingly computationally intensive, underscoring the need to more efficient ways to crunch data.

In a study published 22 May in IEEE Transactions on Computers, researchers describe a novel approach, called Waterwave, to increase the efficiency of training multiple AI models simultaneously and efficiently on the same GPU. Their results show that, in scenarios with high memory demand, Waterwave is 12 times as fast as existing spatial sharing on a GPU and 1.49 times as fast as existing temporal memory sharing.

When an AI model initially needs training, certain calculations and methods are used to find the optimal or sub-optimal models for data analysis. In this way, “good” or “bad” models for analysis are identified as early as possible to significantly accelerate the overall training process.

However, most current methods for training AI models using GPUs unfortunately have to assess models one by one, rather than simultaneously, due to memory constraints. As a result, each training task must be queued one after another, with the possibility that the desired model is at the tail of the queue.

“In the worst scenario, all training tasks need to be finished one by one, which is very time consuming,” explains Xuan Peng, a Ph.D. candidate at Huazhong University of Science and Technology’s School of Computer Science and Technology.

A Divide and Conquer Approach

Peng’s team designed Waterwave so that it breaks models up into more manageable and evenly sized “sub-models.” Multiple sub-models from different models can be processed simultaneously on the same GPU, and as soon as the GPU is finished computing one sub-model, memory space is freed up for the next sub-model in the queue.

“By achieving similar memory sizes, it increases the probability that the freed memory from the preceding sub-model is sufficient for the next sub-model which requires memory allocation. This approach enables the memory freed by one model to be effectively utilized by another model,” says Peng.

Peng and his colleagues tested Waterwave using several popular neural networks used for computer vision and natural language processing applications, and compared it another memory flow approach developed by NVIDIA, called Multi-Process Service (MPS), which also simultaneously evaluates multiple models on a GPU.

The results show that, overall, Waterwave demonstrates excellent memory sharing efficiency when accommodating multiple training jobs, using 76.4 percent to 96.8 percent of GPU memory for each job.

In comparing Waterwave and MPS, the researchers found that MPS outperforms Waterwave by a small margin when the GPU memory has not oversubscribed computing jobs. However, MPS experiences a significant performance degradation (greater than 90 percent) when the GPU memory is oversubscribed, and this level of degradation was not observed to the same extent with Waterwave.

However, Peng notes several limitations with Waterwave. Notably, if one computing job fails, this causes the other computing jobs to fail simultaneously. Also, for models with high GPU compute demand, the performance improvement gained by running tasks in parallel is marginal. “Therefore, our next research objective focuses on optimizing pipeline model parallelism to achieve higher training throughput,” says Peng.

The Conversation (0)