How to Avoid the Staggering Cost of Training State-of-the-art Large Language Models
, Deep Learning Algorithms Engineer, NVIDIA
, Deep Learning Data Scientist, NVIDIA
高度评价
As state-of-the-art language models grow in capability and performance, so does their size, which now measures in billions or even trillions of parameters. Consequently, the time and cost of training these models has become higher than ever, making it impossible to train multiple models to find the best-performing hyperparameters. As an example, a 175 billion-parameter GPT-3 model takes 35 days to converge on 128 DGXA100 80GB nodes, making it incredibly difficult to train such large models in setups that have limited compute and time. We'll explain how to automatically select the best hyperparameters to maximize the training speed for such large language models using NVIDIA’s NeMo-Megatron containers. For instance, for a 5B parameter GPT3 model, the best configuration will train 11.61 times faster than the worst hyperparameter setting, saving a lot of compute time and energy in the process. Learn how to arrive at such configurations in as little time as possible.