If playback doesn't begin shortly, try restarting your device.
•
You're signed out
Videos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.
CancelConfirm
Share
An error occurred while retrieving sharing information. Please try again later.
This talk covers best practices and techniques for scaling machine learning workloads for building large scale models using PyTorch. We will share our experiences of using PyTorch to train 175-billion and 1-Trillion parameter models, different training paradigms and techniques for profiling and troubleshooting that will help you in jumpstarting your efforts in this space.
Jump to:
00:00 Introduction
00:44 Why is large model training needed?
00:59 Scaling creates training and model efficiency
01:13 Larger models = more efficient, less training, less data
01:24 Larger models can learn with few shot learning
02:19 Democratizing largescale language models with OPT175B
02:51 Challenges of large model training
03:25 What is PyTorch Distributed?
04:20 Features Overview
06:00 DistributedDataParallel
06:53 FullyShardedDataParallel
08:44 FSDP Auto wrapping
09:22 FSDP Auto wrapping example
09:38 FSDP CPU Offload, Backward Prefetch policie…...more
This talk covers best practices and techniques for scaling machine learning workloads for building large scale models using PyTorch. We will share our experiences of using PyTorch to train 175-billion and 1-Trillion parameter models, different training paradigms and techniques for profiling and troubleshooting that will help you in jumpstarting your efforts in this space.
Jump to:
00:00 Introduction
00:44 Why is large model training needed?
00:59 Scaling creates training and model efficiency
01:13 Larger models = more efficient, less training, less data
01:24 Larger models can learn with few shot learning
02:19 Democratizing largescale language models with OPT175B
02:51 Challenges of large model training
03:25 What is PyTorch Distributed?
04:20 Features Overview
06:00 DistributedDataParallel
06:53 FullyShardedDataParallel
08:44 FSDP Auto wrapping
09:22 FSDP Auto wrapping example
09:38 FSDP CPU Offload, Backward Prefetch policies
09:46 FSDP Mixed Precision control
09:53 Pipeline
11:06 Example Auto Partitioning
12:26 Pipeline + DDP (PDP)
13:44 Memory Saving Features
13:52 Activation Checkpointing
14:20 Activation Offloading
15:01 Activation Checkpointing & Offloading
15:45 Parameter Offloading
16:15 Memory Saving Feature & Training Paradigms
18:11 Experiments & Insights
18:16 Model Implementation
18:50 Scaling Efficiency Varying # GPUs
20:57 Scaling Efficiency Varying World Size
22:07 Scaling Efficiency Varying Batch Size
23:50 Model Scale Limit
24:55 Impact of Network Bandwidth
27:08 Best Practices
28:20 Best Practices FSDP
29:01 Profiling & Troubleshooting
29:08 Profiling & Troubleshooting for Large Scale Model Training
30:35 Uber Prof (Experimental) Profiling & Troubleshooting tool
32:09 Demonstration
34:15 Combining DCGM + Profiling
35:36 Profiling for Large Scale Model Training
36:04 Nvidia NSights multinode, multigpu Profiling
36:47 PyTorch Profiler Distributed Training Profiling (single node multigpu)
37:04 Try it now
37:24 Resources
37:30 Closing Notes
Microsoft Build 2022…...more