Sign in to confirm you’re not a bot
This helps protect our community. Learn more

Introduction

0:00

Why is large model training needed?

0:44

Scaling creates training and model efficiency

0:59

Larger models = more efficient, less training, less data

1:13

Larger models can learn with few shot learning

1:24

Democratizing largescale language models with OPT175B

2:19

Challenges of large model training

2:51

What is PyTorch Distributed?

3:25

Features Overview

4:20

DistributedDataParallel

6:00

FullyShardedDataParallel

6:53

FSDP Auto wrapping

8:44

FSDP Auto wrapping example

9:22

FSDP CPU Offload, Backward Prefetch policies

9:38

FSDP Mixed Precision control

9:46

Pipeline

9:53

Example Auto Partitioning

11:06

Pipeline + DDP (PDP)

12:26

Memory Saving Features

13:44

Activation Checkpointing

13:52

Activation Offloading

14:20

Activation Checkpointing & Offloading

15:01

Parameter Offloading

15:45

Memory Saving Feature & Training Paradigms

16:15

Experiments & Insights

18:11

Model Implementation

18:16

Scaling Efficiency Varying # GPUs

18:50

Scaling Efficiency Varying World Size

20:57

Scaling Efficiency Varying Batch Size

22:07

Model Scale Limit

23:50

Impact of Network Bandwidth

24:55

Best Practices

27:08

Best Practices FSDP

28:20

Profiling & Troubleshooting

29:01

Profiling & Troubleshooting for Large Scale Model Training

29:08

Uber Prof (Experimental) Profiling & Troubleshooting tool

30:35

Demonstration

32:09

Combining DCGM + Profiling

34:15

Profiling for Large Scale Model Training

35:36

Nvidia NSights multinode, multigpu Profiling

36:04

PyTorch Profiler Distributed Training Profiling (single node multigpu)

36:47

Try it now

37:04

Resources

37:24

Closing Notes

37:30
Scaling ML workloads with PyTorch | OD39
3Likes
147Views
2022May 27
This talk covers best practices and techniques for scaling machine learning workloads for building large scale models using PyTorch. We will share our experiences of using PyTorch to train 175-billion and 1-Trillion parameter models, different training paradigms and techniques for profiling and troubleshooting that will help you in jumpstarting your efforts in this space. Jump to: 00:00 Introduction 00:44 Why is large model training needed? 00:59 Scaling creates training and model efficiency 01:13 Larger models = more efficient, less training, less data 01:24 Larger models can learn with few shot learning 02:19 Democratizing largescale language models with OPT175B 02:51 Challenges of large model training 03:25 What is PyTorch Distributed? 04:20 Features Overview 06:00 DistributedDataParallel 06:53 FullyShardedDataParallel 08:44 FSDP Auto wrapping 09:22 FSDP Auto wrapping example 09:38 FSDP CPU Offload, Backward Prefetch policies 09:46 FSDP Mixed Precision control 09:53 Pipeline 11:06 Example Auto Partitioning 12:26 Pipeline + DDP (PDP) 13:44 Memory Saving Features 13:52 Activation Checkpointing 14:20 Activation Offloading 15:01 Activation Checkpointing & Offloading 15:45 Parameter Offloading 16:15 Memory Saving Feature & Training Paradigms 18:11 Experiments & Insights 18:16 Model Implementation 18:50 Scaling Efficiency Varying # GPUs 20:57 Scaling Efficiency Varying World Size 22:07 Scaling Efficiency Varying Batch Size 23:50 Model Scale Limit 24:55 Impact of Network Bandwidth 27:08 Best Practices 28:20 Best Practices FSDP 29:01 Profiling & Troubleshooting 29:08 Profiling & Troubleshooting for Large Scale Model Training 30:35 Uber Prof (Experimental) Profiling & Troubleshooting tool 32:09 Demonstration 34:15 Combining DCGM + Profiling 35:36 Profiling for Large Scale Model Training 36:04 Nvidia NSights multinode, multigpu Profiling 36:47 PyTorch Profiler Distributed Training Profiling (single node multigpu) 37:04 Try it now 37:24 Resources 37:30 Closing Notes Microsoft Build 2022

Follow along using the transcript.

Microsoft Developer

588K subscribers