Scaling ML workloads with PyTorch

Scaling ML workloads with PyTorch | OD39

3Likes

147Views

2022May 27

This talk covers best practices and techniques for scaling machine learning workloads for building large scale models using PyTorch. We will share our experiences of using PyTorch to train 175-billion and 1-Trillion parameter models, different training paradigms and techniques for profiling and troubleshooting that will help you in jumpstarting your efforts in this space. Jump to: 00:00 Introduction 00:44 Why is large model training needed? 00:59 Scaling creates training and model efficiency 01:13 Larger models = more efficient, less training, less data 01:24 Larger models can learn with few shot learning 02:19 Democratizing largescale language models with OPT175B 02:51 Challenges of large model training 03:25 What is PyTorch Distributed? 04:20 Features Overview 06:00 DistributedDataParallel 06:53 FullyShardedDataParallel 08:44 FSDP Auto wrapping 09:22 FSDP Auto wrapping example 09:38 FSDP CPU Offload, Backward Prefetch policies 09:46 FSDP Mixed Precision control 09:53 Pipeline 11:06 Example Auto Partitioning 12:26 Pipeline + DDP (PDP) 13:44 Memory Saving Features 13:52 Activation Checkpointing 14:20 Activation Offloading 15:01 Activation Checkpointing & Offloading 15:45 Parameter Offloading 16:15 Memory Saving Feature & Training Paradigms 18:11 Experiments & Insights 18:16 Model Implementation 18:50 Scaling Efficiency Varying # GPUs 20:57 Scaling Efficiency Varying World Size 22:07 Scaling Efficiency Varying Batch Size 23:50 Model Scale Limit 24:55 Impact of Network Bandwidth 27:08 Best Practices 28:20 Best Practices FSDP 29:01 Profiling & Troubleshooting 29:08 Profiling & Troubleshooting for Large Scale Model Training 30:35 Uber Prof (Experimental) Profiling & Troubleshooting tool 32:09 Demonstration 34:15 Combining DCGM + Profiling 35:36 Profiling for Large Scale Model Training 36:04 Nvidia NSights multinode, multigpu Profiling 36:47 PyTorch Profiler Distributed Training Profiling (single node multigpu) 37:04 Try it now 37:24 Resources 37:30 Closing Notes Microsoft Build 2022

Transcript

Follow along using the transcript.

Microsoft Developer

588K subscribers

Scaling ML workloads with PyTorch | OD39

Chapters View all

Introduction

Introduction

Introduction

Why is large model training needed?

Why is large model training needed?

Why is large model training needed?

Scaling creates training and model efficiency

Scaling creates training and model efficiency

Scaling creates training and model efficiency

Larger models = more efficient, less training, less data

Larger models = more efficient, less training, less data

Larger models = more efficient, less training, less data

Larger models can learn with few shot learning

Larger models can learn with few shot learning

Larger models can learn with few shot learning

Democratizing largescale language models with OPT175B

Democratizing largescale language models with OPT175B

Democratizing largescale language models with OPT175B

Challenges of large model training

Challenges of large model training

Challenges of large model training

What is PyTorch Distributed?

What is PyTorch Distributed?

What is PyTorch Distributed?

Microsoft Developer

Scaling ML workloads with PyTorch | OD39

Comments

Chapters

Introduction

Introduction

Introduction

Why is large model training needed?

Why is large model training needed?

Why is large model training needed?

Scaling creates training and model efficiency

Scaling creates training and model efficiency

Scaling creates training and model efficiency

Larger models = more efficient, less training, less data

Larger models = more efficient, less training, less data

Larger models = more efficient, less training, less data

Larger models can learn with few shot learning

Larger models can learn with few shot learning

Larger models can learn with few shot learning

Democratizing largescale language models with OPT175B

Democratizing largescale language models with OPT175B

Democratizing largescale language models with OPT175B

Challenges of large model training

Challenges of large model training

Challenges of large model training

What is PyTorch Distributed?

What is PyTorch Distributed?

What is PyTorch Distributed?

Features Overview

Features Overview

Features Overview

DistributedDataParallel

DistributedDataParallel

DistributedDataParallel

FullyShardedDataParallel

FullyShardedDataParallel

FullyShardedDataParallel

FSDP Auto wrapping

FSDP Auto wrapping

FSDP Auto wrapping

FSDP Auto wrapping example

FSDP Auto wrapping example

FSDP Auto wrapping example

FSDP CPU Offload, Backward Prefetch policies

FSDP CPU Offload, Backward Prefetch policies

FSDP CPU Offload, Backward Prefetch policies

FSDP Mixed Precision control

FSDP Mixed Precision control

FSDP Mixed Precision control

Pipeline

Pipeline

Pipeline

Example Auto Partitioning

Example Auto Partitioning

Chapters