Cloud resources and services requirements
Distributed ML training for lane detection with Run:ai
Distributed training for lane detection use case using the TuSimple dataset
Create a delegated subnet for Azure NetApp Files
Azure NetApp Files configuration
Setting up Azure Kubernetes Service
Peering of AKS virtual network and Azure NetApp Files virtual network
Set up Azure NetApp Files back-end and storage class
Installing and running Run:ai distributed lane detection training
Download and process the TuSimple dataset as Run:ai job
Perform distributed lane detection training using Horovod
Azure NetApp Files service levels
Dynamically change the service level of a volume
Leveraging Azure NetApp Files snapshot data protection
Deploy and set up volume snapshot components on AKS
Restore data from an Azure NetApp Files snapshot
Microsoft, NetApp and Run:ai have partnered in the creation of this article to demonstrate the unique capabilities of the Azure NetApp Files together with the Run:ai platform for simplifying orchestration of AI workloads. This article provides a reference architecture for streamlining the process of both data pipelines and workload orchestration for Distributed Machine Learning Training for Lane Detection, by ensuring the use of the full potential of NVIDIA GPUs.
Co-authors: Muneer Ahmad, Verron Martina (NetApp), Ronen Dar (Run:ai)
Microsoft and NetApp have teamed up with Run:ai – a company virtualizing AI infrastructure, to allow faster AI experimentation with full GPU utilization – with the goal to enable teams to speed up AI by running many experiments in parallel, with fast access to data, and leveraging limitless compute resources. Run:ai enables full NVIDIA GPU utilization in Azure by automating resource allocation, and the proven architecture of Azure NetApp Files enables every experiment to run at maximum speed by eliminating data pipeline obstructions.
Microsoft, NetApp and Run:ai have joined forces to offer customers a future-proof platform for their AI journey in Azure. From analytics and high-performance computing (HPC) to autonomous decisions (where customers can optimize their IT investments by only paying for what they need, when they need it), the 3-way partnership offers a single unified experience in the Azure Cloud.
Data science incorporates multiple disciplines in IT and business, therefore multiple personas are part of our targeted audience:
In this article, we describe how Run:ai, NVIDIA GPU-powered compute in Microsoft Azure and Azure NetApp Files help each of these roles bring value to business.
In this architecture, the focus is on the most computationally intensive part of the AI or machine learning (ML) distributed training process of lane detection. Lane detection is one of the most important tasks in autonomous driving, which helps to guide vehicles by localization of the lane markings. Static components like lane markings guide the vehicle to drive on the highway interactively and safely.
Convolutional Neural Network (CNN)-based approaches have pushed scene understanding and segmentation to a new level. Although it doesn’t perform well for objects with long structures and regions that could be occluded (for example, poles, shade on the lane, and so on). Spatial Convolutional Neural Network (SCNN) generalizes the CNN to a rich spatial level. It allows information propagation between neurons in the same layer, which makes it best suited for structured objects such as lanes, poles, or truck with occlusions. This compatibility is because the spatial information can be reinforced, and it preserves smoothness and continuity.
Thousands of scene images need to be injected in the system to allow the model to learn and distinguish the various components in the dataset. These images include weather, daytime or night-time, multilane highway roads, and other traffic conditions.
For training, there is a need for good quality and quantity of data. Single GPU or multiple GPUs can take days to weeks to complete the training. Data-distributed training can speed up the process by using multiple and multi-node GPUs. Horovod is one such framework that grants distributed training but reading data across clusters of GPUs could act as a hindrance. Azure NetApp Files provides ultrafast, high throughput and sustained low latency to provide scale-out/scale-up capabilities so that GPUs are leveraged to the best of their computational capacity. Our experiments verified that all the GPUs across the cluster are used more than 96% on average for training the lane detection using SCNN.
This section covers the technology requirements for the lane detection use case by implementing a distributed training solution at scale that fully runs in the Azure cloud. The figure below provides an overview of the solution architecture.
The elements used in this solution are:
Links to all the elements mentioned here are listed in the Additional Information section.
The following table lists the hardware components that are required to implement the solution. The cloud components that are used in any implementation of the solution might vary based on customer requirements.
Cloud |
Minimum Quantity |
AKS |
Minimum of three system nodes and three GPU worker nodes |
Virtual machine (VM) SKU system nodes |
Three Standard_DS2_v2 |
VM SKU GPU worker nodes |
Three Standard_NC6s_v3 |
Azure NetApp Files |
2 TiB standard tier |
The following table lists the software components that are required to implement the solution. The software components that are used in any implementation of the solution might vary based on customer requirements.
Software |
Minimum version or other information |
AKS – Kubernetes version |
1.18.14 |
Run:ai CLI |
v2.2.25 |
Run:ai Orchestration Kubernetes Operator version |
1.0.109 |
Horovod |
0.21.2 |
Astra Trident |
21.01.1 |
Helm |
3.0.0 |
This section provides details on setting up the platform for performing lane detection distributed training at scale using the Run:ai orchestrator. We discuss installation of all the solution elements and running the distributed training job on the said platform. ML versioning is completed by using Azure NetApp Files snapshots linked with Run:ai experiments for achieving data and model reproducibility. ML versioning plays a crucial role in tracking models, sharing work between team members, reproducibility of results, rolling new model versions into production, and data provenance.
Furthermore, this article provides a performance evaluation on multiple GPU-enabled nodes across AKS.
Finally, this article wraps up a section data protection and recovery for our training environments. Azure NetApp Files version control (snapshots) can capture point-in-time versions of the data, trained models, and logs associated with each experiment. It has rich API support making it easy to integrate with the Run:ai platform; all you have to do is to trigger an event based on the training state. You also must capture the state of the whole experiment without changing anything in the code or the containers running on top of Kubernetes (K8s).
In this article, distributed training is performed on the TuSimple dataset for lane detection. Horovod is used in the training code for conducting data distributed training on multiple GPU nodes simultaneously in the Kubernetes cluster through AKS. Code is packaged as container images for TuSimple data download and processing. Processed data is stored on persistent volumes allocated by Astra Trident plug- in. For the training, one more container image is created, and it uses the data stored on persistent volumes created during downloading the data.
To submit the data and training job, use Run:ai for orchestrating the resource allocation and management. Run:ai allows you to perform Message Passing Interface (MPI) operations which are needed for Horovod. This layout allows multiple GPU nodes to communicate with each other for updating the training weights after every training mini batch. It also enables monitoring of training through the UI and CLI, making it easy to monitor the progress of experiments.
Azure NetApp Files snapshot is integrated within the training code and captures the state of data and the trained model for every experiment. This capability enables you to track the version of data and code used, and the associated trained model generated.
As organizations continue to embrace the scalability and flexibility of cloud-based solutions, Azure NetApp Files (ANF) has emerged as a powerful managed file storage service in Azure. ANF provides enterprise-grade file shares that are highly performant and integrate seamlessly with existing applications and workflows.
In this section, we will delve into two crucial aspects of leveraging the full potential of Azure NetApp Files: the creation of a delegated subnet and the initial configuration tasks. By following these steps, organizations can optimize their ANF deployment, enabling efficient data management and enhanced collaboration.
Firstly, we will explore the process of creating a delegated subnet, which plays a pivotal role in establishing a secure and isolated network environment for ANF. This delegated subnet ensures that ANF resources are efficiently isolated from other network traffic, providing an additional layer of protection and control.
Subsequently, we will discuss the initial configuration tasks necessary to set up Azure NetApp Files effectively. This includes key considerations such as setting up a NetApp account, and provisioning an ANF capacity pool.
By following these steps, administrators can streamline the deployment process and ensure smooth integration with existing infrastructure.
To create a delegated subnet for Azure NetApp Files, follow this series of steps:
Azure NetApp Files volumes are allocated to the application cluster and are consumed as persistent volume claims (PVCs) in Kubernetes. In turn, this allocation provides us the flexibility to map volumes to different services, be it Jupyter notebooks, serverless functions, and so on.
Users of services can consume storage from the platform in many ways. The main benefits of Azure NetApp Files are:
To complete the setup of Azure NetApp Files, you must first configure it as described in Quickstart: Set up Azure NetApp Files and create an NFS volume.
However, you may omit the steps to create an NFS volume for Azure NetApp Files as you will create volumes through Astra Trident (see later in this article). Before continuing, be sure that you have:
For setup and installation of the AKS cluster go to Create an AKS Cluster. Then, follow these series of steps:
(!) Note
Deployment takes 5-10 minutes
|
az account set --subscription xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx
aks get-credentials --resource-group resourcegroup --name aksclustername
kubectl get nodes
(i) Important
If all six nodes are up and running as seen here, your AKS cluster is ready and connected to your local environment.
|
Next, peer the AKS virtual network (VNet) with the Azure NetApp Files VNet by following these steps:
Field |
Value or description |
Peering link name |
aks-vnet-name_to_anf |
SubscriptionID |
Subscription of the Azure NetApp Files VNet to which you’re peering |
VNet peering partner |
Azure NetApp Files VNet |
(i) Important
Leave all the nonasterisk sections on default
|
For more information, visit Create, change, or delete a virtual network peering.
Astra Trident is an open-source project that NetApp maintains for application container persistent storage. Astra Trident has been implemented as an external provisioner controller that runs as a pod itself, monitoring volumes and completely automating the provisioning process.
Astra Trident enables smooth integration with K8s by creating and attaching persistent volumes for storing training datasets and trained models. This capability makes it easier for data scientists and data engineers to use K8s without the hassle of manually storing and managing datasets. Trident also eliminates the need for data scientists to learn managing new data platforms as it integrates the data management-related tasks through the logical API integration.
To install Trident software, complete the following steps:
wget https://github.com/NetApp/trident/releases/download/v21.01.1/trident-installer-21.01.1.tar.gz
tar -xf trident-installer-21.01.1.tar.gz
cd trident-installer
cp ./tridentctl /usr/local/bin
cd helm
helm install trident trident-operator-21.01.1.tgz --namespace trident --create-namespace
kubectl -n trident get pods
To set up Azure NetApp Files back-end and storage class, complete the following steps:
cd ~
cd ./lane-detection-SCNN-horovod/trident-config
az ad sp create-for-rbac –name
The output should look like the following example:
{
"appId": "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"displayName": "netapptrident",
"name": "http://netapptrident",
"password": "xxxxxxxxxxxxxxx.xxxxxxxxxxxxxx",
"tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx"
}
Field |
Value |
subscriptionID |
Your Azure Subscription ID |
tenantID |
Your Azure Tenant ID (from the output of az ad sp in the previous step) |
clientID |
Your appID (from the output of az ad sp in the previous step) |
clientSecret |
Your password (from the output of az ad sp in the previous step) |
The file should look like the following example:
{
"version": 1,
"storageDriverName": "azure-netapp-files",
"subscriptionID": "********-****-****-****-************",
"tenantID": "********-****-****-****-************",
"clientID": "********-****-****-****-************",
"clientSecret": "SECRET",
"location": "westeurope",
"serviceLevel": "Standard",
"virtualNetwork": "anf-vnet",
"subnet": "default",
"nfsMountOptions": "vers=3,proto=tcp",
"limitVolumeSize": "500Gi",
"defaults": {
"exportRule": "0.0.0.0/0",
"size": "200Gi"
}
tridentctl create backend -f anf-backend.json -n trident
kubectl create -f anf-storage-class.yaml
kubectl get sc azurenetappfiles
(i) Important
Before moving on to the next chapter, make sure to follow the instructions in Deploy and set up volume snapshot components on AKS.
|
To install Run:ai, complete the following steps:
`runai config project lane-detection`
The output should look like the following example:
kubectl get namespaces
The output should appear like the following example:
`kubectl create -f runai-project-snap-role.yaml`
`kubectl create -f runai-project-snap-role-binding.yaml`
The process to download and process the TuSimple dataset as a Run:ai job is optional. It involves the following steps:
cd ~
cd ./lane-detection-SCNN-horovod/data
chmod +x build_image.sh
./build_image.sh
runai submit
--name download-tusimple-data
--pvc azurenetappfiles:100Gi:/mnt
--image muneer7589/download-tusimple:1.0
Field |
Value or description |
-name |
Name of the job |
-pvc |
PVC of the format [StorageClassName]:Size:ContainerMountPath
In the above job submission, you are creating an PVC based on-demand using Trident with storage class azurenetappfiles. Persistent volume capacity here is 100Gi and it’s mounted at path /mnt. |
-image |
Docker image to use when creating the container for this job |
The output should look like the following example:
runai list jobs
runai logs download-tusimple-data -t 10
kubectl get pvc | grep download-tusimple-data
The output should look like the following example:
Performing distributed lane detection training using Horovod is an optional process. However, here are the steps involved:
cd ~
cd ./lane-detection-SCNN-horovod
chmod +x build_image.sh
./build_image.sh
kubectl get pvc | grep download-tusimple-data
kubectl patch pv pvc-bb03b74d-2c17-40c4-a445-79f3de8d16d5 -p
'{"spec":{"accessModes":["ReadWriteMany"]}}'
runai submit-mpi
--name dist-lane-detection-training
--large-shm
--processes=3
--gpu 1
--pvc pvc-download-tusimple-data-0:/mnt
--image muneer7589/dist-lane-detection:3.1
-e USE_WORKERS="true"
-e NUM_WORKERS=4
-e BATCH_SIZE=33
-e USE_VAL="false"
-e VAL_BATCH_SIZE=99
-e ENABLE_SNAPSHOT="true"
-e PVC_NAME="pvc-download-tusimple-data-0"
Field |
Value or description |
name |
Name of the distributed training job |
Large shm |
Mount a large /dev/shm device
It is a shared file system mounted on RAM and provides large enough shared memory for multiple CPU workers to process and load batches into CPU RAM. |
processes |
Number of distributed training processes |
gpu |
Number of GPUs/processes to allocate for the job
In this job, there are three GPU worker processes (--processes=3), each allocated with a single GPU (--gpu 1) |
pvc |
Use existing persistent volume (pvc-download-tusimple-data-0) created by previous job (download-tusimple-data) and it is mounted at path /mnt |
image |
Docker image to use when creating the container for this job |
Define environment variables to be set in the container
USE_WORKERS |
Setting the argument to true turns on multi-process data loading |
NUM_WORKERS |
Number of data loader worker processes |
BATCH_SIZE |
Training batch size |
USE_VAL |
Setting the argument to true allows validation |
VAL_BATCH_SIZE |
Validation batch size |
ENABLE_SNAPSHOT |
Setting the argument to true enables taking data and trained model snapshots for ML versioning purposes |
PVC_NAME |
Name of the pvc to take a snapshot of. In the above job submission, you are taking a snapshot of pvc-download-tusimple-data-0, consisting of dataset and trained models |
The output should look like the following example:
runai list jobs
runai logs dist-lane-detection-training
runai logs dist-lane-detection-training --tail 1
kubectl get volumesnapshots | grep download-tusimple-data-0
To show the linear scalability of the solution, performance tests have been done for two scenarios: one GPU and three GPUs. GPU allocation, GPU and memory utilization, different single- and three- node metrics have been captured during the training on the TuSimple lane detection dataset. Data is increased five- fold just for the sake of analyzing resource utilization during the training processes.
The solution enables customers to start with a small dataset and a few GPUs. When the amount of data and the demand of GPUs increase, customers can dynamically scale out the terabytes in the Standard Tier and quickly scale up to the Premium Tier to get four times the throughput per terabyte without moving any data. This process is further explained in the section, Azure NetApp Files service levels.
Processing time on one GPU was 12 hours and 45 minutes. Processing time on three GPUs across three nodes was approximately 4 hours and 30 minutes.
The figures shown throughout the remainder of this document illustrate examples of performance and scalability based on individual business needs.
The figure below illustrates 1 GPU allocation and memory utilization.
The figure below illustrates single node GPU utilization.
The figure below illustrates single node memory size (16GB).
The figure below illustrates single node GPU count (1).
The figure below illustrates single node GPU allocation (%).
The figure below illustrates three GPUs across three nodes – GPUs allocation and memory.
The figure below illustrates three GPUs across three nodes utilization (%).
The figure below illustrates three GPUs across three nodes memory utilization (%).
You can change the service level of an existing volume by moving the volume to another capacity pool that uses the service level you want for the volume. This existing service-level change for the volume does not require that you migrate data. It also does not affect access to the volume.
To change the service level of a volume, use the following steps:
az netappfiles volume pool-change -g mygroup
--account-name myaccname
-pool-name mypoolname
--name myvolname
--new-pool-resource-id mynewresourceid
Set-AzNetAppFilesVolumePool
-ResourceGroupName "MyRG"
-AccountName "MyAnfAccount"
-PoolName "MyAnfPool"
-Name "MyAnfVolume"
-NewPoolResourceId 7d6e4069-6c78-6c61-7bf6-c60968e45fbf
If your cluster does not come pre-installed with the correct volume snapshot components, you may manually install these components by running the following steps:
(i) Important
AKS 1.18.14 does not have pre-installed Snapshot Controller.
|
kubectl create -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl create -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl create -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/deploy/kubernetes/...
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/release-3.0/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
kubectl create -f netapp-volume-snapshot-class.yaml
The output should look like the following example:
kubectl get volumesnapshotclass
To restore data from an Azure NetApp Files snapshot, complete the following steps:
cd ~
cd ./lane-detection-SCNN-horovod
kubectl create -f restore-snapshot-pvc.yaml
The output should look like the following example:
runai submit-mpi
--name dist-lane-detection-training
--large-shm
--processes=3
--gpu 1
--pvc restored-tusimple:/mnt
--image muneer7589/dist-lane-detection:3.1
-e USE_WORKERS="true"
-e NUM_WORKERS=4
-e BATCH_SIZE=33
-e USE_VAL="false"
-e VAL_BATCH_SIZE=99
-e ENABLE_SNAPSHOT="true"
-e PVC_NAME="restored-tusimple"
Microsoft, NetApp and Run:ai have partnered in the creation of this article to demonstrate the unique capabilities of the Azure NetApp Files together with the Run:ai platform for simplifying orchestration of AI workloads. This article provides a reference architecture for streamlining the process of both data pipelines and workload orchestration for Distributed Machine Learning Training for Lane Detection, by ensuring the use of the full potential of NVIDIA GPUs.
In conclusion, with regard to distributed training at scale (especially in a public cloud environment), the resource orchestration and storage component is a critical part of the solution. Making sure that data managing never hinders multiple GPU processing, therefore results in the optimal utilization of GPU cycles. Thus, making the system as cost effective as possible for large- scale distributed training purposes.
Data fabric delivered by NetApp overcomes the challenge by enabling data scientists and data engineers to connect together on-premises and in the cloud to have synchronous data, without performing any manual intervention. In other words, data fabric smooths the process of managing AI workflow spread across multiple locations. It also facilitates on demand-based data availability by bringing data close to compute and performing analysis, training, and validation wherever and whenever needed. This capability not only enables data integration but also protection and security of the entire data pipeline.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.