HPC/AI Storage options for NDm_v4 (A100) Azure kubernetes service (AKS) cluster

CormacGarvey · ‎Jun 15 2023

Introduction

In a previous blog post we showed how to deploy an optimal NDm_v4 AKS cluster, i.e. all 8 InfiniBand and GPU devices on each NDm_v4 are installed correctly. We then verified that the NDm_v4 kubernetes cluster was deployed/configured correctly by running a NCCL allreduce benchmark on 2 NDm_v4 VM’s (16 A100 GPUs). We will use this NDm_v4 AKS as a starting point and show how to use popular Azure HPC/AI storage options (such as local NVMe SSDs, Azure managed lustre Filesystem (AMLFS) and Azure files+NFSv4) in this NDm_v4 AKS cluster.

Create I/O test container

We will use fio and IOR to test the various NDm_v4+AKS storage options.

IOR build script (build_ior.sh)

#!/bin/bash
APP_NAME=ior
PARALLEL_BUILD=8
IOR_VERSION=3.2.1

IOR_PACKAGE=ior-$IOR_VERSION.tar.gz
wget https://github.com/hpc/ior/releases/download/$IOR_VERSION/$IOR_PACKAGE
tar xvf $IOR_PACKAGE
rm $IOR_PACKAGE

cd ior-$IOR_VERSION
CC=`which mpicc`
./configure --prefix=`pwd`
make -j ${PARALLEL_BUILD}
make install

FIO build script (fio_build.sh)

#!/bin/bash
APP_VERSION=3.22
PARALLEL_BUILD=4

yum install -y zlib-devel git
git clone https://github.com/axboe/fio.git
cd fio
git checkout tags/fio-${APP_VERSION}

./configure --prefix=`pwd`
make -j $PARALLEL_BUILD
make install

Dockerfile to build I/O tester container.

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.03-py3

FROM ${FROM_IMAGE_NAME}

RUN apt update
RUN apt-get -y install build-essential
RUN apt-get -y install infiniband-diags
RUN apt-get -y install openssh-server
RUN apt-get -y install kmod
COPY build_ior.sh .
RUN ./build_ior.sh
COPY build_fio.sh .
RUN ./build_fio.sh

Build container locally

docker build -t <ACR_NAME>.azurecr.io/<CONTAINER_NAME> .

NOTE: Choose a suitable <CONTAINER_NAME>

Push your container to your Azure container registry.

docker push ${ACR_NAME}.azurecr.io/$CONTAINER_NAME

Mount Azure Managed Lustre Filesystem (AMLFS)

Prerequisites

AMLFS is already deployed, see AMLFS documentation for details.

In this example AMLFS is deployed in an AMLFS_VNET (Default kubenet networking and AKS deployed the VNET) and AKS is deployed in AMLFS_VNET, to mount AMLFS in AKS, the two VNET's need to peered (to be able to communicate with each other).

az network vnet peering create -n <PEER1_NAME> -g <AMLFS_RESOURCE_GROUP> --vnet-name <AMLFS_VNET> --allow-forwarded-traffic --allow-vnet-access --remote-vnet /subscriptions/75d1e0d5-9fed-4ae1-aec7-2ecc19de26fa/resourceGroups/<AKS_RESOURCE_GROUP>/providers/Microsoft.Network/virtualNetworks/<AKS_VNET>

az network vnet peering create -n <PEER2_NAME> -g <AKS_RESOURCE_GROUP>  --vnet-name <AKD_VNET> --allow-forwarded-traffic --allow-vnet-access --remote-vnet /subscriptions/75d1e0d5-9fed-4ae1-aec7-2ecc19de26fa/resourceGroups/<AMLFS_RESOURCE_GROUP>/providers/Microsoft.Network/virtualNetworks/<AMLFS_VNET>

Get the AMLFS CSI driver repository.

git clone https://github.com/kubernetes-sigs/azurelustre-csi-driver.git

Install the CSI driver on NDm_v4 AKS cluster (kubenet networking)

curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash

From the AMLFS CSI driver repository, edit docs/examples/storageclass_existing_lustre.yaml, update the internal name of the lustre filesystem name and the MGS IP address (See Azure portal, AMLFS, Client connection to get these values for your deployed AMLFS)

kubectl create -f storageclass_existing_lustre.yaml

Edit docs/examples/storageclass_exisiting_lustre.yaml, make sure it has the correct Storage size (storage: 16Ti).

kubectl create -f pvc_storageclass.yaml

Check the persistent volume claim for AMLFS

kubectl get pvc

NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
pvc-lustre   Bound    pvc-26081f53-bb6d-48f7-b9c3-51982c97ed68   16Ti       RWX            sc.azurelustre.csi.azure.com   11s

NOTE: Consult the Azure AMLFS documentation to explore other ways to mount AMLFS in AKS.

See below how we use/mount this AMLFS persistent volume claim and run an IOR I/O benchmark.

Setup NDm_v4 local NVMe SSD

The NDm_v4 VM includes a 7 TB local NVMe SSD, but the 8 NVMe devices need to be configured and mounted (e.g. raid 0, Ext4 filesystem and mounted).

git clone https://github.com/ams0/aks-nvme-ssd-provisioner.git

Modify the local NVMe SSD mount point, edit aks-nvme-ssd-provisioner.sh

sed -i "s/\/pv-disks\/\$UUID/\/pv-disks\/scratch/g" aks-nvme-ssd-provisioner.sh

Modify the Dockerfile (change execute permissions on script)

sed -i "/^COPY .*$/a RUN chmod +x \/usr\/local\/bin\/aks-nvme-ssd-provisioner.sh" Dockerfile

Build container locally and push container to your ACR.

docker build -t ${acr_name}.azurecr.io/aks-nvme-ssd-provisioner:v1.0.2 .
docker push ${acr_name}.azurecr.io/aks-nvme-ssd-provisioner:v1.0.2

Modify storage-local-static-provisioner.yaml, to point to the corect Azure container registry.

sed -i "s/ams0/${acr_name}.azurecr.io/g" ./manifests/storage-local-static-provisioner.yaml

Modify the node label used to signal setting up and mounting the local NVMe SSD.

sed -i "s/kubernetes.azure.com\/aks-local-ssd/aks-local-ssd/g" ./manifests/storage-local-static-provisioner.yaml

Update NDm_v4 node pool with “aks-local-ssd=true” label.

az aks nodepool update –cluster-name <AKS_NAME>—resource-group <RG>  –nodepool-name <NAME> --labels aks-local-ssd=true

NOTE: You could also set the nodepool aks-local-ssd label when initially deploying the NDm_v4 AKS cluster

kubectl apply -f aks-nvme-ssd-provisioner/manifests/storage-local-static-provisioner.yaml

Verify you can see the local NVMe persistent volume

kubectl get pv

NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
local-pv-5b19e64a   7095Gi     RWO            Delete           Available           local-storage       12m

Below we show how to use this local NVMe SSD and run an FIO I/O throughput benchmark.

Deploy and mount Azure Files via NFSv4

Enable the Azure files CSI driver in your AKS cluster

az aks update -n cgakscluster -g  cg_aks_test --enable-file-driver

Create a customized Storage Class for Azure file + NFSv4 (nfs-sc.yaml)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi-nfs
provisioner: file.csi.azure.com
allowVolumeExpansion: true
parameters:
  protocol: nfs
mountOptions:
  - nconnect=4

kubectl apply -f nfs-sc.yaml

Verify you have created the Azure files NFSv4 Storage class

kubectl get sc

NAME                           PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
azurefile-csi-nfs              file.csi.azure.com             Delete          Immediate              true                   7s

Create an Azure files NFSv4 persistent volume claim (files-nfs-pvc.yaml)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: files-nfs-pvc
spec:
  storageClassName: azurefile-csi-nfs
  accessModes:
    - "ReadWriteMany"
  resources:
    requests:
      storage: 100Gi
---

kubectl apply -f files-nfs-pvc.yaml

Verify that the Azure files NFSv4 persistent volume claim is created

kubeclt get pvc

NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                   AGE
files-nfs-pvc   Bound    pvc-4786a03f-b2fc-44a9-b9e2-ac9ccd5c292c   100Gi      RWX            azurefile-csi-nfs            3m50s

Note: Consult the AKS documentation to see other ways to mount Azure files in AKS

Below we give an example of how to run an FIO benchmark using this Azure Files NFSv4 persistent volume claim.

Run I/O benchmarks (FIO and IOR)

We will use FIO to test/validate local NVMe SSD storage and Azure files via NFSv4. IOR I/O benchmark will be used to test/validate AMLFS Storage.

Example of FIO yaml script to test the NDm_v4 local NVMe SSD

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: fio-job1
spec:
  minAvailable: 1
  schedulerName: volcano
  tasks:
    - replicas: 1
      name: fio
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  df -h
                  /workspace/fio/bin/fio --name=write_4G --directory=/scratch --direct=1 --size=4G --bs=4M --rw=write --group_reporting --numjobs=4 --runtime=300
                  /workspace/fio/bin/fio --name=read_4G --directory=/scratch --direct=1 --size=4G --bs=4M --rw=read --group_reporting --numjobs=4 --runtime=300
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: fio
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/infiniband: 8
                limits:
                  nvidia.com/infiniband: 8
              volumeMounts:
              - name: scratch
                mountPath: /scratch
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: scratch
            hostPath:
              path: /pv-disks/scratch
              type: Directory
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

df -h (on NDm_v4 pod)

Filesystem      Size  Used Avail Use% Mounted on
overlay         117G   67G   50G  58% /
tmpfs            64M     0   64M   0% /dev
/dev/md0        7.0T   28K  7.0T   1% /scratch
/dev/root       117G   67G   50G  58% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           861G   12K  861G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           443G   12K  443G   1% /proc/driver/nvidia
tmpfs           178G  5.5M  178G   1% /run/nvidia-fabricmanager/socket
devtmpfs        443G     0  443G   0% /dev/nvidia0

FIO I/O write/read performance measured.

WRITE: bw=7710MiB/s (8085MB/s), 7710MiB/s-7710MiB/s (8085MB/s-8085MB/s), io=16.0GiB (17.2GB), run=2125-2125msec
READ: bw=9012MiB/s (9450MB/s), 9012MiB/s-9012MiB/s (9450MB/s-9450MB/s), io=16.0GiB (17.2GB), run=1818-1818msec

Example of IOR yaml script to test AMLFS

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ior-job1
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 1
      name: mpimaster
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  df -h
                  MPI_HOST=$(cat /etc/volcano/mpiworker.host | tr "\n" ",")
                  mkdir -p /var/run/sshd; /usr/sbin/sshd
                  echo "HOSTS: $MPI_HOST"
                  mpirun --allow-run-as-root -np 16 -npernode 8 --bind-to numa --map-by ppr:8:node -hostfile /etc/volcano/mpiworker.host /workspace/ior-3.2.1/bin/ior  -a POSIX -v -i 1 -B -m -d 1 -F -w -r -t 32m -b 2G -o /mnt/myamlfs/test | tee /home/re
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: mpimaster
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  cpu: 1
              volumeMounts:
              - mountPath: "/mnt/myamlfs"
                name: myamlfs
              - mountPath: /dev/shm
                name: shm
          restartPolicy: OnFailure
          volumes:
          - name: myamlfs
            persistentVolumeClaim:
             claimName: pvc-lustre
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
    - replicas: 2
      name: mpiworker
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/infiniband: 8
                limits:
                  nvidia.com/infiniband: 8
              volumeMounts:
              - mountPath: "/mnt/myamlfs"
                name: myamlfs
              - mountPath: /dev/shm
                name: shm
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: myamlfs
            persistentVolumeClaim:
             claimName: pvc-lustre
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

df -h (on NDm_v4 pod)

Filesystem Size Used Avail Use% Mounted on
overlay 117G 68G 49G 58% /
tmpfs 64M 0 64M 0% /dev
10.2.0.6@tcp:/lustrefs 16T 1.3M 16T 1% /mnt/myamlfs
tmpfs 8.0G 0 8.0G 0% /dev/shm
tmpfs 861G 16K 861G 1% /root/.ssh
/dev/root 117G 68G 49G 58% /etc/hosts
tmpfs 861G 12K 861G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 443G 12K 443G 1% /proc/driver/nvidia
tmpfs 178G 9.1M 178G 1% /run/nvidia-fabricmanager/socket
devtmpfs 443G 0 443G 0% /dev/nvidia0

IOR read/write I/O benchmark result

Max Write: 2263.20 MiB/sec (2373.14 MB/sec)
Max Read:  1829.75 MiB/sec (1918.63 MB/sec)

Note: Inline with expected performance 125 MB/s/TiB, 16TiB AMLFS deployed (Maximum throughput ~2000 MB/s)

Example FIO benchmark yaml file to test the Azure Files+NFSv4 filesystem.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: fio-files-job1
spec:
  minAvailable: 1
  schedulerName: volcano
  tasks:
    - replicas: 1
      name: fio-files
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/bash
                - -c
                - |
                  df -h
                  /workspace/fio/bin/fio --name=write_4G --directory=/mnt/azurefiles --direct=1 --size=4G --bs=4M --rw=write --group_reporting --numjobs=4 --runtime=300
                  /workspace/fio/bin/fio --name=read_4G --directory=/mnt/azurefiles --direct=1 --size=4G --bs=4M --rw=read --group_reporting --numjobs=4 --runtime=300
              image: cgacr2.azurecr.io/pytorch_io_tests_2303:latest
              securityContext:
                capabilities:
                  add: ["IPC_LOCK"]
                privileged: true
              name: fio-files
              workingDir: /workspace
              resources:
                requests:
                  nvidia.com/infiniband: 8
                limits:
                  nvidia.com/infiniband: 8
              volumeMounts:
              - name: persistent-files
                mountPath: /mnt/azurefiles
          restartPolicy: OnFailure
          terminationGracePeriodSeconds: 0
          volumes:
          - name: persistent-files
            persistentVolumeClaim:
              claimName: files-nfs-pvc
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
---

df -h run from pod on NDm_v4.

Filesystem                                                                                                        Size  Used Avail Use% Mounted on
overlay                                                                                                           117G   68G   49G  58% /
tmpfs                                                                                                              64M     0   64M   0% /dev
fc48afa1c2d754f21873a54.file.core.windows.net:/fc48afa1c2d754f21873a54/pvcn-4786a03f-b2fc-44a9-b9e2-ac9ccd5c292c  100G     0  100G   0% /mnt/azurefiles
/dev/root                                                                                                         117G   68G   49G  58% /etc/hosts
shm                                                                                                                64M     0   64M   0% /dev/shm
tmpfs                                                                                                             861G   12K  861G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                                             443G   12K  443G   1% /proc/driver/nvidia
tmpfs                                                                                                             178G  6.0M  178G   1% /run/nvidia-fabricmanager/socket
devtmpfs                                                                                                          443G     0  443G   0% /dev/nvidia0                                                                                                  443G     0  443G   0% /dev/nvidia0

FIO, write/read I/O throughput (100GiB Azure files share mounted via NFSv4)

WRITE: bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=16.0GiB (17.2GB), run=125570-125570msec
READ: bw=158MiB/s (166MB/s), 158MiB/s-158MiB/s (166MB/s-166MB/s), io=16.0GiB (17.2GB), run=103711-103711msec

Conclusion

All popular HPC/AI storage options can easily be consumed in NDm_v4 AKS cluster environments. In this blog post we demonstrated how to set-up and consumed local NVMe SSD’s, AMLFS, Azure files+NFSv4 and validated these storage options by running FIO and IOR I/O benchmarks.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs