Scaling Up in the Cloud: The WEKA Data Platform and Azure HPC Windows Grid Integration
Published Feb 27 2024 05:46 PM 2,827 Views
Microsoft

Co-Written with Erik Garcia, WEKA Director of Cloud SalesBrian Markenson, WEKA System Engineer, & Adam Fowler, WEKA System Engineer

High Performance Compute (HPC) grids in the Financial Services Industry are unique within the HPC industry as they feature Windows-based workloads more than any other.  In that many existing HPC Storage platforms don’t necessarily apply as they were geared for POSIX or General Parallel File System, like Lustre.  However, these grids also feature thousands of nodes all typically reading the same model files at the same time. It is not unusual to see 3+ GBps performance utilizing SMB (Server Message Block) aka “CIFS” protocol by some massive grids.   

 

In the previous article, we talked about evaluating Azure Files Premium for your HPC Pack environment.  Azure Files Premium with 50TB performed at 5GBps across multiple nodes but is still limited with single file performance. The challenge is to find a high-performance scalable storage platform for SMB Shares to support the largest or most challenging Windows grid environments.  With this challenge in mind, we have done further testing on a new entrant into SMB Shares arena in Azure, Weka Data Platform. 
 

About WEKA 

WEKA is leading a paradigm shift in how data is stored, managed, and processed in the cloud and AI era. WEKA was born in the cloud and has built a powerful software approach to delivering high-performance data - called the WEKA Data Platform. The WEKA® Data Platform is a software solution that transforms stagnant data silos into dynamic data pipelines that power GPUs efficiently and fuel performance-intensive workloads seamlessly and sustainably. Its advanced cloud-native architecture is optimized to solve complex data challenges at scale, delivering 10-100x performance improvements across edge, core, cloud, and hybrid environments. WEKA helps the world’s leading data-driven organizations accelerate research and discovery breakthroughs and business outcomes – including 11 of the Fortune 50. The company operates in over 20 countries globally and is backed by dozens of world-class investors. Learn more about WEKA at www.weka.io

 
WEKA and Azure 

Many organizations are adopting Microsoft Azure for their cloud needs, especially for next-gen workloads involving accelerated computing and machine learning. These high-performance grids place added strain on both on-premises and cloud infrastructure. Companies with data center investments are leveraging the cloud to enhance scalability and cost-efficiency. As cloud infrastructure becomes integral, users expect the same performance as on-premises deployments. Across diverse industries, WEKA Data Platform® offers the fastest, most scalable file system for Microsoft Azure, meeting performance expectations and delivering cloud's scalability and simplicity for all workloads. 

 

 

WEKA Architecture  

KentAltena_0-1701365998954.png

 WEKA software runs in the customer’s own subscription and integrates with many Azure services, including but not limited to: Azure VMs, Azure Blob Storage, Azure Kubernetes, and AzureML. The WEKA Data Platform combines dense NVMe storage of Azure Virtual Machine Lsv3-series instances with Azure Blob Storage in a single, efficient namespace, for your high-performance workloads, scaling to billions of files and hundreds of petabytes.  

 

WEKA provides enterprise features such as:  

  • Multiple file protocols (POSIX, NFS, SMB, Object) with full data-share ability across protocols 
  • Single name space up to 14EB 
  • Transparent object tiering 
  • Instantaneous snapshots, snap-to-object (remote clouds/regions), backup, disaster recovery (“DR”)  
  • Encryption 
  • Quotas  
  • Active Directory Services integration 
  • Kubernetes CSI driver 

 

Performance Testing and Results 
 

To evaluate performance in the cloud, several critical factors come into play, including instance size, system architecture, and network speed. In this section, we delve into the specifics of our performance testing and present the results obtained with the Azure cloud platform. 

Testing Architecture 

To conduct our performance tests, we modeled this design to most closely align to previous published testing and common customer architectures. Each instance was selected to fulfill specific roles within our storage platform: 
 

Instance Type 

vCPU 

Memory: GiB 

Role 

Max NICs 

Max Network Bandwidth (Mbps) 

6 x Standard_L32s_v3 

32 

256 

Backend Servers 

8 

16,000 

3 x Standard_D96_v5 

96 

384 

Protocol Servers 

8 

35,000 

 

Our architecture consisted of six backend servers responsible for running the WEKA filesystem. Additionally, we deployed a set of three WEKA SMB protocol servers to serve SMB requests to clients. To ensure optimal performance, we selected the number of Protocol Gateway servers to match the level of SMB performance (BW) required by the environment. For an individual client, the performance limit is defined by the connection between the client and the protocol server, and between the protocol server and the backend. To achieve higher performance on the client side, multiple clients can run concurrently against multiple protocol servers. 

 

WEKA customer deployments are easily automated using Terraform to deploy the Azure Infrastructure and WEKA Software. Once deployed, there is zero tuning required and all data management takes place in the WEKA GUI. 

 

KentAltena_0-1701370431843.png

 

 

 
Methodology: 

For our performance testing, we utilized the "diskspd" tool, available from GitHub, to assess the single-client throughput of our SMB client.  DiskSpd is a storage performance tool from the Windows, Windows Server and Cloud Server Infrastructure engineering teams at Microsoft. Please visit https://github.com/Microsoft/diskspd/wiki for updated documentation. 
 

Performance Results: 

  • Single Client Throughput: 
    Using Diskspd with a 64K block size and a 60/40 read/write split, we achieved remarkable results, reaching approximately 5GB/sec of total throughput. Comparable outcomes were observed when conducting 100% read or 100% write tests. 

60/40 read/write w/ 64K Block size:  

diskspd.exe -c1G -w40 -b64K -F8 -r -o32 -W5 -d30 -Sh -L testfile.dat testfile1.dat testfile2.dat testfile3.dat

Total IO

Thread

Bytes

I/Os

MiB/s

I/O per S

AvgLat

LatStdDev | File

read

95,323,029,504

1,454,514

3,029.10

48,465.58

12.567

0.750

write

63,448,809,472

968,152 

2,016.23

32,259.60

12.860

0.783 

total

158,771,838,976

2,422,666

5,045.32

80,725.18

12.684

0.777

  • Multiple Clients Throughput:  
    We also conducted the same test simultaneously with three different clients, each connected to its own protocol server. Each client independently achieved the same impressive throughput, resulting in a total aggregate throughput of approximately 12GB/sec.  

These results represent the best performance we have seen both single file and single client performance across SMB Storage platforms.  The platform performs at the max network level of the Protocol Gateway and the Clients with the HPC common workload type.  Many platforms prefer Reads vs. Writes, but not the Weka Data Platform.  

 

Solution Cost

 

WEKA’s performance lends itself to mixed IO patterns, capable of delivering both high transactional random IO at microsecond latencies and large sequential reads and writes simultaneously. WEKA’s ability to deliver massive performance with minimal SSD while letting you grow transparently with Azure Blob presents an increasingly attractive price point at scale. At scale, WEKA customers begin to leverage blob storage for cost efficiencies while getting the performance of flash in a single unified name space.

In this benchmark, we had 21TB of usable capacity (all flash):

 

This 21TB usable solution includes (6) dedicated WEKA storage nodes which run on Standard_L32s_v3 VM’s and (3) protocol nodes running on Standard_D96s_v5 VM’s. The solution delivers roughly 1200MB/s (12 GB/s) bandwidth and 83,000 IOPS at microsecond latency. The $/BW and $/IOPS are particularly impressive.

 

  Cost per Month Max Bandwidth (MB/s) Max IOPS Cost Per TB Cost per MBps Cost Per IOPS
Azure Infrastructure $11,900 13,125        
Azure Blob $550          
WEKA Software $2500          
Estimated TCO $14,950 13,125 83,000 ~$712 $1.25 ~$0.18

 

WEKA software enables customers to add either Flash or Object capacity to the single namespace. As a result, customer are able to combine cost-effective object storage with the performance of flash. Most larger data sets have a subset of data that is “hot” or “active”; as a rule of thumb, 10% to 20% of a data set will be active. This approach allows customers to scale a file system cost-efficiently. WEKA licensing provides economies of scale – as flash and object capacity are added, commercial incentives are made available. As a reference point, in a 1PB data set with 15% flash and balance in object, WEKA is more cost-effective than an All Flash Solution by orders of magnitude.

 

First Party Solutions and Guidance

 

Azure Files Premium (AFP)- Azure Files is a fully managed cloud-based file share product, accessible over SMB on Windows and Linux operating systems, fulfilling common HPC Pack use cases. There are several identity-based authentication options for Azure Files over SMB. To ensure high availability and enable disaster recovery scenarios, Azure Files offers different redundancy options, including zone-redundant storage in certain regions which offers 12 9’s of availability over a given year. It is the most cost effective bandwidth capable platform, but does have challenges with large number of small files or metadata operations  

 

Azure NetApp Files (ANF) – ANF is a first-party Azure service with a rich ecosystem of support by Microsoft and NetApp with the ability to target share size and performance tiers to meet performance requirements, replication, and ease in integration with Active Directory Domain Services (AD DS). For most HPC workloads, I would only consider Premium or Ultra levels to host HPC runtime data. It also features the best latency control mechanism across Application Volume Groups, like SAP, for sub millisecond latency.  The volumes or shares when originally released only supported up to 100TB in size.  However, the performance levels was capped before that at about 70TB, regardless what the portal lists as Maximum Bandwidth.  

The maximum empirical throughput that has been observed in testing is 4,500 MiB/s. At the Premium storage tier, an automatic QoS volume quota of 70.31 TiB will provision a throughput limit that is high enough to achieve this level of performance.

Additional details here:  Azure NetApp Files datastore performance benchmarks for Azure VMware Solution | Microsoft Learn

 

Currently in Preview, there is a new feature called Large Volumes. This feature enables the volume to be as large as 500TB, but more importantly for HPC performance concerns, it enables the volume to perform up to 2x for Premium and more than 3x for Ultra Capacity Pool volumes.  In my testing with HPC sample workload a 100TB Ultra Regular volume was capped at 3500MB/sec, but 100TB Ultra Large Volume performed up to 10,000 MB/sec.

KentAltena_0-1709071101422.png

 

ANF is still capped at 500TB per namespace.  Standard Performance tier does offer tiering with Cool Access, being placed on an Azure Storage Account. However Premium and Ultra Tiers do not have similar capability yet even in Preview edition.

 

Simple Guidance or Flow Chart of Selection:

 

It is always difficult to provide general guidance without first understanding the needs or requirements of the Customer. There are few general statements that all customers should follow though as they migrate a workload to the Cloud.  First, understand the storage usage in the On Premises environment.  Utilize Storage management to determine IO Pattern and bandwidth needed. If those figures can’t be obtained, use tools like Diskspd or FIO from a single or multiple clients to determine max capability of current storage shares.

Second, understand the needs around data synchronization with On Premise storage or backup commonality.  It is always better to remain common if possible.  That is if they are NetApp On Premise, use ANF in Azure.  The benefits of its snapshotting ability and  capability for application consistent snapshot remain the best in the industry.

 

If price for performance is the highest concern, Azure Files Premium is the default answer.  Its performance was maxed out in testing at 5307 MB/sec for the HPC workload. Small file performance or large namespace remains a contraindication.

 

Azure NetApp Files can satisfy most HPC workloads now with the addition of Large Volume Support. With Regular Volumes, ANF would be outperformed by AFP, but with Large Volume Support, both Premium and Ultra Pools were among the highest performing volumes.  500 TB Shares also meet most Cx share locations.  Price per MB/sec bandwidth can be higher than its competitors, especially with Ultra tier and inability to tier Cool Data.

 

Third Party Solutions, like WEKA, are therefore best employed to handle the largest or most performant SMB Shares.  WEKA had the highest single client performance limited solely by networking interface speed, soon to be assisted by Azure Boost.  It also had the greatest single name space bandwidth and IOPS performance, while still offering the ability to tier cool data to Azure Storage Accounts to mitigate the overall cost.

 

 

Conclusion / Next Steps

WEKA is positioned to be able to handle the highest-performing shares. It is targeted to be the high-performance storage platform for your grid workloads, rather than general purpose file. For comparison with Azure Files Premium, WEKA would excel in single file performance, >200TB Name Space or Share , or >6GBps total share performance.  It also features the ability to tier storage with Blob Storage account back end.

 

With this multi-protocol access and ability to tier data to object with a single global namespace, WEKA Data Platform enables customers to bring their most challenging HPC workloads to Azure. It brings flash level performance and low latency to massive storage environments. My recommendation would be to first define your storage needs up front – bandwidth, average file size, IOPS requirements, protocol access, capacity of hot/cool tier, etc. for your HPC Shares. The WEKA Data Platform allows you to build an architecture to meet those needs.

 

The WEKA Data Platform is a game-changer for High-Performance Computing (HPC) Grids, offering a versatile and powerful solution that enables organizations to manage and utilize their data resources efficiently. Whether you're a quant, data scientist, or an institution operating in the HPC realm, WEKA's advanced features, scalability, and flexibility provide a robust foundation for tackling data-intensive workloads and achieving exceptional performance. As the digital landscape continues to evolve, embracing the WEKA Data Platform is not just a smart choice; it's a strategic advantage that empowers you to harness the full potential of your HPC Grid.

Co-Authors
Version history
Last update:
‎Feb 27 2024 05:46 PM
Updated by: