Ramp up with me... on HPC: Understanding Virtual Machines, CPUs, and GPUs

RachelPruitt · ‎Aug 31 2023

There are a lot of different products you need to successfully complete a high-performance computing (HPC) workload. You’ll hear several terms regularly, like virtual machines, CPUs, GPUs, compute power, and compute constrained. While these are really important to talk about in high performance computing, they were difficult concepts for me to grasp. Personally, I am a visual learner and I struggle with theoretical concepts that I can’t just physically see. So, I’ll do my best to both explain the concepts, and show you the hardware so you can visualize what they are.

Since I work for Microsoft and am more familiar with the cloud, my focus will remain there. However, it’s important to remember that many companies keep their datacenters on premises or are working in a hybrid model. While I’ll keep to the focus of cloud computing here, each model is used for very specific reasons and each have their own benefits.

What are virtual machines?

One of the most common products used in HPC are your virtual machines. These are software-based emulations of a physical computer, created by allocating portions of a physical computer’s resources (located in one of our datacenters) to a user. Think about taking a large computer and splitting all the components (CPU, memory, storage, etc) into equal parts so many people can use the computer. Customers do this in the cloud, so they can take advantage of the most performant technology, scalability, and flexibility it brings. We won’t go into depth into all the benefits here but know that this means you can use whatever machine will work best for your current workload, meaning you can complete your work more efficiently with the best tools.

If you logged into a virtual machine using a virtual desktop, which is what allows you to connect to the virtual machine from your personal computer, it would have the same look as your own desktop looks. To see how you do this, watch my video on rendering.

I hear a lot about CPUs and GPUs. What are they?

Central processing units (CPUs) and graphics processing units (GPUs) are both types of computer processors that perform different types of computations.

CPUs are the primary processing units in a computer system, designed to handle a wide range of general-purpose computations. They are responsible for executing most of the instructions in a computer program and are optimized for handling sequential, single-threaded tasks that require a high level of flexibility and control. Single-threaded tasks are really just tasks that you can’t break up into several smaller tasks and have sequential dependency on other tasks.

GPUs, on the other hand, are specialized processors that are designed to handle parallelizable computations involving very large amounts of data, such as graphics processing, machine learning, and scientific simulations. They are optimized for performing a large number of calculations simultaneously and are particularly effective at handling tasks that can be broken down into smaller, independent sub-tasks. We call this extreme parallelism.

On a basic stance, go back to an algebra problem where some parts of the equation are in parentheses and some are not, to reflect the order solving the problem needs to be done and which parts of the equation can be done in parallel. If the problem has to be done in order, then CPUs may be better. If the equation can be split into little parts that can be done at the same time, GPUs may be the right answer. Again this is a very simplistic explanation, and there’s a lot more involved, but I hope it helps.

My favorite example came from a video I watched by NVIDIA and Mythbusters, that showed a demo illustrating the difference between CPUs and GPUs. In the video they had robots recreating images, one powered by CPUs and one powered by GPUs. While both performed fine, the difference was in how it was performed. The machine using CPUs painted a small section each time – splattering the canvas one at a time. Eventually, this would finish the painting, but ultimately would take a substantial amount of time because each piece was being done in sequence. The GPUs, however, came in and sprayed the entire painting at once – completing the entire painting in 80 seconds. Because there weren’t any dependencies, the entire painting could be painted in one swipe.

So how does this relate to HPC?

The goal of HPC is to break up a problem into several parts and then utilize tens to thousands of machines to complete the parts in parallel. Customers working in high performance computing choose machines (virtual or on premises) dependent on the size and type of work that’s being done. These machines can either utilize CPUs or GPUs. CPUs are generally used when there are very complex calculations that can be done in parallel but some of the parts are dependent on others. This is often where workloads like simulation and modeling and rendering sit, though not always. GPUs on the other hand, are great for small tasks, that can be done with extreme parallelization. If we go back to the painting analogy, we already knew what the painting would look like, so what color one pixel was, is not dependent on what color another pixel was. Because of this, you can spin up many GPUs and they can complete the workload in a significantly faster timeframe.

While CPUs and GPUs each have things they do really well, I’ve come to find that these are each being used across the board with use cases you wouldn’t expect – usually because of costs, speed, or performance. Inference, for instance, is something a virtual machine with a GPU will do very well and fast. However, customers are regularly using virtual machines without GPUs, because the costs of GPUs can be higher, and the customer doesn’t mind waiting a little longer for it to be completed on a cheaper machine. Like any other product, it’s really important to find the machine that has what you need for the right return on investment.

Let’s wrap this up

Here's the fun part. We like to think of HPC as needing these incredibly complex machines to handle these workloads, and sometimes they do. But some companies are breaking the mold, and using smaller virtual machines that they can spin up hundreds or thousands of to bypass using more expensive machines with higher cores or GPUs. If time is not a huge concern and there are plenty of smaller virtual machines available, this can sometimes bring large cost savings.

Take Impact Observatory as an example. Impact Observatory is changing the way maps are produced and using artificial intelligence to create a first of its kind, global map. While its initial model training was completed on GPU based machines, when it came to using the output to render the actual map, they used large numbers of smaller machines to render a map of the entire world in only a week. Why? Because there were large numbers of those machines available, and there were large cost savings that came from not using GPU based virtual machines.

Learn More

Interested in learning more about high performance computing?

Read about Azure HPC, AI Infrastructure
Visit our hub to find all Azure content for HPC

Ramp up with me…on HPC series

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs