Azure regularly updates its platform to enhance the host infrastructure for virtual machines, focusing on reliability, performance, and security. The updates can range from operating system, hypervisor, various networking components/agents deployed on the host, to hardware decommissioning:
There are two types of VM maintenance:
In cases where Live Migration can't be used, the VM experiences unexpected downtime (reboot).
In this article, we will deep-dive into techniques used to apply planned maintenances, what customer can and cannot control.
To go further into detail, Azure uses different techniques for updates depending on the type of update and the constraints to ensure that updates are minimally impactful:
Source: Inside Azure Innovations with Mark Russinovich | BKR214H
Azure can use one of the above techniques to minimize impact during unplanned hardware maintenance, unexpected downtime and planned maintenance.
Note: Reboot is not mentioned above – Guest VMs are only rebooted when the previous techniques cannot be used.
Here is a summary of the procedure used to manage updates and maintenance for the host:
An important point to mention here is that all these maintenance operations we just talk about can happen at any time in Azure and customers do not have control over when these kinds of updates (which generate freezes) can happen.
By default, when you provision a VM in Azure, it will land on a random host in the targeted region and availability zone, and this host is shared by multiple VMs from multiple customers. This is what we call Shared hosts.
In addition to our default hosting model using shared hosts, Microsoft also gives you the capability to have a dedicated host on which only your VMs can be hosted. This offer is named Azure Dedicated Host and offers various advantages among which are:
While both solutions have their pros and cons, let’s focus on this last item: maintenance controls.
As explained above, several techniques can be used to update various infrastructure components and we can distinguish two major categories of impacts:
When being hosted on a shared host, customers are given by a 35-days window during which they can plan when their VM reboot should occur: they have the control over Rebootful updates during this window. Once it expires, Microsoft will schedule the reboot on its own and the customer will be notified a few minutes before the reboot occurs through Scheduled Events (we will describe this mechanism in the next section).
As mentioned before, for Rebootless updates on Shared hosts, customer have no control over them and cannot schedule anything. It means that, on VMs running on Shared hosts, freezes (generally of a couple of seconds) could happen at any time.
If it is unacceptable for customer workloads to be subject to this kind of uncontrolled freezes, then this is where Azure Dedicated Hosts can be of a great help. On these hosts, Maintenance Control allows customers to schedule all kind of updates (rebootless and rebootful) and to apply them at a preferred time within a 35-day window.
Now that you have better visibility on the various options you have in term of maintenance control, let’s see how to manage updates on Shared hosts.
If a reboot is needed, customers are notified and given a time frame to initiate the maintenance themselves, typically within 35 days unless it is urgent. See Handling planned maintenance notification.
If no reboot is required, the VM is either paused or live-migrated to an already updated host.
Some applications may not tolerate a pause, even for a few seconds. For these applications, an alternative is possible:
ScheduleEvents is an Azure Instance Metadata Service (IMDS) API that gives your application time to prepare for VM maintenance. It provides up to 15 minute advance notice prior to maintenance events (Reboot, Redeploy, Freeze, Preempt, Terminate) so that your application can prepare for them and limit disruption:
Services on VM can monitor this API to perform graceful shutdown (& connection draining) before the event is carried out.
Note: Schedule Events are enabled when a service makes first requests to query events. There is some delay in the first response (~1min). It is disabled if there is no request to the endpoint for 24 hours.
As previously explained, Azure Dedicated Host can be a solution for maintaining control over when maintenances are applied.
Project Flash enables Azure customers to detect & diagnose ongoing and completed availability disruptions, including VM degradation.
Azure VM availability can be monitored using:
Using Azure Resource Graph, there are two types of events populated in the HealthResource table:
Denotes the availability state of the Virtual Machine.
Can assume values between Available | Unavailable | Unknown | Degraded:
{
"targetResourceType": "Microsoft.Compute/virtualMachines",
"previousAvailabilityState": “Unavailable",
"targetResourceId": <ARM Id>,
"occurredTime": <Precise Time stamp of transition>,
"availabilityState": "Available"
}
Provides context to interpret why a change in VM availability has occurred, to decisively take actions if needed.
{
"targetResourceType": "Microsoft.Compute/virtualMachines",
"targetResourceId": <ARM Id>,
"annotationName": "VirtualMachineHostRebootedForRepair",
"occurredTime": "2022-09-25T20:21:37.5280000Z",
"category": “Unplanned",
"summary": "We're sorry, your virtual machine isn't available because an unexpected failure on the host server. Azure has begun the auto-recovery process and is currently rebooting the host server. No additional action is required from you at this time. The virtual machine will be back online after the reboot completes.",
"context": “Platform Initiated",
"reason": "Unexpected host failure",
"impactType”: “Downtime Reboot"
}
Examples of useful KQL requests on HealthResources table are available.
In this article, we detailed the two hosting models (shared hosts and dedicated hosts) and their respective options regarding planned maintenance, unplanned maintenance, and their maintenance controls.
If your workload can tolerate infrequent freezes with a duration of couple of seconds (which is generally the case) or if 15 minutes is enough for you to prepare these freezes (by draining current connections and refusing new ones for example) the default hosting model with shared hosts is ideal as it provides scalability and ease of management.
On the other hand, if your workload are highly sensitive to freezes, even for a few seconds, then you should consider to use Azure Dedicated Hosts which will give you the control over the maintenance.
Microsoft is still working to improve the maintenance control experience on shared hosts, so there’s no doubt that it will become better and better with the shared hosts model.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.