Automatic AKS Upgrade
Context
Microsoft releases a new version of Azure Kubernetes Service (AKS) approximately every 6 months, aligning with the release rhythm of Kubernetes itself. These updates include new features and security enhancements to keep AKS up-to-date and in line with the latest Kubernetes releases.
Now, let's discuss the AKS migration process. AKS migration involves deploying new nodes with the targeted AKS version, transferring workloads from old nodes to the new ones, and eventually decommissioning the old nodes. This process can be performed manually or automatically. Microsoft offers features that allow for automatic configuration and upgrades of AKS versions. Leveraging Microsoft's auto-upgrade capability can be advantageous because Microsoft has the resources to develop, maintain, and support this feature. It provides assurance and frees up the user's time from the manual effort required for performing upgrades.
By utilizing Microsoft's auto-upgrade, you can take advantage of their expertise to seamlessly manage the migration process while focusing on other important tasks. However, it's important to carefully plan and test the migration process to ensure a smooth transition and minimize potential disruptions to your workloads.
- https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster
- https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster?tabs=azure-cli
Sandbox before production
To address the lack of a predefined schedule for Azure Kubernetes Service (AKS) releases in the upgrade channels provided by Microsoft, it is not feasible to rely solely on Microsoft's auto-upgrade schedule to ensure that sandbox clusters are updated before production clusters. To mitigate this issue, a solution is to keep auto-upgrade deactivated for production clusters at all times. Once sandbox clusters have been successfully upgraded, the auto-upgrade feature can be activated for them during their upgrade process. However, it is important to note that once the upgrade is complete, auto-upgrade will be turned off for production clusters again.
This approach ensures that production clusters are not automatically upgraded without control or coordination, allowing for careful management of the upgrade process. By separating the upgrade strategies for sandbox and production clusters, it becomes possible to prioritize and manage them independently.
Legacy migration flow
In previous years, the AKS upgrade process was handled by the K8saas team through a migration process. This migration typically took place twice a year and had an approximate duration of 2 months. The process involved the following steps:
- Creating a migration schedule for all clusters.
- Informing customers via email about the upcoming migration of their clusters.
- Prior to starting a migration, sending an email notification to customers to let them know that the AKS upgrade was about to commence.
- Performing the AKS upgrade as planned. Once the upgrade was completed, sending a final email notification to customers to inform them that the AKS upgrade had been successfully carried out. This systematic approach ensured that customers were adequately informed and engaged throughout the AKS upgrade process. By providing timely notifications and updates, it helped to minimize any potential disruptions and allowed customers to maintain awareness of the migration progress.
New migration flow
The AKS upgrade process is managed by Microsoft, making it currently impossible to directly notify customers about the start and end of the upgrade. However, a fixed schedule is configured for each cluster, and it is possible to provide this information to customers if needed.
Cluster type | Week | Time | Auto-upgrage status |
---|---|---|---|
Sandbox | First week of the month | 0-23 | Always activated |
Prod (without distinct upgrade date) | Third week of the month | 0-23 | Only activated after sandbox upgrades |
Prod (with distinct upgrade date) | N/A | N/A | Legacy migration flow |
This table outlines that sandbox clusters are scheduled for upgrades during the first week of every month, with auto-upgrades continuously activated. On the other hand, prod clusters without specific upgrade date are slated for upgrades during the third week of every month but will only be activated after the sandbox clusters have completed their upgrades.
Using a fixed day of the week and a fixed hour for AKS upgrades can lead to potential issues such as overloading the Azure API with a large number of cluster upgrades happening simultaneously. To avoid this, a strategy of spreading the upgrades uniformly throughout the week and different times is implemented.
This new upgrade concept allows for AKS upgrades to be rolled out as soon as they are available in the specific region where the cluster is located. This means that upgrades can occur more frequently, potentially on a monthly basis, as opposed to being limited to just twice a year. However, it's important to maintain stability by following a set rhythm for cluster upgrades.
In the case of selecting the stable channel, the cluster is automatically upgraded to the latest supported patch release, specifically the minor version N-1, where N represents the latest supported minor version. For example, if a cluster is running version 1.17.7 and versions 1.17.9, 1.18.4, 1.18.6, and 1.19.1 are available, the cluster would be upgraded to version 1.18.6.
By adopting this approach, clusters can benefit from the latest bug fixes, patches, and security updates while maintaining stability and minimizing any potential disruptions that could arise from simultaneous upgrades.
Deployment Phases
Continuously - Long Term Monitoring
To ensure the stability and functionality of the clusters, the following monitoring activities will be performed continuously:
-
Monitoring for AKS Version Drift:
All clusters will be monitored continuously to detect any deviation from the expected AKS version. This monitoring will ensure that clusters are running on the intended AKS version and help identify any potential issues.
-
Functional Cluster Monitoring:
All clusters will be constantly monitored to ensure their full functionality. This monitoring will cover various aspects of cluster performance, including resource utilization, availability of critical components, and overall cluster health.
Futur improvements
To further enhance the upgrade process, the following improvements are planned:
-
Activation of Auto-Upgrade (Sandbox Clusters):
Auto-upgrade functionality will be activated for all sandbox clusters, ensuring that they receive automatic upgrades whenever a new version is available. This will help identify any potential compatibility or issues early on in a controlled environment.
-
Activation of Auto-Upgrade (Production Clusters without specific upgrade date):
Once the sandbox upgrade process is completed and validated, auto-upgrade functionality will be activated for all production clusters. This will streamline the upgrade process and ensure that all production clusters are running on the latest version without manual intervention.
-
API:
An API available fr users to know their update schedule
-
More Flexible Upgrade Scheduling:
The upgrade scheduling system will be enhanced to provide customers with more options and flexibility. Customers will be able to choose a more restrictive schedule for their cluster upgrades to minimize potential disruptions during critical business operations. These improvements will contribute to a more automated, efficient, and customer-centric upgrade process, ensuring a smoother experience for customers and reducing downtime during the upgrade process.