Skip to main content

Automatic AKS Upgrade

Context

Microsoft releases a new version of Azure Kubernetes Service (AKS) approximately every 6 months, aligning with the release rhythm of Kubernetes itself. These updates include new features and security enhancements to keep AKS up-to-date and in line with the latest Kubernetes releases.

Now, let's discuss the AKS migration process. AKS migration involves deploying new nodes with the targeted AKS version, transferring workloads from old nodes to the new ones, and eventually decommissioning the old nodes. This process can be performed manually or automatically. Microsoft offers features that allow for automatic configuration and upgrades of AKS versions. Leveraging Microsoft's auto-upgrade capability can be advantageous because Microsoft has the resources to develop, maintain, and support this feature. It provides assurance and frees up the user's time from the manual effort required for performing upgrades.

By utilizing Microsoft's auto-upgrade, you can take advantage of their expertise to seamlessly manage the migration process while focusing on other important tasks. However, it's important to carefully plan and test the migration process to ensure a smooth transition and minimize potential disruptions to your workloads.

Sandbox before production

To address the lack of a predefined schedule for Azure Kubernetes Service (AKS) releases in the upgrade channels provided by Microsoft, it is not feasible to rely solely on Microsoft's auto-upgrade schedule to ensure that sandbox clusters are updated before production clusters. To mitigate this issue, a solution is to keep auto-upgrade deactivated for production clusters at all times. Once sandbox clusters have been successfully upgraded, the auto-upgrade feature can be activated for them during their upgrade process. However, it is important to note that once the upgrade is complete, auto-upgrade will be turned off for production clusters again.

This approach ensures that production clusters are not automatically upgraded without control or coordination, allowing for careful management of the upgrade process. By separating the upgrade strategies for sandbox and production clusters, it becomes possible to prioritize and manage them independently.

Legacy migration flow

In previous years, the AKS upgrade process was handled by the K8saas team through a migration process. This migration typically took place twice a year and had an approximate duration of 2 months. The process involved the following steps:

  1. Creating a migration schedule for all clusters.
  2. Informing customers via email about the upcoming migration of their clusters.
  3. Prior to starting a migration, sending an email notification to customers to let them know that the AKS upgrade was about to commence.
  4. Performing the AKS upgrade as planned. Once the upgrade was completed, sending a final email notification to customers to inform them that the AKS upgrade had been successfully carried out. This systematic approach ensured that customers were adequately informed and engaged throughout the AKS upgrade process. By providing timely notifications and updates, it helped to minimize any potential disruptions and allowed customers to maintain awareness of the migration progress.

New migration flow

The AKS upgrade process is managed by Microsoft, making it currently impossible to directly notify customers about the start and end of the upgrade. However, a fixed schedule is configured for each cluster, and it is possible to provide this information to customers if needed.

Cluster typeWeekTimeAuto-upgrage status
SandboxFirst week of the month0-23Always activated
ProdThird week of the month0-23Only activated after sandbox upgrades

This table outlines that sandbox clusters are scheduled for upgrades during the first week of every month, with auto-upgrades continuously activated. On the other hand, prod clusters are slated for upgrades during the third week of every month but will only be activated after the sandbox clusters have completed their upgrades.

Using a fixed day of the week and a fixed hour for AKS upgrades can lead to potential issues such as overloading the Azure API with a large number of cluster upgrades happening simultaneously. To avoid this, a strategy of spreading the upgrades uniformly throughout the week and different times is implemented.

This new upgrade concept allows for AKS upgrades to be rolled out as soon as they are available in the specific region where the cluster is located. This means that upgrades can occur more frequently, potentially on a monthly basis, as opposed to being limited to just twice a year. However, it's important to maintain stability by following a set rhythm for cluster upgrades.

In the case of selecting the stable channel, the cluster is automatically upgraded to the latest supported patch release, specifically the minor version N-1, where N represents the latest supported minor version. For example, if a cluster is running version 1.17.7 and versions 1.17.9, 1.18.4, 1.18.6, and 1.19.1 are available, the cluster would be upgraded to version 1.18.6.

By adopting this approach, clusters can benefit from the latest bug fixes, patches, and security updates while maintaining stability and minimizing any potential disruptions that could arise from simultaneous upgrades.

Deployment Phases

Continuously - Long Term Monitoring

Continuously - Long Term Monitoring: To ensure the stability and functionality of the clusters, the following monitoring activities will be performed continuously:

  1. Monitoring for AKS Version Drift:

    All clusters will be monitored continuously to detect any deviation from the expected AKS version. This monitoring will ensure that clusters are running on the intended AKS version and help identify any potential issues.

  2. Functional Cluster Monitoring:

    All clusters will be constantly monitored to ensure their full functionality. This monitoring will cover various aspects of cluster performance, including resource utilization, availability of critical components, and overall cluster health.

Phase 1 - Testing - 2nd week of July 2024

During this phase, a selected subset of sandbox clusters will undergo testing and activation of the auto-upgrade feature. The following steps will be taken:

  1. Activation of Auto-Upgrade (Subset of Sandbox Clusters):

    A set of 10% of sandbox clusters (10 clusters) will be selected, and the auto-upgrade feature will be activated for these clusters. This will enable automatic upgrades for these clusters during the scheduled upgrade process.

  2. Deployment of New Sandbox Clusters:

    New sandbox clusters will be deployed with the auto-upgrade feature already enabled. This ensures that new clusters come with the automation feature from the outset.

  3. Scheduled AKS Upgrades:

    Microsoft will perform AKS upgrades for the assigned sandbox clusters, according to the predefined upgrade schedule. This will provide valuable insights into the upgrade process and identify any potential issues that may arise.

Phase 2 - Testing - 3rd week of July 2024

In the second testing phase, a larger subset of sandbox clusters will go through similar testing and auto-upgrade activation. The following steps will be executed:

  1. Activation of Auto-Upgrade (Additional Sandbox Clusters):

    40% of the remaining sandbox clusters (40 clusters) will be selected, and the auto-upgrade feature will be activated for these clusters. This expands the testing of automatic upgrades to a larger sample size.

  2. Scheduled AKS Upgrades:

    Microsoft will conduct AKS upgrades for the designated sandbox clusters, following the predetermined upgrade schedule. This phase will provide further validation and feedback on the auto-upgrade functionality.

Phase 3 - Sandboxes - 4th week of July 2024

In this phase, the overall approach for the remaining sandbox clusters' upgrade will depend on the results from the previous phases:

  1. Auto-Upgrade Activation or Legacy Migration:

    Based on the outcomes of Phases 1 and 2, the remaining sandbox clusters will be assigned for either auto-upgrade activation or the legacy migration process. This decision will be made considering the success and feedback received during the initial testing phases.

This phased deployment plan ensures a systematic approach to testing and upgrading the clusters, progressively expanding the scope while monitoring the process and capturing valuable insights. By selectively enabling the auto-upgrade feature and carefully assessing the results, the upgrade process can be fine-tuned for maximum efficiency and minimal disruptions

Phase 4 - Production - mid-August to be confirmed

During this phase, upgrades will be performed using the legacy migration process. The following steps will be executed:

  1. Informing Customer via Email:

    A notification email will be sent to all customers, informing them about the upcoming upgrade for all clusters. The email will include details about the upgrade schedule and any necessary actions they need to take.

  2. Informing Customer via Email (Specific Cluster):

    Before starting the upgrade on a specific cluster, a dedicated email will be sent to the customer who owns that cluster. This email will provide specific information and instructions related to their cluster's upgrade process.

  3. Upgrading Cluster:

    The upgrade process will be initiated for each cluster. The legacy migration process will be followed to ensure a smooth transition and minimize any downtime or disruptions.

  4. Informing Customer via Email (Completion):

    Once the upgrade of a cluster is completed, a notification email will be sent to the customer to inform them about the successful upgrade and confirm that their cluster is now running on the latest version.

Futur improvements

To further enhance the upgrade process, the following improvements are planned:

  1. Activation of Auto-Upgrade (Sandbox Clusters):

    Auto-upgrade functionality will be activated for all sandbox clusters, ensuring that they receive automatic upgrades whenever a new version is available. This will help identify any potential compatibility or issues early on in a controlled environment.

  2. Activation of Auto-Upgrade (Production Clusters):

    Once the sandbox upgrade process is completed and validated, auto-upgrade functionality will be activated for all production clusters. This will streamline the upgrade process and ensure that all production clusters are running on the latest version without manual intervention.

  3. Email Notifications for Incoming Upgrades:

    A comprehensive email notification system will be implemented to proactively inform customers about upcoming upgrades for their clusters. This will help them prepare and plan accordingly.

  4. More Flexible Upgrade Scheduling:

The upgrade scheduling system will be enhanced to provide customers with more options and flexibility. Customers will be able to choose a more restrictive schedule for their cluster upgrades to minimize potential disruptions during critical business operations. These improvements will contribute to a more automated, efficient, and customer-centric upgrade process, ensuring a smoother experience for customers and reducing downtime during the upgrade process.