Skip to main content

Status Page

Incident INC-K8SAAS-002: Close

Date: 20/08/21

Impact: few customers cannot create a PVC.

08:00 CET - Unable to create a Persistent Volume Claim (PVC) on multiple clusters

After investigation, there is a regression on the last migration. Few AKS clusters in the k8saas perimeter keep using the legacy service principal. You should see logs like:

# for NGINX controller
2m24s Warning ListLoadBalancers service/nginx-ingress-ingress-nginx-controller (combined from similar events): Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to http://XXX StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.\r\nTrace ID: XXX\r\nCorrelation ID: XXXf\r\nTimestamp: 2021-08-20

# for PVC
Warning ProvisioningFailed 10s persistentvolume-controller Failed to provision volume with StorageClass "default-nowc": Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to http://localhost:7788/
subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-XXX?api-version=2019-11-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADS
TS7000215: Invalid client secret is provided.\r\nTrace ID: XXX\r\nCorrelation ID: XXX\r\nTimestamp: 2021-08-20 16:39:49Z","error_codes":[7000215],"timestamp":"2021-08-20 16:39:49Z","trace_id":"XXX","correlation_id":"XXX","error_uri"

10:20 CET - creation of incident to Microsoft with Severity Rate: A

10:30 CET - call from the incident Manager from Microsoft to confirm the severity

10:50 CET - call from technical team

The investigation shows an issue on Azure side.

16:20 CET - Update

The investigation is still in progress. There is no solution found yet.

Date: 23/08/21

8:00 CET - the issue has disappeared. All the storage and nginx are OK.

Waiting for explanation from microsoft.

Incident INC-K8SAAS-001: Close

20:07 UTC - First issue: Unable to be logged to AKS:

# Any kubectl command gives
error: You must be logged in to the server (Unauthorized)

In azure portal: Alt text

20:37 UTC - Escalation: Call to Level 2 TDP IT

20:48 UTC - Mail from Azure: Confirmation of AAD incident

23:48 UTC - Mail from Azure: Pursuing mitigation actions for residual impact

3:30 UTC (16/03) - Mail from Azure: Incident resolved

Impacts:

  • Unable to authenticate end users for application that uses Azure Active Directory
  • Unable to access to Grafana and Log Analytics to monitor your application
  • Unable to administrate AKS clusters (any kubectl command with nominative accounts)

Temporary / backup solution:

  • Please use your service account use for the ci-cd pipeline to administrate your cluster
  • To get your service account, please contact me at loic.jardin@thalesdigital.io

Preliminary Root Cause:

The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key. Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.