Skip to main content

Azure Managed Monitoring For AKS

Introduction

As part of the continuous improvement of our Kubernetes as a Service (K8aaS) offering, we are introducing Azure Managed Prometheus and Azure Managed Grafana as new features. These additions are designed to enhance the observability and monitoring capabilities of your Azure Kubernetes Service (AKS) workloads. This service includes a shared Prometheus and Grafana instance to monitor multiple AKS clusters in a centralized and efficient manner.

By using these as managed services, you benefit from automatic scaling, high availability, and reduced operational overhead. This allows teams to focus on gaining insights from metrics, alerts and dashboards without worrying about managing the underlying systems.

Context

This documentation outlines the use of Azure Monitor Workspace (Managed Prometheus) and Azure Managed Grafana. These tools will allow you to gain deeper insights into the performance, health, and operational efficiency of your AKS clusters and applications running within them

Use case

  • Monitor an SLA, a capacity planning
  • Post-mortem diagnosis
  • Real-time monitoring
  • Create dynamic and reusable dashboards, explore metrics and logs, specify data sources, and define alerts in Grafana.

Requirements

  • Minimum required Grafana role: Grafana Editor.
  • This feature require much resource than baseline, successfully tested with system nodepool requirements:
    vm sizeagent countmax pod
    standard_B2_ms430
    standard_B4_ms240

Getting Started

Azure managed prometheus and grafana are deployed as shared components. These component can be shared by any k8saas instance under the same tdf account Id.

Access Management

Azure Managed Grafana Roles

Our Azure Managed Grafana setup comes with three distinct roles to support different user groups. Each role is tailored to provide the necessary permissions for various tasks, ensuring appropriate access control.

Grafana RoleDescriptionKey PermissionsTarget Azure Group
Grafana AdminFull administrative access to the Grafana instance.Manage all settings, data sources, users, dashboards, and alerts.Admins
Grafana EditorCreate and modify dashboards and alerts.Create, update, and delete dashboards, configure visualizations and alerts.DevOps Teams / Developers
Grafana ViewerRead-only access to view dashboards and explore data.View dashboards, panels, and explore data. No editing or administrative capabilities.Other tenats that only need read access

This role structure ensures that each user group within our Azure setup has the right access level based on their role and responsibilities.

Create and Customize Dashboards in Grafana

  1. Log in to the Azure Portal: In the search bar at the top of the portal, type "Grafana" and select Azure Managed Grafana from the dropdown list.

img

  1. Select Your Managed Grafana Instance: If you have one or more Azure Managed Grafana instances, you'll see a list of them. Click on the instance you want to access.

img

  1. Open the Grafana Dashboard: Once inside your Grafana instance page, click on the grafana endpoint link, which will take you to the Grafana dashboard.

img

Use Pre-Built Dashboards:

Grafana provides pre-built dashboards for Kubernetes and Prometheus metrics. You can import these dashboards from the Grafana Marketplace or create custom dashboards based on your needs.

note

The following automatically generated dashboards are available:

  • Kubernetes / Compute Resources / Cluster
  • Kubernetes / Compute Resources / Cluster (Windows)
  • Kubernetes / Compute Resources / Namespace (Windows)
  • Kubernetes / Compute Resources / Namespace (Pods)
  • Kubernetes / Compute Resources / Namespace (Workloads)
  • Kubernetes / Compute Resources / Node (Pods)
  • Kubernetes / Compute Resources / Pod
  • Kubernetes / Compute Resources / Pod (Windows)
  • Kubernetes / Compute Resources / Workload
  • Kubernetes / Kubelet
  • Kubernetes / Networking
  • Kubernetes / USE Method / Cluster (Windows)
  • Kubernetes / USE Method / Node (Windows)
  • Node Exporter / Nodes
  • Node Exporter / USE Method / Node

Import existing Dashboard

Access to the import Dashboard option

  1. In the left-side navigation pane of Grafana, click on "Dashboards".

  2. In the top-right corner of the Dashboards page, click the "New" then "Import" button img

  3. Provide the Dashboard Information: You have two options to import a dashboard

    Option 1: Import by Dashboard ID or URL: Enter the Dashboard ID or URL of a shared Grafana dashboard from Grafana.com. Click "Load" to fetch the dashboard details. Option 2: Import from a JSON file: Click the "Upload JSON file" button if you have a .json file of the dashboard configuration. Select the JSON file from your computer and upload it.

  4. Configure Data Sources: After loading the dashboard, Grafana will ask you to configure data sources. Select the appropriate data source from the dropdown list (e.g., Azure Monitor, Managed_Prometheus_XXXx). Click "Import" to complete the process.

Create Custom Dashboards:

To create a custom dashboard:

  1. In the left-side navigation pane of Grafana, click on "Dashboards".
  2. In the top-right corner of the Dashboards page, click the "New" then "New Dashboard" button

Choose Add + Add visualization, and use the Azure Monitor or Prometheus data sources to query the metrics you want to visualize (e.g., CPU usage, memory usage, pod restarts, request latencies).
img

Adjust the panel settings for visualization (graphs, tables, heatmaps, etc.).
img

  1. Save the Dashboard:

Once your panels are configured, click Save Dashboard, and assign it a name. You can now view real-time metrics and historical data of your AKS cluster on your Grafana dashboard.

Setting Grafana Alerting

Step 1: Access the Azure Managed Grafana Instance and Configure SMTP Settings

In the Settings section, look for Configuration then Email settings.
Enable the SMTP settings

img

Input the following required SMTP details:

img

  • Host: Port: The address and Port of your SMTP server
  • User: Your email address or the username for the SMTP server.
  • Password: The password or App-specific password for the SMTP user.
  • From Address: The email address that will appear as the sender.
  • From Name: The name that will used when sending out emails.
  • Skip Verify: Enable or disable ssl verification based on your SMTP provider’s requirements.

Step 2: Save and Test the Configuration

After entering the SMTP details, save the configuration. Test the configuration by sending a test email from within Grafana.

  1. Go to Alerting img

  2. Create Alert rule and define your alert condition img

  3. Create Contact Point to create or manage notification channels (e.g., email, Slack, webhooks).

img

  1. Trigger a test alert to confirm the SMTP configuration is working properly After defining the alert, choose the appropriate contact point for sending notifications.

If needed, go to Alerting > Notification policies to manage escalation paths, grouping, or silencing notifications.

img

Managed Grafana resources as code (Terraform)

  1. Ensure that the API keys and Service account feature is enabled

img

  1. Create a grafana service account
  • Login to grafana dashboard
  • Go to Administration/Users and access/Service accounts
  • on the top right section click Add service account and create one with Editor persmission for managing dashboards and admin permission to manage dashboards and alerts.

img

  • Create a token under the service account and save it

img

in this example we will create :

  • Alert Group: A logical grouping of alert rules.

  • Contact Point: A destination for alert notifications (e.g., email, Slack, webhook). This determines where alerts are sent.

  • Notification Policy: A set of rules that define when and how alerts are routed to contact points. You can use labels to target specific alerts.

  • Alert Rule: A condition-based rule that triggers when certain thresholds or conditions are met within Grafana's data (in our example we are looking for any NotReady node in our cluster).

alerts.tf
resource "grafana_contact_point" "contact_point_0" {
name = "demo-contact-point"

email {
addresses = ["amine.nefzi.e@thalesdigital.io"]
single_email = false
}
}

resource "grafana_notification_policy" "notification_policy_1" {
contact_point = grafana_contact_point.contact_point_0.name
group_by = ["grafana_folder", "alertname"]
policy {
contact_point = grafana_contact_point.contact_point_0.name
group_by = ["grafana_folder", "alertname"]
matcher {
label = "severity"
match = "="
value = "critical"
}
}
}

resource "grafana_folder" "rule_folder" {
title = "My Alert Rule Folder"
}

resource "grafana_rule_group" "rule_group_0000" {
org_id = null
name = "test-group"
folder_uid = grafana_folder.rule_folder.uid
interval_seconds = 300

rule {
name = "test-rule"
condition = "C"

data {
ref_id = "A"

relative_time_range {
from = 600
to = 0
}

datasource_uid = "k8saas-tdf-k8saas-prometheus"
model = "{\"datasource\":{\"type\":\"prometheus\",\"uid\":\"k8saas-tdf-k8saas-prometheus\"},\"editorMode\":\"code\",\"expr\":\"kube_node_status_condition{condition=\\\"Ready\\\", status=\\\"true\\\", cluster=\\\"k8saas-am-moni-sandbox\\\"} == 0\",\"instant\":true,\"intervalMs\":1000,\"legendFormat\":\"__auto\",\"maxDataPoints\":43200,\"range\":false,\"refId\":\"A\"}"
}
data {
ref_id = "B"

relative_time_range {
from = 600
to = 0
}

datasource_uid = "__expr__"
model = "{\"conditions\":[{\"evaluator\":{\"params\":[],\"type\":\"gt\"},\"operator\":{\"type\":\"and\"},\"query\":{\"params\":[\"B\"]},\"reducer\":{\"params\":[],\"type\":\"last\"},\"type\":\"query\"}],\"datasource\":{\"type\":\"__expr__\",\"uid\":\"__expr__\"},\"expression\":\"A\",\"intervalMs\":1000,\"maxDataPoints\":43200,\"reducer\":\"last\",\"refId\":\"B\",\"type\":\"reduce\"}"
}
data {
ref_id = "C"

relative_time_range {
from = 600
to = 0
}

datasource_uid = "__expr__"
model = "{\"conditions\":[{\"evaluator\":{\"params\":[0],\"type\":\"gt\"},\"operator\":{\"type\":\"and\"},\"query\":{\"params\":[\"C\"]},\"reducer\":{\"params\":[],\"type\":\"last\"},\"type\":\"query\"}],\"datasource\":{\"type\":\"__expr__\",\"uid\":\"__expr__\"},\"expression\":\"B\",\"intervalMs\":1000,\"maxDataPoints\":43200,\"refId\":\"C\",\"type\":\"threshold\"}"
}

no_data_state = "NoData"
exec_err_state = "Error"
for = "5m"
annotations = {
description = ""
runbook_url = ""
summary = "Node {{ $labels.node }} is not ready in cluster k8saas-am-moni-sandbox"
}
labels = {
severity = "critical"
}
is_paused = false
}
}
note

Accees to the gitlab project

  1. Apply and check if the resources are created as expected
note

You create all grafana resources from the UI then export then on HCL format. Example: Import an Alert rule img

note

As you notice we set the org_id to null, this is because the org_id is scoped with the auth token.

note

Key Differences in Grafana 10 (default version when choosing managed monitoring): Unified Alerting: Grafana 10 introduces a unified alerting system that allows managing alert rules, notification policies, and contact points in one place, making it easier to scale alerting across multiple dashboards. Contact Points and Notification Policies: In Grafana 10, notifications are managed via Contact Points and Notification Policies, giving more control over how alerts are grouped, escalated, or silenced. Multi-condition Alerts: You can now create more advanced alert conditions that evaluate multiple queries and use logical operators to trigger alerts.

Limitations

  • This feature is not yet fully mature, stability and maturity will be a key focus of the next release.
  • Not fully automated: dashboards and Alerts are added manually at the moment and some promethues rule must be created to use public grafana dashboard ( nginx ingress)
  • No private access (no private endpoint)
  • Baseline infrastructure has not enough resources to prometheus exporter and oms
  • Azure use own crd to service/pod monitor. Due to azure prometheus billing rule only a few COTS should be configured to use new crds
    • nginx
    • flux
  • Grafana enterprise is disabled
  • Installing, uninstalling and upgrading plugins from the Grafana Catalog isn't possible
  • All users must have accounts in Microsoft Entra ID. Third-party accounts aren't supported. As a workaround, use the default tenant of your Azure subscription with your Grafana instance and add other users as guests.
  • Azure Managed Grafana currently doesn't support the Grafana Role Based Access Control (RBAC) feature and the RBAC API is therefore disabled.

Conclusions

By following these steps, you can quickly get started with Azure Managed Prometheus and Azure Managed Grafana for monitoring AKS clusters and workloads. This setup allows you to visualize performance metrics, monitor logs, set up alerts, and gain actionable insights with minimal infrastructure overhead. These fully managed services provide a scalable and secure solution to meet your Kubernetes monitoring needs.