Skip to main content

KAAS referential

Rule source ./advisor-alerts/k8saas.advisor.capacity.yaml

Group Name: "k8saas.advisor.services"

Alert Name: "CertManagerDown"
  • Message: ``
  • Severity: AdvisorCritical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here

Rule source ./core-alerts/k8saas.alertmanager.yaml

Group Name: "alertmanager.rules"

Alert Name: "AlertmanagerFailedReload"
  • Message: Reloading an Alertmanager configuration has failed.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerMembersInconsistent"
  • Message: A member of an Alertmanager cluster has not found all other cluster members.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerFailedToSendAlerts"
  • Message: An Alertmanager instance failed to send notifications.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerClusterFailedToSendAlerts"
  • Message: All Alertmanager instances in a cluster failed to send notifications to a critical integration.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerClusterFailedToSendAlerts"
  • Message: All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerConfigInconsistent"
  • Message: Alertmanager instances within the same cluster have different configurations.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerClusterDown"
  • Message: Half or more of the Alertmanager instances within the same cluster are down.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "AlertmanagerClusterCrashlooping"
  • Message: Half or more of the Alertmanager instances within the same cluster are crashlooping.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here

Rule source ./core-alerts/k8saas.certmanager.yaml

Group Name: "k8saas.certmanager"

Alert Name: "CertMgrCertificateExpires2days"
  • Message: TLS certificate will expired in less than 2 days.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here
Alert Name: "CertMgrCertificateExpires7days"
  • Message: TLS certificate will expired in less than 7 days.
  • Severity: warning
  • Is SLA Impacted: No
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here
Alert Name: "CertMgrCertificateNotReadyStatus"
  • Message: TLS certificate has a not ready status
  • Severity: warning
  • Is SLA Impacted: No
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here

Rule source ./core-alerts/k8saas.core-services.yaml

Group Name: "k8saas.core-services"

Alert Name: "ApiServerDown"
  • Message: Kubernetes API Server Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: no
  • Concerned resources: kubernetes,api
  • Businness impact: Kubernetes API is not accessible
  • Reflex: Click here
Alert Name: "CertManagerDown"
  • Message: Cert-Manager has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: no
  • Concerned resources: certmanager
  • Businness impact: Certificate Renewal, Ingress Deployment
  • Reflex: Click here
Alert Name: "CoreDNSDown"
  • Message: CoreDNS has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: no
  • Concerned resources: CoreDNS,kubernetes
  • Businness impact: ALL Network communications
  • Reflex: Click here
Alert Name: "KubeStateMetricsDown"
  • Message: ``
  • Severity: critical
  • Is SLA Impacted: No
  • Concerned resources: CoreDNS,kubernetes
  • Businness impact: ALL Network communications
  • Reflex: Click here
Alert Name: "KubeletDown"
  • Message: ``
  • Severity: critical
  • Is SLA Impacted: no
  • Concerned resources: kubernetes
  • Businness impact: ALL Operations
  • Reflex: Click here
Alert Name: "NodeExporterDown"
  • Message: ``
  • Severity: critical
  • Is SLA Impacted: no
  • Concerned resources: monitoring
  • Businness impact: System metrics unavailable
  • Reflex: Click here
Alert Name: "GrafanaDown"
  • Message: Grafana has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: no
  • Concerned resources: monitoring
  • Businness impact: Monitoring dashboard is not accessible
  • Reflex: Click here
Alert Name: "KubePodCrashLooping"
  • Message: Pod is crash looping.
  • Severity: critical
  • Is SLA Impacted: yes
  • Concerned resources: aks,k8saas
  • Businness impact: Could impact Ingress and client workload
  • Reflex: Click here

Rule source ./core-alerts/k8saas.disks-DOA.yaml

Group Name: "k8saas.disks"

Alert Name: "KubePersistentVolumeFillingUp"
  • Message: PersistentVolume is filling up.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubePersistentVolumeFillingUp"
  • Message: PersistentVolume is filling up.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubePersistentVolumeErrors"
  • Message: PersistentVolume is having issues with provisioning.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here

Rule source ./core-alerts/k8saas.disks.yaml

Group Name: "k8saas.disks"

Alert Name: "KubePersistentVolumeFillingUp"
  • Message: PersistentVolume is filling up.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubePersistentVolumeFillingUp"
  • Message: PersistentVolume is filling up.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubePersistentVolumeErrors"
  • Message: PersistentVolume is having issues with provisioning.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here

Rule source ./core-alerts/k8saas.flux.yaml

Group Name: "k8saas.flux"

Alert Name: "FluxNotificationControllerDown"
  • Message: flux notification-controller has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: No
  • Concerned resources: deployment
  • Businness impact: No sync could be done between k8saas git and cluster
  • Reflex: Click here
Alert Name: "FluxKustomizeControllerDown"
  • Message: flux kustomize has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: No
  • Concerned resources: deployment
  • Businness impact: No kustomize could be apply
  • Reflex: Click here
Alert Name: "FluxHelmControllerDown"
  • Message: flux helm-controller has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: No
  • Concerned resources: deployment
  • Businness impact: No helm could be apply
  • Reflex: Click here
Alert Name: "FluxSourceControllerDown"
  • Message: flux source-controller has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: No
  • Concerned resources: deployment
  • Businness impact: No sync could be done between k8saas git and cluster
  • Reflex: Click here
Alert Name: "FluxReconciliationFailure"
  • Message: {{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:

Rule source ./core-alerts/k8saas.gatekeeper.yaml

Group Name: "k8saas.gatekeeper"

Alert Name: "GatekeeperDown"
  • Message: Gatekeeper has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: Security
  • Businness impact: Deployment cannot be done
  • Reflex: Click here
Alert Name: "GatekeeperAdmisionRejection"
  • Message: Gatekeeper admission webhook failed for unknown reason.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: Security
  • Businness impact: Deployment cannot be done
  • Reflex: Click here

Rule source ./core-alerts/k8saas.kubernetes.yaml

Group Name: "k8saas.kubernetes"

Alert Name: "KubernetesNodeReady"
  • Message: Kubernetes Node ready (instance {{ $labels.instance }})
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeNodeNotReady"
  • Message: Node is not ready.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeNodeUnreachable"
  • Message: Node is unreachable.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletTooManyPodsOnCluster"
  • Message: Kubelet is running at capacity.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletTooManyPodsOnNode"
  • Message: Kubelet is running at capacity.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeNodeReadinessFlapping"
  • Message: Node readiness status is flapping.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletPlegDurationHigh"
  • Message: Kubelet Pod Lifecycle Event Generator is taking too long to relist.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletPodStartUpLatencyHigh"
  • Message: Kubelet Pod startup latency is too high.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletClientCertificateExpiration"
  • Message: Kubelet client certificate is about to expire.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletClientCertificateExpiration"
  • Message: Kubelet client certificate is about to expire.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletServerCertificateExpiration"
  • Message: Kubelet server certificate is about to expire.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletServerCertificateExpiration"
  • Message: Kubelet server certificate is about to expire.
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletClientCertificateRenewalErrors"
  • Message: Kubelet has failed to renew its client certificate.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "KubeletServerCertificateRenewalErrors"
  • Message: Kubelet has failed to renew its server certificate.
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
  • Reflex: Click here
Alert Name: "UnschedulableNode"
Alert Name: "HostHighCpuLoad5"
Alert Name: "KubernetesMemoryPressure"
  • Message: Kubernetes memory pressure (instance {{ $labels.instance }})
  • Severity: warning
  • Is SLA Impacted: Yes, some application may be evicted
  • Concerned resources:
  • Businness impact: Business applications may be unavailable
Alert Name: "KubernetesMemoryPressure30m"
  • Message: Kubernetes memory pressure (instance {{ $labels.instance }})
  • Severity: critical
  • Is SLA Impacted: Yes, some application may be evicted
  • Concerned resources:
  • Businness impact: Business applications may be unavailable
Alert Name: "KubernetesDiskPressure"
  • Message: Kubernetes disk pressure (instance {{ $labels.instance }})
  • Severity: critical
  • Is SLA Impacted: Yes, some applications may be evicted
  • Concerned resources:
  • Businness impact: Business applications may be unavailable

Rule source ./core-alerts/k8saas.nginx.yaml

Group Name: "k8saas.nginx"

Alert Name: "NginxLatencyHigh"
  • Message: Nginx latency high (instance {{ $labels.instance }})
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
Alert Name: "NginxHighHttp5xxErrorRate"
  • Message: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
Alert Name: "TooManyRequest"
  • Message: sucspicious activity detect on ingress controller
  • Severity: warning
  • Is SLA Impacted: No, suspicious activity
  • Concerned resources:
  • Businness impact: an attacker may scan cluster endpoint's
Alert Name: "NginxValidk8SaasCertificateExpires2days"
  • Message: TLS certificate will expired in less than 2 days on supported domain.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here
Alert Name: "NginxValidk8SaasCertificateExpires7days"
  • Message: TLS certificate will expired in less than 7 days.
  • Severity: critical
  • Is SLA Impacted: No
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here
Alert Name: "NginxExternalIngressControllerDown"
  • Message: public ingress has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: communication
  • Businness impact: Bussiness application could not be joined
  • Reflex: Click here
Alert Name: "NginxInternalIngressControllerDown"
  • Message: private ingress has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: communication
  • Businness impact: Bussiness application could not be joined
  • Reflex: Click here

Rule source ./fluentd-alerts/k8saas.fluent.yaml

Group Name: "k8saas.fluentbit"

Alert Name: "FluentBitServerDown"
  • Message: ``
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
Alert Name: "TooManyLogs"
  • Message: sucspicious activity detect on logs
  • Severity: critical
  • Is SLA Impacted: No, suspicious activity
  • Concerned resources: pod
  • Businness impact: Cluster produce anormal logs volume
Alert Name: "TooManyLogsV2"
  • Message: sucspicious activity detect on logs
  • Severity: critical
  • Is SLA Impacted: No, suspicious activity
  • Concerned resources: pod
  • Businness impact: Cluster produce anormal logs volume

Rule source ./linkerd-alerts/k8saas.linkerd.yaml

Group Name: "k8saas.linkerd"

Alert Name: "LinkerdControllerDown"
  • Message: ``
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
Alert Name: "LinkerdProxyDown"
  • Message: ``
  • Severity: critical
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:
Alert Name: "LinkerdHighErrorRate"
  • Message: Linkerd high error rate (instance {{ $labels.instance }})
  • Severity: warning
  • Is SLA Impacted:
  • Concerned resources:
  • Businness impact:

Rule source ./pomerium-alerts/k8saas.pomerium.yaml

Group Name: "k8saas.pomerium"

Alert Name: "PomeriumCertificateExpires2days"
  • Message: TLS certificate will expired in less than 2 days.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here
Alert Name: "PomeriumCertificateExpires7days"
  • Message: TLS certificate will expired in less than 7 days.
  • Severity: warning
  • Is SLA Impacted: No
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here
Alert Name: "PomeriumCertificateNotReadyStatus"
  • Message: TLS certificate has a not ready status
  • Severity: warning
  • Is SLA Impacted: No
  • Concerned resources: Security
  • Businness impact: Security breach
  • Reflex: Click here

Rule source ./velero-alerts/k8saas.velero.yaml

Group Name: "k8saas.velero"

Alert Name: "VeleroControllerDown"
  • Message: Velero has disappeared from Prometheus target discovery.
  • Severity: critical
  • Is SLA Impacted: Yes
  • Concerned resources: backup
  • Businness impact: Busniness application could not be backuped or restored
  • Reflex: Click here
Alert Name: "VeleroBackupFailures"
  • Message: Velero backup partialy failed
  • Severity: warning
  • Is SLA Impacted: Yes
  • Concerned resources: backup
  • Businness impact: Busniness application could not be restored
  • Reflex: Click here
Alert Name: "VeleroBackupPartialFailures"
  • Message: Velero backup partialy failed
  • Severity: warning
  • Is SLA Impacted: No
  • Concerned resources: backup
  • Businness impact: Some items could not be restored if needed
  • Reflex: Click here