KAAS referential
Rule source ./advisor-alerts/k8saas.advisor.capacity.yaml
Group Name: "k8saas.advisor.services"
Alert Name: "CertManagerDown"
- Message: ``
- Severity: AdvisorCritical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Rule source ./core-alerts/k8saas.alertmanager.yaml
Group Name: "alertmanager.rules"
Alert Name: "AlertmanagerFailedReload"
- Message:
Reloading an Alertmanager configuration has failed.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerMembersInconsistent"
- Message:
A member of an Alertmanager cluster has not found all other cluster members.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerFailedToSendAlerts"
- Message:
An Alertmanager instance failed to send notifications.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerClusterFailedToSendAlerts"
- Message:
All Alertmanager instances in a cluster failed to send notifications to a critical integration.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerClusterFailedToSendAlerts"
- Message:
All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerConfigInconsistent"
- Message:
Alertmanager instances within the same cluster have different configurations.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerClusterDown"
- Message:
Half or more of the Alertmanager instances within the same cluster are down.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "AlertmanagerClusterCrashlooping"
- Message:
Half or more of the Alertmanager instances within the same cluster are crashlooping.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Rule source ./core-alerts/k8saas.certmanager.yaml
Group Name: "k8saas.certmanager"
Alert Name: "CertMgrCertificateExpires2days"
- Message:
TLS certificate will expired in less than 2 days.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Alert Name: "CertMgrCertificateExpires7days"
- Message:
TLS certificate will expired in less than 7 days.
- Severity: warning
- Is SLA Impacted: No
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Alert Name: "CertMgrCertificateNotReadyStatus"
- Message:
TLS certificate has a not ready status
- Severity: warning
- Is SLA Impacted: No
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Rule source ./core-alerts/k8saas.core-services.yaml
Group Name: "k8saas.core-services"
Alert Name: "ApiServerDown"
- Message:
Kubernetes API Server Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: no
- Concerned resources: kubernetes,api
- Businness impact: Kubernetes API is not accessible
- Reflex: Click here
Alert Name: "CertManagerDown"
- Message:
Cert-Manager has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: no
- Concerned resources: certmanager
- Businness impact: Certificate Renewal, Ingress Deployment
- Reflex: Click here
Alert Name: "CoreDNSDown"
- Message:
CoreDNS has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: no
- Concerned resources: CoreDNS,kubernetes
- Businness impact: ALL Network communications
- Reflex: Click here
Alert Name: "KubeStateMetricsDown"
- Message: ``
- Severity: critical
- Is SLA Impacted: No
- Concerned resources: CoreDNS,kubernetes
- Businness impact: ALL Network communications
- Reflex: Click here
Alert Name: "KubeletDown"
- Message: ``
- Severity: critical
- Is SLA Impacted: no
- Concerned resources: kubernetes
- Businness impact: ALL Operations
- Reflex: Click here
Alert Name: "NodeExporterDown"
- Message: ``
- Severity: critical
- Is SLA Impacted: no
- Concerned resources: monitoring
- Businness impact: System metrics unavailable
- Reflex: Click here
Alert Name: "GrafanaDown"
- Message:
Grafana has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: no
- Concerned resources: monitoring
- Businness impact: Monitoring dashboard is not accessible
- Reflex: Click here
Alert Name: "KubePodCrashLooping"
- Message:
Pod is crash looping.
- Severity: critical
- Is SLA Impacted: yes
- Concerned resources: aks,k8saas
- Businness impact: Could impact Ingress and client workload
- Reflex: Click here
Rule source ./core-alerts/k8saas.disks-DOA.yaml
Group Name: "k8saas.disks"
Alert Name: "KubePersistentVolumeFillingUp"
- Message:
PersistentVolume is filling up.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubePersistentVolumeFillingUp"
- Message:
PersistentVolume is filling up.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubePersistentVolumeErrors"
- Message:
PersistentVolume is having issues with provisioning.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Rule source ./core-alerts/k8saas.disks.yaml
Group Name: "k8saas.disks"
Alert Name: "KubePersistentVolumeFillingUp"
- Message:
PersistentVolume is filling up.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubePersistentVolumeFillingUp"
- Message:
PersistentVolume is filling up.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubePersistentVolumeErrors"
- Message:
PersistentVolume is having issues with provisioning.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Rule source ./core-alerts/k8saas.flux.yaml
Group Name: "k8saas.flux"
Alert Name: "FluxNotificationControllerDown"
- Message:
flux notification-controller has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: No
- Concerned resources: deployment
- Businness impact: No sync could be done between k8saas git and cluster
- Reflex: Click here
Alert Name: "FluxKustomizeControllerDown"
- Message:
flux kustomize has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: No
- Concerned resources: deployment
- Businness impact: No kustomize could be apply
- Reflex: Click here
Alert Name: "FluxHelmControllerDown"
- Message:
flux helm-controller has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: No
- Concerned resources: deployment
- Businness impact: No helm could be apply
- Reflex: Click here
Alert Name: "FluxSourceControllerDown"
- Message:
flux source-controller has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: No
- Concerned resources: deployment
- Businness impact: No sync could be done between k8saas git and cluster
- Reflex: Click here
Alert Name: "FluxReconciliationFailure"
- Message:
{{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Rule source ./core-alerts/k8saas.gatekeeper.yaml
Group Name: "k8saas.gatekeeper"
Alert Name: "GatekeeperDown"
- Message:
Gatekeeper has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: Security
- Businness impact: Deployment cannot be done
- Reflex: Click here
Alert Name: "GatekeeperAdmisionRejection"
- Message:
Gatekeeper admission webhook failed for unknown reason.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: Security
- Businness impact: Deployment cannot be done
- Reflex: Click here
Rule source ./core-alerts/k8saas.kubernetes.yaml
Group Name: "k8saas.kubernetes"
Alert Name: "KubernetesNodeReady"
- Message:
Kubernetes Node ready (instance {{ $labels.instance }})
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeNodeNotReady"
- Message:
Node is not ready.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeNodeUnreachable"
- Message:
Node is unreachable.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletTooManyPodsOnCluster"
- Message:
Kubelet is running at capacity.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletTooManyPodsOnNode"
- Message:
Kubelet is running at capacity.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeNodeReadinessFlapping"
- Message:
Node readiness status is flapping.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletPlegDurationHigh"
- Message:
Kubelet Pod Lifecycle Event Generator is taking too long to relist.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletPodStartUpLatencyHigh"
- Message:
Kubelet Pod startup latency is too high.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletClientCertificateExpiration"
- Message:
Kubelet client certificate is about to expire.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletClientCertificateExpiration"
- Message:
Kubelet client certificate is about to expire.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletServerCertificateExpiration"
- Message:
Kubelet server certificate is about to expire.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletServerCertificateExpiration"
- Message:
Kubelet server certificate is about to expire.
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletClientCertificateRenewalErrors"
- Message:
Kubelet has failed to renew its client certificate.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "KubeletServerCertificateRenewalErrors"
- Message:
Kubelet has failed to renew its server certificate.
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
- Reflex: Click here
Alert Name: "UnschedulableNode"
-
Message:
node is unschedulable since 2 hours
-
Severity: critical
-
Is SLA Impacted: Yes, if value is upper to 1
-
Concerned resources: node
-
Businness impact: Business applications may be unavailable
-
Reflex: Click here
Alert Name: "HostHighCpuLoad5"
-
Message:
Host high CPU load (instance {{ $labels.instance }})
-
Severity: warning
-
Is SLA Impacted: Yes, if value is upper to 95
-
Concerned resources: node
-
Businness impact: Business applications may be unavailable
-
Reflex: Click here
Alert Name: "KubernetesMemoryPressure"
- Message:
Kubernetes memory pressure (instance {{ $labels.instance }})
- Severity: warning
- Is SLA Impacted: Yes, some application may be evicted
- Concerned resources:
- Businness impact: Business applications may be unavailable
Alert Name: "KubernetesMemoryPressure30m"
- Message:
Kubernetes memory pressure (instance {{ $labels.instance }})
- Severity: critical
- Is SLA Impacted: Yes, some application may be evicted
- Concerned resources:
- Businness impact: Business applications may be unavailable
Alert Name: "KubernetesDiskPressure"
- Message:
Kubernetes disk pressure (instance {{ $labels.instance }})
- Severity: critical
- Is SLA Impacted: Yes, some applications may be evicted
- Concerned resources:
- Businness impact: Business applications may be unavailable
Rule source ./core-alerts/k8saas.nginx.yaml
Group Name: "k8saas.nginx"
Alert Name: "NginxLatencyHigh"
- Message:
Nginx latency high (instance {{ $labels.instance }})
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Alert Name: "NginxHighHttp5xxErrorRate"
- Message:
Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Alert Name: "TooManyRequest"
- Message:
sucspicious activity detect on ingress controller
- Severity: warning
- Is SLA Impacted: No, suspicious activity
- Concerned resources:
- Businness impact: an attacker may scan cluster endpoint's
Alert Name: "NginxValidk8SaasCertificateExpires2days"
- Message:
TLS certificate will expired in less than 2 days on supported domain.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Alert Name: "NginxValidk8SaasCertificateExpires7days"
- Message:
TLS certificate will expired in less than 7 days.
- Severity: critical
- Is SLA Impacted: No
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Alert Name: "NginxExternalIngressControllerDown"
- Message:
public ingress has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: communication
- Businness impact: Bussiness application could not be joined
- Reflex: Click here
Alert Name: "NginxInternalIngressControllerDown"
- Message:
private ingress has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: communication
- Businness impact: Bussiness application could not be joined
- Reflex: Click here
Rule source ./fluentd-alerts/k8saas.fluent.yaml
Group Name: "k8saas.fluentbit"
Alert Name: "FluentBitServerDown"
- Message: ``
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Alert Name: "TooManyLogs"
- Message:
sucspicious activity detect on logs
- Severity: critical
- Is SLA Impacted: No, suspicious activity
- Concerned resources: pod
- Businness impact: Cluster produce anormal logs volume
Alert Name: "TooManyLogsV2"
- Message:
sucspicious activity detect on logs
- Severity: critical
- Is SLA Impacted: No, suspicious activity
- Concerned resources: pod
- Businness impact: Cluster produce anormal logs volume
Rule source ./linkerd-alerts/k8saas.linkerd.yaml
Group Name: "k8saas.linkerd"
Alert Name: "LinkerdControllerDown"
- Message: ``
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Alert Name: "LinkerdProxyDown"
- Message: ``
- Severity: critical
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Alert Name: "LinkerdHighErrorRate"
- Message:
Linkerd high error rate (instance {{ $labels.instance }})
- Severity: warning
- Is SLA Impacted:
- Concerned resources:
- Businness impact:
Rule source ./pomerium-alerts/k8saas.pomerium.yaml
Group Name: "k8saas.pomerium"
Alert Name: "PomeriumCertificateExpires2days"
- Message:
TLS certificate will expired in less than 2 days.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Alert Name: "PomeriumCertificateExpires7days"
- Message:
TLS certificate will expired in less than 7 days.
- Severity: warning
- Is SLA Impacted: No
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Alert Name: "PomeriumCertificateNotReadyStatus"
- Message:
TLS certificate has a not ready status
- Severity: warning
- Is SLA Impacted: No
- Concerned resources: Security
- Businness impact: Security breach
- Reflex: Click here
Rule source ./velero-alerts/k8saas.velero.yaml
Group Name: "k8saas.velero"
Alert Name: "VeleroControllerDown"
- Message:
Velero has disappeared from Prometheus target discovery.
- Severity: critical
- Is SLA Impacted: Yes
- Concerned resources: backup
- Businness impact: Busniness application could not be backuped or restored
- Reflex: Click here
Alert Name: "VeleroBackupFailures"
- Message:
Velero backup partialy failed
- Severity: warning
- Is SLA Impacted: Yes
- Concerned resources: backup
- Businness impact: Busniness application could not be restored
- Reflex: Click here
Alert Name: "VeleroBackupPartialFailures"
- Message:
Velero backup partialy failed
- Severity: warning
- Is SLA Impacted: No
- Concerned resources: backup
- Businness impact: Some items could not be restored if needed
- Reflex: Click here