When you first install the Kubernetes integration, we deploy a default set of recommended alerts conditions to your account that form the basis for alert conditions on your Kubernetes cluster. These conditions are grouped into a policy called Kubernetes alert policy.
While we've tried to address the most common use cases in all environments, there are a number of additional alerts you can set up to extend the default policy. These are our recommended alerting policies.
Adding the recommended alert policy
To add recommended alert policy, follow these steps:
Go to one.newrelic.com > Integrations & Agents.
Select Alerts to access the pre-built resources.
Search Kubernetes and select the recommended alert policy you want to add.
How to see the recommended alert policy
To view the recommended alert policy you've added, do this:
Go to one.newrelic.com > All capabilities > Alerts.
Click Alert Policies in the left navigation pane.
You'll see Kubernetes alert policy and Google Kubernetes engine alert policy.
Kubernetes alert policy
This is the default set of recommended alert conditions you'll add:
This alert condition generates an alert when a container is throttled by more than 25% for more than 5 minutes. It runs this query:
FROM K8sContainerSampleSELECT sum(containerCpuCfsThrottledPeriodsDelta) / sum(containerCpuCfsPeriodsDelta) * 100 WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet containerName, podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the average container CPU usage against the limit exceeds 90% for over 5 minutes. It runs this query:
FROM K8sContainerSampleSELECT average(cpuCoresUtilization)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet containerName, podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the average container memory usage against the limit exceeds 90% for over 5 minutes. It runs this query:
FROM K8sContainerSampleSELECT average(memoryWorkingSetUtilization)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet containerName, podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when container restarts exceed 0 in a 5-minute sliding window. It runs this query:
FROM K8sContainerSampleSELECT sum(restartCountDelta)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet containerName, podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a container waits over 5 minutes. It runs this query:
FROM K8sContainerSampleSELECT uniqueCount(podName)WHERE status = 'Waiting' and clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') FACET containerName, podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the daemonset is missing any pods for a period longer than 5 minutes. It runs this query:
FROM K8sDaemonsetSampleSELECT latest(podsMissing)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet daemonsetName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the deployment is missing any pods for a period longer than 5 minutes. It runs this query:
FROM K8sDeploymentSampleSELECT latest(podsMissing)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet deploymentName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the Etcd
file descriptor usage exceeds 90% for over 5 minutes. It runs this query:
FROM K8sEtcdSampleSELECT max(processFdsUtilization)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet displayName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the Etcd
file descriptor is leaderless for over 1 minute. It runs this query:
FROM K8sEtcdSampleSELECT min(etcdServerHasLeader)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet displayName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the current replicas of a horizontal pod autoscaler are lower than the desired replicas for more than 5 minutes. It runs this query:
FROM K8sHpaSampleSELECT latest(desiredReplicas - currentReplicas)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet displayName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a horizontal pod autoscaler exceeds 5 replicas. It runs this query:
FROM K8sHpaSampleSELECT latest(maxReplicas - currentReplicas)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet displayName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a job reports a failed status. It runs this query:
FROM K8sJobSampleSELECT uniqueCount(jobName)WHERE failed = 'true' and clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet jobName, namespaceName, clusterName, failedPodsReason
See the GitHub configuration file for more info.
This alert condition generates an alert when more than 5 pods in a namespace fail for more than 5 minutes. It runs this query:
FROM K8sPodSampleSELECT uniqueCount(podName)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') and status = 'Failed' facet namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the average node allocable CPU utilization exceeds 90% for more than 5 minutes. It runs this query:
FROM K8sNodeSampleSELECT average(allocatableCpuCoresUtilization)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet nodeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the average node allocable memory utilization exceeds 90% for more than 5 minutes. It runs this query:
FROM K8sNodeSampleSELECT average(allocatableMemoryUtilization)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet nodeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a node is unavailable for 5 minutes. It runs this query:
FROM K8sNodeSampleSELECT latest(condition.Ready)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet nodeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a node is marked unscheduled. It runs this query:
FROM K8sNodeSampleSELECT latest(unschedulable)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet nodeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a node's running pods exceed 90% of the node's pod capacity for more than 5 minutes. It runs this query:
FROM K8sPodSample, K8sNodeSampleSELECT ceil(filter(uniqueCount(podName)WHERE status = 'Running') / latest(capacityPods) * 100) as 'Pod Capacity %' where nodeName != '' and nodeName is not null and clusterName in ('YOUR_CLUSTER_NAME') facet nodeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when the average node root file system capacity utilization exceeds 90% for more than 5 minutes. It runs this query:
FROM K8sNodeSampleSELECT average(fsCapacityUtilization)WHERE clusterName in ('YOUR_CLUSTER_NAME') facet nodeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when persistent volume is in a failed or pending state for more than 5 minutes. It runs this query:
FROM K8sPersistentVolumeSampleSELECT uniqueCount(volumeName)WHERE statusPhase in ('Failed','Pending') and clusterName in ('YOUR_CLUSTER_NAME') facet volumeName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a pod is unable to be scheduled for more than 5 minutes. It runs this query:
FROM K8sPodSampleSELECT latest(isScheduled)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when a pod is unavailable for over 5 minutes. It runs this query:
FROM K8sPodSampleSELECT latest(isReady)WHERE status not in ('Failed', 'Succeeded') where clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet podName, namespaceName, clusterName
See the GitHub configuration file for more info.
This alert condition generates an alert when statefulset
is missing pods over 5 minutes. It runs this query:
FROM K8sStatefulsetSampleSELECT latest(podsMissing)WHERE clusterName in ('YOUR_CLUSTER_NAME') and namespaceName in ('YOUR_NAMESPACE_NAME') facet daemonsetName, namespaceName, clusterName
See the GitHub configuration file for more info.
Google Kubernetes engine alert policy
This is the default set of recommended Google Kubernetes engine alert conditions you'll add:
This alert condition generates an alert when a node's CPU utilization exceeds 90% for at least 15 minutes. It runs this query:
FROM MetricSELECT max(`gcp.kubernetes.node.cpu.allocatable_utilization`) * 100WHERE clusterName LIKE '%' FACET gcp.kubernetes.nodeName
See the GitHub configuration file for more info.
This alert condition generates an alert when a node's memory usage exceeds 85% of its total capacity. It runs this query:
FROM K8sPodSampleSELECT max(gcp.kubernetes.node.memory.allocatable_utilization) * 100 WHERE clusterName LIKE '%' FACET gcp.kubernetes.nodeName
See the GitHub configuration file for more info.