Kubernetes integration errors (version 2)

If you're running version 2, check out these common Kubernetes integration errors. These errors show up in the standard non-verbose infrastructure agent logs. If you need more detailed logs, see Kubernetes logs.

The Kubernetes integration requires kube-state-metrics. If this is missing, you'll see an error like the following in the newrelic-infra container logs:

bash

$2018-04-11T08:02:41.765236022Z time="2018-04-11T08:02:41Z" level=error 
$msg="executing data source" data prefix="integration/com.newrelic.kubernetes" 
$error="exit status 1" plugin name=nri-kubernetes stderr="time=\"2018-04-11T08:02:41Z\" 
$level=fatal msg=\"failed to discover kube-state-metrics endpoint, 
$got error: no service found by label k8s-app=kube-state-metrics\"\n"

Common reasons for this error include:

kube-state-metrics has not been deployed into the cluster.
kube-state-metrics is deployed using a custom deployment.
There are multiple versions of kube-state-metrics running and the Kubernetes integration is not finding the correct one.

The Kubernetes integration automatically discovers kube-state-metrics in your cluster using this logic:

It looks for a kube-state-metrics service running on the kube-system namespace.
If that is not found, it looks for a service tagged with label "k8s-app: kube-state-metrics".

The integration also requires the kube-state-metrics pod to have the label k8s-app: kube-state-metrics or app: kube-state-metrics. If neither of those are found, there will be a log entry like the following:

bash

$2018-04-11T09:25:00.825532798Z time="2018-04-11T09:25:00Z" level=error 
$msg="executing data source" data prefix="integration/com.newrelic.kubernetes" 
$error="exit status 1" plugin name=nri-kubernetes stderr="time=\"2018-04-11T09:25:00Z\" 
$level=fatal msg=\"failed to discover nodeIP with kube-state-metrics, 
$got error: no pod found by label k8s-app=kube-state-metrics\"\n

To solve this issue, add the k8s-app=kube-state-metrics label to the kube-state-metrics pod.

If metrics for Kubernetes nodes, pods, and containers are showing but metrics for namespaces, deployments, and ReplicaSets are missing, the Kubernetes integration is not able to connect to kube-state-metrics.

There are few possible reasons for this:

kube-state-metrics service has been customized to listen on port 80. If you're using a different port, you may see an error like the following in the verbose logs:

bash

$time="2018-04-04T09:35:47Z" level=error msg="executing data source" 
$data prefix="integration/com.newrelic.kubernetes" error="exit status 1" 
$plugin name=nri-kubernetes stderr="time=\"2018-04-04T09:35:47Z\" 
$level=fatal msg=\"Non-recoverable error group: error querying KSM. 
$Get http://kube-state-metrics.kube-system.svc.cluster.local:0/metrics: 
$net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\"\n"

This is a known problem that happens in some clusters where it takes too much time for kube-state-metrics to collect all the cluster's information before sending to the integration.

As a workaround, increase the kube-state-metrics client timeout.

kube-state-metrics instance is running behind kube-rbac-proxy. New Relic does not currently support this configuration. You may see an error like the following in the verbose logs:

bash

$time="2018-03-28T23:09:12Z" level=error msg="executing data source" 
$data prefix="integration/com.newrelic.kubernetes" error="exit status 1" 
$plugin name=nri-kubernetes stderr="time=\"2018-03-28T23:09:12Z\" 
$level=fatal msg=\"Non-recoverable error group: error querying KSM. 
$Get http://192.168.132.37:8443/metrics: net/http: HTTP/1.x 
$transport connection broken: malformed HTTP response \\\"\\\\x15\\\\x03\\\\x01\\\\x00\\\\x02\\\\x02\\\"\"\n"

The KSM payload is quite large, and the Kubernetes integration processing the date is being OOM-killed. Since the integration is not the main process of the container, the pod is not restarted. This situation can be spotted at the logs of the newrelic-infra pod running in the same node of KSM:
bash
```
$time="2020-12-10T17:40:44Z" level=error msg="Integration command failed" error="signal: killed" instance=nri-kubernetes integration=com.newrelic.kubernetes
```
As a workaround, increase the DaemonSet memory limits so the process is not killed.

Newrelic pods and newrelic service account are not deployed in the same namespace. This is usually because the current context specifies a namespace. If this is the case, you will see an error like the following:

bash

$time=\"2018-05-31T10:55:39Z\" level=panic msg=\"p
$ods is forbidden: User \\\"system:serviceaccount:kube-system:newrelic\\\" cannot list pods at the cluster scope\"

To check to see if this is the case, run:

bash

$kubectl describe serviceaccount newrelic | grep Namespace
$kubectl get pods -l name=newrelic-infra --all-namespaces
$kubectl config get-contexts

To resolve this problem, change the namespace for the service account in the New Relic DaemonSet YAML file to be the same as the namespace for the current context:

- kind: ServiceAccount
  name: newrelic
  namespace: default
---

Not seeing control plane data

Execute the following commands to manually find the control plane nodes:

bash

$kubectl get nodes -l node-role.kubernetes.io/control-plane=""

If the control plane nodes follow the labeling convention defined in the Control plane component, you should get some output like:

bash

$NAME                         STATUS  ROLES          AGE   VERSION
$ip-10-42-24-4.ec2.internal   Ready   control-plane  42d   v1.14.8

If no nodes are found, there are two scenarios:

Your control plane nodes don't have the required labels that identify them as control planes. In this case, you need to add both labels to your control plane nodes.

You're in a managed cluster and your provider is handling the control plane nodes for you. In this case, there is nothing you can do, since your provider is limiting the access to those nodes.

To identify an integration pod running on a control plane node, replace NODE_NAME in the following command with one of the node names listed in the previous step:

bash

$kubectl get pods --field-selector spec.nodeName=NODE_NAME -l name=newrelic-infra --all-namespaces

The next command is the same as the previous one, just that it selects the node for you:

bash

$kubectl get pods --field-selector spec.nodeName=$(kubectl get nodes -l node-role.kubernetes.io/control-plane="" -o jsonpath="{.items[0].metadata.name}") -l name=newrelic-infra --all-namespaces

If everything is correct you should get some output like:

bash

$NAME                   READY   STATUS    RESTARTS   AGE
$newrelic-infra-whvzt   1/1     Running   0          6d20h

If the integration is not running on your control plane nodes, check that the daemonset has all the desired instances running and ready.

bash

$kubectl get daemonsets -l app=newrelic-infra --all-namespaces

Refer to the discovery of control plane nodes and components documentation section and look for the labels the integration uses to discover the components. Then run the following commands to see if there are any pods with such labels and the nodes where they are running:

bash

$kubectl get pods -l k8s-app=kube-apiserver --all-namespaces

If there is component with the given label you should see something like:

bash

$NAMESPACE    NAME                                        READY  STATUS   RESTARTS  AGE
$kube-system  kube-apiserver-ip-10-42-24-42.ec2.internal  1/1    Running  3         49d

The same should be done with the rest of the components:

bash

$kubectl get pods -l k8s-app=etcd-manager-main --all-namespaces

bash

$kubectl get pods -l k8s-app=kube-scheduler --all-namespaces

bash

$kubectl get pods -l k8s-app=kube-kube-controller-manager --all-namespaces

To retrieve the logs, follow the instructions on get logs from pod running on a control plane node. The integration logs for every component the following message Running job: COMPONENT_NAME. Fro example:

bash

$Running job: scheduler

bash

$Running job: etcd

bash

$Running job: controller-manager

bash

$Running job: api-server

If you didn't specify the ETCD_TLS_SECRET_NAME configuration option, you'll find this message in the logs:

bash

$Skipping job creation for component etcd: etcd requires TLS configuration, none given

If any error occurs while querying the metrics of any component, it'll be logged after the Running job message.

To get the endpoint of the control plane component you want to query, see the Control plane component. With the endpoint, you can use the integration pod that's running on the same node as the component to query. See these examples on how to query the Kubernetes scheduler:

bash

$kubectl exec -ti POD_NAME -- wget -O - localhost:10251/metrics

The following command does the same as the previous one, but also chooses the pod for you:

bash

$kubectl exec -ti $(kubectl get pods --all-namespaces --field-selector spec.nodeName=$(kubectl get nodes -l node-role.kubernetes.io/control-plane="" -o jsonpath="{.items[0].metadata.name}") -l name=newrelic-infra -o jsonpath="{.items[0].metadata.name}") -- wget -O - localhost:10251/metrics

If everything is correct, you should get some metrics on the Prometheus format, something like this:

bash

$Connecting to localhost:10251 (127.0.0.1:10251)
$# HELP apiserver_audit_event_total Counter of audit events generated and sent to the audit backend.
$# TYPE apiserver_audit_event_total counter
$apiserver_audit_event_total 0
$# HELP apiserver_audit_requests_rejected_total Counter of apiserver requests rejected due to an error in audit logging backend.
$# TYPE apiserver_audit_requests_rejected_total counter
$apiserver_audit_requests_rejected_total 0
$# HELP apiserver_client_certificate_expiration_seconds Distribution of the remaining lifetime on the certificate used to authenticate a request.
$# TYPE apiserver_client_certificate_expiration_seconds histogram
$apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
$apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0
$apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0

Kubernetes integration errors (version 2)

Failed to discover kube-state-metrics

Missing metrics for Namespaces, Deployments, and ReplicaSets

Indicators of missing data

Cannot list pods at the cluster scope

Not seeing control plane data

Check that the control plane nodes have the correct labels

Check that the integration is running on the control plane nodes

Check that the control plane components have the required labels

Retrieve the verbose logs of one of the integrations running on a control plane node and check for the control plane components jobs

Manually query the metrics of the components

Kubernetes integration errors (version 2)

Missing metrics for Namespaces, Deployments, and ReplicaSets

Cannot list pods at the cluster scope

Not seeing control plane data .css-21sua1{background:none;border:none;width:0;padding:0;}

Check that the control plane nodes have the correct labels

Check that the integration is running on the control plane nodes

Check that the control plane components have the required labels

Retrieve the verbose logs of one of the integrations running on a control plane node and check for the control plane components jobs

Manually query the metrics of the components

Not seeing control plane data