If you're running version 2, check out these common Kubernetes integration errors. These errors show up in the standard non-verbose infrastructure agent logs. If you need more detailed logs, see Kubernetes logs.
The Kubernetes integration requires kube-state-metrics
. If this is missing, you'll see an error like the following in the newrelic-infra
container logs:
$2018-04-11T08:02:41.765236022Z time="2018-04-11T08:02:41Z" level=error $msg="executing data source" data prefix="integration/com.newrelic.kubernetes" $error="exit status 1" plugin name=nri-kubernetes stderr="time=\"2018-04-11T08:02:41Z\" $level=fatal msg=\"failed to discover kube-state-metrics endpoint, $got error: no service found by label k8s-app=kube-state-metrics\"\n"
Common reasons for this error include:
kube-state-metrics
has not been deployed into the cluster.kube-state-metrics
is deployed using a custom deployment.- There are multiple versions of
kube-state-metrics
running and the Kubernetes integration is not finding the correct one.
The Kubernetes integration automatically discovers kube-state-metrics
in your cluster using this logic:
- It looks for a
kube-state-metrics
service running on thekube-system
namespace. - If that is not found, it looks for a service tagged with label
"k8s-app: kube-state-metrics"
.
The integration also requires the kube-state-metrics pod to have the label k8s-app: kube-state-metrics
or app: kube-state-metrics
. If neither of those are found, there will be a log entry like the following:
$2018-04-11T09:25:00.825532798Z time="2018-04-11T09:25:00Z" level=error $msg="executing data source" data prefix="integration/com.newrelic.kubernetes" $error="exit status 1" plugin name=nri-kubernetes stderr="time=\"2018-04-11T09:25:00Z\" $level=fatal msg=\"failed to discover nodeIP with kube-state-metrics, $got error: no pod found by label k8s-app=kube-state-metrics\"\n
To solve this issue, add the k8s-app=kube-state-metrics
label to the kube-state-metrics
pod.
If metrics for Kubernetes nodes, pods, and containers are showing but metrics for namespaces, deployments, and ReplicaSets
are missing, the Kubernetes integration is not able to connect to kube-state-metrics
.
Indicators of missing namespace, deployment, and ReplicaSet
data:
- In the # of K8s objects chart, that data is missing.
- Queries for
K8sNamespaceSample
,K8sDeploymentSample
, andK8sReplicasetSample
don't show any data.
There are few possible reasons for this:
kube-state-metrics
service has been customized to listen on port 80. If you're using a different port, you may see an error like the following in theverbose
logs:bash$time="2018-04-04T09:35:47Z" level=error msg="executing data source"$data prefix="integration/com.newrelic.kubernetes" error="exit status 1"$plugin name=nri-kubernetes stderr="time=\"2018-04-04T09:35:47Z\"$level=fatal msg=\"Non-recoverable error group: error querying KSM.$Get http://kube-state-metrics.kube-system.svc.cluster.local:0/metrics:$net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\"\n"This is a known problem that happens in some clusters where it takes too much time for
kube-state-metrics
to collect all the cluster's information before sending to the integration.As a workaround, increase the kube-state-metrics client timeout.
kube-state-metrics
instance is running behindkube-rbac-proxy
. New Relic does not currently support this configuration. You may see an error like the following in theverbose
logs:bash$time="2018-03-28T23:09:12Z" level=error msg="executing data source"$data prefix="integration/com.newrelic.kubernetes" error="exit status 1"$plugin name=nri-kubernetes stderr="time=\"2018-03-28T23:09:12Z\"$level=fatal msg=\"Non-recoverable error group: error querying KSM.$Get http://192.168.132.37:8443/metrics: net/http: HTTP/1.x$transport connection broken: malformed HTTP response \\\"\\\\x15\\\\x03\\\\x01\\\\x00\\\\x02\\\\x02\\\"\"\n"The KSM payload is quite large, and the Kubernetes integration processing the date is being OOM-killed. Since the integration is not the main process of the container, the pod is not restarted. This situation can be spotted at the logs of the
newrelic-infra
pod running in the same node of KSM:bash$time="2020-12-10T17:40:44Z" level=error msg="Integration command failed" error="signal: killed" instance=nri-kubernetes integration=com.newrelic.kubernetesAs a workaround, increase the DaemonSet memory limits so the process is not killed.
Newrelic pods and newrelic service account are not deployed in the same namespace. This is usually because the current context specifies a namespace. If this is the case, you will see an error like the following:
$time=\"2018-05-31T10:55:39Z\" level=panic msg=\"p$ods is forbidden: User \\\"system:serviceaccount:kube-system:newrelic\\\" cannot list pods at the cluster scope\"
To check to see if this is the case, run:
$kubectl describe serviceaccount newrelic | grep Namespace$kubectl get pods -l name=newrelic-infra --all-namespaces$kubectl config get-contexts
To resolve this problem, change the namespace for the service account in the New Relic DaemonSet
YAML file to be the same as the namespace for the current context:
- kind: ServiceAccount name: newrelic namespace: default---
Not seeing control plane data
Execute the following commands to manually find the control plane nodes:
$kubectl get nodes -l node-role.kubernetes.io/control-plane=""
If the control plane nodes follow the labeling convention defined in the Control plane component, you should get some output like:
$NAME STATUS ROLES AGE VERSION$ip-10-42-24-4.ec2.internal Ready control-plane 42d v1.14.8
If no nodes are found, there are two scenarios:
Your control plane nodes don't have the required labels that identify them as control planes. In this case, you need to add both labels to your control plane nodes.
You're in a managed cluster and your provider is handling the control plane nodes for you. In this case, there is nothing you can do, since your provider is limiting the access to those nodes.
To identify an integration pod running on a control plane node, replace NODE_NAME
in the following command with one of the node names listed in the previous step:
$kubectl get pods --field-selector spec.nodeName=NODE_NAME -l name=newrelic-infra --all-namespaces
The next command is the same as the previous one, just that it selects the node for you:
$kubectl get pods --field-selector spec.nodeName=$(kubectl get nodes -l node-role.kubernetes.io/control-plane="" -o jsonpath="{.items[0].metadata.name}") -l name=newrelic-infra --all-namespaces
If everything is correct you should get some output like:
$NAME READY STATUS RESTARTS AGE$newrelic-infra-whvzt 1/1 Running 0 6d20h
If the integration is not running on your control plane nodes, check that the daemonset has all the desired instances running and ready.
$kubectl get daemonsets -l app=newrelic-infra --all-namespaces
Refer to the discovery of control plane nodes and components documentation section and look for the labels the integration uses to discover the components. Then run the following commands to see if there are any pods with such labels and the nodes where they are running:
$kubectl get pods -l k8s-app=kube-apiserver --all-namespaces
If there is component with the given label you should see something like:
$NAMESPACE NAME READY STATUS RESTARTS AGE$kube-system kube-apiserver-ip-10-42-24-42.ec2.internal 1/1 Running 3 49d
The same should be done with the rest of the components:
$kubectl get pods -l k8s-app=etcd-manager-main --all-namespaces
$kubectl get pods -l k8s-app=kube-scheduler --all-namespaces
$kubectl get pods -l k8s-app=kube-kube-controller-manager --all-namespaces
To retrieve the logs, follow the instructions on get logs from pod running on a control plane node. The integration logs for every component the following message Running job: COMPONENT_NAME
. Fro example:
$Running job: scheduler
$Running job: etcd
$Running job: controller-manager
$Running job: api-server
If you didn't specify the ETCD_TLS_SECRET_NAME
configuration option, you'll find this message in the logs:
$Skipping job creation for component etcd: etcd requires TLS configuration, none given
If any error occurs while querying the metrics of any component, it'll be logged after the Running job
message.
To get the endpoint of the control plane component you want to query, see the Control plane component. With the endpoint, you can use the integration pod that's running on the same node as the component to query. See these examples on how to query the Kubernetes scheduler:
$kubectl exec -ti POD_NAME -- wget -O - localhost:10251/metrics
The following command does the same as the previous one, but also chooses the pod for you:
$kubectl exec -ti $(kubectl get pods --all-namespaces --field-selector spec.nodeName=$(kubectl get nodes -l node-role.kubernetes.io/control-plane="" -o jsonpath="{.items[0].metadata.name}") -l name=newrelic-infra -o jsonpath="{.items[0].metadata.name}") -- wget -O - localhost:10251/metrics
If everything is correct, you should get some metrics on the Prometheus format, something like this:
$Connecting to localhost:10251 (127.0.0.1:10251)$# HELP apiserver_audit_event_total Counter of audit events generated and sent to the audit backend.$# TYPE apiserver_audit_event_total counter$apiserver_audit_event_total 0$# HELP apiserver_audit_requests_rejected_total Counter of apiserver requests rejected due to an error in audit logging backend.$# TYPE apiserver_audit_requests_rejected_total counter$apiserver_audit_requests_rejected_total 0$# HELP apiserver_client_certificate_expiration_seconds Distribution of the remaining lifetime on the certificate used to authenticate a request.$# TYPE apiserver_client_certificate_expiration_seconds histogram$apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0$apiserver_client_certificate_expiration_seconds_bucket{le="1800"} 0$apiserver_client_certificate_expiration_seconds_bucket{le="3600"} 0