Troubleshoot common issues

Important

Agent Control and New Relic Control are now generally available for Kubernetes! Support for Linux hosts is also in public preview program, pursuant to our pre-release policies.

This document covers the steps to troubleshoot common issues when installing or operating Agent Control. It is organized by environment.

Kubernetes troubleshooting

To diagnose errors during the installation process, you can increase the log level for Agent Control by adding the following setting in your values-newrelic.yaml file:

agentControlDeployment:
  chartValues:
    config:
      log:
        level: trace

Default log level: info.
Other supported log levels: debug and trace.
OTel collector logs: To enable debug logs in the OpenTelemetry collector, add verboseLog: true.

To inspect the Agent Control logs, run the following command, replacing agent-control-*** with the name of your Agent Control pod:

bash

$# Find the Agent Control pod name
$kubectl get pods -n newrelic-agent-control
$
$# Inspect the logs, replacing `agent-control-***` with your pod's name
$kubectl logs agent-control-*** -n newrelic-agent-control

Agent Control exposes a local status endpoint you can use to check the health of Agent Control and its managed agents. This endpoint is enabled by default on port 51200. Follow these steps to query the cluster status:

Forward a local port to the main agent-control pod:

bash

$kubectl port-forward <pod-name> 51200:51200

Request the agent status:

bash

$curl localhost:51200/status

When agent-control-bootstrap chart is installed, a job is launched installing all the resources and charts, and the installation may fail with a BackoffLimitExceeded error:

bash

Error: UPGRADE FAILED: pre-upgrade hooks failed: job failed: BackoffLimitExceeded

You can debug the installation errors looking at the installation-job logs:

bash

$kubectl logs agent-control-bootstrap-install-job-**** -n newrelic-agent-control

Agent Control requires a valid authentication credential to securely connect to Fleet Control. Initially, this credential is automatically generated through the Agent Control installation UI and is represented by the identityClientId and identityClientSecret fields in the values file. For security reasons, the credential necessary for installing Agent Control expires after 12 hours.

If the installation fails with a BackoffLimitExceeded error, it often indicates an expired or invalid credential.

Check the logs of the Kubernetes job responsible for setting up the Agent Control system identity.

First, identify the job’s pods:

bash

$kubectl describe job agent-control-generate-system-identity -n <your-namespace>

In the Events section, look for entries for the specific pods, as follows:

bash

Events:
      Type     Reason                Age   From            Message
      ----     ------                ----  ----            -------
      Normal   SuccessfulCreate      88s   job-controller  Created pod: agent-control-generate-system-identity-jr6cg
      Normal   SuccessfulCreate      73s   job-controller  Created pod: agent-control-generate-system-identity-wnx2v
      Normal   SuccessfulCreate      50s   job-controller  Created pod: agent-control-generate-system-identity-8zsqd
      Normal   SuccessfulCreate      7s    job-controller  Created pod: agent-control-generate-system-identity-btqh7
      Warning  BackoffLimitExceeded  1s    job-controller  Job has reached the specified backoff limit

View the logs of the failing pods:

bash

$kubectl logs <pod-name> -n <your-namespace>

Example:

bash

$kubectl logs agent-control-generate-system-identity-btqh7 -n newrelic-agent-control

After reviewing the logs, retry the installation using Helm while watching for specific error messages and checking the logs for potential problems. Below are some known issues and how to interpret them:

Invalid identityClientId:Error getting system identity auth token. The API endpoint returned 404: Failed to find Identity: <identityClientId-value>
Invalid identityClientSecret:Error getting system identity auth token. The API endpoint returned 400: Bad client secret.
Identity expired:Error getting system identity auth token. The API endpoint returned 400: Expired client secret.
Missing required permissions:Failed to create a New Relic System Identity for Fleet Control communication authentication. Please verify that your User Key is valid and that your Account Organization has the necessary permissions to create a System Identity: Exception while fetching data (/create) : Not authorized to perform this action or the entity is not found.

If you see an error message like the one below in the logs of the OpenTelemetry collector deployment pod, it may indicate an invalid New Relic license key. This prevents the collector from being able to export telemetry data to New Relic:

bash

2024-06-13T13:46:05.898Z error exporterhelper/retry_sender.go:126 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "Permanent error: error exporting items, request to https://otlp.nr-dat ││ go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send

Solution

Confirm that you're using a valid New Relic license key in your configuration.

If a managed agent's pods are not being created, there may be an issue with its HelmRelease.

Check the status of the Helm release:

bash

$kubectl get helmrelease open-telemetry -n newrelic

A successful and healthy release should show READY: True and STATUS: InstallSucceeded.

If the release failed, the STATUS and READY fields will indicate the problem. Depending on the type of error, the root problem might not be fully reflected in the status field. To get more details, use kubectl to describe the HelmRelease resource:

bash

$kubectl describe helmrelease open-telemetry -n newrelic

When deleting agent-control-bootstrap, a job is launched deleting all the created resources and charts.

If the uninstallation shows an error like: * job agent-control-bootstrap-uninstall-job failed: BackoffLimitExceeded

You can view the job logs to debug the error.

bash

$kubectl logs agent-control-bootstrap-uninstall-job-*** -n newrelic-agent-control

If the helm delete command is canceled while executing, the job uninstaller will continue working, deleting the charts and resources, but the agent-control-bootstrap helm secret may still exist. In that case you won't be able to upgrade or install the chart, getting the error:

Error: UPGRADE FAILED: "agent-control-bootstrap" has no deployed releases

Running the uninstallation again won't work, the logs from the uninstallation job will show an error like:

Error: uninstall: Release not loaded: agent-control-cd: release: not found

Solution

Delete all helm secrets from your release (change agent-control-bootstrap for the name of your release if it was changed):

bash

$kubectl delete secrets -l "name=agent-control-bootstrap"

Then you can do the installation again.

New Relic diagnostics tool NRDiag is a utility that gathers resources and logs related to agent-control in your cluster for debugging. Follow these steps to gather all the data:

On your host, install the NRDiag tool using the getting started guide.
Run the K8s Agent Control suite:
tip
Ensure that kubectl and helm are installed.
- Run the command in the namespace set in kubeconfig's context:
bash
```
$./nrdiag -suites K8s-agent-control
```
- Specify a different namespace for Agent Control using the --k8s-namespace flag:
bash
```
$./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic
```
- Specify a different namespace for subagents using the ac-agents-namespace flag:
bash
```
$./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic-agent-control --ac-agents-namespace=newrelic
```

The expected output should look like the following report:

bash

Check Results
-------------------------------------------------
Info     K8s/Flux/Charts [Successfully collected Flux Helm Charts]
Info     K8s/Resources/Config [Successfully collected K8s configMaps ]
Info     K8s/AgentControl/agent-control-status-server [Successfully collected K8s agent-control status se...]
Info     K8s/Resources/Daemonset [Successfully collected K8s newrelic-infrastructure...]
Info     K8s/Resources/Pods [Successfully collected K8s newrelic-infrastructure...]
Info     K8s/Flux/Repositories [Successfully collected Flux Helm Repositories]
Info     K8s/AgentControl/helm-controller-logs [Successfully collected K8s agent-control helm-cont...]
Info     K8s/Env/Version [kubectl version output successfully collected]
Info     K8s/Resources/Deploy [Successfully collected K8s newrelic-infrastructure...]
Info     K8s/Helm/Releases [Successfully collected the list of helm releases]
Info     K8s/AgentControl/agent-control-logs [Successfully collected K8s agent-control agent-con...]
Info     K8s/Flux/Releases [Successfully collected Flux Helm Releases]
Info     K8s/AgentControl/source-controller-logs [Successfully collected K8s agent-control source-co...]
See nrdiag-output.json for full results.

All the logs and resources related to Agent Control are saved in the nrdiag_output.zip file in the current directory. You can analyze the contents of the zip file or open a support ticket with New Relic support for further assistance.

Linux hosts troubleshooting

If you receive the error message Installing agent-control (Unsupported), please check the system requirements and ensure you are running a supported OS version.

If you see Installing agent-control (Failed), follow these steps:

Check the logs provided with the installation script:
- If you see Error creating an identity, please ensure your user key belongs to a platform user with the All product admin role.
Check the status of the newrelic-agent-control service:
bash
```
$sudo systemctl status newrelic-agent-control
```
If the service appears in failed or stopped state, this means the agent got installed but there's an issue preventing its normal operation. Check the agent services logs using journalctl (or any similar Linux tool):
bash
```
$journalctl -u newrelic-agent-control
```
If no insights are available, check how to run the agent in debug mode to access detailed logs explaining why the service cannot be started.
If the service is not installed, try appending --debug at the end of the install command and run it again. This will enable verbose logging for the installation script. See if the verbose output has additional context explaining the error.
Optionally, answer yes when asked to send logs to New Relic to help troubleshooting the installation. Once submitted, logs can be accessed with the following NRQL query:
```
SELECT * FROM Log WHERE hostname = `your-host-name`
```

To access logs, you'll first need to enable agent logging by following these steps:

To enable logging to a file, use the log setting in Agent Control configuration file:

# Fleet Control connection settings
  #fleet_control:

  # managed agents settings
  #agents:

  # agent logging settings
  log:
    level: debug
    file:
      enable: true
      # Add a custom path if needed, default path: /var/log/newrelic-agent-control/agent-control.log
      # path: "/path/to/agent-control.log"
    # Optional formatting settings
    format:
      # Include the target module (disabled by default for better readability)
      target: true
      # Custom timestamp format "%Y-%m-%dT%H:%M:%S"
      timestamp: "%Y"

Log level possible values are:

trace
debug
info (default)
warning
error
Logs from the underlying infrastructure agent and/or OpenTelemetry collector are included when level is debug or trace.

Restart Agent Control.
If the file log is enabled, check the corresponding local file based on the path setting. Or use your preferred log troubleshooting tool such as journalctl -u new-relic-agent-control.

To access the health status details, you'll first need to enable the local server by following these steps:

Add the following settings in Agent Control configuration file:

server:
    enabled: true
    # default values (change if needed)
    #host: "127.0.0.1"
    #port: 51200

Restart Agent Control.

Query the status endpoint using the following command:

bash

$curl 127.0.0.1:51200/status

The server will return the health information in json format, example:

{
  "agent_control": {
    "healthy": true
  },
  "fleet_control": {
    "enabled": true,
    "endpoint": "https://opamp.service.newrelic.com/v1/opamp",
    "reachable": true
  },
  "sub_agents": {
    "nr-otel-collector": {
      "agent_id": "nr-otel-collector",
      "agent_type": "newrelic/com.newrelic.opentelemetry.collector:0.1.0",
      "healthy": true
    },
    "nr-infra-agent": {
      "agent_id": "nr-infra-agent",
      "agent_type": "newrelic/com.newrelic.infrastructure:0.1.0",
      "healthy": false,
      "last_error": "process exited with code: exit status: 1"
    }
  }
}

Agent Control performs certain validations before receiving and applying remote configuration from Fleet Control. Additionally, configurations might have a valid format (for example, valid .yaml structure) but include unexpected values for certain settings (for example, a string when an integer is expected). The following table shows common errors for the different supported agents:

Agent type	Error	Troubleshooting notes
(All agents)	Error applying remote config: could not resolve config	Review your configuration format. The configuration migth not be a valid `.yaml` file, or mandatory fields might be missing.
(All agents)	Invalid config: restricted values detected	Review your configuration content. Specific settings might not be available for the target agent type based on the security policy.
(All agents)	`exit code 1`	Review your configuration. Unexpected values are causing the agent to exit with an unexpected error.
Infrastructure agent	`exit code 1`	Review your configuration. Unexpected values are causing the infrastructure agent to exit with unexpected config. Review the supported settings.

Important

Kubernetes troubleshooting

Enable debug logging

Status endpoint

Helm release failure

Invalid New Relic license

HelmRelease Failure for Managed Agents

Helm view uninstallation errors

Impossibility to install or upgrade after a canceled uninstall

Troubleshoot with NRDiag

tip

Linux hosts troubleshooting

Unable to install via New Relic CLI

Diagnose issues with agent logging

Local health status endpoint

Invalid or unexpected remote configuration with Fleet Control

Troubleshoot common issues

Important

Kubernetes troubleshooting.css-21sua1{background:none;border:none;width:0;padding:0;}

Status endpoint

Helm release failure

Invalid New Relic license

HelmRelease Failure for Managed Agents

Helm view uninstallation errors

Impossibility to install or upgrade after a canceled uninstall

Troubleshoot with NRDiag

Linux hosts troubleshooting

Unable to install via New Relic CLI

Diagnose issues with agent logging

Local health status endpoint

Invalid or unexpected remote configuration with Fleet Control

Kubernetes troubleshooting