Important
Agent Control and New Relic Control are now generally available for Kubernetes! Support for Linux hosts is also in public preview program, pursuant to our pre-release policies.
This document covers the steps to troubleshoot common issues when installing or operating Agent Control. It is organized by environment.
Kubernetes troubleshooting
To diagnose errors during the installation process, you can increase the log level for Agent Control by adding the following setting in your values-newrelic.yaml
file:
agent-control-deployment: config: agentControl: content: log: level: trace
- Default log level:
info
. - Other supported log levels:
debug
andtrace
. - OTel collector logs: To enable debug logs in the OpenTelemetry collector, add
verboseLog: true
.
To inspect the Agent Control logs, run the following command, replacing agent-control-***
with the name of your Agent Control pod:
$# Find the Agent Control pod name$kubectl get pods -n newrelic-agent-control$
$# Inspect the logs, replacing `agent-control-***` with your pod's name$kubectl logs agent-control-*** -n newrelic-agent-control
Agent Control exposes a local status endpoint you can use to check the health of Agent Control and its managed agents. This endpoint is enabled by default on port 51200
. Follow these steps to query the cluster status:
Forward a local port to the main agent-control
pod:
$kubectl port-forward <pod-name> 51200:51200
Request the agent status:
$curl localhost:51200/status
When agent-control-bootstrap chart is installed, a job is launched installing all the resources and charts, and the installation may fail with a BackoffLimitExceeded error:
Error: UPGRADE FAILED: pre-upgrade hooks failed: job failed: BackoffLimitExceeded
You can debug the installation errors looking at the installation-job logs:
$kubectl logs agent-control-bootstrap-install-job-**** -n newrelic-agent-control
Agent Control requires a valid authentication credential to securely connect to Fleet Control. Initially, this credential is automatically generated through the Agent Control installation UI and is represented by the identityClientId
and identityClientSecret
fields in the values file. For security reasons, the credential necessary for installing Agent Control expires after 12 hours.
If the installation fails with a BackoffLimitExceeded error, it often indicates an expired or invalid credential.
Check the logs of the Kubernetes job responsible for setting up the Agent Control system identity.
First, identify the job’s pods:
$kubectl describe job agent-control-generate-system-identity -n <your-namespace>
In the Events
section, look for entries for the specific pods, as follows:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 88s job-controller Created pod: agent-control-generate-system-identity-jr6cg Normal SuccessfulCreate 73s job-controller Created pod: agent-control-generate-system-identity-wnx2v Normal SuccessfulCreate 50s job-controller Created pod: agent-control-generate-system-identity-8zsqd Normal SuccessfulCreate 7s job-controller Created pod: agent-control-generate-system-identity-btqh7 Warning BackoffLimitExceeded 1s job-controller Job has reached the specified backoff limit
View the logs of the failing pods:
$kubectl logs <pod-name> -n <your-namespace>
Example:
$kubectl logs agent-control-generate-system-identity-btqh7 -n newrelic-agent-control
After reviewing the logs, retry the installation using Helm while watching for specific error messages and checking the logs for potential problems. Below are some known issues and how to interpret them:
- Invalid identityClientId:
Error getting system identity auth token. The API endpoint returned 404: Failed to find Identity: <identityClientId-value>
- Invalid identityClientSecret:
Error getting system identity auth token. The API endpoint returned 400: Bad client secret.
- Identity expired:
Error getting system identity auth token. The API endpoint returned 400: Expired client secret.
- Missing required permissions:
Failed to create a New Relic System Identity for Fleet Control communication authentication. Please verify that your User Key is valid and that your Account Organization has the necessary permissions to create a System Identity: Exception while fetching data (/create) : Not authorized to perform this action or the entity is not found.
If you see an error message like the one below in the logs of the OpenTelemetry collector deployment pod, it may indicate an invalid New Relic license key. This prevents the collector from being able to export telemetry data to New Relic:
2024-06-13T13:46:05.898Z error exporterhelper/retry_sender.go:126 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "Permanent error: error exporting items, request to https://otlp.nr-dat ││ go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
Solution
Confirm that you're using a valid New Relic license key in your configuration.
If a managed agent's pods are not being created, there may be an issue with its HelmRelease.
Check the status of the Helm release:
$kubectl get helmrelease open-telemetry -n newrelic
A successful and healthy release should show READY: True
and STATUS: InstallSucceeded
.
If the release failed, the STATUS
and READY
fields will indicate the problem. Depending on the type of error, the root problem might not be fully reflected in the status field. To get more details, use kubectl
to describe the HelmRelease resource:
$kubectl describe helmrelease open-telemetry -n newrelic
When deleting agent-control-bootstrap, a job is launched deleting all the created resources and charts.
If the uninstallation shows an error like:
* job agent-control-bootstrap-uninstall-job failed: BackoffLimitExceeded
You can view the job logs to debug the error.
$kubectl logs agent-control-bootstrap-uninstall-job-*** -n newrelic-agent-control
If the helm delete command is canceled while executing, the job uninstaller will continue working, deleting the charts and resources, but the agent-control-bootstrap helm secret may still exist. In that case you won't be able to upgrade or install the chart, getting the error:
Error: UPGRADE FAILED: "agent-control-bootstrap" has no deployed releases
Running the uninstallation again won't work, the logs from the uninstallation job will show an error like:
Error: uninstall: Release not loaded: agent-control-cd: release: not found
Solution
Delete all helm secrets from your release (change agent-control-bootstrap for the name of your release if it was changed):
$kubectl delete secrets -l "name=agent-control-bootstrap"
Then you can do the installation again.
New Relic diagnostics tool NRDiag
is a utility that gathers resources and logs related to agent-control in your cluster for debugging.
Follow these steps to gather all the data:
On your host, install the
NRDiag
tool using the getting started guide.Run the K8s agent control suite:
tip
Ensure that
kubectl
andhelm
are installed.- Run the command in the namespace set in kubeconfig's context:
bash$./nrdiag -suites K8s-agent-control- Specify a different namespace for Agent Control using the
--k8s-namespace
flag:
bash$./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic- Specify a different namespace for sub-agents using the
ac-agents-namespace
flag:
bash$./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic-agent-control --ac-agents-namespace=newrelicThe expected output should look like the following report:
bashCheck Results-------------------------------------------------Info K8s/Flux/Charts [Successfully collected Flux Helm Charts]Info K8s/Resources/Config [Successfully collected K8s configMaps ]Info K8s/AgentControl/agent-control-status-server [Successfully collected K8s agent-control status se...]Info K8s/Resources/Daemonset [Successfully collected K8s newrelic-infrastructure...]Info K8s/Resources/Pods [Successfully collected K8s newrelic-infrastructure...]Info K8s/Flux/Repositories [Successfully collected Flux Helm Repositories]Info K8s/AgentControl/helm-controller-logs [Successfully collected K8s agent-control helm-cont...]Info K8s/Env/Version [kubectl version output successfully collected]Info K8s/Resources/Deploy [Successfully collected K8s newrelic-infrastructure...]Info K8s/Helm/Releases [Successfully collected the list of helm releases]Info K8s/AgentControl/agent-control-logs [Successfully collected K8s agent-control agent-con...]Info K8s/Flux/Releases [Successfully collected Flux Helm Releases]Info K8s/AgentControl/source-controller-logs [Successfully collected K8s agent-control source-co...]See nrdiag-output.json for full results.All the logs and resources related to the agent-control are saved in the
nrdiag_output.zip
file in the current directory. You can analyze the contents of the zip file or open a support ticket with New Relic support for further assistance.
Linux hosts troubleshooting
If you receive the error message Installing agent-control (Unsupported)
, please check the system requirements and ensure you are running a supported OS version.
If you see Installing agent-control (Failed)
, follow these steps:
Check the logs provided with the installation script:
- If you see
Error creating an identity
, please ensure your user key belongs to a platform user with the All product admin role.
- If you see
Check the status of the
newrelic-agent-control
service:bash$sudo systemctl status newrelic-agent-controlIf the service appears in
failed
orstopped
state, this means the agent got installed but there's an issue preventing its normal operation. Check the agent services logs usingjournaltctl
(or any similar Linux tool):bash$journalctl -u newrelic-agent-controlIf no insights are available, check how to run the agent in debug mode to access detailed logs explaining why the service cannot be started.
If the service is not installed, try appending
--debug
at the end of the install command and run it again. This will enable verbose logging for the installation script. See if the verbose output has additional context explaining the error.Optionally, answer
yes
when asked to send logs to New Relic to help troubleshooting the installation. Once submitted, logs can be accessed with the following NRQL query:SELECT * FROM Log WHERE hostname = `your-host-name`
To access logs, you'll need to first enable agent logging by following these steps:
- To enable logging to a file, use the
log
setting in Agent Control configuration file:
# Fleet Control connection settings #fleet_control:
# managed agents settings #agents:
# agent logging settings log: level: debug file: enable: true # Add a custom path if needed, default path: /var/log/newrelic-agent-control/agent-control.log # path: "/path/to/agent-control.log" # Optional formatting settings format: # Include the target module (disabled by default for better readability) target: true # Custom timestamp format "%Y-%m-%dT%H:%M:%S" timestamp: "%Y"
Log level possible values are:
trace
debug
info
(default)warning
error
Logs from the underlying infrastructure agent and/or OpenTelemetry collector are included when level is
debug
ortrace
.
- Restart Agent Control.
- If the
file
log is enabled, check the corresponding local file based on thepath
setting. Or use your prefered log troubleshooting tool such asjournatlctl -u new-relic-agent-control
.
To access the health status details, you'll need to first enable the local server by following these steps:
Add the following settings in Agent Control configuration file:
server:enabled: true# default values (change if needed)#host: "127.0.0.1"#port: 51200Restart Agent Control.
Query the status endpoint using the following command:
bash$curl 127.0.0.1:51200/statusThe server will return the health information in
json
format, example:{"agent_control": {"healthy": true},"fleet_control": {"enabled": true,"endpoint": "https://opamp.service.newrelic.com/v1/opamp","reachable": true},"sub_agents": {"nr-otel-collector": {"agent_id": "nr-otel-collector","agent_type": "newrelic/io.opentelemetry.collector:0.1.0","healthy": true},"nr-infra-agent": {"agent_id": "nr-infra-agent","agent_type": "newrelic/com.newrelic.infrastructure:0.1.0","healthy": false,"last_error": "process exited with code: exit status: 1"}}}
Agent Control performs certain validations before receiving and applying remote configuration from Fleet Control. Additionally, configurations might have a valid format (for example, valid .yaml structure) but include unexpected values for certain settings (for example, string when integer is expected). The following table shows common errors for the different supported agents:
Agent type | Error | Troubleshooting notes |
---|---|---|
(All agents) | Error applying remote config: could not resolve config | Review your configuration format. The configuration migth not be a valid |
(All agents) | Invalid config: restricted values detected | Review your configuration content. Specific settings might not be available for the target agent type based on the security policy. |
(All agents) |
| Review your configuration. Unexpected values are causing the agent to exit with an unexpected error. |
Infrastructure agent |
| Review your configuration. Unexpected values are causing the infrastructure agent to exit with unexpected config. Review the supported settings. |