ECS integration troubleshooting: No data appears

Problem

You installed our on-host ECS integration and waited a few minutes, but your cluster is not showing in the entity explorer.

We have two ECS integrations: a cloud-based integration and an on-host integration. This document is about the on-host integration.

Solution

If your New Relic account had previously installed the infrastructure agent or an infrastructure on-host integration, your data should appear in the UI within a few minutes.

If your account had not previously done either of those things before installing the on-host ECS integration, it may take tens of minutes for data to appear in the UI. In that case, we recommend waiting up to an hour before doing the following troubleshooting steps or contacting support.

There are several options for troubleshooting no data appearing:

For information about stopped tasks, see Stopped tasks reasons.

Troubleshoot via awscli

When interacting with New Relic support, use this method and send the generated files with your support request:

  1. Retrieve the information related to the newrelic-infra service or the Fargate service that contains a task with a newrelic-infra sidecar:
    aws ecs describe-services --cluster YOUR_CLUSTER_NAME --service newrelic-infra > newrelic-infra-service.json
    aws ecs describe-services --cluster YOUR_CLUSTER_NAME --service YOUR_FARGATE_SERVICE_WITH_NEW_RELIC_SIDECAR > newrelic-infra-sidecar-service.json
  2. The failures attribute details any errors for the services.
  3. Under services is the status attribute. It says ACTIVE if the service has no issues.
  4. The desiredCount should match the runningCount. This is the number of tasks the service is handling. Because we use the daemon service type, there should be one task per container instance in your cluster. The pendingCount attribute should be zero, because all tasks should be running.
  5. Inspect the events attribute of services to check for issues with scheduling or starting the tasks. For example: if the service is unable to start tasks successfully, it will display a message like:
    {
      "id": "5295a13c-34e6-41e1-96dd-8364c42cc7a9",
      "createdAt": "2020-04-06T15:28:18.298000+02:00",
      "message": "(service newrelic-ifnra) is unable to consistently start tasks successfully. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide."
    }
  6. In the same section, you can also see which tasks were started by the service from the events:
    {
      "id": "1c0a6ce2-de2e-49b2-b0ac-6458a804d0f0",
      "createdAt": "2020-04-06T15:27:49.614000+02:00",
      "message": "(service fargate-fail) has started 1 tasks: (task YOUR_TASK_ID)."
    }
  7. Retrieve the information related to the task with this command:
    aws ecs describe-tasks --tasks YOUR_TASK_ID --cluster YOUR_CLUSTER_NAME > newrelic-infra-task.json
  8. The desiredStatus and lastStatus should be RUNNING. If the task couldn't start normally, it will have a STOPPED status.
  9. Inspect the stopCode and stoppedReason. One reason example: a task that couldn't be started because the task execution role doesn't have the appropriate permissions to download the license-key-containing secret would have the following output:
    "stopCode": "TaskFailedToStart",
    "stoppedAt": "2020-04-06T15:28:54.725000+02:00",
    "stoppedReason": "Fetching secret data from AWS Secrets Manager in region YOUR_AWS_REGION: secret arn:aws:secretsmanager:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT:secret:NewRelicLicenseKeySecret-Dh2dLkgV8VyJ-80RAHS-fail: AccessDeniedException: User: arn:aws:sts::YOUR_AWS_ACCOUNT:assumed-role/NewRelicECSIntegration-Ne-NewRelicECSTaskExecution-1C0ODHVT4HDNT/8637b461f0f94d649e9247e2f14c3803 is not authorized to perform: secretsmanager:GetSecretValue on resource: arn:aws:secretsmanager:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT:secret:NewRelicLicenseKeySecret-Dh2dLkgV8VyJ-80RAHS-fail-DmLHfs status code: 400, request id: 9cf1881e-14d7-4257-b4a8-be9b56e09e3c",
    "stoppingAt": "2020-04-06T15:28:10.953000+02:00",
  10. If the task is running but you’re still not seeing data, generate verbose logs and examine them for errors.

For details about reasons for stopped tasks, see Stopped tasks.

Troubleshoot in the UI

To use the UI to troubleshoot:

  1. Log in to your AWS Console and navigate to the EC2 Container Service section.
  2. Click on the cluster where you installed the New Relic ECS integration.
  3. On the Services tab, use the filter to search for the integration service. If you used the automatic install script, the name of the service will be newrelic-infra. If you are using Fargate, it will be the name of your monitored service. Once found, click on the name.
  4. The service page shows the Status of the service. It says ACTIVE if the service has no issues.
  5. On the same page, the Desired count should match the Running count. This is the number of tasks the service is handling. Because we use the daemon service type, there should be one task per container instance in your cluster. Pending count should be zero, because all tasks should be running.
  6. Inspect the Events tab to check for issues with scheduling or starting the tasks.
  7. In the Tasks tab of your service, you can inspect the running tasks and the stopped tasks by clicking on the Task status selector. Containers that failed to start are shown when you select the Stopped status.
  8. Click on a task to go to the task details page. Under Stopped reason, it displays a message explaining why the task was stopped.
  9. If the task is running but you’re still not seeing data, generate verbose logs and examine them for errors.

For details about reasons for stopped tasks, see Stopped tasks.

Reasons for stopped tasks

In the AWS ECS troubleshooting documentation you can find information on common causes of errors related to running tasks and services. See below for details about some reasons for stopped tasks.

Task stopped with reason:

Fetching secret data from AWS Secrets Manager 
  in region YOUR_AWS_REGION: 
  secret arn:aws:secretsmanager:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT:secret:YOUR_SECRET_NAME: 
  AccessDeniedException: User: arn:aws:sts::YOUR_AWS_ACCOUNT:assumed-role/YOUR_ROLE_NAME 
  is not authorized to perform: secretsmanager:GetSecretValue 
  on resource: arn:aws:secretsmanager:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT:secret:YOUR_SECRET_NAME 
  status code: 400, request id: 9cf1881e-14d7-4257-b4a8-be9b56e09e3c"

This means that the IAM role specified using executionRoleArn in the task definition doesn't have access to the secret used for the NRIA_LICENSE_KEY. The execution role should have a policy attached that grants it access to read the secret.

  1. Get the execution role of your task:
    aws ecs describe-task-definition --task-definition newrelic-infra --output text --query taskDefinition.executionRoleArn

    You can replace the --task-definition newrelic-infra with the name of your fargate task that includes the sidecar container.

    aws ecs describe-task-definition --task-definition YOUR_FARGATE_TASK_NAME --output text --query taskDefinition.executionRoleArn
    
  2. List the policies attached to role:
    aws iam list-attached-role-policies --role-name YOUR_EXECUTION_ROLE_NAME

    This should return 3 policies AmazonECSTaskExecutionRolePolicy, AmazonEC2ContainerServiceforEC2Role and a third one that should grant read access to the license key. In the following example the policy it's named NewRelicLicenseKeySecretReadAccess.

    {
      "AttachedPolicies": [
        {
          "PolicyName": "AmazonECSTaskExecutionRolePolicy",
          "PolicyArn": "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
        },
        {
          "PolicyName": "AmazonEC2ContainerServiceforEC2Role",
          "PolicyArn": "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
        },
        {
          "PolicyName": "YOUR_POLICY_NAME",
          "PolicyArn": "arn:aws:iam::YOUR_AWS_ACCOUNT:policy/YOUR_POLICY_NAME"
        }
      ]
    }
  3. Retrieve the default policy version:
    aws iam get-policy-version --policy-arn arn:aws:iam::YOUR_AWS_ACCOUNT:policy/YOUR_POLICY_NAME --version-id $(aws iam get-policy --policy-arn arn:aws:iam::YOUR_AWS_ACCOUNT:policy/YOUR_POLICY_NAME --output text --query Policy.DefaultVersionId)

    This retrieves the policy permissions. There should be an entry for Actionsecretsmanager:GetSecretValue if you used AWS Secrets Manager to store your license key, or an entry for ssm:GetParametersif you used AWS Systems Manager Parameter Store:

    AWS Secrets Manager
    {
      "PolicyVersion": {
        "Document": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Action": "secretsmanager:GetSecretValue",
              "Resource": "arn:aws:secretsmanager:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT:secret:YOUR_SECRET_NAME",
              "Effect": "Allow"
            }
          ]
        },
        "VersionId": "v1",
        "IsDefaultVersion": true,
        "CreateDate": "2020-03-31T13:47:07+00:00"
      }
    }
    AWS Systems Manager Parameter Store
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Action": "ssm:GetParameters",
          "Resource": [
            "arn:aws:ssm:YOUR_AWS_REGION:YOUR_AWS_ACCOUNT:parameter/YOUR_SECRET_NAME"
          ],
          "Effect": "Allow"
        }
      ]
    }

For more help

If you need more help, check out these support and learning resources: