Monitor Amazon ECS tasks running on EC2 instances by deploying OpenTelemetry Collector Contrib as a sidecar container. This comprehensive guide walks you through creating task definitions, configuring the collector, and setting up monitoring for your ECS on EC2 workloads.
Installation steps
Follow these steps in order to set up monitoring for your ECS on EC2 tasks.
Before you begin
Make sure your environment meets these requirements:
Store your New Relic license key
Save your license key as a Systems Manager (SSM) parameter to securely store credentials for the OpenTelemetry Collector:
$aws ssm put-parameter \> --name "/newrelic-infra/ecs/license-key" \> --type SecureString \> --description 'New Relic license key for ECS monitoring' \> --value "YOUR_NEW_RELIC_LICENSE_KEY"Create IAM policy and execution role
Create an IAM policy so your ECS containers can securely retrieve the New Relic license key:
bash$aws iam create-policy \>--policy-name "NewRelicSSMLicenseKeyReadAccess" \>--policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["ssm:GetParameters"],"Resource":["arn:aws:ssm:*:*:parameter/newrelic-infra/ecs/license-key"]}]}' \>--description "Provides read access to the New Relic SSM license key parameter"Create an IAM role to be used as the task execution role:
bash$aws iam create-role \>--role-name "NewRelicECSTaskExecutionRole" \>--assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}' \>--description "ECS task execution role for New Relic infrastructure"Attach the required managed policies to the role:
bash$# Attach the standard ECS task execution policy$aws iam attach-role-policy \>--role-name "NewRelicECSTaskExecutionRole" \>--policy-arn "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"$$# Attach the New Relic SSM license key read access policy$aws iam attach-role-policy \>--role-name "NewRelicECSTaskExecutionRole" \>--policy-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/NewRelicSSMLicenseKeyReadAccess"
Store collector configuration
Store the OpenTelemetry Collector configuration in AWS Systems Manager Parameter Store so you can manage and update settings without rebuilding container images:
$aws ssm put-parameter \> --name "/ecs/otel-collector/ec2-config" \> --type "String" \> --value "$(cat <<EOF$receivers:$ awsecscontainermetrics:$ collection_interval: <COLLECTION_INTERVAL>$ hostmetrics:$ collection_interval: <COLLECTION_INTERVAL>$ scrapers:$ cpu:$ metrics:$ system.cpu.time:$ enabled: false$ system.cpu.utilization:$ enabled: true$ load:$ memory:$ metrics:$ system.memory.utilization:$ enabled: true$ paging:$ metrics:$ system.paging.utilization:$ enabled: false$ system.paging.faults:$ enabled: false$ filesystem:$ metrics:$ system.filesystem.utilization:$ enabled: true$ disk:$ metrics:$ system.disk.merged:$ enabled: false$ system.disk.pending_operations:$ enabled: false$ system.disk.weighted_io_time:$ enabled: false$ network:$ metrics:$ system.network.connections:$ enabled: false$
$processors:$ metricstransform/containers:$ transforms:$ - include: container.cpu.utilized$ action: insert$ new_name: container.cpu.utilization$ - include: container.memory.usage$ action: insert$ new_name: container.memory.usage.total$ - include: container.storage.read_bytes$ action: insert$ new_name: container.blockio.io_service_bytes_recursive$ operations:$ - action: add_label$ new_label: operation$ new_value: read$ - include: container.storage.write_bytes$ action: insert$ new_name: container.blockio.io_service_bytes_recursive$ operations:$ - action: add_label$ new_label: operation$ new_value: write$ metricstransform:$ transforms:$ - include: system.cpu.utilization$ action: update$ operations:$ - action: aggregate_labels$ label_set: [ state ]$ aggregation_type: mean$ - include: system.paging.operations$ action: update$ operations:$ - action: aggregate_labels$ label_set: [ direction ]$ aggregation_type: sum$ filter/exclude_cpu_utilization:$ metrics:$ datapoint:$ - 'metric.name == \"system.cpu.utilization\" and attributes[\"state\"] == \"interrupt\"'$ - 'metric.name == \"system.cpu.utilization\" and attributes[\"state\"] == \"nice\"'$ - 'metric.name == \"system.cpu.utilization\" and attributes[\"state\"] == \"softirq\"'$ filter/exclude_memory_utilization:$ metrics:$ datapoint:$ - 'metric.name == \"system.memory.utilization\" and attributes[\"state\"] == \"slab_unreclaimable\"'$ - 'metric.name == \"system.memory.utilization\" and attributes[\"state\"] == \"inactive\"'$ - 'metric.name == \"system.memory.utilization\" and attributes[\"state\"] == \"cached\"'$ - 'metric.name == \"system.memory.utilization\" and attributes[\"state\"] == \"buffered\"'$ - 'metric.name == \"system.memory.utilization\" and attributes[\"state\"] == \"slab_reclaimable\"'$ filter/exclude_memory_usage:$ metrics:$ datapoint:$ - 'metric.name == \"system.memory.usage\" and attributes[\"state\"] == \"slab_unreclaimable\"'$ - 'metric.name == \"system.memory.usage\" and attributes[\"state\"] == \"inactive\"'$ filter/exclude_filesystem_utilization:$ metrics:$ datapoint:$ - 'metric.name == \"system.filesystem.utilization\" and attributes[\"type\"] == \"squashfs\"'$ filter/exclude_filesystem_usage:$ metrics:$ datapoint:$ - 'metric.name == \"system.filesystem.usage\" and attributes[\"type\"] == \"squashfs\"'$ - 'metric.name == \"system.filesystem.usage\" and attributes[\"state\"] == \"reserved\"'$ filter/exclude_filesystem_inodes_usage:$ metrics:$ datapoint:$ - 'metric.name == \"system.filesystem.inodes.usage\" and attributes[\"type\"] == \"squashfs\"'$ - 'metric.name == \"system.filesystem.inodes.usage\" and attributes[\"state\"] == \"reserved\"'$ filter/exclude_system_disk:$ metrics:$ datapoint:$ - 'metric.name == \"system.disk.operations\" and IsMatch(attributes[\"device\"], \"^loop.*\") == true'$ - 'metric.name == \"system.disk.merged\" and IsMatch(attributes[\"device\"], \"^loop.*\") == true'$ - 'metric.name == \"system.disk.io\" and IsMatch(attributes[\"device\"], \"^loop.*\") == true'$ - 'metric.name == \"system.disk.io_time\" and IsMatch(attributes[\"device\"], \"^loop.*\") == true'$ - 'metric.name == \"system.disk.operation_time\" and IsMatch(attributes[\"device\"], \"^loop.*\") == true'$ filter/exclude_system_paging:$ metrics:$ datapoint:$ - 'metric.name == \"system.paging.usage\" and attributes[\"state\"] == \"cached\"'$ - 'metric.name == \"system.paging.operations\" and attributes[\"type\"] == \"cached\"'$ filter/exclude_network:$ metrics:$ datapoint:$ - 'IsMatch(metric.name, \"^system.network.*\") == true and attributes[\"device\"] == \"lo\"'$
$ attributes/exclude_system_paging:$ include:$ match_type: strict$ metric_names:$ - system.paging.operations$ actions:$ - key: type$ action: delete$
$ cumulativetodelta:$
$ transform/host:$ metric_statements:$ - context: metric$ statements:$ - set(metric.description, \"\")$ - set(metric.unit, \"\")$
$ transform:$ trace_statements:$ - context: span$ statements:$ - truncate_all(span.attributes, <ATTRIBUTE_TRUNCATION_LIMIT>)$ - truncate_all(resource.attributes, <RESOURCE_ATTRIBUTE_TRUNCATION_LIMIT>)$ log_statements:$ - context: log$ statements:$ - truncate_all(log.attributes, <ATTRIBUTE_TRUNCATION_LIMIT>)$ - truncate_all(resource.attributes, <RESOURCE_ATTRIBUTE_TRUNCATION_LIMIT>)$
$ memory_limiter:$ check_interval: <MEMORY_LIMITER_CHECK_INTERVAL>$ limit_mib: \${env:NEW_RELIC_MEMORY_LIMIT_MIB:-<MEMORY_LIMIT_MIB>}$ batch:$ send_batch_size: <SEND_BATCH_SIZE>$ timeout: <BATCH_TIMEOUT>$ resource:$ attributes:$ - key: ClusterName$ from_attribute: aws.ecs.cluster.name$ action: insert$ - key: ServiceName$ from_attribute: aws.ecs.service.name$ action: insert$ - key: TaskId$ from_attribute: aws.ecs.task.id$ action: insert$ - key: TaskDefinitionFamily$ from_attribute: aws.ecs.task.family$ action: insert$ - key: LaunchType$ from_attribute: aws.ecs.launch_type$ action: insert$
$ resourcedetection:$ detectors:$ - env$ - ecs$ - ec2$ - system$ timeout: <RESOURCE_DETECTION_TIMEOUT>$ override: false$
$exporters:$ otlphttp:$ endpoint: https://otlp.nr-data.net:443$ headers:$ api-key: \${NEW_RELIC_LICENSE_KEY}$
$ debug:$ verbosity: basic$
$service:$ pipelines:$ metrics/containers:$ receivers: [awsecscontainermetrics]$ processors: [metricstransform/containers, resource, batch]$ exporters: [otlphttp, debug]$ metrics/host:$ receivers: [hostmetrics]$ processors:$ - memory_limiter$ - metricstransform$ - filter/exclude_cpu_utilization$ - filter/exclude_memory_utilization$ - filter/exclude_memory_usage$ - filter/exclude_filesystem_utilization$ - filter/exclude_filesystem_usage$ - filter/exclude_filesystem_inodes_usage$ - filter/exclude_system_disk$ - filter/exclude_network$ - attributes/exclude_system_paging$ - transform/host$ - resourcedetection$ - cumulativetodelta$ - batch$ exporters: [otlphttp, debug]$EOF$)"Configuration parameters
The following parameters can be customized in the OpenTelemetry Collector configuration:
Parameter | Description |
|---|---|
| Interval to collect metrics from ECS container and host metrics endpoints. |
| Memory limit for the OpenTelemetry Collector in MiB |
| Interval for the memory limiter to check current memory usage |
| Number of metrics to batch before sending to New Relic |
| Maximum time to wait before sending a batch |
| Timeout for resource detection processors |
| Maximum length for span and log attribute values before truncation. Default: 4095 |
| Maximum length for resource attribute values before truncation. Default: 4095 |
Create task definition
Create a new ECS task definition that includes the OpenTelemetry Collector sidecar container. Choose the appropriate task definition for your container platform:
Task definition parameters
The following parameters can be customized in your ECS task definition:
Parameter | Description |
|---|---|
| Total CPU units for the EC2 task |
| Total memory for the EC2 task in MiB |
| CPU units allocated to your application container |
| Memory allocated to your application container in MiB |
| CPU units allocated to the OpenTelemetry Collector |
| Memory allocated to the OpenTelemetry Collector in MiB |
| CloudWatch log group name for your application container |
| CloudWatch log group name for the OpenTelemetry Collector |
| AWS region for CloudWatch logs |
| Log stream prefix for your application container |
| Log stream prefix for the OpenTelemetry Collector |
ヒント
The networkMode is set to "host" for Linux containers and should be "default" for Windows containers. Host mode provides better access to system metrics on EC2 instances.
重要
Replace YOUR_ACCOUNT and region values with your actual AWS account ID and AWS region.
Deploy and run the task
Deploy your task definition to your ECS cluster:
Register the task definition:
bash$aws ecs register-task-definition --cli-input-json file://task-definition.jsonCreate a service with daemon scheduling strategy:
bash$aws ecs create-service \>--cluster your-cluster-name \>--service-name otel-monitoring-service \>--task-definition otel-ecs-ec2-sidecar-metrics:1 \>--scheduling-strategy DAEMON \>--launch-type EC2
ヒント
DAEMON scheduling strategy ensures one monitoring task runs on every EC2 instance in your cluster, providing comprehensive infrastructure monitoring coverage.
Verify data collection
Check that your data is flowing to New Relic:
Check OpenTelemetry Collector status: Review container logs to confirm the collector is running without errors and successfully connecting to New Relic:
bash$aws logs get-log-events \>--log-group-name "/ecs/otel-collector-ec2" \>--log-stream-name "otel/otel-collector/TASK_ID"Verify data in New Relic UI: Navigate to one.newrelic.com > All Capabilities > Infrastructure to confirm your ECS hosts and containers appear with metrics. For detailed guidance on exploring your data, see Find and query your ECS monitoring data.
Next steps
After setting up monitoring, you can:
- Create custom dashboards for your ECS metrics
- Set up alerts for container and host-level issues
- Correlate ECS metrics with application traces and logs