New Relic integrates with Amazon Web Services (AWS) for reporting your Amazon SageMaker metrics and other data to New Relic.
This document explains how to activate the integration, and describes the data reported.
Features
Collect and send telemetry data to New Relic from your Amazon SageMaker services using our integration. Monitor your services, query incoming data, and build dashboards to observe everything at a glance.
Activate integration
This integration is available through CloudWatch Metric Streams.
To enable this integration, see how to connect AWS services to New Relic via CloudWatch Metric Streams.
Find and use data
To find your integration's metrics, go to one.newrelic.com > Metrics and events and filter by aws.sagemaker
.
Metric data
This New Relic infrastructure integration collects the following Amazon SageMaker data:
SageMaker Metric data
Metric (min, max, average, count, sum) | Unit | Description |
---|---|---|
| Count | The number of InvokeEndpoint requests sent to a model endpoint. |
| Count | The number of invocations sent to a model, normalized by InstanceCount in each ProductionVariant. |
| Microseconds | The interval of time added to the time taken to respond to a client request by SageMaker overheads. |
| Microseconds | The interval of time taken by a model to respond to a SageMaker API request. |
| Count | The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. |
| Count | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. |
| Count | The number of model invocation requests which did not result in 2XX HTTP response. |
All imported data from SageMaker have one dimension: EndpointName
Sagemaker Endpoints Metric data
Metric (min, max, average, count, sum) | Unit | Description |
---|---|---|
| Percent | The percentage of memory that is used by the containers on an instance. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. |
| Percent | The percentage of disk space used by the containers on an instance uses. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. |
| Percent | The sum of each individual CPU core's utilization. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. |
| Percent | The percentage of GPU memory used by the containers on an instance. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance. |
| Percent | The percentage of GPU units that are used by the containers on an instance. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance. |
All imported data from SageMaker Endpoints have one dimension: Host
SageMaker Training Jobs Metric data
Metric (min, max, average, count, sum) | Unit | Description |
---|---|---|
| Percent | The percentage of memory that is used by the containers on an instance. For training jobs, the value is the memory utilization of the algorithm container on the instance. |
| Percent | The percentage of disk space used by the containers on an instance uses. For training jobs, the value is the disk space utilization of the algorithm container on the instance. |
| Percent | The sum of each individual CPU core's utilization. For training jobs, the value is the CPU utilization of the algorithm container on the instance. |
| Count | Measures the number of training job's train errors. |
All imported data from SageMaker Training Jobs have one dimension: Host
Create alerts
You can set up to notify you if there are any changes. For example, you can set up an alert to notify relevant parties of critical or fatal errors.
Learn more about creating alerts here.