Azure Machine Learning through Azure Monitor integration

New Relic's integrations include an integration for reporting your Microsoft Azure Machine Learning metrics and other data to New Relic. This document explains how to activate the integration and describes the data reported.

Features

New Relic gathers metrics data from Azure Monitor for the Azure Machine Learning service. Azure Machine Learning is a cloud service for accelerating and managing the machine learning project lifecycle. Machine learning professionals, data scientists, and engineers can use it in their day-to-day workflows to train and deploy models or manage MLOps

Using New Relic, you can:

View Azure Machine Learning metrics in pre-built dashboards.
Run custom queries and visualize the data.
Create alert conditions to notify you of changes in data.

Activate integration

Follow standard Azure Monitor integration procedure to activate your Azure service in New Relic infrastructure monitoring.

Configuration and polling

You can change the polling frequency and filter data using configuration options.

New Relic queries your Azure Machine Learning service through the Azure Monitor integration according to a default polling interval.

Find and use data

To explore your integration data, go to one.newrelic.com/infra > Azure > (select an integration).

Metric data

This integration collects the following metric data:

Azure Machine Learning metrics

Workspaces

The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces resource type.

Metric	Description
`ActiveCores`	Number of active cores
`ActiveNodes`	Number of active nodes. These are the nodes which are actively running a job.
`CancelRequestedRuns`	Number of runs where cancel was requested for this workspace.
`CancelledRuns`	Number of runs cancelled for this workspace.
`CompletedRuns`	Number of runs completed successfully for this workspace.
`CpuCapacityMillicores`	Maximum capacity of a CPU node in millicores.
`CpuMemoryCapacityMegabytes`	Maximum memory utilization of a CPU node in megabytes.
`CpuMemoryUtilizationMegabytes`	Memory utilization of a CPU node in megabytes.
`CpuMemoryUtilizationPercentage`	Memory utilization percentage of a CPU node.
`CpuUtilization`	Percentage of utilization on a CPU node
`CpuUtilizationMillicores`	Utilization of a CPU node in millicores
`CpuUtilizationPercentage`	Utilization percentage of a CPU node.
`DiskAvailMegabytes`	Available disk space in megabytes.
`DiskReadMegabytes`	Data read from disk in megabytes
`DiskUsedMegabytes`	Used disk space in megabytes
`DiskWriteMegabytes`	Data written into disk in megabytes
`Errors`	Number of run errors in this workspace
`FailedRuns`	Number of runs failed for this workspace
`FinalizingRuns`	Data read from disk in megabytes
`GpuCapacityMilliGPUs`	Maximum capacity of a GPU device in milli-GPUs
`GpuEnergyJoules`	Interval energy in Joules on a GPU node
`GpuMemoryCapacityMegabytes`	Maximum memory capacity of a GPU device in megabytes.
`GpuMemoryUtilization`	Percentage of memory utilization on a GPU node.
`GpuMemoryUtilizationMegabytes`	Memory utilization of a GPU device in megabytes
`GpuMemoryUtilizationPercentage`	Memory utilization percentage of a GPU device
`GpuUtilization`	Percentage of utilization on a GPU node
`GpuUtilizationMilliGPUs`	Utilization of a GPU device in milli-GPUs
`GpuUtilizationPercentage`	Utilization percentage of a GPU device
`IBReceiveMegabytes`	Network data received over InfiniBand in megabytes
`IBTransmitMegabytes`	Network data sent over InfiniBand in megabytes
`IdleCores`	Number of idle cores
`IdleNodes`	Number of idle nodes
`LeavingCores`	Number of leaving cores
`LeavingNodes`	Number of leaving nodes
`ModelDeployFailed`	Number of model deployments that failed in this workspace
`ModelDeployStarted`	Number of model deployments started in this workspace
`ModelDeploySucceeded`	Number of model deployments that succeeded in this workspace
`ModelRegisterFailed`	Number of model registrations that failed in this workspace
`ModelRegisterSucceeded`	Number of model registrations that succeeded in this workspace
`NetworkInputMegabytes`	Network data received in megabytes. Metrics are aggregated in one minute intervals
`NetworkOutputMegabytes`	Network data sent in megabytes. Metrics are aggregated in one minute intervals.
`Not Responding Runs`	Number of runs not responding for this workspace.
`NotStartedRuns`	Number of runs in Not Started state for this workspace
`PreemptedCores`	Number of preempted cores
`PreemptedNodes`	Number of preempted nodes
`PreparingRuns`	Number of runs that are preparing for this workspace.
`Provisioning Runs`	Number of runs that are provisioning for this workspace.
`Queued Runs`	Number of runs that are queued for this workspace
`QuotaUtilizationPercentage`	Percent of quota utilized
`Started Runs`	Number of runs running for this workspace
`Starting Runs`	Number of runs started for this workspace
`StorageAPIFailureCount`	Azure Blob Storage API calls failure count.
`StorageAPISuccessCount`	Azure Blob Storage API calls success count.
`TotalCores`	Number of total cores
`TotalNodes`	Number of total nodes
`UnusableCores`	Number of unusable cores
`UnusableNodes`	Number of unusable nodes
`Warnings`	Number of run warnings in this workspace

The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments resource type.

Metric	Description
`CpuMemoryUtilizationPercentage`	Percentage of memory utilization on an instance
`CpuUtilizationPercentage`	Percentage of CPU utilization on an instance
`DataCollectionErrorsPerMinute`	The number of data collection events dropped per minute
`DataCollectionEventsPerMinute`	The number of data collection events processed per minute.
`DeploymentCapacity`	The number of instances in the deployment
`DiskUtilization`	Percentage of disk utilization on an instance
`GpuEnergyJoules`	Interval energy in Joules on a GPU node
`GpuMemoryUtilizationPercentage`	Percentage of GPU memory utilization on an instance
`GpuUtilizationPercentage`	Percentage of GPU utilization on an instance.
`RequestLatency_P50`	The average P50 request latency
`RequestLatency_P90`	The average P90 request latency
`RequestLatency_P95`	The average P95 request latency
`RequestLatency_P99`	The average P99 request latency
`RequestsPerMinute`	The number of requests sent to online deployment within a minute

The following table lists the metrics available for the Microsoft.MachineLearningServices/workspaces/onlineEndpoints resource type.

Metric	Description
`ConnectionsActive`	The total number of concurrent TCP connections active from clients
`DataCollectionErrorsPerMinute`	The number of data collection events dropped per minute
`DataCollectionEventsPerMinute`	The number of data collection events processed per minute
`NetworkBytes`	The bytes per second served for the endpoint
`NewConnectionsPerSecond`	The average number of new TCP connections per second established from clients
`RequestLatency`	The average complete interval of time taken for a request to be responded in milliseconds
`RequestLatency_P50`	The average P50 request latency aggregated by all request latency values collected over the selected time period
`RequestLatency_P90`	The average P90 request latency aggregated by all request latency values collected over the selected time period
`RequestLatency_P95`	The average P95 request latency aggregated by all request latency values collected over the selected time period
`RequestLatency_P99`	The average P99 request latency aggregated by all request latency values collected over the selected time period
`RequestsPerMinute`	The number of requests sent to online endpoint within a minute