• English日本語한국어
  • Log inStart now

NVIDIA GPU integration

Our NVIDIA GPU integration assists you in monitoring the status of GPUs. This integration leverages our infrastructure agent and the Flex integration, which is seamlessly integrated with NVIDIA's SMI utility. It provides you with a pre-built dashboard containing crucial GPU metrics, including GPU utilization, ECC error counts, active compute processes, clock and performance states, temperature, fan speed, as well as dynamic and static information about each supported device.

After you set up our NVIDIA GPU integration, we give you a dashboard for your GPU metrics.

Install the infrastructure agent

To get data into New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your GPUs performance.

You can install the infrastructure agent two different ways:

Configure Flex integration for NVIDIA GPUs

Flex comes bundled with the New Relic infrastructure agent and it can be integrated with the NVIDIA SMI, a command line utility to monitor NVIDIA GPU devices.

Important

NVIDIA-smi ships pre-installed with NVIDIA GPU display drivers on Linux and Windows Server.

To configure Flex follow these steps:

  1. Create a file named nvidia-smi-gpu-monitoring.yml in this path:
bash
$
/etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml
  1. Update the file nvidia-smi-gpu-monitoring.yml with these configurations:

    • EVENT_TYPE: You can consider EVENT_TYPE as a New Relic database table that you can query using NRQL.
    • COMMAND: This contains the command which is used to print metrics on the terminal.

    Once your configuration file is updated, it will look like below:

---
integrations:
- name: nri-flex
# interval: 30s
config:
name: NvidiaSMI
variable_store:
metrics:
"name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
mig.mode.current,mig.mode.pending"
apis:
- name: NvidiaGpu
commands:
- run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
output: csv
rename_keys:
" ": ""
"\\[MiB\\]": ".MiB"
"\\[%\\]": ".percent"
"\\[W\\]": ".watts"
"\\[MHz\\]": ".MHz"
value_parser:
"clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
'.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'

Confirm GPU metrics are being ingested

The Flex configuration will be automatically detected and executed by the infrastructure agent, there's no need to restart the agent. You can confirm metrics are being ingested by running this NRQL query:

SELECT * FROM NvidiaSMI

Monitor your application

You can use our pre-built dashboard template to monitor your GPU metrics. Follow these steps:

  1. Go to one.newrelic.com and click on Dashboards.
  2. Click on the Import dashboard tab.
  3. Copy the file content (.json) from the NVIDIA GPU dashboard.
  4. Select the target account where the dashboard needs to be imported.
  5. Click on Import dashboard to confirm the action.

Your NVIDIA GPU Monitoring dashboard is considered a custom dashboard and can be found in the Dashboards UI. For docs on using and editing dashboards, see our dashboard docs.

Here is a NRQL query to view all the telemetry available:

SELECT * FROM NvidiaSMI

What's next?

You can adapt the Flex configuration to include or exclude information available from the NVIDIA SMI utility.

To learn more about building NRQL queries and generating dashboards, check out these docs:

  • Introduction to the query builder to create basic and advanced queries.
  • Introduction to dashboards to customize your dashboard and carry out different actions.
  • Manage your dashboard to adjust your display mode, or to add more content to your dashboard.
Copyright © 2023 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.