NVIDIA GPU integration

Our NVIDIA GPU integration lets you monitor the status of your GPUs. This integration uses our infrastructure agent with the Flex integration, which lets us access NVIDIA's SMI utility.

After you set up our NVIDIA GPU integration, we give you a dashboard for your GPU metrics.

When you install, you'll get a pre-built dashboard containing crucial GPU metrics:

GPU utilization
ECC error counts
Active compute processes
Clock and performance states
Temperature and fan speed
Dynamic and static information about each supported device

Install the infrastructure agent

To capture data with New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your GPUs performance.

You can install the infrastructure agent two different ways:

Our guided install is a CLI tool that inspects your system and installs the infrastructure agent alongside the application monitoring agent that best works for your system. To learn more about how our guided install works, check out our Guided install overview.
If you'd rather install our infrastructure agent manually, you can follow a tutorial for manual installation for Linux, Windows.

Configure Flex integration for NVIDIA GPUs

Flex comes bundled with the New Relic infrastructure agent and it can be integrated with the NVIDIA SMI, a command line utility to monitor NVIDIA GPU devices.

Important

nvidia-smi ships pre-installed with NVIDIA GPU display drivers on Linux and Windows Server.

Follow these steps to configure Flex:

Create a file named nvidia-smi-gpu-monitoring.yml in this path:
bash
```
$sudo touch /etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml
```
You may also download from the git repository.

Update the nvidia-smi-gpu-monitoring.yml file with the integration config:

--- 
  integrations:
    - name: nri-flex
      # interval: 30s
      config:
        name: NvidiaSMI
        variable_store:
          metrics: 
            "name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
            pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
            pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
            persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
            driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
            gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
            clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
            clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
            clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
            clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
            utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
            encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
            ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
            ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
            ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
            ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
            ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
            ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
            ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
            ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
            ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
            ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
            ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
            ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
            retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
            power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
            clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
            clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
            mig.mode.current,mig.mode.pending"
        apis:
          - name: NvidiaGpu
            commands:
              - run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
                output: csv
            rename_keys:
              " ": ""
              "\\[MiB\\]": ".MiB"
              "\\[%\\]": ".percent"
              "\\[W\\]": ".watts"
              "\\[MHz\\]": ".MHz"
            value_parser:
              "clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
              '.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'

Confirm GPU metrics are being ingested

The Flex configuration will be automatically detected and executed by the infrastructure agent, there's no need to restart the agent. You can confirm metrics are being ingested by running this NRQL query:

SELECT * FROM NvidiaGpuSample

Monitor your application

You can use our pre-built dashboard template to monitor your GPU metrics. Follow these steps:

Go to one.newrelic.com and click on Dashboards.
Click on the Import dashboard tab.
Copy the file content (.json) from the NVIDIA GPU dashboard.
Select the target account where the dashboard needs to be imported.
Click on Import dashboard to confirm the action.
Your NVIDIA GPU Monitoring dashboard is considered a custom dashboard and can be found in the Dashboards UI. For docs on using and editing dashboards, see our dashboard docs.
Here is a NRQL query to view all the telemetry available:
```
SELECT * FROM NvidiaGpuSample
```

What's next?

You can adapt the Flex configuration to include or exclude information available from the NVIDIA SMI utility.

To learn more about building NRQL queries and generating dashboards, check out these docs: