Our NVIDIA DCGM integration assists you in monitoring the status of GPUs. This integration leverages our infrastructure agent and the Prometheus remote write integration, which is seamlessly integrated with NVIDIA's SMI utility. It provides you with a pre-built dashboard containing crucial DCGM metrics, including GPU utilization, XID error counts, clock and performance states, temperature, and power usage.
After you set up our NVIDIA DCGM integration, we give you a dashboard for your DCGM metrics.
To get data into New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your DCGM performance.
You can install the infrastructure agent two different ways:
- Our guided install is a CLI tool that inspects your system and installs the infrastructure agent alongside the application monitoring agent that best works for your system. To learn more about how our guided install works, check out our Guided install overview.
- If you'd rather install our infrastructure agent manually, you can follow a tutorial for manual installation for Linux, Windows.
- In your terminal, clone the
$git clone https://github.com/NVIDIA/dcgm-exporter
- In the cloned repository, navigate to the
- Install necessary binaries:
$sudo make install
- Start the
- See the details of your DCGM metrics:
Prometheus is an open-source monitoring and alerting tool that can be used to monitor NVIDIA GPUs using the NVIDIA-DCGM exporter. To configure Prometheus to monitor DCGM metrics, follow these steps:
- Visit the Prometheus download page to find the latest release.
- Select the appropriate version for your operating system and architecture. For Linux, you'll likely choose the linux-amd64 version. Copy the download link for the tarball (
- Once Prometheus is downloaded, untar the download tar file:
$tar -xvzf <filename.tar.gz>
- Navigate to the downloaded Prometheus folder:
- Open your
prometheus.ymlfile and add the following lines:
---scrape_configs:- job_name: NVIDIstatic_configs:- targets:['localhost:9400']
- Start Prometheus:
After setting up Prometheus configuration, you have to send NVIDIA DCGM metrics to Prometheus. Later, to integrate Prometheus metrics with New Relic, you can leverage the Prometheus remote write agent. Simply follow the Prometheus remote write setup launcher in the UI.
Before you can start reading your data, use the instructions in our infrastructure agent docs to restart your infrastructure agent.
$sudo systemctl restart newrelic-infra.service
Once you've completed the setup above, you can view your metrics using our pre-built dashboard template named nvidia-dcgm. To access this dashboard:
- Go to one.newrelic.com > + Add data.
- Click on the Dashboards tab.
- In the search box, type "nvidia-dcgm".
- Select it and click Install.
To instrument the nvidia-dcgm quickstart and to see metrics and alerts, you can also follow our Nvidia-DCGM quickstart page by clicking on the “Install now” button.
Here are some example queries:
Example: view the count of the device GPU temperature
SELECT latest(DCGM_FI_DEV_GPU_TEMP) FROM Metric WHERE metricName LIKE 'DCGM_FI_DEV_GPU_TEMP' TIMESERIES
To learn more about building NRQL queries and generating dashboards, check out these docs: