NVIDIA GPUの統合

NVIDIA GPU 統合により、GPU のステータスを監視できます。この統合では、インフラストラクチャエージェントと Flex 統合が使用され、NVIDIA の SMI ユーティリティにアクセスできるようになります。

NVIDIA GPU 統合を設定すると、GPU メトリクスのダッシュボードが提供されます。

インストールすると、重要な GPU メトリクスを含む事前に構築されたダッシュボードが表示されます。

GPU使用率
ECCエラー数
アクティブな計算プロセス
クロックとパフォーマンスの状態
温度とファン速度
サポートされている各デバイスに関する動的情報と静的情報

インフラストラクチャエージェントをインストールします

New Relic でデータをキャプチャするには、インフラストラクチャエージェントをインストールします。当社のインフラストラクチャエージェントはデータを収集して取り込むため、GPU のパフォーマンスを追跡できます。

インフラストラクチャエージェントは、次の 2 つの方法でインストールできます。

ガイド付きインストールは、システムを検査し、システムに最適なアプリケーション監視エージェントとともにインフラストラクチャエージェントをインストールする CLI ツールです。ガイド付きインストールの仕組みの詳細については、ガイド付きインストールの概要をご覧ください。
インフラストラクチャエージェントを手動でインストールしたい場合は、 Linux 、 Windowsの手動インストールのチュートリアルに従ってください。

NVIDIA GPU の Flex 統合を構成する

Flex は New Relic インフラストラクチャエージェントにバンドルされており、NVIDIA GPU デバイスを監視するコマンドラインユーティリティである NVIDIA SMIと統合できます。

重要

nvidia-smi は、Linux および Windows Server に NVIDIA GPU ディスプレイドライバーがプリインストールされた状態で出荷されます。

Flex を構成するには、次の手順に従います。

このパスに nvidia-smi-gpu-monitoring.yml という名前のファイルを作成します。

bash

$sudo touch /etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml

git リポジトリからダウンロードすることもできます。

統合構成を使用してnvidia-smi-gpu-monitoring.ymlファイルを更新します。

--- 
integrations:
  - name: nri-flex
    # interval: 30s
    config:
      name: NvidiaSMI
      variable_store:
        metrics: 
          "name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
          pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
          pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
          persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
          driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
          gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
          clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
          clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
          clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
          clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
          utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
          encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
          ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
          ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
          ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
          ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
          ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
          ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
          ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
          ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
          ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
          ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
          ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
          ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
          retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
          power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
          clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
          clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
          mig.mode.current,mig.mode.pending"
      apis:
        - name: NvidiaGpu
          commands:
            - run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
              output: csv
          rename_keys:
            " ": ""
            "\\[MiB\\]": ".MiB"
            "\\[%\\]": ".percent"
            "\\[W\\]": ".watts"
            "\\[MHz\\]": ".MHz"
          value_parser:
            "clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
            '.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'

GPU メトリクスが取り込まれていることを確認する

Flex 構成はインフラストラクチャエージェントによって自動的に検出され、実行されるため、エージェントを再起動する必要はありません。次の NRQL クエリを実行すると、メトリクスが取り込まれていることを確認できます。

SELECT * FROM NvidiaGpuSample

アプリケーションを監視する

事前に構築されたダッシュボードテンプレートを使用して、GPU メトリクスを監視できます。次の手順を実行します：

one.newrelic.com
に移動し、
Dashboards
をクリックします。
Import dashboard
タブをクリックします。
NVIDIA GPU ダッシュボードからファイルの内容 ( .json ) をコピーします。
ダッシュボードをインポートする必要があるターゲットアカウントを選択します。
Import dashboard
をクリックしてアクションを確認します。
NVIDIA GPU Monitoringダッシュボードはカスタムダッシュボードとみなされ、 Dashboards UI に表示されます。ダッシュボードの使用と編集に関するドキュメントについては、ダッシュボードのドキュメントを参照してください。
利用可能なすべてのテレメトリを表示する NRQL クエリは次のとおりです。

SELECT * FROM NvidiaGpuSample

次は何ですか？

Flex 構成を調整して、NVIDIA SMI ユーティリティから入手可能な情報を含めたり除外したりできます。

NRQL クエリの作成とダッシュボードの生成の詳細については、次のドキュメントをご覧ください。

基本的なクエリと高度なクエリを作成するためのクエリビルダーの概要。
ダッシュボードをカスタマイズしてさまざまなアクションを実行するためのダッシュボードの概要。
ダッシュボードを管理して、
表示モードを調整したり、ダッシュボードにコンテンツを追加したりできます。