NVIDIA GPU 통합

NVIDIA GPU 통합을 통해 GPU 상태를 모니터링할 수 있습니다. 이 통합에서는 NVIDIA의 SMI 유틸리티에 액세스할 수 있는 Flex 통합과 함께 인프라 에이전트를 사용합니다.

NVIDIA GPU 통합을 설정하면 GPU 지표에 대한 대시보드가 제공됩니다.

설치하면 중요한 GPU 지표가 포함된 사전 구축된 대시보드가 제공됩니다.

GPU 활용도
ECC 오류 수
활성 컴퓨팅 프로세스
시계 및 성능 상태
온도 및 팬 속도
지원되는 각 장치에 대한 동적 및 정적 정보

인프라 에이전트 설치

New Relic으로 데이터를 캡처하려면 인프라 에이전트를 설치하세요. 당사의 인프라 에이전트는 GPU 성능을 추적할 수 있도록 데이터를 수집하고 수집합니다.

두 가지 방법으로 인프라 에이전트를 설치할 수 있습니다.

가이드 설치 는 시스템을 검사하고 시스템에 가장 적합한 애플리케이션 모니터링 에이전트와 함께 인프라 에이전트를 설치하는 CLI 도구입니다. 가이드 설치 작동 방식에 대해 자세히 알아보려면 가이드 설치 개요 를 확인하세요.
인프라 에이전트를 수동으로 설치하려는 경우 Linux, Windows 용 수동 설치 자습서를 따를 수 있습니다.

NVIDIA GPU용 Flex 통합 구성

Flex는 New Relic 인프라 에이전트와 함께 번들로 제공되며 NVIDIA GPU 장치를 모니터링하는 명령줄 유틸리티인 NVIDIA SMI 와 통합될 수 있습니다.

중요

nvidia-smi는 Linux 및 Windows Server에 NVIDIA GPU 디스플레이 드라이버가 사전 설치되어 제공됩니다.

Flex를 구성하려면 다음 단계를 따르세요.

다음 경로에 nvidia-smi-gpu-monitoring.yml 이라는 파일을 만듭니다.

bash

$sudo touch /etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml

git 저장소 에서 다운로드할 수도 있습니다.

통합 구성으로 nvidia-smi-gpu-monitoring.yml 파일을 업데이트합니다.

--- 
integrations:
  - name: nri-flex
    # interval: 30s
    config:
      name: NvidiaSMI
      variable_store:
        metrics: 
          "name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
          pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
          pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
          persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
          driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
          gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
          clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
          clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
          clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
          clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
          utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
          encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
          ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
          ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
          ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
          ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
          ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
          ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
          ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
          ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
          ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
          ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
          ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
          ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
          retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
          power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
          clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
          clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
          mig.mode.current,mig.mode.pending"
      apis:
        - name: NvidiaGpu
          commands:
            - run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
              output: csv
          rename_keys:
            " ": ""
            "\\[MiB\\]": ".MiB"
            "\\[%\\]": ".percent"
            "\\[W\\]": ".watts"
            "\\[MHz\\]": ".MHz"
          value_parser:
            "clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
            '.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'

GPU 측정항목이 수집되고 있는지 확인

Flex 구성은 인프라 에이전트에 의해 자동으로 감지되고 실행되므로 에이전트를 다시 시작할 필요가 없습니다. 다음 NRQL 쿼리를 실행하여 측정항목이 수집되고 있는지 확인할 수 있습니다.

SELECT * FROM NvidiaGpuSample

애플리케이션 모니터링

사전 구축된 대시보드 템플릿을 사용하여 GPU 지표를 모니터링할 수 있습니다. 다음과 같이하세요:

one.newrelic.com
으로 이동하여
Dashboards
를) 클릭합니다.
Import dashboard
탭을 클릭합니다.
NVIDIA GPU 대시보드 에서 파일 콘텐츠(.json)를 복사합니다.
대시보드를 가져와야 하는 대상 계정을 선택합니다.
작업을 확인하려면
Import dashboard
클릭하세요.
귀하의 NVIDIA GPU Monitoring 대시보드는 맞춤형 대시보드로 간주되며 Dashboards UI에서 찾을 수 있습니다. 대시보드 사용 및 편집에 대한 문서는 대시보드 문서 를 참조하세요.
다음은 사용 가능한 모든 원격 분석을 보기 위한 NRQL 쿼리입니다.

SELECT * FROM NvidiaGpuSample

다음은 뭐지?

NVIDIA SMI 유틸리티에서 사용할 수 있는 정보를 포함하거나 제외하도록 Flex 구성을 조정할 수 있습니다.

NRQL 쿼리 작성 및 대시보드 생성에 대해 자세히 알아보려면 다음 문서를 확인하세요.

기본 및 고급 쿼리를 생성 하기 위한 쿼리 빌더 소개
대시보드를 사용자 지정하고 다양한 작업을 수행하기 위한 대시보드 소개
대시보드를 관리하여
디스플레이 모드를 조정하거나 대시보드에 더 많은 콘텐츠를 추가하세요.

사용자의 편의를 위해 제공되는 기계 번역입니다.

인프라 에이전트 설치.css-21sua1{background:none;border:none;width:0;padding:0;}

NVIDIA GPU용 Flex 통합 구성

중요

GPU 측정항목이 수집되고 있는지 확인

애플리케이션 모니터링

다음은 뭐지?

인프라 에이전트 설치