• EnglishEspaรฑolๆ—ฅๆœฌ่ชžํ•œ๊ตญ์–ดPortuguรชs
  • ๋กœ๊ทธ์ธ์ง€๊ธˆ ์‹œ์ž‘ํ•˜๊ธฐ

์‚ฌ์šฉ์ž์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด ์ œ๊ณต๋˜๋Š” ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์ž…๋‹ˆ๋‹ค.

์˜๋ฌธ๋ณธ๊ณผ ๋ฒˆ์—ญ๋ณธ์ด ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ์˜๋ฌธ๋ณธ์ด ์šฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋ณด๋‹ค ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ํŽ˜์ด์ง€๋ฅผ ๋ฐฉ๋ฌธํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

๋ฌธ์ œ ์‹ ๊ณ 

NVIDIA GPU ํ†ตํ•ฉ

NVIDIA GPU ํ†ตํ•ฉ์„ ํ†ตํ•ด GPU ์ƒํƒœ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ†ตํ•ฉ์—์„œ๋Š” NVIDIA์˜ SMI ์œ ํ‹ธ๋ฆฌํ‹ฐ์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ๋Š” Flex ํ†ตํ•ฉ๊ณผ ํ•จ๊ป˜ ์ธํ”„๋ผ ์—์ด์ „ํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

NVIDIA GPU ํ†ตํ•ฉ์„ ์„ค์ •ํ•˜๋ฉด GPU ์ง€ํ‘œ์— ๋Œ€ํ•œ ๋Œ€์‹œ๋ณด๋“œ๊ฐ€ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

์„ค์น˜ํ•˜๋ฉด ์ค‘์š”ํ•œ GPU ์ง€ํ‘œ๊ฐ€ ํฌํ•จ๋œ ์‚ฌ์ „ ๊ตฌ์ถ•๋œ ๋Œ€์‹œ๋ณด๋“œ๊ฐ€ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

  • GPU ํ™œ์šฉ๋„
  • ECC ์˜ค๋ฅ˜ ์ˆ˜
  • ํ™œ์„ฑ ์ปดํ“จํŒ… ํ”„๋กœ์„ธ์Šค
  • ์‹œ๊ณ„ ๋ฐ ์„ฑ๋Šฅ ์ƒํƒœ
  • ์˜จ๋„ ๋ฐ ํŒฌ ์†๋„
  • ์ง€์›๋˜๋Š” ๊ฐ ์žฅ์น˜์— ๋Œ€ํ•œ ๋™์  ๋ฐ ์ •์  ์ •๋ณด

์ธํ”„๋ผ ์—์ด์ „ํŠธ ์„ค์น˜

New Relic์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์บก์ฒ˜ํ•˜๋ ค๋ฉด ์ธํ”„๋ผ ์—์ด์ „ํŠธ๋ฅผ ์„ค์น˜ํ•˜์„ธ์š”. ๋‹น์‚ฌ์˜ ์ธํ”„๋ผ ์—์ด์ „ํŠธ๋Š” GPU ์„ฑ๋Šฅ์„ ์ถ”์ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.

๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ธํ”„๋ผ ์—์ด์ „ํŠธ๋ฅผ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

NVIDIA GPU์šฉ Flex ํ†ตํ•ฉ ๊ตฌ์„ฑ

Flex๋Š” New Relic ์ธํ”„๋ผ ์—์ด์ „ํŠธ์™€ ํ•จ๊ป˜ ๋ฒˆ๋“ค๋กœ ์ œ๊ณต๋˜๋ฉฐ NVIDIA GPU ์žฅ์น˜๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๋Š” ๋ช…๋ น์ค„ ์œ ํ‹ธ๋ฆฌํ‹ฐ์ธ NVIDIA SMI ์™€ ํ†ตํ•ฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ค‘์š”

nvidia-smi๋Š” Linux ๋ฐ Windows Server์— NVIDIA GPU ๋””์Šคํ”Œ๋ ˆ์ด ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์‚ฌ์ „ ์„ค์น˜๋˜์–ด ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

Flex๋ฅผ ๊ตฌ์„ฑํ•˜๋ ค๋ฉด ๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ๋”ฐ๋ฅด์„ธ์š”.

  1. ๋‹ค์Œ ๊ฒฝ๋กœ์— nvidia-smi-gpu-monitoring.yml ์ด๋ผ๋Š” ํŒŒ์ผ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
bash
$
sudo touch /etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml

git ์ €์žฅ์†Œ ์—์„œ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ํ†ตํ•ฉ ๊ตฌ์„ฑ์œผ๋กœ nvidia-smi-gpu-monitoring.yml ํŒŒ์ผ์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
---
integrations:
- name: nri-flex
# interval: 30s
config:
name: NvidiaSMI
variable_store:
metrics:
"name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
mig.mode.current,mig.mode.pending"
apis:
- name: NvidiaGpu
commands:
- run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
output: csv
rename_keys:
" ": ""
"\\[MiB\\]": ".MiB"
"\\[%\\]": ".percent"
"\\[W\\]": ".watts"
"\\[MHz\\]": ".MHz"
value_parser:
"clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
'.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'

GPU ์ธก์ •ํ•ญ๋ชฉ์ด ์ˆ˜์ง‘๋˜๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธ

Flex ๊ตฌ์„ฑ์€ ์ธํ”„๋ผ ์—์ด์ „ํŠธ์— ์˜ํ•ด ์ž๋™์œผ๋กœ ๊ฐ์ง€๋˜๊ณ  ์‹คํ–‰๋˜๋ฏ€๋กœ ์—์ด์ „ํŠธ๋ฅผ ๋‹ค์‹œ ์‹œ์ž‘ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ NRQL ์ฟผ๋ฆฌ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ธก์ •ํ•ญ๋ชฉ์ด ์ˆ˜์ง‘๋˜๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

SELECT * FROM NvidiaGpuSample

์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋ชจ๋‹ˆํ„ฐ๋ง

์‚ฌ์ „ ๊ตฌ์ถ•๋œ ๋Œ€์‹œ๋ณด๋“œ ํ…œํ”Œ๋ฆฟ์„ ์‚ฌ์šฉํ•˜์—ฌ GPU ์ง€ํ‘œ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ดํ•˜์„ธ์š”:

  1. one.newrelic.com

    ์œผ๋กœ ์ด๋™ํ•˜์—ฌ

    Dashboards

    ๋ฅผ) ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค.

  2. Import dashboard

    ํƒญ์„ ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค.

  3. NVIDIA GPU ๋Œ€์‹œ๋ณด๋“œ ์—์„œ ํŒŒ์ผ ์ฝ˜ํ…์ธ (.json)๋ฅผ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  4. ๋Œ€์‹œ๋ณด๋“œ๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•˜๋Š” ๋Œ€์ƒ ๊ณ„์ •์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

  5. ์ž‘์—…์„ ํ™•์ธํ•˜๋ ค๋ฉด

    Import dashboard

    ํด๋ฆญํ•˜์„ธ์š”.

    ๊ท€ํ•˜์˜ NVIDIA GPU Monitoring ๋Œ€์‹œ๋ณด๋“œ๋Š” ๋งž์ถคํ˜• ๋Œ€์‹œ๋ณด๋“œ๋กœ ๊ฐ„์ฃผ๋˜๋ฉฐ Dashboards UI์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€์‹œ๋ณด๋“œ ์‚ฌ์šฉ ๋ฐ ํŽธ์ง‘์— ๋Œ€ํ•œ ๋ฌธ์„œ๋Š” ๋Œ€์‹œ๋ณด๋“œ ๋ฌธ์„œ ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

    ๋‹ค์Œ์€ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์›๊ฒฉ ๋ถ„์„์„ ๋ณด๊ธฐ ์œ„ํ•œ NRQL ์ฟผ๋ฆฌ์ž…๋‹ˆ๋‹ค.

SELECT * FROM NvidiaGpuSample

๋‹ค์Œ์€ ๋ญ์ง€?

NVIDIA SMI ์œ ํ‹ธ๋ฆฌํ‹ฐ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ฑฐ๋‚˜ ์ œ์™ธํ•˜๋„๋ก Flex ๊ตฌ์„ฑ์„ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

NRQL ์ฟผ๋ฆฌ ์ž‘์„ฑ ๋ฐ ๋Œ€์‹œ๋ณด๋“œ ์ƒ์„ฑ์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด ๋‹ค์Œ ๋ฌธ์„œ๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

Copyright ยฉ 2024 New Relic Inc.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.