Prometheus GPU 监控
以下是步骤
1,Prometheus GPU 监控
2,安装gpu-monitoring-tools
2.1,设置`dcgm-exporter`开机启动
3,Prometheus修改配置
4,grafana
5,使用监控面板`9957`可以切换节点
6,Grafana设置
7,使用`12027`
8,使用GPU-Nodes-Metrics-Nvidia 12639
1,Prometheus GPU 监控
安装DCGM
datacenter-gpu-manager_1.7.2_amd64.deb
# dcgmi --version
dcgmi version: 1.7.2
2,安装gpu-monitoring-tools
# git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
# cd gpu-monitoring-tools/
# make binary
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
# make install
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
install -m 557 dcgm-exporter /usr/bin/dcgm-exporter
install -m 557 -D ./etc/dcgm-exporter/default-counters.csv /etc/dcgm-exporter/default-counters.csv
install -m 557 -D ./etc/dcgm-exporter/dcp-metrics-included.csv /etc/dcgm-exporter/dcp-metrics-included.csv
- 运行
dcgm-exporter
# which dcgm-exporter
/usr/bin/dcgm-exporter
# dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
- 测试,可以看到监控数据
# curl 192.168.1.2:9400/metrics
2.1,设置dcgm-exporter
开机启动
#新建服务
vim /lib/systemd/system/dcgm-exporter.service
#如下
[Unit]
Description=dcgm-exporter service
[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter
TimeoutStopSec=10
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
保存退出
加载、添加开机启动、开启、查看服务的一些命令
1.加载
systemctl daemon-reload
2.添加开机启动
systemctl enable dcgm-exporter.service
3.开启
systemctl start dcgm-exporter.service
4.查看
systemctl status dcgm-exporter.service
3,Prometheus修改配置
- 添加
dcgm-exporter
(修改prometheus配置文件)
# dcgm-exporter
- job_name: 'gpu'
static_configs:
- targets: ['192.168.1.2:9400']
如下是我的配置文件 实例:# dcgm-exporter 以下是新添加的
# cat prometheus.yml
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
# node_exporter
- job_name: 'node'
static_configs:
- targets: ['127.0.0.1:9100','192.168.1.2:9100']
# dcgm-exporter
- job_name: 'gpu'
static_configs:
- targets: ['192.168.1.2:9400']
- 重启
prometheus
systemctl restart prometheus.service
浏览器访问你的prometheus
如下:http://10.10.201.86:9090/targets
可以看到新添加的 UP了
4,grafana
5,使用监控面板9957可以切换节点
6,Grafana设置
- 监控功率,
instance
为ip地址
DCGM_FI_DEV_POWER_USAGE{instance="192.168.1.101:9400"}
- 显卡使用率
DCGM_FI_DEV_GPU_UTIL{instance="192.168.1.101:9400"}
7,使用12027模板
# dcgm-exporter
- job_name: 'gpu-metrics'
static_configs:
- targets: ['127.0.0.1:9400','192.168.1.101:9400','192.168.1.102:9400']
手动设置监控
- 查看显卡指标
curl http://127.0.0.1:9400/metrics
- 使用功率
DCGM_FI_DEV_POWER_USAGE{instance="127.0.0.1:9400"}
- 内存使用
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}
- 总内存
DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}+DCGM_FI_DEV_FB_FREE{instance="127.0.0.1:9400"}
- GPU使用率
DCGM_FI_DEV_GPU_UTIL{instance="127.0.0.1:9400"}