centos 7 下通过 conda 安装 cuda pytorch-摩杜云开发者社区

先查看自己的linux上显卡型号：

# lspci | grep -i nvidia
04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

查看是否有程序占用（如果存在占用，请停掉该程序）

# lsof | grep nvidia
nvidia-mo   443                 root cwd       DIR              253,0        254          64 /
nvidia-mo   443                 root rtd       DIR              253,0        254          64 /
nvidia-mo   443                 root txt   unknown                                           /proc/443/exe

当然显卡驱动也可以这样安装：（推荐）
sudo yum install nvidia-detect

nvidia-detect -v

Probing for supported NVIDIA devices...
[10de:1b06] NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
This device requires the current 440.64

yum -y install kmod-nvidia

错误：nvidia-x11-drv-390xx conflicts with nvidia-x11-drv-460.39-1.el7_9.elrepo.x86_64
错误：nvidia-x11-drv-390xx conflicts with nvidia-x11-drv-libs-460.39-1.el7_9.elrepo.x86_64
错误：nvidia-x11-drv conflicts with nvidia-x11-drv-390xx-390.138-1.el7_8.elrepo.x86_64
您可以尝试添加 --skip-broken 选项来解决该问题
** 发现 2 个已存在的 RPM 数据库问题， 'yum check' 输出如下：
dnf-4.0.9.2-1.el7_6.noarch 有缺少的需求 python2-dnf = ('0', '4.0.9.2', '1.el7_6')
orca-3.6.3-4.el7.x86_64 有缺少的需求 pyatspi

卸载冲突的包

yum remove -y nvidia-x11-drv-390xx-390.138-1.el7_8.elrepo.x86_64
yum remove -y nvidia-x11-drv-460.39-1.el7_9.elrepo.x86_64

卸载驱动：
sudo yum remove kmod-nvidia

# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

# nvidia-smi

Failed to initialize NVML: Driver/library version mismatch

http://www.nvidia.cn/Download/Find.aspx?lang=cn

centos 7 下通过 conda 安装 cuda pytorch_显卡驱动

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/440.64/NVIDIA-Linux-x86_64-440.64.run

sudo chmod a+x NVIDIA-Linux-x86_64-440.64.run
./NVIDIA-Linux-x86_64-440.64.run

# nvidia-smi

ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel.

# sudo systemctl isolate multi-user.target
# sudo modprobe -r nvidia-drm
modprobe: FATAL: Module nvidia_drm is in use.

sudo modprobe -r nvidia-modeset

# lsmod | grep nvidia.drm
nvidia_drm             43547 2
nvidia_modeset       1053327 1 nvidia_drm
drm_kms_helper        186531 1nvidia_drm
drm                   456166 5 drm_kms_helper,nvidia_drm

Run lsmod | grep nvidia.drm and see the numbers to the right of the nvidia_drm module name. The first number is simply the size of the module; the second is the use count.

If the X11 server is running and using the nvidia driver, then the nvidia_drm kernel module will most assuredly be in use. So you'll need, at the very least, switch into text console and shutdown the X11 server. Usually this can be done by stopping whichever X Display Manager service you're using (depends on which desktop environment you're using).

As the error message said, if you are running nvidia-persistenced, you'll need to stop that too before you can unload the nvidia_drm module.

kill -9 Xvnc

17080 root 20 0 519316 214832 47908 S 6.3 0.1 5421:48 Xvnc

ps aux | grep nvidia
root 443 0.0 0.0 0 0 ? S 2020 0:00 [nvidia-modeset]
root 8197 0.0 0.0 112832 984 pts/0 S+ 22:01 0:00 grep --color=auto nvidia