深海游弋的鱼 – 默默的点滴

目前 ubuntu 18.04 上使用 sudo apt-get install nvidia-cuda-toolkit 安装的是 9.1.85 版本的 nvidia cuda , 尽管版本比较老，但是好在稳定性好，适用范围广。

当我们的项目需要使用指定版本的 pytorch 的时候，目前官方提供的编译好的 nvidia cuda 安装包并不兼容全部的硬件。这个在实际环境中是比较麻烦的。

目前来说，比较稳妥的办法是直接从源代码编译。

如果显卡是几年前的显卡（GeForce GTX 760 Compute Capability = 3.0 / GeForce GT 720M Lenveo Thinkpad T440 Compute Capability = 2.1），运行的时候会提示：

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.

Found GPU0 GeForce GTX 760 which is of cuda capability 3.0.

PyTorch no longer supports this GPU because it is too old.

The minimum cuda capability that we support is 3.5.

执行的时候会报错

RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

1	RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

硬件的计算能力查询 Recommended GPU for Developers

------------------------------------------------------------------------------------

安装最新版本的 cuda-10.1,低版本的编译会出问题：

# 卸载之前已经安装的cuda
$ sudo apt-get remove nvidia-cuda-toolkit

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get -y install cuda

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常
$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 可能需要删除一下XWindow的配置文件，否则驱动可能不能正常加载
$ sudo rm -rf ~/.Xauthority 

# 如果出现如下错误
# ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1 
# 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"
# 参考 http://www.mobibrw.com/?p=21739 

# 删除安装源，可以节约几个GB的磁盘，安装完成后这部分已经用不上了
$ sudo apt-get remove --purge cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00 

$ sudo apt-get update

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常
$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 卸载之前已经安装的cuda

$ sudo apt-get remove nvidia-cuda-toolkit

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

$ wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb

$ sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get -y install cuda

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常

$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

# 可能需要删除一下XWindow的配置文件，否则驱动可能不能正常加载

$ sudo rm -rf ~/.Xauthority

# 如果出现如下错误

# ubuntu 18.04 "nvidia-340 导致 /usr/lib/x86_64-linux-gnu/libGL.so.1

# 转移到 /usr/lib/x86_64-linux-gnu/libGL.so.1.distrib"

# 参考 http://www.mobibrw.com/?p=21739

# 删除安装源，可以节约几个GB的磁盘，安装完成后这部分已经用不上了

$ sudo apt-get remove --purge cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00

$ sudo apt-get update

# 部分驱动可能会更新，需要执行更新，否则可能依旧不正常

$ sudo apt-get dist-upgrade

$ sudo apt-get autoremove

安装 cuDNN 去官网下载对应版本的 cuDNN 一共是三个安装包，逐个安装即可。

Lenveo T440 Compute Capability = 2.1 不支持 cuDNN ，因此没必要安装 , 其实连最新版本的 CUDA-10.1 也不能安装，原因在于 NVIDIA GT 720M 的驱动只支持到 390 版本，而 CUDA-10.1 需 418 以上的版本才能支持，具体表现在于系统启动后没有加载显卡驱动，dmesg 可以查看到如下信息：

[   72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 430.50 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[   72.533875] NVRM: No NVIDIA graphics adapter found!

[ 72.533870] NVRM: The NVIDIA GeForce GT 720M GPU installed in this system is

NVRM: supported through the NVIDIA 390.xx Legacy drivers. Please

NVRM: visit http://www.nvidia.com/object/unix.html for more

NVRM: information. The 430.50 NVIDIA driver will ignore

NVRM: this GPU. Continuing probe...

[ 72.533875] NVRM: No NVIDIA graphics adapter found!

------------------------------------------------------------------------------------

依旧是推荐在 Anaconda 上建立独立的编译环境，然后执行编译：

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

# magma-cuda90 magma-cuda91 magma-cuda92 会编译失败 
$ conda install -c pytorch magma-cuda101

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行
# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py
$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch
$ conda uninstall pytorch

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability” 
# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常
# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用
# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常
$ cd ..

$ sudo apt-get install git

# conda remove -n pytorch --all

$ conda create -n pytorch -y python=3.6.8 pip

$ source activate pytorch

$ conda install numpy pyyaml mkl=2019.1 mkl-include=2019.1 setuptools cmake cffi typing pybind11

$ conda install ninja

# magma-cuda90 magma-cuda91 magma-cuda92 会编译失败

$ conda install -c pytorch magma-cuda101

$ git clone https://github.com/pytorch/pytorch

$ cd pytorch

# pytorch 1.0.1 版本支持“Compute Capability” 低于3.0版本的硬件，pytorch 1.2.0需要至少3.5版本的硬件才可以正常运行

# https://github.com/pytorch/pytorch/blob/v1.3.0/torch/utils/cpp_extension.py

$ git checkout v1.0.1 -b v1.0.1

$ git submodule sync

$ git submodule update --init --recursive

$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

# 如果不需要使用cuda的话，这里还要加上一句：export NO_CUDA=1

$ python setup.py clean

# 卸载以前安装的pytorch

$ conda uninstall pytorch

# 从Nvidia开发网站查询到自己硬件对应的“Compute Capability”

# 比如 “GeForce GTX 760” 对应 “3.0” 计算能力，能力不正确会导致运行异常

# RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device

$ python setup.py install

# 对于开发者模式，可以使用

# python setup.py build develop

# 一定要退出 pytorch 的编译目录，在pytorch代码目录下执行命令会出现异常

$ cd ..

如果出现如下错误：

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o
~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:
            function "cusparseGetErrorString(cusparseStatus_t)"
            function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"
            argument types are: (cusparseStatus_t)

[ 68%] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/ATen/native/sparse/cuda/caffe2_gpu_generated_SparseCUDABlas.cu.o

~/pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(58): error: more than one instance of function "at::native::sparse::cuda::cusparseGetErrorString" matches the argument list:

function "cusparseGetErrorString(cusparseStatus_t)"

function "at::native::sparse::cuda::cusparseGetErrorString(cusparseStatus_t)"

argument types are: (cusparseStatus_t)

则需要调整代码 aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu, 在其中的 cusparseGetErrorString 函数上增加 #if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

如下：

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))
const char* cusparseGetErrorString(cusparseStatus_t status) {
  switch(status)
  {
    case CUSPARSE_STATUS_SUCCESS:
      return "success";

    case CUSPARSE_STATUS_NOT_INITIALIZED:
      return "library not initialized";

    case CUSPARSE_STATUS_ALLOC_FAILED:
      return "resource allocation failed";

    case CUSPARSE_STATUS_INVALID_VALUE:
      return "an invalid numeric value was used as an argument";

    case CUSPARSE_STATUS_ARCH_MISMATCH:
      return "an absent device architectural feature is required";

    case CUSPARSE_STATUS_MAPPING_ERROR:
      return "an access to GPU memory space failed";

    case CUSPARSE_STATUS_EXECUTION_FAILED:
      return "the GPU program failed to execute";

    case CUSPARSE_STATUS_INTERNAL_ERROR:
      return "an internal operation failed";

    case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
      return "the matrix type is not supported by this function";

    case CUSPARSE_STATUS_ZERO_PIVOT:
      return "an entry of the matrix is either structural zero or numerical zero (singular block)";

    default:
      return "unknown error";
  }
}
#endif

#if (!((CUSPARSE_VER_MAJOR >= 10) && (CUSPARSE_VER_MINOR >= 2)))

const char* cusparseGetErrorString(cusparseStatus_t status) {

switch(status)

{

case CUSPARSE_STATUS_SUCCESS:

return "success";

case CUSPARSE_STATUS_NOT_INITIALIZED:

return "library not initialized";

case CUSPARSE_STATUS_ALLOC_FAILED:

return "resource allocation failed";

case CUSPARSE_STATUS_INVALID_VALUE:

return "an invalid numeric value was used as an argument";

case CUSPARSE_STATUS_ARCH_MISMATCH:

return "an absent device architectural feature is required";

case CUSPARSE_STATUS_MAPPING_ERROR:

return "an access to GPU memory space failed";

case CUSPARSE_STATUS_EXECUTION_FAILED:

return "the GPU program failed to execute";

case CUSPARSE_STATUS_INTERNAL_ERROR:

return "an internal operation failed";

case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:

return "the matrix type is not supported by this function";

case CUSPARSE_STATUS_ZERO_PIVOT:

return "an entry of the matrix is either structural zero or numerical zero (singular block)";

default:

return "unknown error";

}

#endif

这样解决跟 CUDA-10.1自带函数的冲突问题。

具体参考： https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu

源码安装的Pytorch，卸载需要执行：

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

# conda uninstall pytorch

$ pip uninstall torch

$ python setup.py clean

Pytorch 代码下载非常缓慢，可以本站下载同步好的pytorch源代码。

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

pytorch 1.0.1在ubuntu 18.04(GeForce GTX 760)编译(CUDA-10.1)

参考链接

发布者

默默

发表回复取消回复

参考链接

发布者

默默

发表回复 取消回复

发表回复取消回复