MegEngine 无法正常启用 CUDA 时应该如何解决

在使用 MegEngine 时,MegEngine 会自动选择最快的计算设备(即 XPU),在 CUDA 不可用时,您可能会遇到如下几种文字提示您当前 MegEngine 运行在 CPU 模式下:

info: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
info: + Failed to load CUDA driver library, MegEngine works under CPU mode now.         +
info: + To use CUDA mode, please make sure NVIDIA GPU driver was installed properly.    +
info: + Refer to https://discuss.megengine.org.cn/t/topic/1264 for more information.    +
info: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
02 11:47:53[mgb] WRN cuda unavailable: no CUDA-capable device is detected(100) ndev=-1
err: Failed to load cuda API library
err: failed to load cuda func: cuCtxGetCurrent
err: failed to load cuda func: cuDeviceGetCount
err: failed to load cuda func: cuGetErrorString
02 09:10:10[mgb] WRN cuda unavailable: unknown cuda error(999) ndev=-1

如果您确定已经正确安装了 NVIDIA 显卡设备,却仍然无法正常启用 CUDA 模式,请检查如下问题:

  1. 是否正常安装了显卡驱动并识别到显卡?请通过 nvidia-smi 确认显卡状态
  2. 驱动版本是否过老?
  3. 是否设置了 CUDA_VISIABLE_DEVICES 等环境变量屏蔽了全部显卡设备?
  4. 当前是否在 docker 环境中?如是,是否正确通过 nvidia-docker 把对应设备和 lib 映射到正确位置?
2赞