Fix corrupted NVIDIA Driver after Upgrading / Downgrading Ubuntu Kernel
One day, you rebooted your server and suddenly found your cute GPUs had all disappeared. Then you executed nvidia-smi
to see what was going on, but you only got this error message.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I know what you're gonna say: Nvidia F*** You!
Causes
Actually, I encountered this problem many times. It is most likely caused by upgrading or downgrading the Linux kernel without properly generating kernel modules, which might be essential parts of GPU drivers.
Steps
Check System Status
Right now nvidia
module is supposed not to be loaded (Could check with lsmod | grep nvidia
). We could try to load the kernel manually.
1 | sudo modprobe nvidia |
You should get an error message like this.
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.0-84-generic
Meanwhile, check whether /usr/lib/modules/5.15.0-84-generic/updates/dkms/nvidia.ko
is missing.
If you don’t see the error message above and that kernel module file does exist, you might have other issues, such as hardware failure. At this time, try to read kernel logs through dmesg
and check the existence of GPUs through lspci -vvv
, which should give you some clues.
Reinstall DKMS and NVIDIA Drivers
DKMS, a utility that manages drivers, as well as NVIDIA Drivers, might be broken. We could fix them by removing them first and installing them back later.
1 | sudo rm -r /var/lib/dkms/nvidia |
Note: Installing the full CUDA Toolkits is the only way I recommend to install drivers. Using the drivers provided by the Ubuntu official repo is NOT recommended.
(Optional) Reinstall NVIDIA Docker Runtime
The previous step will also remove NVIDIA Docker Runtime, which may lead to this error if you use Docker.
Error response from daemon: Cannot restart container ...: could not select device driver "" with capabilities: [[gpu]]
Thus, we need to install it back too.
1 | sudo apt install -y nvidia-container-toolkit |