NUMA node error for Nvidia cards on Linux PCs

You may have experienced it in various contexts: CUDA, Tensorflow, gaming applications or complex 3D graphics applications may warn you that your Nvidia card is associated with an unexpected negative NUMA value. The warning often refers to a value of “-1”. And the clever application replaces this value by a default value of “0”.

The problem is particularly annoying when dealing Machine Learning, e.g. in Jupyter notebooks. There warnings may repeatedly clatter the output of some cells – e.g. during the setup of the graphics card for some ML experiments.

Besides the question why the Nvidia drivers for Linux and/or CUDA drivers do not fix this problem by detecting just one NUMA node on the system and setting the value for the card to “0”, the question for us users is how we can get rid of the warnings.

A basic idea is that we set the right value by ourselves. I have described this simple measure in the sister blog, which unfortunately still is under construction. See:
Setting NUMA node to 0 for Nvidia cards on standard Linux PCs.

There I also briefly discuss what NUMA basically is thought for – and why it normally does not affect consumer PCs.

 

Opensuse Leap 15.5 – installation of CUDA 12.3 for Machine Learning

Working with Machine Learning and Deep Neural Networks not only requires GPU drivers, but in case of Nvidia GPUs also the installation of CUDA and cuDNN. This process is always a bit tricky as additional environment variables have to be set for IPython-based Jupyterlab or classic Jupyter Notebook. On an Opensuse system one must in addition take care of the right settings in /etc/alternatives.

I have described the necessary steps in a post at “machine-learning.anracom.com“.

I hope this helps people who want to use Leap 15.5 for Machine Learning with Nvidia GPUs, Keras/Tensorflow 2 and Jupyterlab.

Important addendum 01/27/2024:
Although the combination of CUDA 12.3, cuDNN 8.9.7, Tensorflow 2.15 and Nvidia drivers 545.29.06 works regarding AI-models, there is another major problem:
Nvidia’s driver 545.29.06 is buggy – at least for Leap 15.5, KDE/Plasma with multiple screens. The bug affects Suspend-to-RAM. Suspend-to-RAM seems to work in the suspend phase, and the system also comes up afterward in a seemingly proper state of your KDE/Plasma interface (on your screens).

However, the problems begin when you want to change to another virtual screen via Ctrl-Alt-Fx. You wait and wait and wait … The same for changing the run-level or systemd target state or when you want to shut the system down. This makes Suspend-to-RAM with driver 545.29.06 impossible to use.

Recommendation:
If you have a working older Nvidia driver (e.g. a stable 535 version) do not change to 545.29.06. Unfortunately, it is a mess on a multiscreen Leap 15.5 system to return to an older driver version. The Nvidia community repository does not offer you a choice. (Why by the way ????). Downloading an older proprietary driver from Nvidia and trying to install it afterward on a console terminal (after having stopped X11 or Wayland) did not work in my case – the screens displaying the terminal changed their resolution and froze afterward. So, you may have to completely uninstall the present driver 545 completely, go back to standard VGA and then try to install an older driver via Nvidias install mechanism. As I said: It is a mess …

 

Opensuse Leap 15.4 – Problems with Optimus and prime-select after updates of SW packages

Presently, I work a lot on an old laptop which has a so called Optimus combination of a dedicated Nvidia GPU and an Intel GPU coming with the main CPU-processor. “prime-select” is a tool which Opensuse includes with Leap 15.4 to provide an efficient way of controlling which GPU shall be used. As good as prime-select has worked for me on Leap 15.3 and also some time with Leap 15.4 recent updates of a variety of SW packages lead into trouble.

I had the Nvidia card active before the SW updates. After a cold restart of the system it did no longer start the SDDM display manager on the default systemd target. This happened even when the updates did not directly affect the kernel or the Nvidia kernel modules.

The problem always had to do with bbswitch turning off the Nvidia device when the system switched to the default graphical target. And with a turned off Nvidia graphics device the Nvidia drivers can not be loaded.

So some SW updates lead to a change of the configuration prime-select had set up before the updates. The stupid thing is that it is not quite so simple to get things back to work. To try to us “init 3” to go to a console interface on a non-graphical target and then use “prime-select nvidia” plus a subsequent “init 5” on the command line does not work. You do not change the wrong bbswitch actions that way. You can also turn bbswitch off by “tee /proc/acpi/bbswitch <<< OFF". And then load the Nvidia driver successfully. But trying to afterward switch to the standard graphical target invokes bbswitch again in the wrong way. It is a bit of a mess. The following steps seem to work to get back to normal operation again:

  • Step 1: Use “init 3” on a console terminal.
  • Step 2: Use the command “prime-select intel”.
  • Step 3: Restart your system. It should boot now into the graphical target based on the i915 intel GPU driver.
  • Step 4: Ignore any information from a prime-select icon. It shows you a plainly wrong info that you are using Nvidia.
  • Step 5: Log in as root on a root terminal window. Switch bbswitch off (e.g. by the command given above). Load the Nvidia module by “modprobe nvidia”. Check via lsmod that it is successfully loaded.
  • Step 6: Type in “prime-select nvidia”.
  • Step 7: Log out from your graphical interface.
  • Step 8: Check that SDDM or whatever display manager is started with bbswitch not shutting down the Nvidia card. Log inn with the Nvidia card active.
  • Step 9: Check that the Nvidia driver is still loaded on a root terminal window. Then issue “mkinitrd” and restart your Leap 15.4 system.

Afterward using the “prime-select intel” or “prime-select nvidia” commands at the command line of a root terminal window, a logout from the graphical desktop and a login again via the restarted graphical display manager switches correctly between the cards.

However, the prime-select applet gives you wrong information when the intel card is active. And it does not give you the chance to switch back to the Nvidia card again. Its stupid, but no major problem as long as the basic prime-select command does its job on the command line.

Hope this helps people having to work with Opensuse on an Optimus system.