Keras 3/TF vs. PyTorch – small model performance tests on a Nvidia 4060 TI

There are many PROs and CONs regarding the choice of a Machine Learning [ML] framework for private studies on a Linux Workstations. Two mainly used frameworks are PyTorch and a Keras/Tensorflow combination. One aspect for productive work with ML models certainly is performance. And as I personally do not have TPUs or other advanced chips available, but just a consumer Nvidia 4060 TI graphics card, performance and optimal GPU usage are of major interest – even for the training of relatively small models.

With this post I just want to point out that the question of performance advantages of some framework on a CUDA controlled graphics card can not be answered in a unique way. Even for small neural network [NN] models the performance may depend on a variety of relevant settings, on jit-/xla-compilation and the chosen precision level of your training or inference runs.

Continue reading

Blender – even on old laptops a graphics card increases rendering performance

My present experiments with Blender on my old laptop take considerable time to render- especially animations. So, I got interested in whether rendering on the laptop’s old Nvidia card, a GT 645M, would make a difference in comparison to rendering on the available 8 hyperthreaded cores of the CPU. The laptop’s CPU is an old one, too, namely an i7-3632QM. The laptop’s operative system is Opensuse Leap 15.3. The system uses Optimus technology. To switch between the Nvidia card and the Intel graphics I invoke Suse’s Prime Select application on KDE.

I got a factor of 2 up to 5.2 faster rendering on the GPU in comparison to the CPU. The difference depends on multiple factors. The number of CPU cores used is an important one.

How to activate GPU rendering in Blender?

Basically three things are required: (1) A working recent Nvidia driver (with compute components) for your graphics card. (2) A certain setting in Blender’s preferences. (3) A setting for the Cycles renderer.

Regarding the CUDA toolkit I quote from Blender’s documentation

Normally users do not need to install the CUDA toolkit as Blender comes with precompiled kernels.

With respect to required Blender settings one has to choose a CUDA capable device via the menu point “Preferences >> System”:

You may also select both the GPU and the CPU. Then rendering will be done both on the GPU and the CPU. My graphics card unfortunately only understands a low level of CUDA instructions. The Nvidia driver I used is of version 470.103.01, installed via Opensuse’s Nvidia community repository:

In addition, you must set an option for the Cycles renderer:

With all these settings I got a factor of 2 up to > 6 faster rendering on the GPU in comparison to a CPU with multiple cores.

The difference in performance, of course, depends on

  • the number of threads used on the CPU with 8 (hyperthreaded) cores available to the Linux OS
  • tiling – more precisely the “tile size” – in case of the GPU and the CPU

All other render options with the exception of “Fast G” were kept constant during the experiments.

Scene Setup

To give the Blender’s Cylces renderer something to do I set up a scene with the following elements:

  • a mountain-like landscape (via the A.N.T Landscape Add-On) with a sub-dividion of 256 to 128 – plus subdivision modifier (Catmull-Clark, render level 2, limit surface quality 3) – plus simple procedural texture with some noise and bumps
  • a plane with an “ocean” modifier (no repetition, waves + noisy bump texture for the normal to simulate waves)
  • a world with a sky texture of the Nishita type ( blue sky by much oxygen, some dust and a sun just above the horizon)

The scene looked like

The central red rectangle marks the camera perspective and the area to be rendered. With 80 samples and a resolution of 1200×600 we get:

The hardest part for the renderer is the reflection on the water (Ocean with wave and texture). Also the “landscape” requires some time. The Nishita world (i.e. the sky with the sun), however, is rendered pretty fast.

Required time for rendering on multiple CPU cores

I used 40 samples to render – no denoising, progressive multi-jitter, 0 minimum bounces.
Other settings can be found here:


The number of threads, the tile size and the use of the Fast CI approximation were varied.
The resolution was chosen to be 1200×600 px.

All data below were measured on a flatpak installation of Blender 3.1.2 on Opensuse Leap 15.3.

tile size threads Fast GI time
64 2 no 82.24
128 2 no 81.13
256 2 no 81.01
32 4 no 45.63
64 4 no 43.73
128 4 no 43.47
256 4 no 43.21
512 4 no 44.06
128 8 no 31.25
256 8 no 31.04
256 8 yes 26.52
512 8 no 31.22

A tile size of 256×256 seems to provide an optimum regarding rendering performance. In my experience this depends heavily on the scene and the chosen image resolution.

“Fast GI” gives you a slight, but noticeable improvement. The differences in the rendered picture could only be seen in relatively tiny details of my special test case. It may be different for other scenes and illumination.

Note: With 8 CPU cores activated my laptop was stressed regarding CPU temperature: It went up to 81° Celsius.

Required time for rendering on the mobile GPU

Below are the time consumption data for rendering on the mobile Nvidia GPU 645M:

tile size Fast GI time
64 no 18.3
128 no 16.47
256 no 15.56
512 no 15.41
1024 no 15.39
1200 no 15.21
1200 yes 12.80

Bigger tile sizes improve the GPU rendering performance! This may be different for rendering on a CPU, especially for small scenes. There you have to find an optimum for the tile size. Again, we see an effect of Fast GI.

Note: The temperature of the mobile graphics card never rose above 58° Celsius. I measured this whilst rendering a much bigger image of 4800×2400 px. I therefore think that the temperature stress Blender rendering exerts on the GPU is relatively smaller in comparison to the heat stress on a CPU.

Required time for rendering both on the CUDA capable mobile GPU and the CPU

As the CPU is CUDA capable one can activate CUDA based rendering on the CPU in addition to the GPU in the “preferences” settings. With 4 CPU cores this brings you down to around 11 secs, with 8 cores down to 10 secs.

tile size threads Fast GI time
64 4 no 11.01
128 8 no 10.08

Conclusion

Even on an old laptop with Optimus technology it is worthwhile to use a CUDA capable Nvidia graphics card for Cycles based rendering in Blender experiments. The rise in temperature was relatively low in my case. The gain in performance may range from a factor 2 to 5 depending on how many CPU cores you can invoke without overheating your laptop.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

 

Pandas – Extending a vocabulary or simple dataframe relatively fast

During some work for a ML project on a large text corpus I needed to extend a personally used reference vocabulary by some complex ad unusual German compounds and very branch specific technical terms. I kept my vocabulary data in a Pandas dataframe. Each “word” there had some additional information associated with it in some extra columns of the dataframe – as e.g. the length of a word or a stem or a list of constituting tri-char-grams. I was looking for a fast method to extend the dataframe in a quick procedure with a list of hundreds or thousands of new words.

I tried the df.append() method first and got disappointed with its rather bad performance. I also experimented with the incorporation of some lists or dictionaries. In the end a procedure based on csv-data was the by far most convenient and fastest approach. I list up the basic steps below.

In my case I used the lower case character version of the vocabulary words as an index of the dataframe. This is a very natural step. It requires some small intermediate column copies in the step sequence below, which may not be necessary for other use-cases. For the sake of completeness the following list contains many steps which have to be performed only once and which later on are superfluous for a routine workflow.

  1. Step1: Collect your extension data, i.e. a huge bunch of words, in a Libreoffice Calc-file in ods-format or (if you absolutely must) in an MS Excel-file. One of the columns of your datasheet should contain data which you later want to use as a (unique) index of your dataframe – in my case a column “lower” (containing the low letter representation of a word).
  2. Step 2: Avoid any operations for creating additional column information which you later can create by Python functions working on information already contained in some dataframe columns. Fill in dummy values into respective columns. (Or control the filling of a dataframe with special data during the data import below)
  3. Step 3: Create a CSV-File containing the collected extension data with all required field information in columns which correspond to respective columns of the dataframe to be extended.
  4. Step 4:Create a backup copy of your original dataframe which you want to extend. Just as a precaution ….
  5. Step 5: Copy the contents of the index of your existing dataframe to a specific dataframe column consistent with step 1. In my case I copied the words’ lower case version into a new data column “lower”.
  6. Step 6: Delete the existing index of the original dataframe and create a new basic integer based index.
  7. Step 7: Import the CSV-file into a new and separate intermediate Pandas dataframe with the help of the method pd.read_csv(). Map the data columns and the data formats properly by supplying respective (list-like) information to the parameter list of read_csv(). Control the filling of possibly empty row-fields. Check for fields containing “null” as string and handle these by the parameter “na_filter” if possible (in my case by “na_filter=False”)
  8. Step 8: Work on the freshly created dataframe and create required information in special columns by applying row-specific Python operations with a function and the df.apply()-method. For the sake of performance: Watch out for naturally vectorizable operations whilst doing so and separate them from other operations, if possible.
  9. Step 9: Check for completeness of all information in
    your intermediate dataframe. verify that the column structure matches the columns of the original dataframe to be extend.
  10. Step 10: Concatenate the original Pandas dataframe (for your vocabulary) with the new dataframe containing the extension data by using the df.concat() or (simpler) by df.append() methods.
  11. Step 11: Drop the index in the extended dataframe by the method pd.reset_index(). Afterward recreate a new index by pd.set_index() and using a special column containing the data – in my case the column “lower”
  12. Step 12: Check the new index for uniqueness – if required.
  13. Step 13: If uniqueness is not given but required:
    Apply df = df[~df.index.duplicated(keep=’first’)] to keep only the first occurrence of rows for identical indices. But be careful and verify that this operation really fits your needs.
  14. Step 14: Resort your index (and extended dataframe) if necessary by applying df.sort_index(inplace=True)

Some steps in the list above are of course specific for a dataframe with a vocabulary. But the general scheme should also be applicable for other cases.

From the description you have certainly realized which steps must only be performed once in the beginning to establish a much shorter standard pipeline for dataframe extensions. Some operations regarding the index-recreation and re-sorting can also be automatized by some simple Python function.

Have fun with Pandas!