KMeans as a classifier for the WIFI and MNIST datasets – V – cluster based classification of the MNIST dataset

In this series about KMeans

KMeans as a classifier for the WIFI and MNIST datasets – I – Cluster analysis of the WIFI example
KMeans as a classifier for the WIFI and MNIST datasets – II – PCA in combination with KMeans for the WIFI-example
KMeans as a classifier for the WIFI and MNIST datasets – III – KMeans as a classifier for the WIFI-example
KMeans as a classifier for the WIFI and MNIST datasets – IV – KMeans on PCA transformed data

we have so far studied the application of KMeans to the WIFI dataset of the UCI Irvine. We now apply the Kmeans clustering algorithm to the MNIST dataset – also in an extended form, namely as a classifier. The MNIST dataset – a collection of 28x28px images of handwritten numbers – has already been discussed in other sections of this blog and is well documented on the Internet. I, therefore, do not describe its basic properties in this post. A typical image of the collection is

Load MNIST – dimensionality of the feature space and scaling of the data

Due to the ease of use, I loaded the MNIST data samples via TF2 and the included Keras interface. Otherwise TF2 was not used for the following experiments. Instead the clustering algorithms were taken from “sklearn”.

Each MNIST image can be transformed into a one-dimensional array with dimension 784 (= 28 * 28). This means the MNIST feature space has a dimension of 784 – which is much more than the seven dimensions we dealt with when analyzing the WIFI data in the last post. All MNIST samples were shuffled for individual runs.

Scaling of MNIST data for clustering?

A good question is whether we should scale or normalize the sample data for clustering – and if so by what formula. I could not answer this question directly; instead I tested multiple methods. Previous experience with PCA and MNIST indicated that Sklearn’s “Normalizer” would be helpful, but I did not take this as granted.

A simple scaling method is to just divide the pixel values by 255. This brings all 784 data array elements of each image into the value range [0,1]. Note that this scaling does not change relative length differences of the sample vectors in the feature space. Neither does it shift or change the width of the data distribution around its mean value. Other methods would be to standardize the data or to normalize them, e.g. by using respective algorithms from Scikit-Learn. Using either method in combination with a cluster analysis corresponds to a theory about the cluster distribution in the feature space. Normalization would mean that we assume that the clusters do not so much depend on the vector length in the feature space but mainly on the angle of the sample vectors. We shall later see what kind of scaling helps when we classify the MNIST data based on clusters.

In a first approach we leave the data as they are, i.e. unscaled.

Parameters for clustering

All the following cluster calculations were done on 3 out of 8 available (hyperthreaded) CPU cores. For Kmeans and MiniBatchKMeans we used

n_init       = 100       # number of initial cluster configurations to test 
max_iter     = 100       # maximum number of iterations  
tol          = 1.e-4     # final deviation of subsequent results (= stop condition)  
random_state = 2         # a random state nmber for repeatable runs
mb_size      = 200       # size of minibatches (for MiniBatchKMeans) 

The number of clusters “num_clus” was defined individually for each run.

Analysis by KMeans? Too expensive …

A naive approach to perform an elbow analysis, as we did for the WIFI-data, would be to apply KMeans of Sklearn directly to the MNIST data. But a test run on the CPU shows that such an endeavor would cost too much time. With 3 CPU cores and only a very limited number of clusters and iterations

n_init   = 10      # only a few initial configurations
max_iter = 50 
tol      = 1.e-3  
num_clus = 25      # only a few clusters

a KMeans fit() run applied to 60,000 training samples [len(X_train) => 60,000]

kmeans.fit(X_train)

requires around 42 secs. For 200 clusters the cluster analysis requires around 214 secs. Doing an elbow analysis would therefore require many hours of computational time.
To overcome this problem I had to use MiniBatchKMeans. It is by factors > 80 faster.

Elbow analysis with the help of MiniBatchKmeans

When we use the following setting for MiniBatchKMeans

n_init = 50 # only a few initial configurations
max_iter = 100
tol = 1.e-4
mb_size = 200 

I could perform an elbow analysis for all cluster-numbers 1 < k <= 250 in less than 20 minutes. The following graphics shows the resulting intertia curve vs. cluster number:

The “elbow” is not very pronounced. But I would say that by using a cluster number around 200 we are on the safe side. By the way: The shape of the curve does not change very much when we apply Sklearn’s Normalizer to the MNIST data ahead of the cluster analysis.

Classifying unscaled data with the help of clusters

We now perform a prediction of our adapted cluster algorithm regarding the cluster membership for the training data and for k=225 clusters:

n_clu    = 225
mb_size  = 200
max_iter = 120
n_init   = 100
tol      = 1.e-4

Based on the resulting data we afterward apply the same type of algorithm which we used for the WIFI data to construct a “classifier” based on clusters and a respective predictor function (see the last post of this series).

The data distribution for the 10 different digits of the training set was:

class 0 :  5905
class 1 :  6721
class 2 :  6031
class 3 :  6082
class 4 :  5845
class 5 :  5412
class 6 :  5917
class 7 :  6266
class 8 :  5860
class 9 :  5961

How good is the cluster membership of a sample for a digit class defined?
Well, out of 225 clusters there were only around 15 for which I got an “error” above 40%, i.e. for which the relative fraction of data samples deviating from the dominant class of the cluster was above 40%. For the vast majority of clusters, however, samples of one specific digit class dominated the clusters members by more than 90%.

The resulting confusion matrix of our new “cluster classifier” for the (unscaled) MNIST data looks like

[[5695    4   37   21    7   57   51   15   15    3]
 [   0 6609   33   21   11    2   15   17    2   11]
 [  62   45 5523  120   14   10   27  107  116    7]
 [  11   43  114 5362   15  153    8   60  267   49]
 [   5   60   62    2 4752    3   59   63    5  834]
 [  54   18  103  777   25 4158  126    9  110   32]
 [  49   20   56    4    6   38 5736    0    8    0]
 [   5   57   96    2   86    1    0 5774    7  238]
 [  30   76  109  416   51  152   39   35 4864   88]
 [  25   20   37   84  706   14    6  381   46 4642]]

This confusion matrix comes at no surprise: The digits “4”, “5”, “8”, “9” are somewhat error prone. Actually, everybody familiar with MNIST images knows that sometimes “4”s and “9”s can be mixed up even by the human eye. The same is true for handwritten “5”s, “8”s and “3”s.

Another representation of the confusion matrix is:

The calculation for the matrix elements was done in a standard way – the sum over percentages in a row gives 100% (the slight deviation in the matrix is due to rounding). I.e. we look at erors of the type TN (True Negatives).

The confusion matrix for the remaining 10,000 test data samples is:

The relative errors we get for our classifier when applied to the train and test data is

rel_err_train = 0.115 ,
rel_err_test = 0.112

All for unscaled MNIST data. Taking into account the crudeness of the whole approach this is a rather convincing result. It proves that it is worth the effort to perform a cluster analysis on high dimensional data:

  • It provides a first impression whether the data are structured in the feature space such that we can find relatively good separable clusters with dominant members belonging to just one class.
  • It also shows that a cluster based classification for many datasets cannot reach accuracy levels of CNNs, but that it may still deliver good results. Without any supervised training …

The second point also proves that the distance of the data points to the various cluster centers contains valuable information about the class membership. So, a MLP or CNN based classification could be performed on transformed MNIST data, namely distance vectors of sample datapoints to the different cluster centers. This corresponds to a dimension reduction of the classification problem. Actually, in a different part of this blog, I have already shown that such an approach delivers accuracy values beyond 98%.

For MNIST we can say that the samples define a relatively well separable cluster structure in the feature space. The granularity required to resolve classes sufficiently well means a clsuter number of around 200 < k < 250. Then we get an accuracy close to 90% for cluster based classification.

t-SNE representation of the MNIST data

Can we somehow confirm this finding about a good cluster-class-relation independently? Well, in a limited way. The t-SNE algorithm, which can be used to “project” multidimensional data onto a 2-dimensional plane, respects the vicinity of vectors in the original feature space whilst deriving a 2-dim representation. So, a rather well structured t-SNE diagram is an indication of clustering in the feature space. And indeed for 10,000 randomly selected samples of the (shufffled) training data we get:

The colorization was done by classes, i.e. digits. We see a relatively good separation of major “clusters” with data points belonging to a specific class. But we also can identify multiple problem zones, where data points belonging to different classes are intermixed. This explains the confusion matrix. It also explains why we need so many fine-grained clusters to get a reasonable resolution regarding a reliable class-cluster-relation.

Classifying scaled and normalized MNIST data with the help of clusters

Can we improve the accuracy of our cluster based classification a bit? This would, e.g., require some transformation which leads to a better cluster separation. To see the effect of two different scalers I tried the “Normalizer” and then also the “StandardScaler” of Sklearn. Actually, they work in opposite direction regarding accuracy:

The “Normalizer” improves accuracy by more than 1.5%, while the “Standardizer” reduces it by almost the same amount.

I only discuss results for “Normalization” below. The confusion matrix for the training data becomes:

and for the test data:

The relative error for the test data is

Error for trainings data:
avg_err_train = 0.085 :: num_err_train = 5113
 
Error for test data:
avg_err_test = 0.083 :: num_err_test = 832

So, the relative accuracy is now around 91.5%.
The result depends a bit on the composition of the training and the test dataset after an initial shuffling. But the value remains consistently above 90%.

Data compression by Autoencoders and clustering

Just for interest I also had a look at a very different approach to invoke clustering:

I first applied a simple CNN-based AutoEncoder [AE] to compress the MNIST data into a 25-dimensional space and applied our clustering methods afterwards.

I shall not discuss the technology of autoenconders in this post. The only relevant point in our context is that an autoencoder provides an efficient non-linear way of data compression and dimensionality reduction. Among many other useful properties and abilities … . Note: I did not use a “Variational Autoencoder” which would have allowed for even better results. The loss function for the AE was a simple quadratic loss. The autoencoder was trained on 50,000 training samples and for 40 epochs.

A t-SNE based plot of the “clusters” for test data in the 25-dimensional space looks like:

We see that the separation of the data belonging to different classes is somewhat better than before. Therefore, we expect a slightly better classification based on clusters, too. Without any scaling we get the following confusion data:

[[5817    7   10    3    1   14   15    2   27    1]
 [   3 6726   29    2    0    1   10    5   12   10]
 [  49   35 5704   35   14    4   10   61   87    7]
 [   8   78   48 5580   22  148    2   40  111   29]
 [  47   27   18    0 4967    0   44   38    3  673]
 [  32   20   10  150    8 5039   73    4   43   28]
 [  31   11   23    2    2   47 5746    0   15    1]
 [   6   35   35    6   32    0    1 5977    7  163]
 [  17   67   22   86   16  217   24   22 5365   52]
 [  35   32   11   92  184   15    1  172   33 5406]]

Error averaged over (all) clusters :  6.74

The resulting relative error for the test data was:

avg_err_test = 0.0574 :: num_err_test = 574

With Normalization:

Error for test data:
avg_err_test = 0.054 :: num_err_test = 832

So, after performing the autoencoder training on normalized data we consistently get

an accuracy of around 94%.

This is not too much of a gain. But remember:
We performed a cluster analysis on a feature space with only 25 dimensions – which of course is much cheaper. However, we paid a prize, namely the Autoencoder training which lasted about 150 secs on my old Nvidia 960 GTX.

And note: Even with only 100 clusters we get above 92% on the AE-compressed data.

Conclusion

We have shown that using a non-supervised cluster analysis of the MNIST data with around 225 clusters allows for classifying images with an accuracy around 90.5%. In combination with an Autoencoder compression we even reaches values around 94%. This is comparable with other non-optimized standard algorithms aside of neural networks.

This means that the MNIST data samples are organized in a well separable cluster structure in their feature space. A test run with normalized data showed that the clusters (and their centers) differ mostly by their direction relative to the origin of the feature space and not so much by their distance from the origin. With a relatively fine grained resolution we could establish a simple cluster-class-relation which allowed for cluster based classification.

The accuracy is, of course, below the values reachable with optimized MLPS (98%) and CNNs (above 99%). But, clustering is a fast, reliable and non-supervised method. In addition in combination with t-SNE we can create plots which can easily be understood by the customers. So, even for more complex data I would always recommend to try a cluster based classification approach if you need to provide plots and quick results. Sometimes the accuracy may even be sufficient for your customer’s purposes.

Blender – even on old laptops a graphics card increases rendering performance

My present experiments with Blender on my old laptop take considerable time to render- especially animations. So, I got interested in whether rendering on the laptop’s old Nvidia card, a GT 645M, would make a difference in comparison to rendering on the available 8 hyperthreaded cores of the CPU. The laptop’s CPU is an old one, too, namely an i7-3632QM. The laptop’s operative system is Opensuse Leap 15.3. The system uses Optimus technology. To switch between the Nvidia card and the Intel graphics I invoke Suse’s Prime Select application on KDE.

I got a factor of 2 up to 5.2 faster rendering on the GPU in comparison to the CPU. The difference depends on multiple factors. The number of CPU cores used is an important one.

How to activate GPU rendering in Blender?

Basically three things are required: (1) A working recent Nvidia driver (with compute components) for your graphics card. (2) A certain setting in Blender’s preferences. (3) A setting for the Cycles renderer.

Regarding the CUDA toolkit I quote from Blender’s documentation

Normally users do not need to install the CUDA toolkit as Blender comes with precompiled kernels.

With respect to required Blender settings one has to choose a CUDA capable device via the menu point “Preferences >> System”:

You may also select both the GPU and the CPU. Then rendering will be done both on the GPU and the CPU. My graphics card unfortunately only understands a low level of CUDA instructions. The Nvidia driver I used is of version 470.103.01, installed via Opensuse’s Nvidia community repository:

In addition, you must set an option for the Cycles renderer:

With all these settings I got a factor of 2 up to > 6 faster rendering on the GPU in comparison to a CPU with multiple cores.

The difference in performance, of course, depends on

  • the number of threads used on the CPU with 8 (hyperthreaded) cores available to the Linux OS
  • tiling – more precisely the “tile size” – in case of the GPU and the CPU

All other render options with the exception of “Fast G” were kept constant during the experiments.

Scene Setup

To give the Blender’s Cylces renderer something to do I set up a scene with the following elements:

  • a mountain-like landscape (via the A.N.T Landscape Add-On) with a sub-dividion of 256 to 128 – plus subdivision modifier (Catmull-Clark, render level 2, limit surface quality 3) – plus simple procedural texture with some noise and bumps
  • a plane with an “ocean” modifier (no repetition, waves + noisy bump texture for the normal to simulate waves)
  • a world with a sky texture of the Nishita type ( blue sky by much oxygen, some dust and a sun just above the horizon)

The scene looked like

The central red rectangle marks the camera perspective and the area to be rendered. With 80 samples and a resolution of 1200×600 we get:

The hardest part for the renderer is the reflection on the water (Ocean with wave and texture). Also the “landscape” requires some time. The Nishita world (i.e. the sky with the sun), however, is rendered pretty fast.

Required time for rendering on multiple CPU cores

I used 40 samples to render – no denoising, progressive multi-jitter, 0 minimum bounces.
Other settings can be found here:


The number of threads, the tile size and the use of the Fast CI approximation were varied.
The resolution was chosen to be 1200×600 px.

All data below were measured on a flatpak installation of Blender 3.1.2 on Opensuse Leap 15.3.

tile size threads Fast GI time
64 2 no 82.24
128 2 no 81.13
256 2 no 81.01
32 4 no 45.63
64 4 no 43.73
128 4 no 43.47
256 4 no 43.21
512 4 no 44.06
128 8 no 31.25
256 8 no 31.04
256 8 yes 26.52
512 8 no 31.22

A tile size of 256×256 seems to provide an optimum regarding rendering performance. In my experience this depends heavily on the scene and the chosen image resolution.

“Fast GI” gives you a slight, but noticeable improvement. The differences in the rendered picture could only be seen in relatively tiny details of my special test case. It may be different for other scenes and illumination.

Note: With 8 CPU cores activated my laptop was stressed regarding CPU temperature: It went up to 81° Celsius.

Required time for rendering on the mobile GPU

Below are the time consumption data for rendering on the mobile Nvidia GPU 645M:

tile size Fast GI time
64 no 18.3
128 no 16.47
256 no 15.56
512 no 15.41
1024 no 15.39
1200 no 15.21
1200 yes 12.80

Bigger tile sizes improve the GPU rendering performance! This may be different for rendering on a CPU, especially for small scenes. There you have to find an optimum for the tile size. Again, we see an effect of Fast GI.

Note: The temperature of the mobile graphics card never rose above 58° Celsius. I measured this whilst rendering a much bigger image of 4800×2400 px. I therefore think that the temperature stress Blender rendering exerts on the GPU is relatively smaller in comparison to the heat stress on a CPU.

Required time for rendering both on the CUDA capable mobile GPU and the CPU

As the CPU is CUDA capable one can activate CUDA based rendering on the CPU in addition to the GPU in the “preferences” settings. With 4 CPU cores this brings you down to around 11 secs, with 8 cores down to 10 secs.

tile size threads Fast GI time
64 4 no 11.01
128 8 no 10.08

Conclusion

Even on an old laptop with Optimus technology it is worthwhile to use a CUDA capable Nvidia graphics card for Cycles based rendering in Blender experiments. The rise in temperature was relatively low in my case. The gain in performance may range from a factor 2 to 5 depending on how many CPU cores you can invoke without overheating your laptop.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

 

Blender – complexity inside spherical and concave cylindrical mirrors – VI – first experiment with a half-sphere

I had some fun yesterday with a first experiment I did in Blender with a fully reflective concave half-sphere. This is a further step in this series about the construction of some special reflecting surfaces in Blender for virtual optical experiments.

Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions
Blender – complexity inside spherical and concave cylindrical mirrors – II – a step towards the S-curve
Blender – complexity inside spherical and concave cylindrical mirrors – III – a second step towards the S-curve
Blender – complexity inside spherical and concave cylindrical mirrors – IV – reflective images of a Blender variant of Mr Kapoor’s S-curve
Blender – complexity inside spherical and concave cylindrical mirrors – V – a video of S-curve reflections

One of the most spectacular effect in a NASA movie about a real half-sphere (see the first post of this series) is the virtual finger that seems to come out of the sphere when the physicist moved his own finger towards the center. I thought this might be a nice Blender experiment on a free day with bad weather outside. The finger had to be replaced by a small lengthy NURBS based cylinder in Blender. The main objective was to find some cylinder-like figure coming out of a half-sphere built in Blender. But as I was playing around I shot some additional images of the half-sphere from different positions to verify the appearance of ring-like patterns at the outer edges of the half-sphere, too.

Building the half-sphere

I created the half-sphere for this experiment as a mesh with 64×64 faces, which I cut into two parts afterwards. I added the “subdivision modifier” at the end to get a smooth surface. But, probably, you could also have started with a high resolution Nurb sphere. I have not yet tested which methods results in faster rendering. Reflectivity was achieved by setting the material to be fully metallic without any roughness. The base color of the material was set to pure white.

Scene setting

The geometrical setup I choose for this post is the following:

The image above reflects my first camera perspective. What we want to see from this perspective is a mirrored pattern of the cylinder-like (Nurbs) object – appearing in front of the background mirror-surface and opposite of the real object. Such an appearance would of course just be an illusion, but is the result of simple optics. The light rays reaching the camera stem of course from the inner surface of the half sphere. Note that the cylinder is placed on the central symmetry axis of the half-sphere. We later give it a metallic red surface. Note also that the Nurb’s cylinder is an open figure at both ends and not a solid one. So, we will get some tiny reflection patterns on the inside of the cylinders, too.

The small sphere is placed off the symmetry axis to show a different reflection effect. I gave it a green color. A few light sources have been placed at various positions.

As a background I chose again a kind of dusk with a “sun” just above the horizon. This gives us a distinct contrast rich orange-yellow horizontal stripe, which will later help us to identify multiple reflections inside the half-sphere. Remember that we should get an inner part of the half-sphere showing a sharp ring-like edge marking the border-line where the first double reflection occurs. Other ring-like shapes, which a camera positioned on the central symmetry axis and looking towards the half-sphere’s center should “see”, appear in the outer areas of the half-sphere. The distance between these ring like structures should systematically become smaller.

At the bottom of the half-sphere I have also added an extended reflective plane. This helps us later to watch different reflection patterns coming from two different angles towards the camera. The half sphere is placed a bit above the ground level, i.e. above the plane. This leads to incomplete rings resulting from multiple reflections of the horizon line.

Result for a camera position above and sideways of the cylinder close to the half-sphere

From the perspective shown above we get the following rendered image:

So, indeed we see an artificial cylindrical “figure” coming out of the inner part of the half sphere opposite the real red cylinder. In addition, the sharply edged inner area of the half-sphere is clearly visible – here a bit distorted as our camera is placed off the central axis. At the top of the half-sphere the effects of multiple reflections stemming from the horizontal line can clearly be seen. The green sphere leads to a reflection pattern which already appeared at the concave side of the S-curve (see the previous posts).

Changed camera perspective high above the central symmetry axis

In a second experiment I placed the camera above the central axis at quite a distance away from the half-sphere and relatively high in z-direction. A yellow and somewhat extended light source has been added to get one more reflection pattern.

We clearly see the expected ring like structures due to multiple reflections at the upper edge of the half sphere. It seems that the basic optics works well in Blender. Something which is unrealistic in the images I create is the lossless reflection; in reality reflective patterns would get a bit weaker in intensity with every reflection.

Camera perspective from a position on the central symmetry axis

If we put the camera down to the central axis, the green sphere to the same height and switch off the yellow light sources we get the following result:

A bit off the symmetry axis

Let us eventually get the camera a bit offside the symmetry axis and move the green sphere into a position closer to, but not onto the main axis. Then we get the following picture:

Conclusion

As we expected after our experience with the S-curve we get interesting reflection patterns from a mirroring half-sphere. We showed that Blender reproduces a pattern where a figure seems to grow out of the half-sphere when we move an elongated body into the sphere along the main symmetry axis. We also saw a strong dependency of the reflection patterns on the position of the camera. So, all is well prepared for more experiments.
And, again, a basic lesson is: Some things appear bigger than they are, because we look at a bloated image instead of the real thing. If only things were that easy in politics, too.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.