KMeans as a classifier for the WIFI and MNIST datasets – I – Cluster analysis of the WIFI example

In the November and December 2021 editions of the German “Linux Magazin” R. Pleger discussed a simple but nevertheless interesting example for the application of a cluster algorithm. His test case was based on a dataset of the UCI Irvine. This dataset contains 2000 samples with (fictitious?) data describing WIFI signals which stemmed from seven WLAN spots around a building. The signal strength of each source was measured at varying positions in four different rooms. I call the whole setup the “WIFI example” below.

One objective of the articles in the Linux Magazin was to demonstrate how simple it is today to apply basic Machine Learning methods. In a first step the author used a ML classifier algorithm to determine the location (i.e. the room) of a measuring instrument just from the strengths of the different WIFI signals. This task can be solved by a variety of algorithms – e.g. by a Decision Tree, SVM/SVC or a simple Multilayer Perceptron. The author used Sklearn’s RandomForestTree. This method is a good example for the powerful “Ensemble Learning” technique. When applied to the simple and well structured WIFI example it predicts the rooms for test samples with an accuracy of more than 98%.

The author afterward performed a deeper analysis of the WIFI data via Kmeans, MiniBatchKMeans and PCA. His second article underlined a major question, which sometimes is not taken seriously enough:

Do the data, which we feed into ML algorithms, really cover all aspects of the problem? Is the set of target labels complete or sufficient in the sense that the separation of the samples into labeled groups really reflects the problem’s internal structure? Or do the data contain more information than the labels reveal?

Unfortunately, in my opinion, the Linux Magazine covered an important point, namely the relation of the results of a PCA analysis to a 2-dimensional cluster visualization, in an incomplete and also slightly misleading way. In addition another interesting question was not discussed at all:

Can we use KMeans also as a classifier? How would we do this?

In this series of posts I want to dig a bit deeper into these topics – both for the WIFI example and also for the MNIST dataset. For MNIST we will not be able to visualize clusters as easily as for the WIFI example. Therefore, we should have a clear idea about what we do when we use clusters for classifying.

In this first post I focus on the results of a cluster analysis for the WIFI example. In a second article I will discuss the relation of cluster results to a PCA analysis. A third post will then present a very simple method of how to turn a cluster algorithm into a classifier algorithm. In later articles we shall transfer our knowledge to the MNIST data. More precisely: We shall combine a PCA analysis with a cluster classifier to predict the labels of handwritten digit images. We will use the PCA technique to reduce the dimensions of the MNIST feature space from 784 down to below 80. It will be interesting to see what accuracy we can reach with a relatively crude clustering approach on only about 30 main PCA components. As a side aspect we shall also have a look at standardization and normalization of the MNIST data.

I do not present any code in the first three posts as the required Python programs can be build relatively straight-forward and most of the core statements were already given in the Linux Magazin. You unfortunately have to buy the articles of the magazine; but see https://www.linux-magazin.de/ausgaben/2021/11/maschinenlernen/. However, as soon as we turn to MNIST I shall provide a Jupyter notebook.

The WIFI example: Two thousand samples, each with data for the signal strength of seven WLAN sources measured in four rooms

You can download the data WIFI data set from the following address:
https://archive.ics.uci.edu/ml/machine-learning-databases/00422/wifi_localization.txt

The feature space of this is example is 7-dimensional: 7 WLAN spots provide WIFI signals in the building. We have 2000 samples. Each sample provides the signal strength of each of the WLAN sources measured at different times and positions within a specific room. An integer number in [1,4] is provided as a label which identifies the room. The following plot shows the interpolated frequency distribution over the signal strength for each of the 7 signals in the four rooms:

Cluster analysis of the WIFI data – more than four rooms?

The original label-data of the WIFI example imply the existence of four rooms. But can we trust this information? The measurements in the room, which we called “Diele” in the plots above, indicate a consistent second peak for both the signals 0 and 3. Is this due to an opening into another room?

A simple method to analyze the inner structure of the distribution of data points in a configuration or feature space is a “cluster analysis”. The KMeans algorithm provides such an analysis for an assumed number of clusters.

KMeans is a basic but important ML method which reveals a lot about the data distribution in feature space and indirectly about the complexity of hyperplanes required to separate data according to their labels. Among other things KMeans determines the positions of cluster centers – the so called centroids – by measuring and systematically optimizing distances of samples to assumed centroids. Actually, the sum over all intra-cluster variances, i.e. the summed quadratic distances of the associated samples to their cluster’s centroid, is minimized. The respective quantity is called “inertia” of the cluster distribution. See e.g. the excellent book of P. Wilmott, Machine learning – an applied mathematics introduction” on this topic.

A simple method to find out into how many clusters a distribution probably segregates is to look for an elbow in the variation of the inertia with the number of clusters. When we look at the variation of inertia values with the number of potential clusters “k” for the WIFI example we get the following curve:

This indicates an elbow at k=4,5.

Another method to identify the most probable number of distinct clusters in a multi dimensional data point distribution is the so called “silhouette analysis”. See the book of A. Geron “Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow”, 2n edition, for a description. For the WIFI example the plots of the silhouette score data support the result of the elbow analysis:

The second plot shows ordered silhouette data for k = 3,4,5,6 clusters. Again, we get the most consistent pictures for k=4 and k=5.

So, the data indicate a fragmentation into 4 or 5 clusters. How can we visualize this with respect to the feature space?

Scatter plots for 2-dim sub-spaces of the feature space

A general problem with the visualization of cluster data for multidimensional data is that we are limited to 2, maximal 3 dimensions. And a projection down to two dimensions may not reflect the real cluster separation in the multidimensional feature space in a realistic way. But sometimes we are lucky.

We shall later see that there are two primary components which dominate the data and signal distributions in the WIFI example. A major question, however, is whether we will also find that only a few original features contribute dominantly to these major components. A PCA analysis does not mean that a “primary component” only depends on the same number of features!

As I did not know the relation of “primary components” to features I just plotted the results of “KMeans” for a variety of 2-dim signal combinations. I used Sklearn’s version of KMeans; due to the very small data ensemble KMeans is applicable without consuming too much CPU time (this will change with MNIST; there we need to invoke MiniBatchKMeans):

Note that the colorization of the data points in all plots was done with respect to the cluster number predicted by KMeans for the samples – and not with respect to their labels.

It is interesting that the projections onto two special feature combinations – namely WLAN-4/WLAN 0 and WLAN-3/WLAN-0 signal – show a very distinct separation of the clusters.

Four or five clusters ?

The data displayed above depend a bit on the initial distribution of cluster centers as an input into the KMeans algorithm. But for 4 and 5 clusters we get very consistent results. The next plots show the positions of the centroids:

This time the colorization was done with respect to the labels. What we see is: Five clusters represent the situation a bit better than only 4 clusters.

When we align this with the rooms: Five “rooms” may describe the signal variation better than only 4 rooms. The reason for this might be that one of the four rooms has a wall which partially separates different areas from another. We often find this in “entrances” [German: “Diele”] to houses. Sketches of the rooms in the Linux Magazin article actually show that this is the case. And, of course, such a wall or an opening into another room would have an impact on the damping of the WLAN signals.

Addendum 19.03.2022: Comparing clusters with groups of labeled data points

An important question which we have not answered yet by the images shown above is the following:

How well do clusters coincide with groups of data points having a specific label?

Note that in general you can not be sure that clusters reflect data points of the same label. Actually, a cluster is only a way to describe a close spatial vicinity of data points in some region of the multidimensional feature space. I.e. some kind of clumping of the data points around some centroids. But spatial vicinity does not necessarily reflect a label: A label border may often separate data points which are very close neighbors. And a cluster may contain a mixture of samples with different labels ….

Well, in the case of the WIFI example the identified 4 to 5 clusters match the groups of data points with different labels quite well. Below I superimposed the sample’s data points with different colors: First I colorized the data points according to their label. On top of the resulting scatter plot I placed the same data points again, but this time with a different and transparent colorization according to their cluster association. In addition I shifted the second data layer a bit to get a better contrast:

You see that the areas are not completely identical, but they overlap quite well. Obviously, I used 5 clusters. Also the fifth cluster fits well into a region characterized by just one label.

Conclusion

The simple WIFI example shows that a cluster analysis may give you new insights into the structure of ML data sets which a simple classifier algorithm can not provide. In the next article

KMeans as a classifier for the WIFI and MNIST datasets – II – PCA in combination with KMeans for the WIFI-example

we shall link the information contained in the “clusters” to the results of a PCA analysis of the WIFI example.

Stay tuned …

Ceterum censeo: The most important living fascist which must be denazified is the Putler.

Blender – complexity inside spherical and concave cylindrical mirrors – IV – reflective images of a Blender variant of Mr Kapoor’s S-curve

The topic of the last post in this series

Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions
Blender – complexity inside spherical and concave cylindrical mirrors – II – a step towards the S-curve
Blender – complexity inside spherical and concave cylindrical mirrors – III – a second step towards the S-curve

was the construction of a metallic object with basic similarities to the “S-curve” of artist Anish Kapoor. Our object was a bit more extreme than the artists real object; we had a smaller curvature radii in two dimensions and on the outer ends our surface approximated a half circle boundary curve. We therefore could expect multiple reflections of light rays on the concave side(s) of our virtual object when applying ray tracing.

At the end of my last article I already presented some images of the reflection of a far distant horizontal line marked by a sun close to the horizon at dusk or dawn. In this article I am going to add some simple objects – a small red and a small green sphere at varying positions. Plus a point like light source. I take some shots with the virtual Blender camera form different angles and with varying focal length. I present the results below without many comments.

What we see is a rich variation of patterns and figures. Mathematically it is all the result from a single and simple mapping-operation. Each operation maps one point on our surface to another point on the S-curve (or on a hit sphere). The points are given by millions and millions of light rays which in the end reach our virtual camera from different angles. The basic message is:

Simplicity can create a complexity which or brain would not predict without some deeper analysis. And a complex apparition may be based on simple rules and the selection of special circumstances.

So, besides many other philosophical aspects Mr. Kapoor’s “S-curve” reveals a very fundamental idea in physics, certain branches of mathematics and in information theory.

Reflections of a horizon line

Reflections of a horizon line and a red sphere

Reflections of a horizon line, a red and a green sphere

Note that the concave side of the S-curve gives us a first idea about what we can expect from a full half-sphere where even more reflections on the surface are possible before a light ray reaches the camera.

But, in my next article
Blender – complexity inside spherical and concave cylindrical mirrors – V – a video of S-curve reflections
I am first going to produce a movie of objects moving in front of the concave part of the S-curve.

 

Blender – complexity inside spherical and concave cylindrical mirrors – III – a second step towards the S-curve

I continue with my mini-series on how to (re-) build something like the S-curve of Mr. Kapoor in Blender. See :

Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions

In my last post

Blender – complexity inside spherical and concave cylindrical mirrors – II – a step towards the S-curve

I discussed how we can add a smooth transition from convex and concave curvature around the y-axis of an originally flat rectangular Blender mesh positioned in the (y,z)-plane. The rectangle had its longer side in y-direction. Starting from a flat area around the vertical middle axis (in z-direction) of the object we bent the wings to the left and right around the y-axis with systematically growing curvature, i.e. with shrinking curvature radius, in y-direction. The curvature around the y-axis left of the central z-axis got a different sign than the curvature right of the central z-axis. At the outer edges of both wings we approximated the form of a half cylinder. So curvature became a function of both x and y.

To create a really smooth surface with differentiable gradient and curvature we had to apply a modifier called “Subdivide Surface“. The trick to make this modifier work sufficiently well with only a few data points in z-direction was to keep curvature almost constant in z-direction for a given y-position along the horizontal axis. We achieved this by putting the vertices of our mesh on the central rotation-axis at the same z-(height)-values as the vertices on the circles at the outer edges. In the end we had established a smoothly varying gradient of the surface curvature around the y-axis with the y-coordinate while the partial derivative in z-direction of this curvature was close to zero for any fixed y-position.

In this post I want to add an “S-curvature” of the object in y-direction. Meaning: We are now going to bend the object along a S-shaped path in the (x,y)-plane. Physically, we are adding curvature in x-direction, more or less constant around two imaginary vertical axes positioned at some distance in y-direction from the central z-axis – and with different signs of the curvature. So, we are creating a superposition of a growing curvature around the y-axis with a constant curvature around two z-axes for each of the wings left and right of the central rotational z-axis of our object. Eventually, we build something like shown below:

When we look at images of Mr. Kapoor’s real S-curve we see that he keeps curvature at zero both in z- and a diagonal x/y-direction at the central rotational axis – due to the “S”. The same is true for our object. But: In comparison to the real S-curve of Mr. Kapoor our object is more extreme:

The ratio of height in z-direction to length in y-direction is bigger in our case. The object is shorter in y-direction and thus relatively higher in z-direction. The curvature radii around the y-axis are significantly smaller. Our surface approximates full half circle curves at the outer edges; in contrast to the real S-curve our surface approximates a shifted cut of a cylinder.

We could, of course, adapt our S-curve in Blender a bit more to the real S-curve. However: Our more extreme bending around the y-axis at the outer left and right edges of the object allows for multiple reflections of light falling in in x-direction on the concave sides of the object. This is not the case for the real S-curve.

Steps to give the surface an S-shape

How do we get to the surfaces presented in the above figures in Blender? The following steps comprise building a suitable symmetric and very fine grained Bezier-path and the use of a curve modifier. To achieve a relatively smooth bending in x-direction we must first subdivide our original object in sufficient sections in y-direction in addition to the already existing subdivisions in z-direction.

Step 1: Sub-dividing our object in y-direction

The final object mesh constructed in my last post is depicted below:

We “apply” (via menu “Object” > “apply”) any rotations we may have done so far. The object’s center should now coincide with the world center. We position the camera some distance apart in x-direction from our object, but at y=z=0. The following image shows the camera perspective.

Remember that our object consists of two meshes which we have joined at the central axis. We are now going to separate each half in 32 sections in y-direction.

Go to Edit mode, position the cursor in one half and press Ctrl-R:

Turn your mouse wheel until you have created 32 subdivisions. (The number is shown at the bottom left of your view-port). Left click twice to fix the subdivision lines at their present positions; do NOT move the mouse in between the clicks.

Doing this on both halves eventually gives us:

We check that y- and x-distances of the vertices have equal absolute values. We in addition check z-heights for selected vertices and compare them to the height of vertices on the outer circles.

Step 2: Design a S-path

We move our object, which has dimensions of 6m in y-direction and 2m in z- and x-direction, respectively, to the left (y=-6m). Just to get some space at the world origin. We now add a Bezier curve to our scene. Its origin is located at the world center and it is stretched along the x-axis. Rotate the curve around the z-axis by 90 degree such that it stretches along the y-axis. Choose a top-view position and rotate the viewport such that the y-axis points to the right.

Stretch the curve to an y-dimension of 6m. Select the curve. Go to edit mode. We now rotate the rightmost and the leftmost tangent bars such that we get the following curve:

You fix the required symmetry by watching and adjusting the position information of the tangent bar\’s end vertices in the sidebar of the viewport. Note that a tangent bar has 2 associated handles and vertices. By changing vertex positions systematically to [(x=-0.5, y= -3.5), (x=0.5, y= -2.5)] and [(x=-0.5, y=2.5), (x=0.5, y=3.5)] on the left and the right side, respectively, you get a seemingly nice flat S-Curve.

Go to object mode and change the dimension of the curve in x-direction to 2m. You then clearly see that the path is not a very smooth one, but consists of distinguished linear segments. This would be a major problem later on. In addition the curvature for the S seems to be a bit extreme. We change the x-dimension to just 1 m. Then we subdivide the curve into further segments in Edit mode. Multiple times.

And we get a smoothly curved and well dimensioned path:

Eventually, we end up with the following situation:

Step 3: Apply a curve modifier to our object

Now comes an important point: If you have assigned the origin of the curve to its midpoint, you should do the same with your object- see the Internet for appropriate Blender operations:

We then move our object to y=3.171m and z=1m, i.e. a bit further than rightmost end-point of the path and above the worlds central plane.

Now, we add a curve-modifier to our object, select the Bezier curve (our path) and get

We center our view and choose a top-position. We adjust the y-location of the object until its present center coincides with the world center

And there we have our personal S-curve – more extreme than Kapoor’s real S-cure – but this is only a question of dimension adjustments AND the right choice of how to cut the limiting circle mesh in the beginning (see the last post). Our S-curve will show multiple reflections on its concave side(s).

Apply the modifier “subdivision surface”

The rest is routine for us already. We apply the subdivision surface modifier to get a smooth surface.

When we modify the world’s sky texture a bit with a sun just at the horizon then we get images like the following just from the horizon line and from different camera perspectives with different focal lengths (wide angle shot).

We clearly see multiple reflections in vertical direction on the left concave side of our object.

Complexity out of simple things … Real fun …

Conclusion

Rebuilding something like the S-curve of Mr. Kapoor was hard work for me who uses Blender just as a hobby tool. But it was worth the effort. In my next post,

Blender – complexity inside spherical and concave cylindrical mirrors – IV – reflective images of a Blender variant of Mr Kapoor’s S-curve

I am going to show what happens when we place objects in front of our S-curve. This will give us a first impression of what might happen with a totally concave surface as an open half sphere.
Stay tuned …