Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss

In the last post of this series

Variational Autoencoder with Tensorflow – I – some basics

I have briefly discussed the basic elements of a Variational Autoencoder [VAE]. Among other things we identified an Encoder, which transforms sample data into z-points in a low dimensional “latent space”, and a Decoder which reconstructs objects in the original variable space from z-points.

In the present post I want to demonstrate that a simple Autoencoder [AE] works as expected with Tensorflow 2.8 [TF2]. In contrast to certain versions of Variational Autoencoders which we shall test in the next post. For our AE we use the “binary cross-entropy” as a suitable loss to compare reconstructed MNIST images with the original ones.

A simple AE

I use the functional Keras API to set up the Autoencoder. Later on we shall encapsulate everything in a class, but lets keep things simple for the time being. We design both the Encoder and the Decoder as CNNs and use properly configured Conv2D and Conv2DTranspose layers as their basic elements.

Required Imports

# tensorflow and keras 
import tensorflow as tf
from tensorflow import keras as K
from tensorflow.keras import backend as B 
from tensorflow.keras.models import Model
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import metrics
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \
                                    Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, \
                                    AlphaDropout, Concatenate, Rescaling, ZeroPadding2D

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import schedules

z_dim and the Encoder defined with the functional API
We set the dimension of the “latent space” to z-dim = 16.
This allows for a very good reconstruction of images in the MNIST case.

z_dim = 16
encoder_input = Input(shape=(28,28,1)) # Input images pixel values are scaled by 1./255.
x = encoder_input
x = Conv2D(filters = 32, kernel_size = 3, strides = 1, padding='same')(x)
x = LeakyReLU()(x)
x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding='same')(x)
x = LeakyReLU()(x)
x = Conv2D(filters = 128, kernel_size = 3, strides = 2, padding='same')(x)
x = LeakyReLU()(x)
shape_before_flattening = B.int_shape(x)[1:]  # B: Keras backend
x = Flatten()(x)
encoder_output = Dense(self.z_dim, name='encoder_output')(x)
encoder = Model([encoder_input], [encoder_output], name="encoder")

This is almost an overkill for something as simple as MNIST data. The Decoder reverses the operations. To do so it uses Conv2DTranspose layers.

The Decoder

dec_inp_z      = Input(shape=(z_dim))
x = Dense(np.prod(shape_before_flattening))(dec_inp_z)
x = Reshape(shape_before_flattening)(x)
x = Conv2DTranspose(filters=64, kernel_size=3, strides=2, padding='same')(x)
x = LeakyReLU()(x) 
x = Conv2DTranspose(filters=32, kernel_size=3, strides=2, padding='same')(x)
x = LeakyReLU()(x) 
x = Conv2DTranspose(filters=1,  kernel_size=3, strides=1, padding='same')(x)
x = LeakyReLU()(x) 
# Output into a region of 0 to 1 - requires a one hot encoding of classes 
# and scaling of pixel values of the input samples to [0, 1] 
x = Activation('sigmoid')(x)
decoder_output = x
decoder = Model([dec_inp_z], [decoder_output], name="decoder")

The full AE-model

ae_input  = encoder_input
ae_output = decoder(encoder_output)        
AE = Model(ae_input, ae_output, name="AE")

That was easy! Now we compile:

AE.compile(optimizer=Adam(learning_rate = 0.0005), loss=['binary_crossentropy'])

I use the “binary_crossentropy” loss function to evaluate and punish differences between the reconstructed image and the original. Note that we need to scale pixel values of input images (as MNIST images) down to a value range between [0, 1] due to using a sigmoid activation function at the output layer of the Decoder.
Using “binary cross-entropy” leads to a better and faster convergence of the training process:

Training and results of the Autoencoder for MNIST samples

Eventually, we can train our model:

n_epochs = 40
batch_size = 128
AE.fit( x=x_train, y=x_train, shuffle=True, 
         epochs = n_epochs, batch_size=batch_size)

After 40 training steps we can visualize the resulting data clusters in z-space. To get a 2-dimensional plot out of 16-dimensional data requires the use of t-SNE. For 15.000 training data (out of 60.000) we get:

Our Encoder CNN was able to separate the different classes quite well by extraction basic features of the handwritten digits by its Conv2D layers. The test which the AE never had seen before are well separated too:

Note that the different position of the clusters on the 2-dim plots are due to transformations and arrangements of t-SNE and have no deeper meaning.

What about the reconstruction quality?
For z_dim = 16 it is almost perfect: Below you see a sequence of rows presenting original images and their reconstructed counterparts for selected MNIST test data, which the AE had not seen before:

The special case of “z_dim = 2”

When we set z_dim = 2 the reconstruction quality of course suffers:

Regarding class related clusters we can directly plot the data distributions in the z-space:

We see a relatively vast spread of data points in clusters for “6”s and “0”s. We also recognize the typical problem zones for “3”s, “5”s, “8”s on one side and for “4”s, “7”s and “9”s on the other side. They are difficult to distinguish with only 2 dimensions.

Conclusion

Autoencoders are simple to set up and work as expected together with Keras and Tensorflow 2.8.

In the next post

Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution

we shall, however, see that VAEs and their KL loss may lead to severe problems.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

Variational Autoencoder with Tensorflow – I – some basics

Last week I tried to perform some systematic calculations with a “Variational Autoencoder” [VAE] for a presentation about Machine Learning [ML]. I wanted to apply VAEs to different standard datasets (MNIST Fashion, CIFAR 10, Celebrity) and demonstrate some basic capacities of VAE algorithms. A standard task … well covered by several introductory books on ML.

Actually, I had some 2 years old Python modules for VAEs available – all based on recommendations in the literature. My VAE-model setup had been done with Keras. Moreprecisely the version integrated into Tensorflow 2 [TF2]. But I ran into severe trouble with my present Tensorflow versions, namely 2.7 and 2.8. The cause of the problems was my handling of the so called Kullback-Leibler [KL] loss in combination with the “eager execution mode” of TF2.

Actually, the problems may already have arisen with earlier TF versions. With this post series I hope to save time for some readers who, as myself, are no ML professionals. In a first post I will briefly repeat some basics about Autoencoders [AEs] and Variational Autoencoders [VAEs]. In a second post we shall build a simple Autoencoder. In a third post I shall turn to VAEs and discuss some variants for dealing with the KL loss – all variants will be picked from recipes of ML teaching books. I shall demonstrate afterward that some of these recipes fail for TF 2.8. A fourth post will discuss the problem with “eager execution” and derive a practical rule for calculations based on layer specific data. In three further posts I apply the rule in the form of three different variants for the KL loss of VAE models. I will set up the VAE models with the help of the functional Keras API. The suggested solutions all do work with Tensorflow 2.7 and 2.8.

If you are not interested in the failures of classical approaches to the KL loss presented in teaching books on ML and if you are not at all interested in theory you may skip the first four posts – and jump to the solutions.

Some basics of VAEs – a brief introduction

I assume that the reader is already familiar with the concepts of “Variational Autoencoders”. I nevertheless discuss some elements below to get a common vocabulary.

I call the vector space which describes the input samples the "variable space". As AEs and VAEs are typically used for images this “variable space” has many dimensions – from several hundred up to millions. I call the number of these dimensions "n_dim".

An input sample can either be represented as a vector in the variable space or by a multi-ranked tensor. An image with 28×28 pixels and gray “colors” corresponds to a 784 dimensional vector in the variable space or a a “rank 2 tensor” with the shape (28, 28). I use the term “rank” to avoid any notional ambiguities regarding the term “dimensions”. Remember that people sometimes speak of 2-“dimensional” Numpy matrices with 28 elements in each dimension.

An Autoencoder [AE] typically consists of two connected artificial neural networks [ANN]:

  • Encoder : maps a point of the “variable space” to a point in the so called “latent space”
  • Decoder : maps a point of the “latent space” to a point in the “variable space”.

Both parts of an AE can e.g. be realized with Dense and/or Conv2D layers. See the next post for a simple example of a VAE layer-layout. With Keras both ANNs can be defined as individual “Keras models” which we later connect.

Some more details:

  • Encoder:
    The Encoder ANN creates a vector for each input sample (mostly images) in a low-dimensional vector space, often called the “latent space“. I will sometimes use the abbreviation “z-space” for it and call a point in it a “z-point“. The dimensions of the “z-space” are given by a number “z_dim“. A “z-point” is therefore represented as a z-dimensional vector (a special kind of a tensor of rank 1). The Encoder realizes a non-linear transformation of n-dimensional vectors to z-dimensional vectors. Encoders, actually, are very effective data compressors. z_dim may take numbers between 2 and a few hundred – depending on the complexity of data and objects represented by the input samples with much higher dimensions (some hundreds to millions).
  • Decoder:
    The “Decoder” ANN instead takes arbitrary low dimensional vectors from the “latent space” as input and creates or better “reconstructs” tensors with the same rank and dimensions as the input samples. These output samples can, of course, also be represented by n-dimensional vectors in the original variable space. The Decoder is a kind of inversion of the “Encoder”. This is also reflected in its structure as we shall see later on. One objective of (V)AEs is that the reconstruction vector for a defined input sample should be pretty close to the input vector of the very same sample – with some metric to measure distances. In natural terms for images: An output image of a Decoder should be pretty comparable to the input image of the Encoder, when we feed the Decode with the z-dim vector created by the Encoder.

A Keras model for the (V)AE can be build by connecting the models for the Encoder and Decoder. The (V)AE model is trained as a whole to minimize differences between the input (e.g. an image) presented to the Encoder and the created output of the Decoder (a reconstructed image). A (V)AE is trained “unsupervised” in the sense that the “label data” for the training are identical to the encoder input samples.

Encoder and Decoder as CNNs?

When you build the Encoder and the Decoder as a “Convolutional Neutral Networks” [CNNs] they support data compression and reconstruction very effectively by a parallel extraction of basic “features” of the input samples and putting the related information into neural “maps”. (The neurons in such a “map” react sensitively to correlation patterns of components of the tensors representing the input data samples.) For images such CNN networks would accept rank 2 tensors.

Note that the (V)AE’s knowledge about basic “features” (or data correlations) in the input samples is encoded in the weights of the “Encoder” and “Decoder”. The points in the “latent space” which represent the input samples may show cluster patterns. The clusters could e.g. be used in classification tasks. However, without an appropriate Decoder with fitted weights you will not be able to reconstruct a complete image from a point in the z-space. Similar to encryption scenarios the Decoder is a kind of “key” to retrieve original information (even with reduced detail quality).

Variational Autoencoders

Variational Autoencoders take care about a reasonable data organization in the latent space – in addition to data compression. The objective is to get well defined and segregated clusters there for “features” or maybe even classes of input samples. But the clusters should also be confined in a common, relatively small and compact region of the z-space.

To achieve this goal one judges the arrangement of data points, i.e. their distribution in the latent space, with the help of a special loss contribution. I.e. we use specially designed costs which punish vast extensions of the data distribution off the origin in the latent space.

The data distribution is controlled by parameters, typically a mean value “mu” and a standard deviation or variance “var”. These parameters provide additional degrees of freedom during training. The special loss used to control the predicted distribution in comparison to an ideal one is the so called “Kullback-Leibler” [KL] loss. It measures the “distance” of the real data distribution in the latent space from a more confined ideal one.

You find more information about the KL-loss in the books of D. Foster and R. Atienza named in the last section of this post.

Why the fuss about VAEs and the KL-loss at all?

Why are VAEs and their “KL loss” important? Basically, because VAEs help you to confine data in the z-space. Especially when the latent space has a very low number of dimensions. VAEs thereby also help to condense clusters which may be dominated by samples of a specific class or a label for some specific feature. Thus VAEs provide the option to define paths between such clusters and allow for “feature arithmetics” via vector operations.

See the following images of MNIST data mapped into a 2-dimensional latent space by a standard Autoencoder:

Despite the fact that the AE’s sub-nets (CNNs) already deal very well with with features the resulting clusters for the digit classes are spread in a relatively large area of the latent space: [-7 ≤ x ≤ +7], [6 ≥ y ≥ -9]. Had we used dense layers of a MLP only, the spread of data points would have covered an even bigger area in the latent space.

The following three plots resulted from VAEs with different parameters. Watch the scale of the axes.

The data are now confined to a region of around [[-4,4], [-4,4]] with respect to [x,y]-coordinates in the “latent space”. The available space is more efficiently used. So, you want to avoid any problems with the KL-loss in VAEs due to some peculiar requirements of Tensorflow 2.x.

Conclusion

AEs are easy to understand, VAEs have subtle additional properties which seem to be valuable for certain tasks. But also for principle reasons we want to enable a Keras model of an ANN to calculate quantities which depend on the elements of one or a few given layers. One example of such a quantity is the KL loss.

In the next post

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

we shall build an Autoencoder and apply it to the MNIST dataset. This will give us a reference point for VAEs later.

Stay tuned ….
Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

Further articles in this series

Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow – X – VAE application to CelebA images
Variational Autoencoder with Tensorflow – XI – image creation by a VAE trained on CelebA
Variational Autoencoder with Tensorflow – XII – save some VRAM by an extra Dense layer in the Encoder
Variational Autoencoder with Tensorflow – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?

 

KMeans as a classifier for the WIFI and MNIST datasets – V – cluster based classification of the MNIST dataset

In this series about KMeans

KMeans as a classifier for the WIFI and MNIST datasets – I – Cluster analysis of the WIFI example
KMeans as a classifier for the WIFI and MNIST datasets – II – PCA in combination with KMeans for the WIFI-example
KMeans as a classifier for the WIFI and MNIST datasets – III – KMeans as a classifier for the WIFI-example
KMeans as a classifier for the WIFI and MNIST datasets – IV – KMeans on PCA transformed data

we have so far studied the application of KMeans to the WIFI dataset of the UCI Irvine. We now apply the Kmeans clustering algorithm to the MNIST dataset – also in an extended form, namely as a classifier. The MNIST dataset – a collection of 28x28px images of handwritten numbers – has already been discussed in other sections of this blog and is well documented on the Internet. I, therefore, do not describe its basic properties in this post. A typical image of the collection is

Load MNIST – dimensionality of the feature space and scaling of the data

Due to the ease of use, I loaded the MNIST data samples via TF2 and the included Keras interface. Otherwise TF2 was not used for the following experiments. Instead the clustering algorithms were taken from “sklearn”.

Each MNIST image can be transformed into a one-dimensional array with dimension 784 (= 28 * 28). This means the MNIST feature space has a dimension of 784 – which is much more than the seven dimensions we dealt with when analyzing the WIFI data in the last post. All MNIST samples were shuffled for individual runs.

Scaling of MNIST data for clustering?

A good question is whether we should scale or normalize the sample data for clustering – and if so by what formula. I could not answer this question directly; instead I tested multiple methods. Previous experience with PCA and MNIST indicated that Sklearn’s “Normalizer” would be helpful, but I did not take this as granted.

A simple scaling method is to just divide the pixel values by 255. This brings all 784 data array elements of each image into the value range [0,1]. Note that this scaling does not change relative length differences of the sample vectors in the feature space. Neither does it shift or change the width of the data distribution around its mean value. Other methods would be to standardize the data or to normalize them, e.g. by using respective algorithms from Scikit-Learn. Using either method in combination with a cluster analysis corresponds to a theory about the cluster distribution in the feature space. Normalization would mean that we assume that the clusters do not so much depend on the vector length in the feature space but mainly on the angle of the sample vectors. We shall later see what kind of scaling helps when we classify the MNIST data based on clusters.

In a first approach we leave the data as they are, i.e. unscaled.

Parameters for clustering

All the following cluster calculations were done on 3 out of 8 available (hyperthreaded) CPU cores. For Kmeans and MiniBatchKMeans we used

n_init       = 100       # number of initial cluster configurations to test 
max_iter     = 100       # maximum number of iterations  
tol          = 1.e-4     # final deviation of subsequent results (= stop condition)  
random_state = 2         # a random state nmber for repeatable runs
mb_size      = 200       # size of minibatches (for MiniBatchKMeans) 

The number of clusters “num_clus” was defined individually for each run.

Analysis by KMeans? Too expensive …

A naive approach to perform an elbow analysis, as we did for the WIFI-data, would be to apply KMeans of Sklearn directly to the MNIST data. But a test run on the CPU shows that such an endeavor would cost too much time. With 3 CPU cores and only a very limited number of clusters and iterations

n_init   = 10      # only a few initial configurations
max_iter = 50 
tol      = 1.e-3  
num_clus = 25      # only a few clusters

a KMeans fit() run applied to 60,000 training samples [len(X_train) => 60,000]

kmeans.fit(X_train)

requires around 42 secs. For 200 clusters the cluster analysis requires around 214 secs. Doing an elbow analysis would therefore require many hours of computational time.
To overcome this problem I had to use MiniBatchKMeans. It is by factors > 80 faster.

Elbow analysis with the help of MiniBatchKmeans

When we use the following setting for MiniBatchKMeans

n_init = 50 # only a few initial configurations
max_iter = 100
tol = 1.e-4
mb_size = 200 

I could perform an elbow analysis for all cluster-numbers 1 < k <= 250 in less than 20 minutes. The following graphics shows the resulting intertia curve vs. cluster number:

The “elbow” is not very pronounced. But I would say that by using a cluster number around 200 we are on the safe side. By the way: The shape of the curve does not change very much when we apply Sklearn’s Normalizer to the MNIST data ahead of the cluster analysis.

Classifying unscaled data with the help of clusters

We now perform a prediction of our adapted cluster algorithm regarding the cluster membership for the training data and for k=225 clusters:

n_clu    = 225
mb_size  = 200
max_iter = 120
n_init   = 100
tol      = 1.e-4

Based on the resulting data we afterward apply the same type of algorithm which we used for the WIFI data to construct a “classifier” based on clusters and a respective predictor function (see the last post of this series).

The data distribution for the 10 different digits of the training set was:

class 0 :  5905
class 1 :  6721
class 2 :  6031
class 3 :  6082
class 4 :  5845
class 5 :  5412
class 6 :  5917
class 7 :  6266
class 8 :  5860
class 9 :  5961

How good is the cluster membership of a sample for a digit class defined?
Well, out of 225 clusters there were only around 15 for which I got an “error” above 40%, i.e. for which the relative fraction of data samples deviating from the dominant class of the cluster was above 40%. For the vast majority of clusters, however, samples of one specific digit class dominated the clusters members by more than 90%.

The resulting confusion matrix of our new “cluster classifier” for the (unscaled) MNIST data looks like

[[5695    4   37   21    7   57   51   15   15    3]
 [   0 6609   33   21   11    2   15   17    2   11]
 [  62   45 5523  120   14   10   27  107  116    7]
 [  11   43  114 5362   15  153    8   60  267   49]
 [   5   60   62    2 4752    3   59   63    5  834]
 [  54   18  103  777   25 4158  126    9  110   32]
 [  49   20   56    4    6   38 5736    0    8    0]
 [   5   57   96    2   86    1    0 5774    7  238]
 [  30   76  109  416   51  152   39   35 4864   88]
 [  25   20   37   84  706   14    6  381   46 4642]]

This confusion matrix comes at no surprise: The digits “4”, “5”, “8”, “9” are somewhat error prone. Actually, everybody familiar with MNIST images knows that sometimes “4”s and “9”s can be mixed up even by the human eye. The same is true for handwritten “5”s, “8”s and “3”s.

Another representation of the confusion matrix is:

The calculation for the matrix elements was done in a standard way – the sum over percentages in a row gives 100% (the slight deviation in the matrix is due to rounding). I.e. we look at erors of the type TN (True Negatives).

The confusion matrix for the remaining 10,000 test data samples is:

The relative errors we get for our classifier when applied to the train and test data is

rel_err_train = 0.115 ,
rel_err_test = 0.112

All for unscaled MNIST data. Taking into account the crudeness of the whole approach this is a rather convincing result. It proves that it is worth the effort to perform a cluster analysis on high dimensional data:

  • It provides a first impression whether the data are structured in the feature space such that we can find relatively good separable clusters with dominant members belonging to just one class.
  • It also shows that a cluster based classification for many datasets cannot reach accuracy levels of CNNs, but that it may still deliver good results. Without any supervised training …

The second point also proves that the distance of the data points to the various cluster centers contains valuable information about the class membership. So, a MLP or CNN based classification could be performed on transformed MNIST data, namely distance vectors of sample datapoints to the different cluster centers. This corresponds to a dimension reduction of the classification problem. Actually, in a different part of this blog, I have already shown that such an approach delivers accuracy values beyond 98%.

For MNIST we can say that the samples define a relatively well separable cluster structure in the feature space. The granularity required to resolve classes sufficiently well means a clsuter number of around 200 < k < 250. Then we get an accuracy close to 90% for cluster based classification.

t-SNE representation of the MNIST data

Can we somehow confirm this finding about a good cluster-class-relation independently? Well, in a limited way. The t-SNE algorithm, which can be used to “project” multidimensional data onto a 2-dimensional plane, respects the vicinity of vectors in the original feature space whilst deriving a 2-dim representation. So, a rather well structured t-SNE diagram is an indication of clustering in the feature space. And indeed for 10,000 randomly selected samples of the (shufffled) training data we get:

The colorization was done by classes, i.e. digits. We see a relatively good separation of major “clusters” with data points belonging to a specific class. But we also can identify multiple problem zones, where data points belonging to different classes are intermixed. This explains the confusion matrix. It also explains why we need so many fine-grained clusters to get a reasonable resolution regarding a reliable class-cluster-relation.

Classifying scaled and normalized MNIST data with the help of clusters

Can we improve the accuracy of our cluster based classification a bit? This would, e.g., require some transformation which leads to a better cluster separation. To see the effect of two different scalers I tried the “Normalizer” and then also the “StandardScaler” of Sklearn. Actually, they work in opposite direction regarding accuracy:

The “Normalizer” improves accuracy by more than 1.5%, while the “Standardizer” reduces it by almost the same amount.

I only discuss results for “Normalization” below. The confusion matrix for the training data becomes:

and for the test data:

The relative error for the test data is

Error for trainings data:
avg_err_train = 0.085 :: num_err_train = 5113
 
Error for test data:
avg_err_test = 0.083 :: num_err_test = 832

So, the relative accuracy is now around 91.5%.
The result depends a bit on the composition of the training and the test dataset after an initial shuffling. But the value remains consistently above 90%.

Data compression by Autoencoders and clustering

Just for interest I also had a look at a very different approach to invoke clustering:

I first applied a simple CNN-based AutoEncoder [AE] to compress the MNIST data into a 25-dimensional space and applied our clustering methods afterwards.

I shall not discuss the technology of autoenconders in this post. The only relevant point in our context is that an autoencoder provides an efficient non-linear way of data compression and dimensionality reduction. Among many other useful properties and abilities … . Note: I did not use a “Variational Autoencoder” which would have allowed for even better results. The loss function for the AE was a simple quadratic loss. The autoencoder was trained on 50,000 training samples and for 40 epochs.

A t-SNE based plot of the “clusters” for test data in the 25-dimensional space looks like:

We see that the separation of the data belonging to different classes is somewhat better than before. Therefore, we expect a slightly better classification based on clusters, too. Without any scaling we get the following confusion data:

[[5817    7   10    3    1   14   15    2   27    1]
 [   3 6726   29    2    0    1   10    5   12   10]
 [  49   35 5704   35   14    4   10   61   87    7]
 [   8   78   48 5580   22  148    2   40  111   29]
 [  47   27   18    0 4967    0   44   38    3  673]
 [  32   20   10  150    8 5039   73    4   43   28]
 [  31   11   23    2    2   47 5746    0   15    1]
 [   6   35   35    6   32    0    1 5977    7  163]
 [  17   67   22   86   16  217   24   22 5365   52]
 [  35   32   11   92  184   15    1  172   33 5406]]

Error averaged over (all) clusters :  6.74

The resulting relative error for the test data was:

avg_err_test = 0.0574 :: num_err_test = 574

With Normalization:

Error for test data:
avg_err_test = 0.054 :: num_err_test = 832

So, after performing the autoencoder training on normalized data we consistently get

an accuracy of around 94%.

This is not too much of a gain. But remember:
We performed a cluster analysis on a feature space with only 25 dimensions – which of course is much cheaper. However, we paid a prize, namely the Autoencoder training which lasted about 150 secs on my old Nvidia 960 GTX.

And note: Even with only 100 clusters we get above 92% on the AE-compressed data.

Conclusion

We have shown that using a non-supervised cluster analysis of the MNIST data with around 225 clusters allows for classifying images with an accuracy around 90.5%. In combination with an Autoencoder compression we even reaches values around 94%. This is comparable with other non-optimized standard algorithms aside of neural networks.

This means that the MNIST data samples are organized in a well separable cluster structure in their feature space. A test run with normalized data showed that the clusters (and their centers) differ mostly by their direction relative to the origin of the feature space and not so much by their distance from the origin. With a relatively fine grained resolution we could establish a simple cluster-class-relation which allowed for cluster based classification.

The accuracy is, of course, below the values reachable with optimized MLPS (98%) and CNNs (above 99%). But, clustering is a fast, reliable and non-supervised method. In addition in combination with t-SNE we can create plots which can easily be understood by the customers. So, even for more complex data I would always recommend to try a cluster based classification approach if you need to provide plots and quick results. Sometimes the accuracy may even be sufficient for your customer’s purposes.