Deep Dreams of a CNN trained on MNIST data – I – a first approach based on one selected map of a convolutional layer

It is fun to play around with Convolutional Neural Networks [CNNs] on the level of an dedicated amateur. One of the reasons is the possibility to visualize the output of elementary building blocks of this class of AI networks. The resulting images help to understand CNN algorithms in an entertaining way - at least in my opinion. The required effort is in addition relatively limited: You must be willing to invest a bit of time into programming, but on a quite modest level of difficulty. And you can often find many basic experiments which are in within the reach of limited PC capabilities.

A special area where the visualization of CNN guided processes is the main objective is the field of "Deep Dreams". Anyone studying AI methods sooner or later stumbles across the somewhat psychedelic, but none the less spectacular images which Google presented in 2016 as a side branch of their CNN research. Today, you can download DeepDream generators from GitHub.

When I read a bit more about "DeepDream" experiments, I quickly learned that people use quite advanced CNN architectures, like Google's Inception CNNs, and apply them to high resolution images (see e.g. the Book of F. Chollet on "Deep Learning with Keras and Python" and, 2015, inceptionism-going-deeper-into-neural). Even if you pick up an already trained version of an Inception CNN, you need some decent GPU power to do your own experiments. Another questionable point for an interested amateur is: What does one actually learn from applying "generators", which others have programmed, and what from just following a "user guide" without understanding what a DeepDream SW actually does? Probably not much, even if you produce stunning images after some time...

So, I asked myself: Can one study basic methods of the DeepDream technology with self programmed tools and a simple dataset? Could one create a "DeepDream" visualization with a rather simply structured CNN trained on MNIST data?
The big advantage of the MNIST data set is that the individual samples are small; and the amount of numerical operations, which a related simple CNN must perform on input images, fits well to the capabilities of PC technology - even if the latter is some years old.

After a first look into DeepDream algorithms, I think: Yes, it should be possible. In a way DeepDream experiments are a natural extension of the visualization of CNN filters and maps which I have already discussed in depth in another article series. Therefore, DeepDream visualizations might even help us to better understand how the internal filter of CNNs work and what "features" are. However, regarding the creation of spectacular images we need to reduce our expectations to a reasonably low level:

A CNN trained on MNIST data works with gray images, low resolution and only simple feature patterns. Therefore, we will never produce such impressive images as published by DeepDream artists or by Google. But, we do have a solid chance to grasp some basic principles and ideas of this side-branch of AI with very simplistic tools.

As always in this blog, I explore a new field step-wise and let you as a reader follow me through the learning process. Throughout most of this new series of articles we will use a CNN created with the help of Keras and filter visualization tools which were developed in another article series of this blog. The CNN has been trained on the MNIST data set already.

In this first post we are going to pick just a single selected feature or response map of a deep CNN layer and let it "dream" upon a down-scaled image of roses. Well, "dream", as a matter of fact, is a misleading expression; but this is true for the whole DeepDream business - as we shall see. A CNN does not dream; "DeepDream" creation is more to be seen as an artistic discipline using algorithmic image enhancement.

The input image which we shall feed into our CNN today is shown below:

As our CNN works on a resolution level of 28x28 pixels, only, the "dreaming" will occur in a coarse way, very comparable to hallucinations on the blurred vision level of a short-sighted, myopic man. More precisely: Of a disturbed myopic man who works the whole day with images of digits and lets this poor experience enter and manipulate his dreamy visions of nicer things :-).

Actually, the setup for this article's experiment was a bit funny: I got the input picture of roses from my wife, who is very much interested in art and likes flowers. I am myopic and in my soul still a theoretical physicist, who is much more attracted by numbers and patterns than by roses - if we disregard the interesting fractal nature of rose blossoms for a second :-).

What do DeepDreams based on single maps of trained MNIST CNNs produce?

To rouse your interest a bit or to disappoint you from the start, I show you a typical result of today's exercise: "Dreams" or "hallucinations" based on MNIST and a selected single map of a deep convolutional CNN layer produce gray scale images with ghost-like "apparitions".

When these images appeared on my computer screen, I thought: This is fun, indeed! But my wife just laughed - and said "physicists" with a known undertone and something about "boys and toys" .... I hope this will not stop you from reading further. Later articles will, hopefully, produce more "advanced" hallucinations. But as I said: It all depends on your expectations.

But, lets focus: How did I create the simple "dream" displayed above?

Requirements - a CNN and analysis and visualization tools described in another article series of this blog

I shall use results and methods, which I have already explained in another article series. You need a basic understanding of how a CNN works, what convolutional layers, kernel based filters and cost functions are, how we can build simple CNNs with the help of Keras, ... - otherwise you will be lost from the beginning.
A simple CNN for the MNIST datasets – I – CNN basics
We also need a CNN, already trained on the MNIST data. I have shown how to build and train a very simple, yet suitable CNN with the help of Keras and Python; see e.g.:
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
In addition we need some code to create input image patterns which trigger response maps or full layers of a CNN optimally. I called such pixel patterns "OIPs"; others call them "features". I have offered a Python class in the other article series which offers an optimization loop and other methods to work on OIPs and filter visualization.
A simple CNN for the MNIST dataset – XI – Python code for filter visualization and OIP detection

We shall extend this class by further methods throughout our forthcoming work. To develop and run the codes you should have a working Jupyter environment, a virtual Python environment, an IDE like Eclipse with PyDev for building larger code segments and a working Cuda installation for a NVidia graphics card. My 960GTX proved to be fully sufficient for what we are going to do.

Deep "Dream" - or some funny image manipulation?

As it unfortunately happens so often with AI topics: Also in case of the term "DeepDream" the vocabulary is exaggerated and thoroughly misleading. A simple CNN neither thinks nor "dreams" - it is a software manifestation of the results of an optimization algorithm applied to and trained on selected input data. If applied to new input, it will only detect patterns for which it was optimized before. You could also say:

A CNN is a manifestation of learned prejudices.

CNNs and other types of AI networks filter input according to encoded rules which serve a specific purpose and which reflect the properties of the selected training data set. If you ever used the CNN of my other series on your own hand-written images after a training only on the (US-) MNIST images you will quickly see what I mean. The MNIST dataset reflects an American style of writing digits - a CNN trained on MNIST will fail relatively often when confronted with image samples of digits written by Europeans.

Why do I stress this point at all? Because DeepDreams reveal such kinds of "prejudices" in a visible manner. DeepDream technology extracts and amplifies patterns within images, which fit the trained filters of the involved CNN. F. Chollet correctly describes "DeepDream" as an image manipulation technique which makes use of algorithms for the visualization of CNN filters.

The original algorithmic concept for DeepDreams consists of the following steps:

  • Extend your algorithm for CNN filter visualization (= OIP creation) from a specific map to the optimization of the response of complete layers. Meaning: Use the total response of all maps of a layer to define contributions to your cost function. Then mix these contributions in a defined weighted way.
  • Take some image of whatever motive you like and prepare 4 or more down-scaled versions of this image, i.e. versions with different levels of size and resolution below the original size and resolution.
  • Offer the image with the lowest resolution to the CNN as an input image.
  • Loop over all prepared image sizes :
    • Apply your algorithm for filter visualization of all maps and layers to the input image - but only for a very limited amount of epochs.
    • Upscale the resulting output image (OIP-image) to the next level of higher resolution.
    • Add details of the original image with the same resolution to the upscaled OIP-image.
    • Offer the resulting image as a new input image to your CNN.

Readers who followed me through my last series on "a simple CNN for MNIST" should already raise their eyebrows: What if the CNN expects a certain fixed size of of the input image? Well, a good question. I'll come back to it in a second. For the time being, let us say that we will concentrate more on resolution than on an actual image size.

The above steps make it clear that we manipulate an image multiple times. In a way we transform the image slowly to improve a layer's response and repeat the process with growing resolution. I.e., we apply pattern detection and amplification on more and more details - in the end using all available large and small scale filters of the CNN in a controlled way without fully eliminating the original contents.

What to do about the low resolution of MNIST images and the limited capability of a CNN trained on them?

MNIST images have a very low resolution, real images instead a significantly higher one. With our CNN specialized on MNIST input the OIP-creation algorithm only works on (28x28)-images (and with some warnings, maybe, on smaller ones). What to do about it when we work with input images of a size of e.g. 560x560 pixels?

Well, we just work on the given level of resolution! We have three options:

  • We can downsize the input image itself or parts of it to the MNIST dimensions - with the help of a bicubic interpolation. Then our OIP-algorithm has the chance to detect OIPs on the coarse scale and to change the downsized image accordingly. Then we can upscale the result again to the original image size - and add details again.
  • We can split the input image into tiles of size (28x28) and offer these tiles as input to the CNN.
  • We can combine both of the above options.

Its like what a shortsighted human would do: Work with a blurred impression of the full scale image or look at parts of it from a close distance and then reassemble his/her impressions to larger scales.

A first step - apply only one specific map of a convolutional layer on a down-scaled image version

In this article we have a very limited goal for which we do not have to change our tools, yet:

  • Preparation:
    • We choose a map.
    • We downscale the original image to (28x28) by interpolation, upscale the result again by interpolating again (with loss) and calculate the difference to the original full resolution image (all interpolations done in a bicubic way).
  • Loop (4 times or so):
    • We apply the OIP-algorithm on the downscaled input image for a fixed amount of epochs
    • We upscale the result by bicubic interpolation to the original size.
    • We re-add the difference in details.
    • We downscale the result again.

With this approach I try to apply some of the elements of the original algorithm - but just on one scale of coarse resolution. I shall discuss the code for realizing the recipe given above with Python and Jupyter in the next article. For today let us look at some of the ghost like apparitions in the dreams for selected maps of the 3rd convolutional layer; see:
A simple CNN for the MNIST dataset – IX – filter visualization at a convolutional layer

DeepDreams based on selected maps of the 3rd convolutional layer of a CNN trained on MNIST data

With the image sections displayed below I have tried to collect results for different maps which focus on certain areas of the input image (with the exception of the first image section).

The first two images of each row display the detected OIP-patterns on the (28x28) resolution level with pixel values encoded in a (viridis) color-map; the third image in gray scale. The fourth image reveals the dream on the blurry vision level - up-scaled and interpolated to the original image size. You may still detect traces of the original rose blossoms i these images. The last two images of each row display the results after re-adding details of the original image and an adjustment of the width of the value distribution. The detected and enhanced pattern then turns into a whitey, ghostly shadow.

I have given each section a fancy headline.

I never promised you a rose garden ...

"Getting out ..."

"Donut ..."

"Curls to form a 3 ..."

"Two of them ..."

"The creepy roots of it all ..."

"Look at me ..."

"A hidden opening ..."

"Soft is something different ..."

"Central separation ..."

Conclusion: A CNN detects patterns or parts of patterns it was trained for in any kind of offered input ...

You can compare the results to some input patterns (OIPs) which strongly trigger individual maps on the 3rd convolutional layer; you will detect similarities. E.g. four OIP- or feature patterns to which map 56 reacts strongly, look like:

Filter visualization 1 for CNN map 56Filter visualization 2 for CNN map 56Filter visualization 3 for CNN map 56Filter visualization 4 for CNN map 56

This explains the basic shape of the "apparition" in the first "dream":

This proves that the filters of a trained CNN actually detect patterns, which proved to be useful for a certain training purpose, in any kind of input which shows some traces of such patterns. A CNN simply does not "know" better: If you only have a hammer to interact with the world, everything becomes a nail to you in the end - this is the level of stupidity on which a CNN algorithm works. And it actually is a fundamental ingredient of DeepDream image manipulation - a transportation of learned patterns or prejudices to an environment outside the original training context.

In the next article
Deep Dreams of a CNN trained on MNIST data – II – some code for pattern carving
I provide the code for creating the above images.

Further articles in this series

Deep Dreams of a CNN trained on MNIST data – II – some code for pattern carving
Deep Dreams of a CNN trained on MNIST data – III – catching dream patterns at smaller length scales


A simple CNN for the MNIST dataset – X – filling some gaps in filter visualization

I continue my series of articles on a CNN for the MNIST dataset.

A simple CNN for the MNIST dataset – IX – filter visualization at a convolutional layer
A simple CNN for the MNIST dataset – VIII – filters and features – Python code to visualize patterns which activate a map strongly
A simple CNN for the MNIST dataset – VII – outline of steps to visualize image patterns which trigger filter maps
A simple CNN for the MNIST dataset – VI – classification by activation patterns and the role of the CNN’s MLP part
A simple CNN for the MNIST dataset – V – about the difference of activation patterns and features
A simple CNN for the MNIST dataset – IV – Visualizing the activation output of convolutional layers and maps
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST datasets – I – CNN basics

In the last article we harvested the fruits of previous efforts. We produced a variety of input image patterns which triggered certain maps of the innermost convolutional layer of our CNN and the filters behind it maximally. I called such pixel-patterns OIPs. (I am still careful to avoid the expression "feature", which is used by many authors as a term describing a physical entity identified by the human brain. A connotation which I do not like ...)

Although my code for creating OIPs allows for a variation of fluctuations on 4 different length scales it proved to be hard to find OIPs for quite a lot of maps at the highest convolutional layer and for their filters. I could not always find a combination of initial random pixel fluctuations which led to loss values > 0. More precisely: I did not find a pattern by trial and error on a reasonable short timescale of some minutes.

We already know that OIP pixel patterns for the innermost convolutional layer are a bit artificial:

  • They are idealizations; the displayed pattern may not be present in the same form in the real input images which were used during training.
  • They may contain repetitions of sub-patterns at different positions.

The images in the last articles showed us in addition that some maps at the innermost Conv-layer are sensitive to rather complicated and specific patterns on relatively large length scales.

This is to be expected as at least filters on the higher convolutional levels accumulate information on a coarse level of image coverage: Filtered information coming from original large areas of the image are in the end convoluted down to grids of (3x3) neurons, i.e. onto low resolution maps. Filters at this convolutional depth can therefore require relatively large scale patterns to be passed. So, it is no real wonder that some maps do not react at all to random statistical fluctuations on very short length scales, e.g. on a two pixel scale. The activation of the respective neurons may stay at zero.

In this article I shall describe a simple method which allowed me to create OIP patterns for around half of the maps (at the 3d Conv-level) for which I did not have any luck before. This is done by a "precursor run" which tests the reaction of a map to a large sample of input images with pattern variations on relatively long length scales.

Systematic analysis? My simple approach ...

What is the basic idea? Let us assume that we cover the original image surface by a a grid of squares, e.g. by 16 (7x7)-squares - where (7x7) means a square with a side length of 7 pixels. We then assign a constant grey-value to all the pixels in such a square. Instead of picking random values in the range [0, 255] we use a limited number of N distinct normalized values in an interval [0, 1].

We then investigate all combinations of distinct value distributions across the 16 squares. For each combination we construct an input image with bicubic interpolation on the original (28x28) scale and standardize the resulting distribution of pixel values. (The standardization is required for my CNN, which was trained on standardized images). Thus we produce a (huge) number of input images with smooth large scale fluctuations, which we then can present to our OIP-optimization algorithm for a test of a map's reaction.

Do we have a chance to find OIPs by systematic trials?

Now, do we have a real chance to work a bit more systematically with the described approach? Unfortunately, the answer is: Not really, at least not with a PC equipment and for fluctuations at middle length scales with more than 5 distinct pixel values. The reason in terms of Fourier series' is that the number of amplitude combinations for multiple sinus waves grows exponentially with a defined number of discrete amplitudes values out of an interval.

We see the limitations already in our simplified approach. Let us assume that we pick 5 distinct and normalized pixel values in the interval ]0, 1[. Let us further assume that we cover the whole image area (of a size 28x28) with a grid of (9x9) squares, each with a constant pixel value. (We ignore the fact that 28/9 = 3.11 and trust in bicubic interpolation to take care of the 0.11 🙂 )

Even under these conditions we get already around 5**9 = 1.953 million pattern variations for which we have to test a map's response. For each of these patterns we would have to follow at least 4 to 10 iterations (= epochs) of the optimization loop to identify the input image as a promising candidate for a full optimization or not. This means that we need 8 to 20 million full gradient calculations on a parameter space with 784 (=28x28) dimensions. Hard work even for graphic cards (at the consumer level). At smaller length scales, as induced e.g. by a (7x7) coverage grid, we go far beyond standard calculation capacities on PC hardware.

What was/is within my computational reach? If we reduced the number of distinct pixel values to 3 (e.g.: 0.2, 0.5, 0.8) and concentrated on large scale fluctuations - e.g. based on (9x9) squares - we would have around 19700 combinations to investigate.

I had to work a bit on the respective Python code to get it run under Tensorflow 2.2.1/Cuda 10.2 with a reasonable percentage of around 30% on a graphics card. With 10 epochs for an optimization trial run for each image a complete run over 20000 images takes less than 5 minutes on my old GTX960 graphics card. With a modern card, a recent CPU and some more optimization one would probably arrive at values significantly below a minute. So, systematically working with large scale fluctuations is possible on PC-equipment!

Note that a coverage of the image surface with (9x9)-squares corresponds to the resolution which the (3x3)-maps of the innermost Conv-layers represent! (Including padding in the filter definitions). I assume that it is reasonable to work and test on this scale.

Selection of 8 promising candidates out of 19683

So, I added some "precursor" methods to my class My_OIP to prepare and perform systematic runs for all of the maps for which I had not found a pattern, yet. I chose the following 3 discrete pixel values: [0.2, 0.5, 0.8].
The resulting (9x9) images where rescaled with a bicubic interpolation to the MNIST size of (28x28) and afterwards standardized.

During the test of a map's response I selected 8 candidates out of 19683 for subsequent thorough optimization runs with 1200 epochs. The selection was done by looking at the largest loss values reached after 10 optimization epochs. I then applied the procedure to all 67 maps for which I had not got an OIP, yet.

Note that we cannot be sure whether 10 optimization iterations really give us a perfect indication of the highest loss values which would be reached after a full optimization run with 600 to 1200 epochs. So, it is worthwhile to play around with the number of test epochs; the selection of the input fluctuation patterns may change.

I got results then for 33 of the maps which had no OIP. I present the OIP-images below. The results cover almost 50% of the missing OIPs. So, my computational efforts were well invested.

OIPs for additional 33 maps

Below I present MNIST-related OIP-images for the maps with the following indices on the 3rd Conv-layer of my (!) CNN ):
1, 3, 9, 15, 16, 29, 35, 37, 38, 44, 46, 50, 51, 55, 59, 60, 70, 78, 81, 91, 93, 94, 96, 97, 101, 104, 108, 109, 111, 112, 116, 122, 125.
This time without contrast enhancement - as I was too lazy to integrate it into the precursor code.


CNNs are fun, aren't they?

Unique OIPs?

An interesting question is: Are the OIPs really unique per map in the sense that we produce the same OIP-images for different fluctuations on the input images? Well, there is some unique overall form of the patterns, but there may occur translations in position and there may be differences in details. Below, I show you different derived images for some maps:

Map 46:

Map 81:

Map 116:

Map 125:

Map 127:


By a systematic investigation of the reaction of maps to large scale fluctuations we can find OIPs for maps where a trial and error approach is not sufficient to trigger a response of the filters.
I shall discuss the Python code for a related "precursor run" in the next article

A simple CNN for the MNIST dataset – XI – Python code for filter visualization and OIP detection.


A simple Python program for an ANN to cover the MNIST dataset – XIV – cluster detection in feature space

We extend our studies of a program for a Multilayer perceptron and gradient descent in combination with the MNIST dataset:

A simple Python program for an ANN to cover the MNIST dataset – XIII – the impact of regularization
A simple Python program for an ANN to cover the MNIST dataset – XII – accuracy evolution, learning rate, normalization
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix
A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests
A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests
A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation
A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles
A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“
A simple Python program for an ANN to cover the MNIST dataset – V – coding the loss function
A simple Python program for an ANN to cover the MNIST dataset – IV – the concept of a cost or loss function
A simple Python program for an ANN to cover the MNIST dataset – III – forward propagation
A simple Python program for an ANN to cover the MNIST dataset – II - initial random weight values
A simple Python program for an ANN to cover the MNIST dataset – I - a starting point

In this article we shall work a bit on the following topic: How can we reduce the computational time required for gradient descent runs of our MLP?

Readers who followed my last articles will have noticed that I sometimes used 1800 epochs in a gradient descent run. The computational time including

  • costly intermediate print outs into Jupyter cells,
  • a full determination of the reached accuracy both on the full training and the test dataset at every epoch

lay in a region of 40 to 45 minutes for our MLP with two hidden layers and roughly 58000 weights. Using an Intel I7 standard CPU with OpenBlas support. And I plan to work with bigger MLPs - not on MNIST but other data sets. Believe me: Everything beyond 10 minutes is a burden. So, I have a natural interest in accelerating things on a very basic level already before turning to GPUs or arrays of them.

Factors for CPU-time

This introductory question leads to another one: What basic factors beyond technical capabilities of our Linux system and badly written parts of my Python code influence the consumption of computational time? Four points come to my mind; you probably find even more:

  • One factor is certainly the extra forward propagation run which we apply to all samples of both the test and training data seat the end of each epoch. We perform this propagation to make predictions and to get data on the evolution of the accuracy, the total loss and the ratio of the regularization term to the real costs. We could do this in the future at every 2nd or 5th epoch to save some time. But this will reduce CPU-time only by less than 22%. 76% of the CPU-time of an epoch is spent in batch-handling with a dominant part in error backward propagation and weight corrections.
  • The learning rate has a direct impact on the number of required epochs. We could enlarge the learning rate in combination with input data normalization; see the last article. This could reduce the number of required epochs significantly. Depending on the parameter choices before by up to 40% or 50%. But it requires a bit of experimenting ....
  • Two other, more important factors are the frequent number of matrix operations during error back-propagation and the size of the involved matrices. These operations depend directly on the number of nodes involved. We could therefore reduce the number of nodes of our MLP to a minimum compatible with the required accuracy and precision. This leads directly to the next point.
  • The dominant weight matrix is of course the one which couples layer L0 and layer L1. In our case its shape is 784 x 70; it has almost 55000 elements. The matrix for the next pair of layers has only 70x30 = 2100 elements - it is much, much smaller. To reduce CPU time for forward propagation we should try to make this matrix smaller. During error back propagation we must perform multiple matrix multiplications; the matrix dimensions depend on the number of samples in a mini-batch AND on the number of nodes in the involved layers. The dimensions of the the result matrix correspond to the those of the weight matrix. So once again: A reduction of the nodes in the first 2 layers would be extremely helpful for the expensive backward propagation. See: The math behind EBP.

We shall mainly concentrate on the last point in this article.

Reduction of the dimensions of the dominant matrix"requires a reduction of input features

The following numbers show typical CPU times spend for matrix operations during error back propagation [EBP] between different layers of our MLP and for two different batches at the beginning of gradient descent:

Time_CPU for BW layer operations (to L2) 0.00029015699965384556
Time_CPU for BW layer operations (to L1) 0.0008645610000712622
Time_CPU for BW layer operations (to L0) 0.006551215999934357

Time_CPU for BW layer operations (to L2) 0.00029157400012991275
Time_CPU for BW layer operations (to L1) 0.0009575330000188842
Time_CPU for BW layer operations (to L0) 0.007488838999961445

The operations involving layer L0 cost a factor of 7 more CPU time than the other operations! Therefore, a key to the reduction of the number of mathematical operations is obviously the reduction of the number of nodes in the input layer! We cannot reduce the numbers in the hidden layers much, if we do not want to hamper the accuracy properties of our MLP too much. So the basic question is

Can we reduce the number of input nodes somehow?

Yes, maybe we can! Input nodes correspond to "features". In case of the MNIST dataset the relevant features are given by the gray-values for the 784 pixels of each image. A first idea is that there are many pixels within each MNIST image which are probably not used at all for classification - especially pixels at the outer image borders. So, it would be helpful to chop them off or to ignore them by some appropriate method. In addition, special significant pixel areas may exist to which the MLP, i.e. its weight optimization, reacts during training. For example: The digits 3, 5, 6, 8, 9 all have a bow within the lower 30% of an image, but in other regions, e.g. to the left and the right, they are rather different.

If we could identify suitable image areas in which dark pixels have a higher probability for certain digits then, maybe, we could use this information to discriminate the represented digits? But a "higher density of dark pixels in an image area" is nothing else than a description of a "cluster" of (dark) pixels in certain image areas. Can we use pixel clusters at numerous areas of an image to learn about the represented digits? Is the combination of (averaged) feature values in certain clusters of pixels representative for a handwritten digit in the MNIST dataset?

If the number of such pixel clusters could be reduced below lets say 100 then we could indeed reduce the number of input features significantly!

Cluster detection

To be able to use relevant "clusters" of pixels - if they exist in a usable form in MNIST images at all - we must first identify them. Cluster identification and discrimination is a major discipline of Machine Learning. This discipline works in general with unlabeled data. In the MNIST case we would not use the labels in the "y"-data at all to identify clusters; we would only use the "X"-data. A nice introduction to the mechanisms of cluster identification is given in the book of Paul Wilcott (see Machine Learning – book recommendations for the reference). The most fundamental method - called "kmeans" - iterates over 3 major steps [I simplify a bit :-)]:

  • We assume that K clusters exist and start with random initial positions of their centers (called "centroids") in the multidimensional feature space
  • We measure the distance of all data points to he centroids and associate a point with that centroid to which the distance is smallest
  • We determine the "center of mass" (according to some distance metric) of the identified data point groups and assume it as a new position of the centroids and move the old positions (a bit) in this direction.

We iterate over these steps until the centroids' positions hopefully get stable. Pretty simple. But there is a major drawback: You must make an assumption on the number "K" of clusters. To make such an assumption can become difficult in the complex case of a feature space with hundreds of dimensions.

You can compensate this by executing multiple cluster runs and comparing the results. By what? Regarding the closure or separation of clusters in terms of an appropriate norm. One such norm is called "cluster inertia"; it measures the mean squared distance to the center for all points of a cluster. The theory is that the sum of the inertias for all clusters drops significantly with the number of clusters until an optimal number is reached and the inertia curve flattens out. The point where this happens in a plot of inertia vs. number of clusters is called "elbow". Identifying this "elbow" is one of the means to find an optimal number of clusters. However, this recipe does not work under all circumstances. As the number of clusters get big we may be confronted with a smooth decline of the inertia sum.

What data do we use for gradient descent after cluster detection?

How could we measure whether an image shows certain clusters? We could e.g. measure distances (with some appropriate metric) of all image points to the clusters. The "fit_transform()"-method of KMeans and MiniBatchKMeans provide us with with some distance measure of each image to the identified clusters. This means our images are transformed into a new feature space - namely into a "cluster-distance space". This is a quite complex space, too. But it has less dimensions than the original feature space!

Note: We would of course normalize the resulting distance data in the new feature space before applying gradient descent.

Application of "KMeansBatch" to MNIST

There are multiple variants of "KMeans". We shall use one which is provided by SciKit-Learn and which is optimized for large datasets: "MiniBatchKMeans". It operates batch-wise without loosing too much of accuracy and convergence properties in comparison to KMeans (or a comparison see here). "MiniBatchKMeans"has some parameters you can play with.

We could be tempted to use 10 clusters as there are 10 digits to discriminate between. But remember: A digit can be written in very many ways. So, it is much more probable that we need a significant larger number of clusters. But again: How to determine on which K-values we should invest a bit more time? "Kmeans" and methods alike offer another quantity called "silhouette" coefficient. It measures how well the data points are within, at or outside the borders of a cluster. See the book of Geron referenced at the link given above on more information.

Variation of CPU time, inertia and average silhouette coefficients with the number of clusters "K"

Let us first have a look at the evolution of CPU time, total inertia and averaged silhouette with the number of clusters "K" for two different runs. The following code for a Jupyter cell gives us the data:

# *********************************************************
# Pre-Clustering => Searching for the elbow 
# *********************************************************
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
X = np.concatenate((ANN._X_train, ANN._X_test), axis=0)
y = np.concatenate((ANN._y_train, ANN._y_test), axis=0)
print("X-shape = ", X.shape, "y-shape = ", y.shape)
num = X.shape[0]

li_n = []
li_inertia = []
li_CPU = []
li_sil1 = []

# Loop over the number "n" of assumed clusters 
rg_n = range(10,171,10)
for n in rg_n:
    print("\nNumber of clusters: ", n)
    start = time.perf_counter()
    kmeans = MiniBatchKMeans(n_clusters=n, n_init=500, max_iter=1000, batch_size=500 )  
    X_clustered = kmeans.fit_transform(X)
    sil1 = silhouette_score(X, kmeans.labels_)
    #sil2 = silhouette_score(X_clustered, kmeans.labels_)
    end = time.perf_counter()
    dtime = end - start
    print('Inertia = ', kmeans.inertia_)
    print('Time_CPU = ', dtime)
    print('sil1 score = ', sil1)

# Plots         
# ******
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 14
fig_size[1] = 5
fig1 = plt.figure(1)
fig2 = plt.figure(2)

ax1_1 = fig1.add_subplot(121)
ax1_2 = fig1.add_subplot(122)

ax1_1.plot(li_n, li_CPU)
ax1_1.set_xlabel("num clusters K")
ax1_1.set_ylabel("CPU time")

ax1_2.plot(li_n, li_inertia)
ax1_2.set_xlabel("num clusters K")

ax2_1 = fig2.add_subplot(121)
ax2_2 = fig2.add_subplot(122)

ax2_1.plot(li_n, li_sil1)
ax2_1.set_xlabel("num clusters K")
ax2_1.set_ylabel("silhoutte 1")

You see that I allowed for large numbers of initial centroid positions and iterations to be on the safe side. Before you try it yourself: Such runs for a broad variation of K-values are relatively costly. The CPU time rises from around 32 seconds for 30 clusters to a little less than 1 minute for 180 clusters. These times add up to a significant sum after a while ...

Here are some plots:

The second run was executed with a higher resolution of K_(n+1) - K_n 5 = 5.

We see that the CPU time to determine the centroids' positions varies fairly linear with "K". And even for 170 clusters it does not take more than a minute! So, CPU-time for cluster identification is not a major limitation.

Unfortunately, we do not see a clear elbow in the inertia curve! What you regard as a reasonable choice for the number K depends a lot on where you say the curve starts to flatten. You could say that this happens around K = 60 to 90. But the results for the silhouette-quantity indicate for our parameter setting that K=40, K=70, K=90 are interesting points. We shall look at these points a bit closer with higher resolution later on.

Reduction of the regularization factor (for Ridge regularization)

Now, I want to discuss an important point which I did not find in the literature:
In my last article we saw that regularization plays a significant but also delicate role in reaching top accuracy values for the test dataset. We saw that Lambda2 = 0.2 was a good choice for a normalized input of the MNIST data. It corresponded to a certain ratio of the regularization term to average batch costs.
But when we reduce the number of input nodes we also reduce the number of total weights. So the weight values themselves will automatically become bigger if we want to get to similar good values at the second layer. But as the regularization term depends in a quadratic way on the weights we may assume that we roughly need a linear reduction of Lambda2. So, for K=100 clusters we may shrink Lambda2 to (0.2/784*100) = 0.025 instead of 0.2. In general:

Lambda2_cluster = Lambda2_std * K / (number of input nodes)

I applied this rule of a thumb successfully throughout experiments with clustering befor gradient descent.

Reference run without clustering

We saw at the end of article XII that we could reach an accuracy of around 0.975 after 500 epochs under optimal circumstances. But in the case I presented ten I was extremely lucky with the statistical initial weight distribution and the batch composition. In other runs with the same parameter setup I got smaller accuracy values. So, let us take an ad hoc run with the following parameters and results:
Parameters: learn_rate = 0.001, decrease_rate = 0.00001, mom_rate = 0.00005, n_size_mini_batch = 500, n_epochs = 600, Lambda2 = 0.2, weights at all layers in [-2*1.0/sqrt(num_nodes_layer), 2*1.0/sqrt(num_nodes_layer)]
Results: acc_train: 0.9949 , acc_test: 0.9735, convergence after ca. 550-600 epochs

The next plot shows (from left to right and the down) the evolution of the costs per batch, the averaged error of the last mini-batch during an epoch, the ratio of regularization to batch costs and the total costs of the training set, respectively .

The following plot summarizes the evolution of the total costs of the traaining set (including the regularization contribution) and the evolution of the accuracy on the training and the test data sets (in orange and blue, respectively).

The required computational time for the 600 epochs was roughly 18,2 minutes.

Results of gradient descent based on a prior cluster identification

Before we go into a more detailed discussion of code adaption and test runs with things like clusters in unnormalized and normalized feature spaces, I want to show what we - without too much effort - can get out of using cluster detection ahead of gradient descent. The next plot shows the evolution of a run for K=70 clusters in combination with a special normalization:

and the total cost and accuracy evolution

The dotted line marks an accuracy of 97.8%! This is 0.5% bigger then our reference value of 97.3%. The total gain of %gt; 0.5% means however 18.5% of the remaining difference of 2.7% to 100% and we past a value of 97.8% already at epoch 600 of the run.

What were the required computational times?

If we just wanted 97.4% as accuracy we need around 150 epochs. And a total CPU time of 1.3 minutes to get to the same accuracy as our reference run. This is a factor of roughly 14 in required CPU time. For a stable 97.73% after epoch 350 we were still a factor of 5.6 better. For a stable accuracy beyond 97.8% we needed around 600 epochs - and still were by a factor of 3.3 faster than our reference run! So, clustering really brings some big advantages with it.


In this article I discussed the idea of introducing cluster identification in the (unnormalized or normalized) feature space ahead of gradient descent as a possible means to save computational time. A preliminary trial run showed that we indeed can become significantly faster by at least a factor of 3 up to 5 and even more. This is just due to the point that we reduced the number of input nodes and thus the number of mathematical calculations during matrix operations.

In the next article we shall have a more detailed look at clustering techniques in combination with normalization.