Autoencoders and latent space fragmentation – II – number distributions of latent vector components

This post series studies the (in-) ability of a trained Autoencoder [AE] to create reasonable human face images from statistical vectors placed in its latent space. In my last post

Autoencoders and latent space fragmentation – I – Encoder, Decoder, latent space

I have described what the purposes of the two sub-networks of an AE, i.e. the Encoder and the Decoder, are. We saw that the so called latent space plays an important role for the interplay of these sub-networks: Vectors in the latent space – z-pointsencode properties of objects presented to the Autoencoder, more precisely to its Encoder. The bunch of objects used during training thus gives us a distribution of vectors and respective z-points within certain regions of the latent space. The Decoder reconstructs objects from latent vectors.

One of my eventual objectives in this series is the creation of new objects of the same class presented to the AE during its training. I focus on the special case of images displaying human faces. Therefore, I have trained a convolutional AE with the so called CelebA dataset. After training one may hope that an AE will be able to produce images with new faces, which are not present in CelebA, when we feed the Decoder with suitable z-points. The question is what “suitable z-points” are and where they are located in the latent space.

Of course, I want to use the Decoder’s reconstruction abilities to achieve my goal. To get new faces, a statistical element is a must. The basic idea is to use statistically created z-points as input for the Decoder.

Objective of this post

In the first post of this series I have already indicated that not all vectors in the AE’s latent space may lead to the production of reasonable images. It might well be that we must hit certain confined regions of the latent space. An interesting question, therefore, is the following:

Are all generators of statistical vectors suitable to hit the regions of a latent space which an Autoencoder will fill for certain training objects?

In this and further posts I want to show you that this is not the case. A bad choice of a statistical generating method in combination with the high number of latent space dimensions may lead to a complete failure.

To achieve this insight we must study both the real z-point distribution which an AE creates for CelebA images and the artificial vector or z-point distribution coming from a specific statistical generator. In this post I study a generator which assigns the vector components values which are taken from a real number interval with a constant probability. The number of dimensions N of the latent space shall be N=256.

We shall see that we indeed get confronted with a special side of the curse of high dimensionality and that our artificial z-point distribution does not match the real one for CelebA at all if we do not restrict parameters in a somewhat counter-intuitive way. As a side effect we will learn that our AE organizes the latent vector distributions for CelebA images via functions very similar to Gaussians. Furthermore, we shall see that we speak of just one coherent and confined z-point region with very small extensions. The center of this region sits close to a hypervolume spanned by only a few (out of 256) coordinate axes.

Methods to create statistical z-points

To create statistical z-points in a latent space we have to employ some generating mechanism for respective statistical vectors. See my first post for the correspondence of z-points to latent vectors. There are multiple ways available to create latent vectors. I just name 3 popular ones:

  1. Create a dense homogeneous distribution by filling some volume around the origin with a grid of points.
  2. Use a constant probability density for values in a real number interval [-b, b], pick such values statistically and assign them to individual vector components.
  3. Use one or multiple Gaussian distributions to define the vector component values.

Note that when applying the second and the third method the statistics works on the level of the vector components. These components are handled as independent variables, each obeying a certain probability distribution of assignable real values.

A small calculation shows that method 1 will not work in practice, as such a distribution requires an enormous number of points for high dimensions and a decent resolution. Method 2 looks simple and works well in 2D- and 3D-spaces. The third method requires assumptions about the mean values and standard deviations to take – per vector coordinate.

Therefore it is tempting to use method 2 to create one or more artificial statistical distribution of random vectors in the latent space. This is exactly what we will try in this post – in the hope that a significant part of the resulting points will hit regions which give us reasonable images.

If you are interested in mathematical properties of vector distributions created by method 2 in multi-dimensional spaces, you will find them in the following posts of this blog:

Latent spaces – pitfalls of distributing points in multi dimensions – I – constant probability density per dimension
Latent spaces – pitfalls of distributing points in multi dimensions – II – missing specific regions

Why must we hit specific regions of the latent space?

Experience tells us that we will not get reasonable images from arbitrarily and randomly placed z-points in the latent space.. (Concrete examples will be given in the next post.) What could a plausible reason be?

Convolutional networks [CNNs] extract patterns and save them in parameters of their layer filters (or neural maps). In another post series I have shown such elementary patterns, which the innermost convolutional layers react sensitively to, for the simple case of MNIST. Patterns correspond to correlations between constituting elements of the objects we present to the neural networks. Such patterns reflect certain features of the objects. The number of elementary patterns a CNN can handle depends on the number of available kernel filters – which is fixed by the network structure. For a trained convolutional Autoencoder it is therefore reasonable to assume that a latent space vector encodes a prescription for the superposition of certain elementary patterns by which the Decoder eventually creates an image. The information for the pattern mixture is encoded both by the length and angles of latent vectors. The latter with respect to the many coordinate axes; the multitude of angles describes the orientation of a latent vector in its multi-dimensional space.

It is clear that not all prescriptions for a mixture of elementary patterns will reflect the real pixel correlations of a human face in front of some background. We must therefore assume that the latent space regions filled by a trained AE-Encoder for the training objects are the ones which give us reasonable Decoder results. In principle we could find multiple such regions in different parts of the latent space. They could have particular locations and could be confined to a relatively small volume. Therefore, we should really check that at least a part of our statistical vectors points to those regions.

Number distributions for vector components

A priori we do not know anything about the shape of z-point distribution which an Autoencoder will create in its latent space for CelebA data. Therefore, we need to get an overview over some properties of such a z-point distribution. As multidimensional spaces with a high number of dimensions like N ≥ 256 can not be presented in 3D, we need some other kind of visualization. What we shall use is a kind of spectral display for the coordinate values of our vectors:

We are going to analyze the number distribution for the values of each of the vector components.

I.e., we count the numbers of latent vectors which have values for a certain component in a series of sampling intervals for real values. We do this for all components and for a reasonable total range of values. Note that a distribution for a specific component is a one-dimensional function over ℜ.

Below we will first derive the component related distributions from real Autoencoder vectors for CelebA images. This will tell us already a lot about the orientation, the off-center location and the extensions of the real z-point distribution for CelebA.

Afterward, we will compare the CelebA specific distributions to the number distributions for artificial statistical vector distributions created by method 2.

Number distributions for vector lengths

Another nice and simple method to analyze the compatibility of vector distributions is to compare the number distributions for the vector lengths. For an orthogonal coordinate system of the latent space we can compute the length of a multi-dimensional vector by an Euclidean L2-norm. We will compare the number distribution for the lengths of latent CelebA vectors with the distribution for the lengths of vectors created by method 2. We will call the length of a vector also its “radius”.

Network setup

The basic layer structure of my AE was described already in the last post. It is relatively simple. I employ only a few Encoder and Decoder layers. I have ensured the Encoder and Decoder are actually able to solve the basic task of encoding and decoding data of real CelebA images.

We look at the results of two AE networks which differ in the number of convolutional kernel filters used:

  • Test case I: We use 4 Conv2D layers in the Encoder with 32, 64, 128, 256 filters and 4 TransposeConv2D layers in the Decoder with 256, 128, 64, 32 filters, respectively.
  • Test case II: We use 4 Conv2D layers in the Encoder with 64, 64, 128, 128 filters and 4 TransposeConv2D layers in the Decoder with 128, 128, 64, 64 filters, respectively.

This is reflected in the following code snippet. There you also get information on the kernel sizes, strides and padding-methods. The number of dimensions of the latent space is N = z_dim = 256. The activation function is a chosen to be Leaky Relu.

        # Test case I
        AE1 = Autoencoder(
            input_dim                  = INPUT_DIM
            , encoder_conv_filters     = [32,64,128,256]       # We take a bit bigger than D. Foster 
            , encoder_conv_kernel_size = [3,3,3,3]
            , encoder_conv_strides     = [2,2,2,2]
            , encoder_conv_padding     = ['same','same','same','same']

            , decoder_conv_t_filters     = [128,64,32,n_ch]    # !!! n_ch = 1 or 3 
            , decoder_conv_t_kernel_size = [3,3,3,3]
            , decoder_conv_t_strides     = [2,2,2,2]
            , decoder_conv_t_padding     = ['same','same','same','same']
            , z_dim = 256
            , act   = 0                  # activation 0:Leaky ReLU (standard), 1: ReLU, 2: SELU    
        )
        # test case II
        AE2 = Autoencoder(
            input_dim                  = INPUT_DIM
            , encoder_conv_filters     = [64,64,128,128]       # We take a bit bigger than D. Foster 
            , encoder_conv_kernel_size = [5,5,3,3]
            , encoder_conv_strides     = [2,2,2,2]
            , encoder_conv_padding     = ['same','same','same','same']

            , decoder_conv_t_filters     = [128,64,64,n_ch]    # !!! n_ch = 1 or 3 
            , decoder_conv_t_kernel_size = [3,3,5,5]
            , decoder_conv_t_strides     = [2,2,2,2]
            , decoder_conv_t_padding     = ['same','same','same','same']
            , z_dim = 256
            , act   = 0                  # activation 0:Leaky ReLU (standard), 1: ReLU, 2: SELU    
        )

A method to analyze the vector distribution in a high-dimensional vector space

The components of our latent vectors determine their angle and length. We base our analysis of corresponding z-points on the number distribution per component-value. To do this we select a suitable real value interval covering all the values for vector components which the AE actually uses. We divide this interval into a series of sufficient sub-intervals for data sampling. In our case around 100 sub-intervals.

After training of our AE we once again feed all training objects (in our case > 170,000 CelebA images) into the Encoder and keep the vectors in some Numpy arrays. Then we look at a specific component and a sampling interval and count the number of vectors for which the component value resides inside the sampling interval. Repeating this for all components and intervals we get a number distribution which can be plotted. If we are lucky the resulting shapes of the number distributions will give us information about the corresponding shape of the multi-dimensional regions which the AE fills for CelebA images.

CelebA images: Number distribution for component values of latent vectors

The following plot shows the number distributions for all of the 256 components of vectors for CelebA images in our trained AE’s latent space:

Case I: Number distribution after 24 epochs

Case I: Selected components

Case II: Number distribution after 30 epochs

Case II: Selected components

We see that the individual number distributions are very similar to Gaussian distributions. For test case II I also give you the values for the central average value μ (named mu in the list below) and the half-width (named hw below) of the most interesting components. The half-width is the difference between those coordinate values where the distribution function achieves a value of half of the maximum number value at μ:

 15 mu : -0.25 :: hw:  1.5
 16 mu :  0.5  :: hw:  1.125
 56 mu :  0.0  :: hw:  1.625
 58 mu :  0.25 :: hw:  2.125
 66 mu :  0.25 :: hw:  1.5
 68 mu :  0.0  :: hw:  2.0
110 mu :  0.5  :: hw:  1.875
118 mu :  2.25 :: hw:  2.25
151 mu :  1.5  :: hw:  4.125
177 mu : -1.0  :: hw:  2.25
178 mu :  0.5  :: hw:  1.875
180 mu : -0.25 :: hw:  1.5
188 mu :  0.25 :: hw:  1.75
195 mu : -1.5  :: hw:  2.0
202 mu : -0.5  :: hw:  2.25
204 mu : -0.5  :: hw:  1.25
210 mu :  0.0  :: hw:  1.75
230 mu :  0.25 :: hw:  1.5
242 mu : -0.25 :: hw:  2.375
253 mu : -0.5  :: hw:  1.0

These components obviously have either a relatively large absolute μ-value or a relatively large half-width. I call these components the dominant ones.

Interpretation of the number distributions per vector component

The first thing these plots prove is the fact that the results for different networks are different, too. Without proving it, I also say from experience that even two different training runs for one and the same AE network structure may result in somewhat different number distributions. But although there are some differences there are also striking similarities:

  1. Most of the components show a Gaussian like number distribution about some mean value.
  2. The mean coordinate value μ for most of the components (≥ 90%) is zero or close to it.
  3. The component values cover a region of [-12, 12] in our specific case.
  4. There are only a few components ( ≤ 25) with mean values <μ> ≥ 0.25 or <μ> ≤ -0.25.
  5. There are only a few components ( ≤ 10) with mean values <μ> ≥ +1.0 or <μ> ≤ -1.
  6. There are only relatively few components (≤ 40) with a half-width ≥ 1.
  7. There are only a few components (≤ 20) with a half-width ≥ 1.5
  8. There are only very few components ≤ 5 with a half-width ≥ 3.

A bit of thinking and imagination tells us that the center of the distribution must be located somewhat off the origin, but very close to or within a hypervolume spanned by only a few dominant coordinate axes. The multi-dimensional region filled by the z-points has significant anisotropic extensions or elongations around its center only in a few directions of the multidimensional space.

All in all we speak about a very specific, limited multi-dimensional region, located at some distance from the origin (but not too far) with a center point close to or within a sub-volume of very low dimensions spanned by some coordinate axes. The overall direction of the center with respect to the origin is well defined by a few coordinates. The diameters of the regions are small in most directions. The z-points concentrate strongly towards the center of this region. Significant diameters around the center are only given in some particular directions. The respective axes involved are less than 10% of the total number.

This also means that a primary component analysis of the z-point distribution should only give us a number of dominant main components in the same number region (0.1 * N). See forthcoming posts for such an analysis.

Analogon in 3D: In an analogous case within a 3-dimensional space we would speak of a kind of ellipsoid with a significant diameter beyond 1 only in a specific direction and a center located close to a line in a plane spanned by 2 of the 3 coordinate axes.

Comparison to the number distribution for statistically created vectors with a constant probability for component values in a specific interval

Now we look at the number distributions for all components of 200,000 vectors created by our method 2. We choose the same region for values [-12, 12] as displayed for our test cases I and II above. The following plot confirms what you certainly have already guessed:

This plot just reflects the design of our probability distribution for each of the components. But the plot also indicates clearly that most of our statistically created points will not hit the latent space region which is filled by the Autoencoder for CelebA data.

The statistics obviously plays against us

It is very instructive to write a small program which scans all of the artificially created vectors and checks whether their end points lie within the region defined by the real CelebA points. I leave this to the reader. To get a good guess I personally defined a region of three times the half-width left and right of each component value center of the real distribution. The number of points fulfilling all criteria for our CelebA region came out to be exactly zero. Even with 12 times the half-width you get only very few vectors pointing into the CelebA region (around 20 vectors). Why is this the case?

The curse of a high dimensionality and a constant probability distribution

The first point is that for some components of the CelebA vectors the half-width really is rather small. So you do not cover the whole value interval [-12, 12] for the component values, but only around 75% for 12 times the half-width. Let us assume that this is the case for 7 to 8 components. Let us further assume that another 50 components the width is 90%. Then the probability to get a point is (0.75)8 * 0.965 ≈ 0.1 * 0.0018 ≈ 0.00018. This gives us only 30 out of 170,000 vectors potentially fulfilling our conditions. We are in that range.

But we know already that 90% of all components should hit an interval [-3,3]. The probability for this in the case of our method 2 is 0.25 per component. The probability for a hit thus is 0.25220 which is zero for all practical purposes. This is it what really kills our efforts to place a point in the most interesting regions of the latent space with the help of method 2 and seemingly fitting values of b.

Now, you may think: A decisive parameter for method 2 ist the interval [-b, b] from which I pick my statistical component values. What if we diminish b? E.g. to b=3 or b=2? This is a good idea as we shall see in the next paragraph. Yet, you still get only a few vectors (< 10) fulfilling all criteria for two times the half-width.

Let us reduce our b-interval to [-3, 3] for vector creation. Then for a typical vector more than 50 components miss the target region. Test runs show that even for [-2, 2] only around 10 out of 170.000 vectors would hit (outer parts) of the target region for CelebA images. In this case the few components which have off-center mu values seemingly work against us.

Things change dramatically for b=1.5 and 2 times the half-width. Now the components of all the artificial vectors fulfill our criteria. If you, however, change the intervals to hit to 1.5 times the half-width you are back again to only a very few vectors (< 10). For b=1 and 1.5 times the half-width we get again a number of 170,000 vectors fulfilling our criteria.

Why does this happen? And does it mean that when we restrict our component values to [-1.5, 1.5] we would cover our real CelebA distribution?

Number distributions for the vector lengths

Below you find a plot showing the number distribution with respect to typical vector lengths – both for CelebA (in red) and vectors artificially created with method 2 (other colors).

The parameter “b” of our artificial distributions defines the interval [-b, b] from which we pick component values. The probability density is a constant for this interval.

The first point you may dwell upon is the fact that the radius values get so big – much bigger than b. This is due to the high number of dimensions; see the posts named for a mathematical cover of method 2.

The second counter-intuitive point may be the following: One expects that b=12 should really be a reasonable parameter value for method 2 to cover the range of component values for CelebA. But already for b=3 or b=4 we get vectors outside the vector length interval which CelebA latent vectors fill. The reason for this are properties of the artificial radius distributions which can be derived mathematically. A mathematical calculation of the expectation value for the mean length (= radius R) of our artificially generated vectors gives us a value around

<R>    ≈    b * sqrt(1/3 * N) * sqrt( 1 / (1 + 1/(4*N) )

with a relatively very narrow spread. See the derivation in the posts quoted above. sqrt in the formula above stands for the square root. The standard deviation Δstd(R) has a size of approximately

Δstd(R)    ≈    b * sqrt(1/15 * N) * sqrt( 1 + 1/(4*N) )

The ratio of the half-width to radius thus declines with the square root of the number of dimensions N. As we see, only the artificial distributions for b=1, b=1.5, b=2 cover parts of the radius distribution for latent CelebA vectors.

So, if we want to use method 2 then we should work with b-values 1.0 ≤ b ≤ 2.0 to get a probability > 0 for creating reasonable images of human faces.

Will a proper value of b guarantee us reasonable face images?

We have seen that b must be reduced to a range 1.0 ≤ b ≤ 2.0 to get reasonable radius values of our statistical vectors. Unfortunately, this does not guarantee us proper images either. The reason is that there might be correlations between the dominant component values which our simple number distributions do not reveal. We will take care of this in the next post. For now I just show you a plot of the correlation between two specific dominant components for 1000 randomly selected CelebA latent vectors:

Conclusion

In this post we have partially analyzed the distribution of vectors and related z-points which an Autoencoder creates for CelebA images in its latent space. We have found that the number distributions per vector component look like Gaussian distributions. While most of the components have a small spread around the value zero, there are a few dominant components which determine the (off-center) location and orientation of a coherent, confined and ellipsoidally shaped region for CelebA z-points. The center of this region is close to a hypervolume defined by a few axes.

It was a bit counter-intuitive to see that a simple method to create statistical z-points via a constant probability distribution for individual component values would obviously miss the relevant latent region for CelebA images totally. We saw that we would need very special parameter values to limit the component values to get artificial latent vectors with the required length. These findings alone make it very improbable that arbitrary z-points created without some restrictions for their component values would lead to reasonable face image creation by the Decoder.

Something that we have not yet covered is the question of correlations between vector components. This is the topic of the next post:

Autoencoders and latent space fragmentation – III – correlations of latent vector components

 

A simple CNN for the MNIST dataset – XII – filter visualization for maps of the first two convolutional layers

In the last article of this series

A simple CNN for the MNIST dataset – XI – Python code for filter visualization and OIP detection
A simple CNN for the MNIST dataset – X – filling some gaps in filter visualization
A simple CNN for the MNIST dataset – IX – filter visualization at a convolutional layer
A simple CNN for the MNIST dataset – VIII – filters and features – Python code to visualize patterns which activate a map strongly
A simple CNN for the MNIST dataset – VII – outline of steps to visualize image patterns which trigger filter maps
A simple CNN for the MNIST dataset – VI – classification by activation patterns and the role of the CNN’s MLP part

I provided some code to create visualizations of “OIPs” (original input pattern). An OIP is a characteristic pattern in an input image to which a selected map of the deepest convolutional layer of CNN reacts strongly. Most of the patterns we found showed some overall large scale structure with sub-structures of smaller dimensions. In many cases the patterns were repeated two or even more times with some spatial distance across the images’s surface. That we got unique and relatively big patterns for the last and deepest Conv layer is not surprising because the maps of this layer cover the original image area with just a few neurons, i.e. with coarse resolution. The related convolutional filters work across relatively large distances. The astonishing ability of deeper layers to detect unique large scale patterns in input images is based on the weighted superposition of filters working on smaller scales together with a reduction of resolution.

To get images of OIPs we fed the trained CNN with input images whose pixel values were statistically distributed. We optimized the pixel values of the input images for a maximum response of selected maps of the third, i.e. deepest convolutional layer. We can, however, apply the same methods also to maps of the first and the second convolutional layers of our CNN. Then we get much simpler patterns – in the sense of repetitions of many small scale elements.

Below I just provide images of OIPs triggering maps of the first two convolution layers without much further comments. I refer to the layer names as discussed in previous articles of this series.

Input image patterns which lead to a maximum activation of the maps of the convolutional layer Conv2D-1

Layer “Conv2D-1” has 32 maps. With simple fluctuations on the length scale of one to two pixels, we can easily create OIP-images for each of the maps.

Most of these images were actually derived from one and the same input image.

Input image patterns which lead to a maximum
activation of the maps of the convolutional layer Conv2D-2

Layer “Conv2D-2” has 32 maps. With simple fluctuations on the length scale of one to two pixels plus some experiments with pixel value fluctuation son longer scales, I could produce OIP-images for we can easily create OIP-images for 51 of the maps. Experiments for the other 13 maps took to long time; the systematic approach with large scale fluctuations, which we discussed thoroughly in previous articles did not help on layer 2. If you look at the images below, you see that it i more likely that we need specific short and middle-scale fluctuations. However, the amount of possible data combinations is just too big for a systematic investigation.

Conclusion

In the course of the last articles we got a nice overview over the kind of patterns to which the maps of the different convolutional layers of a CNN react to. We are well prepared now to turn back to the question of what the ominous “features” of objects in input images really are. In the meantime have a look at another application of filter visualization in the realm of Deep Dreams, which I recently started to discuss in another article series of this blog. Stay tuned and wear masks to avoid the Corona virus! Stay healthy!

Other (previous) articles in this series

A simple CNN for the MNIST dataset – IV – Visualizing the activation output of convolutional layers and maps
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST datasets – I – CNN basics

 

Deep Dreams of a CNN trained on MNIST data – I – a first approach based on one selected map of a convolutional layer

It is fun to play around with Convolutional Neural Networks [CNNs] on the level of an dedicated amateur. One of the reasons is the possibility to visualize the output of elementary building blocks of this class of AI networks. The resulting images help to understand CNN algorithms in an entertaining way – at least in my opinion. The required effort is in addition relatively limited: You must be willing to invest a bit of time into programming, but on a quite modest level of difficulty. And you can often find many basic experiments which are in within the reach of limited PC capabilities.

A special area where the visualization of CNN guided processes is the main objective is the field of “Deep Dreams“. Anyone studying AI methods sooner or later stumbles across the somewhat psychedelic, but none the less spectacular images which Google presented in 2016 as a side branch of their CNN research. Today, you can download DeepDream generators from GitHub.

When I read a bit more about “DeepDream” experiments, I quickly learned that people use quite advanced CNN architectures, like Google’s Inception CNNs, and apply them to high resolution images (see e.g. the Book of F. Chollet on “Deep Learning with Keras and Python” and ai.googleblog.com, 2015, inceptionism-going-deeper-into-neural). Even if you pick up an already trained version of an Inception CNN, you need some decent GPU power to do your own experiments. Another questionable point for an interested amateur is: What does one actually learn from applying “generators”, which others have programmed, and what from just following a “user guide” without understanding what a DeepDream SW actually does? Probably not much, even if you produce stunning images after some time…

So, I asked myself: Can one study basic methods of the DeepDream technology with self programmed tools and a simple dataset? Could one create a “DeepDream” visualization with a rather simply structured CNN trained on MNIST data?
The big advantage of the MNIST data set is that the individual samples are small; and the amount of numerical operations, which a related simple CNN must perform on input images, fits well to the capabilities of PC technology – even if the latter is some years old.

After a first look into DeepDream algorithms, I think: Yes, it should be possible. In a way DeepDream experiments are a natural extension of the visualization of CNN filters and maps which I have already discussed in depth in another article series. Therefore, DeepDream visualizations might even help us to better understand how the internal filter of CNNs work and what “features” are. However, regarding the creation of spectacular images we need to reduce our expectations to a reasonably low level:

A CNN trained on MNIST data works with gray images, low resolution and only simple feature patterns. Therefore, we will never produce such impressive images as published by DeepDream artists or by Google. But, we do have a solid chance to grasp some basic principles and ideas of this side-branch of AI with very simplistic tools.

As always in this blog, I explore a new field step-wise and let you as a reader follow me through the learning process. Throughout most of this new series of articles we will use a CNN created with the help of Keras and filter visualization tools which were developed in another article series of this blog. The CNN has been trained on the MNIST data set already.

In this first post we are going to pick just a single selected feature or response map of a deep CNN layer and let it “dream” upon a down-scaled image of roses. Well, “dream“, as a matter of fact, is a misleading expression; but this is true for the whole DeepDream
business – as we shall see. A CNN does not dream; “DeepDream” creation is more to be seen as an artistic discipline using algorithmic image enhancement.

The input image which we shall feed into our CNN today is shown below:

As our CNN works on a resolution level of 28×28 pixels, only, the “dreaming” will occur in a coarse way, very comparable to hallucinations on the blurred vision level of a short-sighted, myopic man. More precisely: Of a disturbed myopic man who works the whole day with images of digits and lets this poor experience enter and manipulate his dreamy visions of nicer things :-).

Actually, the setup for this article’s experiment was a bit funny: I got the input picture of roses from my wife, who is very much interested in art and likes flowers. I am myopic and in my soul still a theoretical physicist, who is much more attracted by numbers and patterns than by roses – if we disregard the interesting fractal nature of rose blossoms for a second :-).

What do DeepDreams based on single maps of trained MNIST CNNs produce?

To rouse your interest a bit or to disappoint you from the start, I show you a typical result of today’s exercise: “Dreams” or “hallucinations” based on MNIST and a selected single map of a deep convolutional CNN layer produce gray scale images with ghost-like “apparitions”.


When these images appeared on my computer screen, I thought: This is fun, indeed! But my wife just laughed – and said “physicists” with a known undertone and something about “boys and toys” …. I hope this will not stop you from reading further. Later articles will, hopefully, produce more “advanced” hallucinations. But as I said: It all depends on your expectations.

But, lets focus: How did I create the simple “dream” displayed above?

Requirements – a CNN and analysis and visualization tools described in another article series of this blog

I shall use results and methods, which I have already explained in another article series. You need a basic understanding of how a CNN works, what convolutional layers, kernel based filters and cost functions are, how we can build simple CNNs with the help of Keras, … – otherwise you will be lost from the beginning.
A simple CNN for the MNIST datasets – I – CNN basics
We also need a CNN, already trained on the MNIST data. I have shown how to build and train a very simple, yet suitable CNN with the help of Keras and Python; see e.g.:
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
In addition we need some
code to create input image patterns which trigger response maps or full layers of a CNN optimally. I called such pixel patterns “OIPs”; others call them “features”. I have offered a Python class in the other article series which offers an optimization loop and other methods to work on OIPs and filter visualization.
A simple CNN for the MNIST dataset – XI – Python code for filter visualization and OIP detection

We shall extend this class by further methods throughout our forthcoming work. To develop and run the codes you should have a working Jupyter environment, a virtual Python environment, an IDE like Eclipse with PyDev for building larger code segments and a working Cuda installation for a NVidia graphics card. My 960GTX proved to be fully sufficient for what we are going to do.

Deep “Dream” – or some funny image manipulation?

As it unfortunately happens so often with AI topics: Also in case of the term “DeepDream” the vocabulary is exaggerated and thoroughly misleading. A simple CNN neither thinks nor “dreams” – it is a software manifestation of the results of an optimization algorithm applied to and trained on selected input data. If applied to new input, it will only detect patterns for which it was optimized before. You could also say:

A CNN is a manifestation of learned prejudices.

CNNs and other types of AI networks filter input according to encoded rules which serve a specific purpose and which reflect the properties of the selected training data set. If you ever used the CNN of my other series on your own hand-written images after a training only on the (US-) MNIST images you will quickly see what I mean. The MNIST dataset reflects an American style of writing digits – a CNN trained on MNIST will fail relatively often when confronted with image samples of digits written by Europeans.

Why do I stress this point at all? Because DeepDreams reveal such kinds of “prejudices” in a visible manner. DeepDream technology extracts and amplifies patterns within images, which fit the trained filters of the involved CNN. F. Chollet correctly describes “DeepDream” as an image manipulation technique which makes use of algorithms for the visualization of CNN filters.

The original algorithmic concept for DeepDreams consists of the following steps:

  • Extend your algorithm for CNN filter visualization (= OIP creation) from a specific map to the optimization of the response of complete layers. Meaning: Use the total response of all maps of a layer to define contributions to your cost function. Then mix these contributions in a defined weighted way.
  • Take some image of whatever motive you like and prepare 4 or more down-scaled versions of this image, i.e. versions with different levels of size and resolution below the original size and resolution.
  • Offer the image with the lowest resolution to the CNN as an input image.
  • Loop over all prepared image sizes :
    • Apply your algorithm for filter visualization of all maps and layers to the input image – but only for a very limited amount of epochs.
    • Upscale the resulting output image (OIP-image) to the next level of higher resolution.
    • Add details of the original image with the same resolution to the upscaled OIP-image.
    • Offer the resulting image as a new input image to your CNN.

Readers who followed me through my last series on “a simple CNN for MNIST” should already raise their eyebrows: What if the CNN expects a certain fixed size of of the input image? Well, a good question. I’ll come back to it in a second. For the time being, let us say that we will concentrate more on resolution than on an
actual image size.

The above steps make it clear that we manipulate an image multiple times. In a way we transform the image slowly to improve a layer’s response and repeat the process with growing resolution. I.e., we apply pattern detection and amplification on more and more details – in the end using all available large and small scale filters of the CNN in a controlled way without fully eliminating the original contents.

What to do about the low resolution of MNIST images and the limited capability of a CNN trained on them?

MNIST images have a very low resolution, real images instead a significantly higher one. With our CNN specialized on MNIST input the OIP-creation algorithm only works on (28×28)-images (and with some warnings, maybe, on smaller ones). What to do about it when we work with input images of a size of e.g. 560×560 pixels?

Well, we just work on the given level of resolution! We have three options:

  • We can downsize the input image itself or parts of it to the MNIST dimensions – with the help of a bicubic interpolation. Then our OIP-algorithm has the chance to detect OIPs on the coarse scale and to change the downsized image accordingly. Then we can upscale the result again to the original image size – and add details again.
  • We can split the input image into tiles of size (28×28) and offer these tiles as input to the CNN.
  • We can combine both of the above options.

Its like what a shortsighted human would do: Work with a blurred impression of the full scale image or look at parts of it from a close distance and then reassemble his/her impressions to larger scales.

A first step – apply only one specific map of a convolutional layer on a down-scaled image version

In this article we have a very limited goal for which we do not have to change our tools, yet:

  • Preparation:
    • We choose a map.
    • We downscale the original image to (28×28) by interpolation, upscale the result again by interpolating again (with loss) and calculate the difference to the original full resolution image (all interpolations done in a bicubic way).
  • Loop (4 times or so):
    • We apply the OIP-algorithm on the downscaled input image for a fixed amount of epochs
    • We upscale the result by bicubic interpolation to the original size.
    • We re-add the difference in details.
    • We downscale the result again.

With this approach I try to apply some of the elements of the original algorithm – but just on one scale of coarse resolution. I shall discuss the code for realizing the recipe given above with Python and Jupyter in the next article. For today let us look at some of the ghost like apparitions in the dreams for selected maps of the 3rd convolutional layer; see:
A simple CNN for the MNIST dataset – IX – filter visualization at a convolutional layer

DeepDreams based on selected maps of the 3rd convolutional layer of a CNN trained on MNIST data

With the image sections displayed below I have tried to collect results for different maps which focus on certain areas of the input image (with the exception of the first image section).

The first two images of each row display the detected OIP-patterns on the (28×28) resolution level with pixel values encoded in a (viridis) color-map; the third image in gray scale. The fourth image reveals the dream on the blurry vision level – up-scaled and interpolated to the original image size. You may still detect traces of the original rose blossoms i these images. The last two images of each row display the results
after re-adding details of the original image and an adjustment of the width of the value distribution. The detected and enhanced pattern then turns into a whitey, ghostly shadow.

I have given each section a fancy headline.

I never promised you a rose garden …

“Getting out …”

“Donut …”

“Curls to form a 3 …”

“Two of them …”

“The creepy roots of it all …”

“Look at me …”

“A hidden opening …”

“Soft is something different …”

“Central separation …”

Conclusion: A CNN
detects patterns or parts of patterns it was trained for in any kind of offered input …

You can compare the results to some input patterns (OIPs) which strongly trigger individual maps on the 3rd convolutional layer; you will detect similarities. E.g. four OIP- or feature patterns to which map 56 reacts strongly, look like:

Filter visualization 1 for CNN map 56Filter visualization 2 for CNN map 56Filter visualization 3 for CNN map 56Filter visualization 4 for CNN map 56

This explains the basic shape of the “apparition” in the first “dream”:

This proves that the filters of a trained CNN actually detect patterns, which proved to be useful for a certain training purpose, in any kind of input which shows some traces of such patterns. A CNN simply does not “know” better: If you only have a hammer to interact with the world, everything becomes a nail to you in the end – this is the level of stupidity on which a CNN algorithm works. And it actually is a fundamental ingredient of DeepDream image manipulation – a transportation of learned patterns or prejudices to an environment outside the original training context.

In the next article
Deep Dreams of a CNN trained on MNIST data – II – some code for pattern carving
I provide the code for creating the above images.

Further articles in this series

Deep Dreams of a CNN trained on MNIST data – II – some code for pattern carving
Deep Dreams of a CNN trained on MNIST data – III – catching dream patterns at smaller length scales