Variational Autoencoder with Tensorflow – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?


This post continues my series on Variational Autoencoders [VAE] with some considerations regarding a VAE whose settings allow for only a tiny amount of the so called Kullback-Leibler [KL] loss.

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow – X – VAE application to CelebA images
Variational Autoencoder with Tensorflow – XI – image creation by a VAE trained on CelebA
Variational Autoencoder with Tensorflow – XII – save some VRAM by an extra Dense layer in the Encoder

So far, most of the posts in this series have covered a variety of methods (provided by Tensorflow and Keras) to control the KL loss. One of the previous posts (XI) provided (indirect) evidence that also GradientTape()-based methods for KL-loss calculation work as expected. In stark contrast to a standard Autoencoder [AE] our VAE trained on CelebA data proved its ability to reconstruct humanly interpretable images from random z-points (or z-vectors) in the latent space. Provided that the z-points lie within a reasonable distance to the origin.

We could leave it at that. One of the basic motivations to work with VAEs is to use the latent space “creatively”. This requires that the data points coming from similar training images should fill the latent space densely and without gaps between clusters or filaments. We have obviously achieved this objective. Now we could start to do funny things like to combine reconstruction with vector arithmetic in the latent space.

But to look a bit deeper into the latent space may give us some new insights. The central point of the KL-loss is that it induces a statistical element into the training of AEs. As a consequence a VAE fills the so called “latent space” in a different way than a simple AE. The z-point distribution gets confined and areas around z-points for meaningful training images are forced to get broader and overlap. So two questions want an answer:

  • Can we get more direct evidence of what the KL-loss does to the data distribution in latent space?
  • Can we get some direct evidence supporting the assumption that most of the latent space of an AE is empty or only sparsely populated? in contrast to a VAE’s latent space?

Therefore, I thought it would be funny to compare the data organization in latent space caused by an AE with that of a VAE. But to get there we need some solid starting point. If you consider a bit where you yourself would start with an AE vs. VAE comparison you will probably come across the following additional and also interesting questions:

  • Can one safely assume that a VAE with only a very tiny amount of KL-loss reproduces the same z-point distribution vs. radius which an AE would give us?
  • In general: Can we really expect a VAE with a very tiny Kullback-Leibler loss to behave as a corresponding AE with the same structure of convolutional layers?

The answers to all these questions are the topics of this post and a forthcoming one. To get some answers I will compare a VAE with a very small KL-loss contribution with a similar AE. Both network types will consist of equivalent convolutional layers and will be trained on the CelebA dataset. We shall look at the resulting data point density distribution vs. radius, clustering properties and the ability to create images from statistical z-points.

This will give us a solid base to proceed to larger and more natural values of the KL-loss in further posts. I got some new insights along this path and hope the presented data will be interesting for the reader, too.

Below and in following posts I will sometimes call the target space of the Encoder also the “z-space“.

CelebA data to fill the latent vector-space

The training of an AE or a VAE occurs in a self-supervised manner. A VAe or an AE learns to create a point, a z-point, in the latent space for each of the training objects (e.g. CelebA images). In such a way that the Decoder can reconstruct an object (image) very close to the original from the z-point’s coordinate data. We will use the “CelebA” dataset to study the KL-impact on the z-point distribution.CelebA is more challenging for a VAE than MNIST. And the latent space requires a substantially higher number of dimensions than in the MNIST case for reasonable reconstructions. This makes things even more interesting.

The latent z-space filled by a trained AE or VAE is a multi-dimensional vector space. Meaning: Each z-point can be described by a vector defining a position in z-space. A vector in turn is defined by concrete values for as many vector components as the z-space has dimensions.

Of course, we would like to see some direct data visualizing the impact of the KL-loss on the z-point distribution which the Encoder creates for our training data. As we deal with a multidimensional vector space we cannot plot the data distribution. We have to simplify and somehow get rid of the many dimensions. A simple solution is to look at the data point distribution in latent space with respect to the distance of these points from the origin. Thereby we transform the problem into a one-dimensional one.

More precisely: I want to analyze the change in numbers of z-points within “radius“-intervals. Of course, a “radius” has to be defined in a multidimensional vector space as the z-space. But this can easily be achieved via an Euclidean L2-norm. As we expect the KL loss to have a confining effect on the z-point distribution it should reduce the average radius of the z-points. We shal later see that this is indeed the case.

Another simple method to reduce dimensions is to look at just one coordinate axis and the data distribution for the calculated values in this direction of the vector space. Therefore, I will also check the variation in the number of data points along each coordinate axis in one of the next posts.

A look at clustering via projections to a plane may be helpful, too.

The expected similarity of a VAE with tiny KL-loss to an AE is not really obvious

Regarding the answers to the 3rd and 4th questions posed above your intuition tells you: Yes, you probably can bet on a similarity between a VAE with tiny KL-loss and an AE.

But when you look closer at the network architectures you may get a bit nervous. Why should a VAE network that has many more degrees of freedom than an AE not use both of its layers for “mu” and “logvar” to find a different distribution solution? A solution related to another minimum of the loss hyperplane in the weight configuration space? Especially as this weight-related space is significantly bigger than that of a corresponding AE with the same convolutional layers?

The whole point has to do with the following facts: In an AE’s Encoder the last flattening layer after the Conv2D-layers is connected to just one output layer. In a VAE, instead, the flattening layer feeds data into two consecutive layers (for mu and logvar) across twice as many connections (with twice as many weight parameters to optimize).

In the last post of this series we dealt with this point from the perspective of VRAM consumption. Now, its the question in how far a VAE will be similar to an AE for a tiny KL-loss.

Why should the z-points found be determined only by mu-values and not also by logvar-values? And why should the mu values reproduce the same distribution as an AE? At least the architecture does not guarantee this by any obvious means …

Well, let us look at some data.

Structure of an AE for CelebA and its total loss after some epochs

Our test AE contains the same simple sequence of four Conv2D layers per Encoder and four 4 Conv2DTranspose layers as our VAE. See the AE’s Encoder layer structure below.

A difference, however, will be that I will not use any BatchNormalizer layers in the AE. But as a correctly implemented BatchNormalization should not affect the representational powers of a VAE network for very principle reasons this should not influence the comparison of the final z-point distribution in a significant way.

I performed an AE training run for 170,000 CelebA training images over 24 epochs. The latent space has a dimension if z_dim=256. (This relatively low number of dimensions will make it easier for a VAE to confine z_points around the origin; see the discussion in previous posts).

The resulting total loss of our AE became ca. 0.49 per pixel. This translates into a total value of

AE total loss on Celeb A after 24 epochs (for a step size of 0.0005): 4515

This value results from a summation over all geometric pixels of our CelebA images which were downsized to 96×96 px (see post IX). The given value can be compared to results measured by our GradientTape()-based VAE-model which delivers integrated values and not averages per pixel.

This value is significantly smaller than values we would get for the total loss of a VAE with a reasonably big KL-loss of contribution in the order of some percent of the reconstruction loss. A VAE produces values around 4800 up to 5000. Apparently, an AE’s Decoder reconstructs originals much better than a VAE with a significant KL-loss contribution to the total loss.

But what about a VAE with a very small KL-loss? You will get the answer in a minute.

Where does a standard Autoencoder [AE] place the z-points for CelebA data?

We can not directly plot a data point distribution in a 256-dimensional vector-space. But we can look at the data point density variation with a calculated distance from the origin of the latent space.

The distance R from the origin to the z-point for each image can be measured in terms of a L2 (= Euclidean) norm of the latent vector space. Afterward it is easy to determine the number of images within all radius intervals with e.g. a length of 0.5 e.g. between radii R

0  <  R  <  35 .

We perform the following steps to get respective numbers. We let the Encoder of our trained AE predict the z-points of all 170,000 training data

z_points  = AE.encoder.predict(data_flow) 

data_flow was created by a Keras DataImageGenerator to send batches of training data to the GPU (see the previous posts).

Radius values are then calculated as

print("NUM_Images_Train = ", NUM_IMAGES_TRAIN)
ay_rad_z = np.zeros((NUM_IMAGES_TRAIN,),  dtype='float32')
for i in range(0, NUM_IMAGES_TRAIN):
    sq = np.square(z_points[i]) 
    sqrt_sum_sq = math.sqrt(sq.sum())
    ay_rad_z[i] = sqrt_sum_sq

The numbers vs. radius relation then results from:

li_rad      = []
li_num_rad  = []
int_width = 0.5
for i in range(0,70):
    low   = int_width * i
    high  = int_width * (i+1) 
    num   = np.count_nonzero( (ay_rad_z >= low) & (ay_rad_z < high ) )
    li_rad.append(0.5 * (low + high))

The resulting curve is shown below:

There seems to be a peak around R = 16.75. So, yet another question arises:

>What is so special about the radius values of 16 or 17 ?

We shall return to this point in the next post. For now we take this result as god-given.

Clustering of CelebA z-point data in the AE’s latent space?

Another interesting question is: Do we get some clustering in the latent space? Will there be a difference between an AE and a VAE?

A standard method to find an indication of clustering is to look for an elbow in the so called “inertia” curve for different assumed numbers of clusters. Below you find an inertia plot retrieved from the z-point data with the help of MiniBatchKMeans.

This result was achieved for data taken at every second value of the number of clusters “num_clus” between 2 ≤ num_clus ≤ 80. Unfortunately, the result does not show a pronounced elbow. Instead the variation at some special cluster numbers is relatively high. But, if we absolutely wanted to define a value then something between 38 and 42 appears to be reasonable. Up to that point the decline in inertia is relatively smooth. But do not let you get misguided – the data depend on statistics and initial cluster values. When you change to a different calculation you may get something like the following plot with more pronounced spikes:

This is always as sign that the clustering is not very clear – and that the clusters do not have a significant distance, at least not in all coordinate directions. Filamental structures will not be honored well by KMeans.

Nevertheless: A value of 40 is reasonable as we have 40 labels coming with the CelebA data. I.e. 40 basic features in the face images are considered to be significant and were labeled by the creators of the dataset.

t-SNE projections

We can also have a look at a 2-dimensional t-SNE-projection of the z-point distribution. The plots below have been produced with different settings for early exaggeration and perplexity parameters. The first plot resulted from standard parameter values for sklearn’s t-SNE variant.

tsne = TSNE(n_components=2, early_exaggeration=12, perplexity=30, n_iter=1000)

Other plots were produced by the following setting:

tsne = TSNE(n_components=2, early_exaggeration=16, perplexity=10, n_iter=1000)

Below you find some plots of a t-SNE-analysis for different numbers and different adjusted parameters for the resulting scatter plot. The number of statistically chosen z-point varies between 20,000 and 140,000.

Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)

Actually we see some indication of clustering, though it is not very pronounced. The clusters in the projection are not separated by clear and broad gaps. Of course a 2-dimensional projection can not completely visualize the original separations in a 256-dim space. However, we get the impression that clusters are located rather close to each other. Remember: We already know that almost all points are locates in a multidimensional sphere shell between 12 < R < 24. And more than 50% between 14 ≤ R ≤ 19.

However, how the actual distribution of meaningful z-points (in the sense of a recognizable face reconstruction) really looks like cannot be deduced from the above t-SNE analysis. The concentration of the z-points may still be one which follows thin and maybe curved filaments in some directions of the multidimensional latent space on relatively small or various scales. We shall get a much clearer picture of the fragmentation of the z-point distribution in an AE’s latent space in the next post of this series.

Number of statistical z-points: 80,000

For the higher number of selected z-points the room between some concentration centers appears to be filled in the projection. But remember: This may only be due to projection effects in the presently chosen coordinate system. Another calculation with the above non-standard data for perplexity and early_exaggeration gives us:

Number of statistical z-points: 140,000

Note that some islands appear. Obviously, there is at least some clustering going on. However, due to projection effects we cannot deduce much for the real structure of the point distribution between possible clusters. Even the clustering itself could appear due to overlapping two or more broader filaments along a projection line.

Whether correlations would get more pronounced and therefore could also be better handled by t-SNE in a rotated coordinate system based on a PCA-analysis remains to be seen. The next post will give an answer.

At least we have got a clear impression about the radial distribution of the z-points. And thereby gathered some data which we can compare to corresponding results of a VAE.

Total loss of a VAE with a tiny KL-loss for CelebA data

Our test VAE is parameterized to create only a very small KL-loss contribution to the total loss. With the Python classes we have developed in the course of this post series we can control the ratio between the KL-loss and a standard reconstruction loss as e.g. BCE (binary-crossentropy) by a parameter “fact“.


fact = 1.0e-5

is a very small value. For a working VAE we would normally choose something like fact=5 (see post XI).

A value like 1.0e-5 ensures a KL loss around 0.0178 compared to a reconstruction loss of 4550, which gives us a ratio below 4.e-6. Now, what is a VAE going to do, when the KL-loss is so small?

For the total loss the last epochs produced the following values:

AE total loss on Celeb A after 24 epochs for a step size of 0.0005: 4,553

Output of the last 6 of 24 epochs.

Epoch 1/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4557.1694 - reco_loss: 4557.1523 - kl_loss: 0.0179
Epoch 2/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.9111 - reco_loss: 4556.8940 - kl_loss: 0.0179
Epoch 3/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.6626 - reco_loss: 4556.6450 - kl_loss: 0.0179
Epoch 4/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.3862 - reco_loss: 4556.3682 - kl_loss: 0.0179
Epoch 5/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4555.9595 - reco_loss: 4555.9395 - kl_loss: 0.0179
Epoch 6/6
1329/1329 [==============================] - 118s 89ms/step - total_loss: 4555.6641 - reco_loss: 4555.6426 - kl_loss: 0.0178

This is not too far away from the value of our AE. Other training runs confirmed this result. On four different runs the total loss value came to lie between

VAE total loss on Celeb A after 24 epochs: 4553 ≤ loss ≤ 4555 .

VAE with tiny KL-loss – z-point density distribution vs. radius

Below you find the plot for the variation of the number density of z-points vs. radius for our VAE:

Again, we get a maximum close to R = rad = 16. The maximum value lies a bit below the one found for a KL-loss-free AE. But all in all the form and width of the distribution of the VAE are very comparable to that of our test AE.

Can this result be reproduced?
Unfortunately not at a 100% of test runs performed. There are two main reasons:

  1. Firstly, we can not be sure that a second minimum does not exist for a distribution of points at bigger radii. This may be the case both for the AE and the VAE!
  2. Secondly, we have a major factor of statistical fluctuation in our game:
    The epsilon value which scales the logvar-contribution to the loss in the sampling layer of the Encoder may in very seldom cases abruptly jump to an unreasonable high value. A Gaussian covers extreme values, although the chances to produce such a value are pretty small. and a Gaussian is invilved in the calculation of z-points by our VAE.

Remember that the z-point coordinates are calculated via the the mu and logvar tensors according to

z = mu + B.exp(log_var / 2.) * epsilon

See Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics for respective code elements of the Encoder.

So, a lot depends on epsilon which is calculated as a statistically fluctuating quantity, namely as

epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)

Is there a chance that the training process may sometimes drive the system to another corner of the weight-loss configuration space due to abrupt fluctuations? With the result for the z-point distribution vs. radius that it may significantly deviate from a distribution around R = 16? I think: Yes, this is possible!

From some other training runs I actually have an indication that there is a second minimum of the cost hyperplane with similar properties for higher average radius-values, namely for a distribution with an average radius at R ≈ 19.75. I got there after changing the initialization of the weights a bit.

Another indication that the cost surface has a relative rough structure and that extreme fluctuations of epsilon and a resulting gradient-fluctuation can drive the position of the network in the weight configuration space to some strange corners. The weight values there can result in different z-point distributions at higher average radii. This actually happened during yet another training run: At epoch 22 the Adam optimizer suddenly directed the whole system to weight values resulting in a maximum of the density distribution at R = 66 ! This appeared as totally crazy. At the same time the KL-loss also jumped to a much higher value.

When I afterward repeated the run from epoch 18 this did not happen again. Therefore, a statistical fluctuation must have been the reason for the described event. Such an erratic behavior can only be explained by sudden and extreme changes of z-point data enforcing a substantial change in size and direction of the loss gradient. And epsilon is a plausible candidate for this!

So far I had nothing in our Python classes which would limit the statistical variation of epsilon. The effects seen spoke for a code change such that we do not allow for extreme epsilon-values. I set limits in the respective part of the code for the sampling layer and its lambda function

        # The following function will be used by an eventual Lambda layer of the Encoder 
        def z_point_sampling(args):
            A point in the latent space is calculated statistically 
            around an optimized mu for each sample 
            mu, log_var = args # Note: These are 1D tensors !
            epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) 
            if abs(epsilon) >= 5: 
                epsilon *= 5. / abs(epsilon)       
            return mu + B.exp(log_var / 2.) * epsilon * self.enc_v_eps_factor

This stabilized everything. But even without these limitations on average three out of 4 runs which I performed for the VAE ran into a cost minimum which was associated with a pronounced maximum of the z-point-distribution around R ≈ 16. Below you see the plot for the fourth run:

So, there is some chance that the degrees of freedom associated with the logvar-layer and the statistical variation for epsilon may drive a VAE into other local minima or weight parameter ranges which do not lead to a z-point distribution around R = 16. But after the limitation of epsilon fluctuations all training runs found a loss minimum similar to the one of our simple AE – in the sense that it creates a z-point density distribution around R ≈ 16.

VAE with tiny KL-loss: Inertia and clustering of the CelebA data?

Our VAE gives the following variation of the inertia vs. the number of assumed clusters:

This also looks pretty similar to one of the plots shown for our AE above.

t-SNE for our VAE with a tiny KL loss

Below you find t-SNE plots for 20,000, 80,000 and 140,000 images:

Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)

This is quite similar to the related image for the AE. You just have to rotate it.

Number of statistical z-points: 80,000

Number of statistical z-points: 140,000

All in all we get very similar indications as from our AE that some clustering is going on.

VAE with tiny KL-loss: Should its logvar values become tiny, too?

Besides reproducing a similar z-point distribution with respect to radius values, is there another indication that a VAE behaves similar to an AE? What would be a clear sign that the similarity really exists on a deeper level of the layers and their weights?

The z-vector is calculated from the mu and logvar-vectors by:

z = mu + exp(logvar/2)*epsilon

with epsilon coming from a normal distribution. Please note that we are talking about vectors of size z_dim=256 per image.

If a VAE with a tiny KL-loss really becomes similar to an AE it should define and set its z-points basically by using mu-values, only, and not by logvar-values. I.e. the VAE should become intelligent enough to ignore the degrees of freedom associated with the logvar-layer. Meaning that the z-point coordinates of a VAE with a very small Kl-loss should in the end be almost identical to the mu-component-values.

Ok, but to me it was not self-evident that a VAE during its training would learn

  • to produce significant mu-related weight-values, only,
  • and to keep the weight values for the connections to the logvar-layer so small that the logvar-impact on the z-space position gets negligible.

Before we speculate about reasons: Is there any evidence for a negligible logvar-contribution to the z-point coordinates or, equivalently, to the respective vector components?

A VAE with tiny KL-loss produces tiny logvar values …

To get some quantitative data on the logvar impact the following steps are appropriate:

  1. Get the size and algebraic sign of the logvar-values. Negative values logvar < -3 would be optimal.
  2. Measure the deviation between the mu- and z_points vector components. There should only be a few components which show significant values &br; abs(mu – z) > 0.05
  3. Compare the the radius-value determined by z-components vs. the radius values derived from mu-components, only, and measure the absolute and relative deviations. The relative deviation should be very small on average.

Some values of logvar, (z – mu), z-radii and z-radius-deviations for a VAE with small KL-loss

Regarding the maximum value of the logvar’s vector-components I found

3.4 ≥ max(logvar) ≥ -3.2. # for 1 up to 3 components out of a total 45.52 million components

The first value may appear to be big for a component. But typically there are only 2 (!) out of 170,000 x 256 = 43.52 million vector components in an interval of [-3, 5]. On the component level I found the following minimum, maximum and average-values

Maximum value for logvar:  -2.0205276
Minimum value for logvar:  -24.660698
Average value for logvar:  -13.682616

The average value of logvar is pretty pleasing: Such big negative values indeed render the logvar-impact on the position of our z-points negligible. So we should only find very small deviations of the mu-components from the z-point components. And, actually, the maximum of the deviation between a z_point component and a mu component was delta_mu_z = 0.26:

Maximum (z_points - mu) = delta_mu_z = 0.26  # on the component level 

There were only 5 out of the 45.52 million components which showed an absolute deviation in the interval

0.05 < abs(delta_mu_z) < 2. 

The rest was much, much smaller!

What about radius values? Here the situation looks, of course, even better:

max radius defined by z  :  33.10274
min radius defined by z  :  6.4961233
max radius defined by mu :  33.0972
min radius defined by mu :  6.494558

avg_z:      16.283989  
avg_mu:     16.283972

max absolute difference :  0.018045425 
avg absolute difference :  0.00094899256

max relative difference  :   0.00072147313
avg relative difference  :   6.1240215e-05

As expected, the relative deviations between z- and mu-based radius values became very small.

In another run (the one corresponding to the second density distribution curve above) I got the following values:

Maximum value for logvar:  3.3761873
Minimum value for logvar:  -22.777826
Average value for logvar:  -13.4265175

max radius z :  35.51387
min radius z :  7.209168
max radius mu :  35.515926
min radius mu :  7.2086616

avg_z:  17.37412
avg_mu: 17.374104

max delta rad relative :   0.012512478
avg delta rad relative :   6.5660715e-05

This tells us that the z-point distributions may vary a bit in their width, their precise center and average values. But, overall they appear to be similar. Especially with respect to a relative negligible contribution of logvar-terms to the z-point position. The relative impact of logvar on the radius value of a z-point is of the order 6.e-5, only.

All the above data confirm that a trained VAE with a very small KL-loss primarily uses mu-values to set the position of its z-points. During training the VAE moves along a path to an overall minimum on the loss hyperplane which leads to an area with weights that produce negligible logvar values.

Explanation of the overall similarity of a VAE with tiny KL-loss to an AE

o far we can summarize: Under normal conditions the VAE’s behavior is pretty close to that of a similar AE. The VAE produces only small logvar values. z-point coordinates are extremely close to just the mu-coordinates.

Can we find a plausible reason for this result? Looking at the cost-hyperplane with respect to the Encoder weights helps:

The cost surface of a VAE spans across a space of many more weight parameters than a corresponding AE. The reason is that we have weights for the connection to the logvar-layer in addition to the weights for the mu-layer (or a single output layer as in a corresponding AE). But if we look at the corner of the weight-vector-space where the logvar-related values are pretty small, then we would at least find a local (if not global) loss minimum there for the same values of the mu-related weight parameters as in the corresponding AE (with mu replacing the z-output).

So our question reduces to the closely related question whether the old minimum of an AE remains at least a local one when we shift to a VAE – and this is indeed the case for the basic reason that the KL-contributions to the height of the cost-hyperplane are negligibly small everywhere (!) – even for higher logvar-related values.

This tells us that a gradient descent algorithm should indeed be able to find a cost minimum for very small values of logvar-related weights and for weight-values related to the mu-layer very close to the AE’s weight-values for direct connections to its output layer. And, of course, with all other weight parameter of the VAE-Decoder being close to the values of the weights of a corresponding AE. At least under the condition that all variable quantities really change smoothly during training.

Does a VAE with small KL-loss produce reasonable face images?

A last test to confirm that a VAE with a very small KL-loss operates as an comparable AE is a trial to create images with recognizable human faces from randomly chosen points in z-space. Such a trial should fail! I just show you three results – one for a normal distribution of the z-point components. And two for equidistant distribution of component values up to 3, 8 and 16:

z-point coordinates from normal distribution

z-point coordinates from equidistant distribution in [-2,2]

z-point coordinates from equidistant distribution in [-10,10]

This reminds us very much about the behavior of an AE. See: Autoencoders, latent space and the curse of high dimensionality – I.

The z-point distribution in latent space of a VAE with a very small KL-loss obviously is as complicated as that of an AE. Neighboring points of a z-point which leads to a good image produce chaotic images. The transition path from good z-points to other meaningful z-points is confined to a very small filament-like volume.


A trained VAE with only a tiny KL-loss contribution will under normal circumstances behave similar to an AE with a the same hidden (convolutional) layers. It may, however, be necessary to limit the statistical variation of the epsilon factor in the z-point calculation based on mu– and logvar-values.

The similarity is based on very small logvar-values after training. The VAE creates a z-point distribution which shows the same dependency on the radius as an AE. We see similar indications and patterns of clustering. And the VAE fails to produce human faces from random z-points in the latent space – as a comparable AE.

We have found a plausible reason for this similarity by comparing the minimum of the loss hyperplane in the weight-loss parameter space with a corresponding minimum in the weight-loss space of the VAE – at a position with small weights for the connection to the logvar layers.

The z-point density distribution shows a maximum at a radius between 16 and 17. The z-point distribution basically has a Gaussian form. In the next post we shall look a bit closer at these findings – and their origin in Gaussian distributions along the coordinate axes of the latent space. After an application of a PCA analysis we shall furthermore see that the z-point distribution in an AE’s latent vector space is indeed fragmented and shows filaments on certain length scales. A VAE with a tiny KL-loss will show the same fragmentation.

In further forthcoming posts we shall afterward investigate the confining and at the same time blurring impact of the KL-loss on the latent space. Which will make it usable for creative purposes. But the next post

Variational Autoencoder with Tensorflow – XIV – Change of variational model parameters at inference time

will first show you how to change model parameters at inference time.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!


A single neuron perceptron with sigmoid activation function – III – two ways of applying Normalizer

In this article series on a perceptron with only one computing neuron we saw that saturation effects of the sigmoid activation function can hamper gradient descent if input data on some features become too big and/or the initial weight distribution is not adapted to the number of input features. See:

A single neuron perceptron with sigmoid activation function – I – failure of gradient descent due to saturation

We can remedy the first point by applying a normalization transformation to the input data before starting gradient descent. I showed the positive result of such a transformation for our perceptron with a rather specific set of input data in the last article:

A single neuron perceptron with sigmoid activation function – II – normalization to overcome saturation

At that time we used the “StandardScaler” provided by Scikit-Learn. In this article we shall instead use an instance of the “Normalizer” class for scaling. With “Normalizer” you have to be a bit careful how you use its interface. We shall apply “Normalizer in two different ways. Besides having some fun with the outcome, we will also learn that the shape of the clusters in which the input samples may be arranged in feature space should be taken into account before normalizing ahead of classification tasks. Which may be difficult in multiple dimensions … but it brings us to the general idea of identifying a method of cluster identification ahead of classification training with gradient descent.

How does the “Normalizer” work?

Let us offer a “Normalizer”-instance an input array “ay_in” with 2 rows and 4 columns for each row. The shape of “ay_in” is (2,4). The first row “s1” shall have elements like s1=[4, 1, 2, 2]. Then Normalizer will then calculate a L2-norm value for the column data of our specific row as

L2([4, 1, 2, 2]) = sqrt(4**2 + 1**2 + 2**2 + 2**2) = 5
=> s1_trafo = [4/L2, 1/L2, 2/L2, 2/L2] = [0.8, 0.2, 0.4, 0.4].

I.e., all columns in one row are multiplied by one common factor determined as the L2-norm of the column data of the sample. Note again: Each row is treated separately. So, an array as

  [1, 3, 9, 3],
  [5, 7, 5, 1]

will be transformed to

  [0.1,, 0.3, 0.9, 0.3],
  [0.5, 0.7, 0.5, 0.1]

How can we make use of this for our perceptron samples?

Standard scaling per feature with Normalizer

A first idea is that we could scale the data of all samples for our perceptron separately per feature; i.e. we collect the data-values of all M samples for “feature 1” in an array and offer it as the first row of an array to Normalizer, plus a row with all the data values for “feature 2”, …. and so on.

If we had M samples and N features we would present an array with shape (N, M) to “Normalizer”. In our simple perceptron experiment this is equivalent to scaling data of an array where the two rows are defined by our K1 and K2-input arrays => ay_K = [ li_K1, li_K2 ].

What would the outcome of such a scaling be?

A constant factor per feature determined by the L2-norm of all samples’ values for the chosen feature brings all values safely down into an interval of [-1, 1]. But this also means that the maximum value of all samples for a specific feature determines the scale.

Then so called “outliers”, i.e. samples whose values are far away from the average values of the samples, would have a major
impact. So “Normalizer”-Scaling is especially helpful, if the values per feature are limited by principle reasons. Note that this is e.g. the case with RGB-color or gray-scale values! Note also that the possible impact of outliers is also relevant for other normalizers as the “MinMaxNormalizer” of SciKit-Learn.

Although the scaling factors will be different per feature I would like to point out another aspect of scaling by a constant factor per feature over all samples: Such a transformation keeps up at least some structural similarity of the sample distribution in the feature space.

Scaling features per sample with Normalizer (?)

A different way of applying “Normalizer” would be to use the transformed array “ay_K.T” as input: For M samples and N features we would then present an array with shape (M, N) to a Normalizer instance. Its algorithm would then scale across the features of each sample. If we interpret a specific sample as a vector in the feature space then the L2-norm corresponds naturally to the length of this vector. Meaning: Normalizer would scale each sample by its vector length.

Two questions before experimenting

The two possible application methods for Normalizer lead directly to two questions for our simple test setup in a 2-dim feature space:

  • How will lines of equal cost values (i.e. cost or loss contours) for our sigmoid-based loss function look like in the {K1, K2}-space after scaling a bunch of N (K1, K2)-datapoints with Normalizer per feature? I.e., if and when should we present an array of feature values with shape (N,M)?
  • What would happen instead if we scaled each input sample individually across its features? I.e., what happens in a situation with M samples and N features and we feed “Normalizer” with an array (of the same feature values) which has a shape (N,M) instead of (M,N)?

I guess a “natural talent” on numbers as Mr Trump could give the answers without hesitation 🙂 . As we certainly are below the standards of the “genius” Mr Trump (his own words on multiple occasions) we shall pick the answers from plots below before we even try a deeper reasoning.

Application of “Normalizer” separately to the feature data of all batch samples

As you remember from the first article of this series our input batch contained samples (K1, K2) with values for K1 and K2 given by two 1-dim arrays :

li_K1 = [200.0,   1.0, 160.0,  11.0, 220.0,  11.0, 120.0,  22.0, 195.0,  15.0, 130.0,   5.0, 185.0,  16.0]
li_K2 = [ 14.0, 107.0,  10.0, 193.0,  32.0, 178.0,   2.0, 210.0,  12.0, 134.0,  15.0, 167.0,  10.0, 229.0] 

The standard scaling application of Normalizer can be coded explicitly as (see the code given in the last article):

    rg_idx = range(num_samples)
    if scale_method == 0:      
        shape_input = (2, num_samples)
        ay_K = np.zeros(shape_input)
        for idx in rg_idx:
            ay_K[0][idx] = li_K1[idx] 
            ay_K[1][idx] = li_K2[idx] 
        scaler = Normalizer()
        ay_K = scaler.fit_transform(ay_K)
        for idx in rg_idx:
            ay_K1[idx] = ay_K[0][idx]   
            ay_K2[idx] = ay_K[1][idx]
        scaling_fact_K1 = ay_K1[0] / li_K1[0]
        scaling_fact_K2 = ay_K2[0] / li_K2[0]

However, a much faster form, which avoids the explicit Python loop, is given by:

ay_K= np.vstack( (li_K1, li_K2) )
ay_K = scaler.fit_transform(ay_K)
ay_K1, ay_K2 = ay_K
scaling_fact_K1 = ay_K1[0] / li_K1[0]
scaling_fact_K2 = ay_K2[0] / li_K2[0]

Here OpenBlas helps 🙂 .

In contrast to other scalers we need to save and keep the
factors by which we transform the various feature data by ourselves somewhere. (This is clear as “Normalizer” calculates a different factor for each feature.) So, we change our Jupyter cell code for scaling to:

# ********
# Scaling
# ********

b_scale = True
scale_method = 0
# 0: Normalizer (standard), 1: StandardScaler, 2. By factor, 3: Normalizer per pair 
# 4: Min_Max, 5: Identity (no transformation) - just there for convenience  

shape_ay = (num_samples,)
ay_K1 = np.zeros(shape_ay)
ay_K2 = np.zeros(shape_ay)

# apply scaling
if b_scale:
    # shape_input = (num_samples,2)
    rg_idx = range(num_samples)
    if scale_method == 0:      
        ay_K = np.vstack( (li_K1, li_K2) )
        print("ay_k.shape = ", ay_K.shape)
        scaler = Normalizer()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K
        scaling_fact_K1 = ay_K1[0] / li_K1[0]
        scaling_fact_K2 = ay_K2[0] / li_K2[0]
        print("\nay_K1 = \n", ay_K1)
        print("\nay_K2 = \n", ay_K2)
        print("\nscaling_fact_K1: ", scaling_fact_K1, ", scaling_fact_K2: ", scaling_fact_K2)
    elif scale_method == 1: 
        ay_K = np.column_stack((li_K1, li_K2))
        scaler = StandardScaler()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K.T    
    elif scale_method == 2:
        dmax = max(li_K1.max() - li_K1.min(), li_K2.max() - li_K2.min())
        ay_K1 = 1.0/dmax * li_K1
        ay_K2 = 1.0/dmax * li_K2
        scaling_fact_K1 = ay_K1[0] / li_K1[0]
        scaling_fact_K2 = ay_K2[0] / li_K2[0]
    elif scale_method == 3:
        ay_K = np.column_stack((li_K1, li_K2))
        scaler = Normalizer()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K.T    
    elif scale_method == 4:
        ay_K = np.column_stack((li_K1, li_K2))
        scaler = MinMaxScaler()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K.T    
    elif scale_method == 5:
        ay_K1 = li_K1
        ay_K2 = li_K2
# Get overview over costs on weight-mesh
#wm1 = np.arange(-5.0,5.0,0.002)
#wm2 = np.arange(-5.0,5.0,0.002)
wm1 = np.arange(-5.5,5.5,0.002)
wm2 = np.arange(-5.5,5.5,0.002)
W1, W2 = np.meshgrid(wm1, wm2) 
C, li_C_sgl = costs_mesh(num_samples = num_samples, W1=W1, W2=W2, li_K1 = ay_K1, li_K2 = ay_K2, \
                               li_a_tgt = li_a_tgt)

C_min = np.amin(C)
print("\nC_min = ", C_min)
IDX = np.argwhere(C==C_min)
print ("Coordinates: ", IDX)
# print(IDX.shape)
# print(IDX[0][0])
wmin1 = W1[IDX[0][0]][IDX[0][1]] 
wmin2 = W2[IDX[0][0]][IDX[0][1]]
print("Weight values at cost minimum:",  wmin1, wmin2)

# Plots
# ******
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 16; fig_size[1] = 16

fig3 = plt.figure(3); fig4 = plt.figure(4)

ax3 = fig3.gca(projection='3d')
ax3.get_proj = lambda:, np.diag([1.0, 1.0, 1, 1]))
ax3.set_xlabel('w1', fontsize=16)
ax3.set_ylabel('w2', fontsize=16)
ax3.set_zlabel('Total costs', fontsize=16)
ax3.plot_wireframe(W1, W2, 1.2*C, colors=('green'))

ax4 = fig4.gca(projection='3d')
ax4.get_proj = lambda:, np.diag([1.0, 1.0, 1, 1]))
ax4.set_xlabel('w1', fontsize=16)
ax4.set_ylabel('w2', fontsize=16)
ax4.set_zlabel('Single costs', fontsize=16)
ax4.plot_wireframe(W1, W2, li_C_sgl[0], colors=('blue'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[1], colors=('red'))
ax4.plot_wireframe(W1, W2, li_C_sgl[5], colors=('orange'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[6], colors=('yellow'))
W2, li_C_sgl[9], colors=('magenta'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[12], colors=('green'))


Ok, lets apply the “Normalizer” to our input samples. We get:

ay_K1 = 
 [0.42786745 0.00213934 0.34229396 0.02353271 0.47065419 0.02353271
 0.25672047 0.02995072 0.41717076 0.03209006 0.27811384 0.01069669
 0.39577739 0.0342294 ]

ay_K2 = 
 [0.02955501 0.22588473 0.02111072 0.40743694 0.06755431 0.37577085
 0.00422214 0.44332516 0.02533287 0.28288368 0.01477751 0.35254906
 0.02111072 0.48343554]

scaling_fact_K1:  0.0021393372268854655 , scaling_fact_K2:  0.0021110722130092533

How do the transformed data points look like in the {K1, K2}-feature-space? See the plot:

Structurally very like the original; but with values reduced to [0,1]. This was to be expected.

The cost hyperplane for the data normalized “per feature

After the transformation of the sample data the cost hyperplane over the {w1, w2}-space looks as follows:

We see a clear minimum; it does, however, not appear as pronounced as for the StandardScaler, which we applied in the last article.

But: There are no side valleys with small gradients at the end of the steep slope area. This means that a path into a minimum will probably look a bit different compared to a path on the hyperplane we got with the “StandardScaler”.

Our mesh in the {w1, w2}-space indicates the following position of the minimum:

C_min =  0.0006350159045771724
Coordinates:  [[3949 1542]]
Weight values at cost minimum: -2.4160000000003397 2.39799999999913

Gradient descent results after normalization per feature with “Normalizer”

With our gradient descent method and the following run-parameters

w1_start = -0.20, w2_start = 0.25 eta = 0.2, decrease_rate = 0.00000001, num_steps = 2500

we get the following result of a run which explores both stochastic and batch gradient descent:

Stoachastic Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.427867  0.029555  200.0   14.0  0.3  0.276365  0.078783
1   0.002139  0.225885    1.0  107.0  0.7  0.630971  0.098613
2   0.342294  0.021111  160.0   10.0  0.3  0.315156  0.050519
3   0.023533  0.407437   11.0  193.0  0.7  0.715038  0.021483
4   0.470654  0.067554  220.0   32.0  0.3  0.273924  0.086920
5   0.023533  0.375771   11.0  178.0  0.7  0.699320  0.000971
6   0.256720  0.004222  120.0    2.0  0.3  0.352075  0.173584
7   0.029951  0.443325   14.0  210.0  0.7  0.729191  0.041701
8   0.417171  0.025333  195.0   12.0  0.3  0.279519  0.068271
9   0.032090  0.282884   15.0  134.0  0.7  0.645816  0.077405
10  0.278114  0.014778  130.0    7.0  0.3  0.346085  0.153615
11  0.010697  0.352549    5.0  167.0  0.7  0.694107  0.008418
12  0.395777  0.021111  185.0   10.0  0.3  0.287962  0.040126
13  0.034229  0.483436   16.0  229.0  0.7  0.745803  0.065432

Batch Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.427867  0.029555  200.0   14.0  0.3  0.276360  0.078799
1   0.002139  0.225885    1.0  107.0  0.7  0.630976  0.098606
2   0.342294  0.021111  160.0   10.0  0.3  0.315152  0.050505
3   0.023533  0.407437   
11.0  193.0  0.7  0.715045  0.021493
4   0.470654  0.067554  220.0   32.0  0.3  0.273919  0.086935
5   0.023533  0.375771   11.0  178.0  0.7  0.699326  0.000962
6   0.256720  0.004222  120.0    2.0  0.3  0.352072  0.173572
7   0.029951  0.443325   14.0  210.0  0.7  0.729198  0.041711
8   0.417171  0.025333  195.0   12.0  0.3  0.279514  0.068287
9   0.032090  0.282884   15.0  134.0  0.7  0.645821  0.077398
10  0.278114  0.014778  130.0    7.0  0.3  0.346081  0.153603
11  0.010697  0.352549    5.0  167.0  0.7  0.694113  0.008410
12  0.395777  0.021111  185.0   10.0  0.3  0.287957  0.040142
13  0.034229  0.483436   16.0  229.0  0.7  0.745810  0.065443

Total error stoch descent:  0.06898872490256348
Total error batch descent:  0.06899042421795792

Good! Seemingly we got some convergence in both cases. The overall “accuracy” achieved on the training set is even a bit better than for the “StandardScaler”. And:

Final (w1,w2)-values stoch : ( -2.4151 ,  2.3977 )
Final (w1,w2)-values batch : ( -2.4153 ,  2.3976 )

This fits very well to the data we got from our mesh analysis of the cost hyperplane!

Regarding the evolution of the costs and the weights we see a slightly different picture than with the “StandardScaler”:

Cost and weight evolution during stochastic gradient descent


Cost and weight evolution during batch gradient descent

From the evolution of the weight parameters we can assume that gradient descent moved along a direct path into the cost minimum. This fits to the different shape of the cost hyperplane in comparison with the hyperplane we got after the application of the “StandardScaler”.

Predicted contour and separation lines in the {K1, K2}-plane after feature-scaling with “Normalizer”

We compute the contour lines of the output A of our solitary neuron (see article 1 of this series) with the following code:

# ***********
# Contours 
# ***********
from matplotlib import ticker, cm

# Take w1/w2-vals from above w1f, w2f
w1_len = len(li_w1_ba)
w2_len = len(li_w1_ba)
w1f = li_w1_ba[w1_len -1]
w2f = li_w2_ba[w2_len -1]

def A_mesh(w1,w2, Km1, Km2):
    kshape = Km1.shape
    A = np.zeros(kshape) 
    Km1V = Km1.reshape(kshape[0]*kshape[1], )
    Km2V = Km2.reshape(kshape[0]*kshape[1], )
    print("km1V.shape = ", Km1V.shape, "\nkm1V.shape = ", Km2V.shape )
    # scaling trafo
    if scale_method == 0: 
        Km1V = scaling_fact_K1 * Km1V
        Km2V = scaling_fact_K2 * Km2V
        KmV = np.vstack( (Km1V, Km2V) )
        KmT = KmV.T
        KmV = np.column_stack((Km1V, Km2V))
        KmT = scaler.transform(KmV)
    Km1T, Km2T = KmT.T
    Km1TR = Km1T.reshape(kshape)
    Km2TR = Km2T.reshape(kshape)
    print("km1TR.shape = ", Km1TR.shape, "\nkm2TR.shape = ", Km2TR.shape )
    rg_idx = range(num_samples)
    Z      = w1 * Km1TR + w2 * Km2TR
    A = expit(Z)
    return A

#Build K1/K2-mesh 
minK1, maxK1 = li_K1.min()-20, li_K1.max()+20 
minK2, maxK2 = li_K2.min()-20, li_
resolution = 0.1
Km1, Km2 = np.meshgrid( np.arange(minK1, maxK1, resolution), 
                        np.arange(minK2, maxK2, resolution))

A = A_mesh(w1f, w2f, Km1, Km2 )
print("A.shape = ", A.shape)

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 14
fig_size[1] = 11
fig, ax = plt.subplots()
#cs = plt.contourf(X, Y, Z1, levels=25, alpha=1.0, cmap=cm.PuBu_r)
cs = ax.contourf(Km1, Km2, A, levels=25, alpha=1.0, cmap=cmap)
cbar = fig.colorbar(cs)
N = 14
r0 = 0.6
x = li_K1
y = li_K2
area = 6*np.sqrt(x ** 2 + y ** 2)  # 0 to 10 point radii
c = np.sqrt(area)
r = np.sqrt(x ** 2 + y ** 2)
area1 = < 100, area)
area2 = >= 100, area)
ax.scatter(x, y, s=area1, marker='^', c=c)
ax.scatter(x, y, s=area2, marker='o', c=c)
# Show the boundary between the regions:
ax.set_xlabel("K1", fontsize=16)
ax.set_ylabel("K2", fontsize=16)


Please note the differences in how we handle the creation of the array “KmT” with the transformed data for “scale_method=0”, i.e. “Normalizer”, in comparison to other methods.

Here is the result:

Looks very similar to our plot for the StandardScaler in the last article – but with a slight shift on the K1-axis. So, the answer to our first question is: The contour lines are straight diagonal lines!

This is a direct result of the equations

expit(z) = E_z = const. => z = const. => w1*f1*K1 + w2*f2*K2 = C_z =>
K2 = C_k -fact*K1

The last one is nothing but an equation for a straight line. As “factor” is a constant, the angle α with the K1-axis remains the same for different E_z and C_k, i.e. we get parallel lines. If “fact = “-w1*f1/w2*f2 ≈ 1 = tan(α)” we get almost a 45°ree;-angle α. Let us see in our case : w1 = -2.4151 , w2 = 2.3977, f1 = 0.00214, f2 = 0.00211 => fact = 1.0215. This explains our plot.

“Normalizer” used per sample

Now we scale the (K1, K2) coordinates in feature space of each single sample with the Normalizer. I.e. we scale K1 and K2 for each individual sample by a common factor 1/sqrt(K1**2 + K2**2). Meaning: No scaling with a common factor per feature over all samples; instead scaling of the features per sample. As already said: If we regard K1 and K2 as coordinates of a vector then we scale the distance of vectors end point radially to the origin of the coordinate system down to a length of 1.

Thus: After this normalization transformation we expect that our points are located on a unit circle! Note, however, that our transformation keeps up the angular distance of all data points. By “angular distance” for two selected points we mean the difference of the angles of these data points with e.g. the K1-axis.

Let us look at the transformed sample points in the {K1, K2}-plane:

Ok, our transformation has done a more pronounced “clustering” for us. Our transformed clusters are even more clearly separated from each other than before!

What does this mean for our cost hyperplane in the {w1, w2}-space? Well, here is a mesh-plot:

Cost hyperplane of
the data scaled per sample by “Normalizer” in the {w1, w2}-space

According to our mesh the minimum is located at:

C_min =  2.2726781812937556e-05
Coordinates:  [[3200 2296]]
Weight values at cost minimum: -0.9080000000005057 0.8999999999992951

Comparison of the cost hyperplane with center of the original hyperplane for the unscaled batch data

Now comes a really funny point: Do you remember that we have gotten a similar plot before? Actually, we did when we looked at a tiny surroundings of the center of the cost hyperplane of the original unscaled data in the first article of this series:

Cost hyperplane at the center of the original unscaled input data in the {w1, w2}-space?

A somewhat different viewing angle – but the similarity is obvious. Note however the very different scales of the (w1, w2)-values compared to the version of the scaled data.

How do we explain this similarity? Part of the answer lies in the fact that the total costs of the batch are dominated by those samples who have the biggest coordinate values, i.e. of those points where either K1 or K2 is biggest. Now, these points were very close to each other in the original data set. Now, for such points a centric stretch by a factor of around 1/200 would require a centric stretch (but now an expansion!) for the (w1, w2)-data with a reciprocate factor if we wanted to reproduce the same cost values. Reason: Linear coupling w1*K1+w2*K2! You compensate a constant factor in the {K1,K2}-space by its reciprocate one in the {w1, W2}-space!

But that is more or less what we have done by our somewhat strange application of the “Normalizer”! At least almost … Fun, isn’t it?

Gradient descent after sample-wise (!) normalization by the “Normalizer”

The clearer separation of the clusters in the {K1, K2}-space after separation and a well formed cost hyperplane over the {w1, w2}-space should help us a bit with our gradient descent. We set the parameters of a gradient descent run to

w1_start = -0.20, w2_start = 0.25 eta = 0.2, decrease_rate = 0.00000001, num_steps = 1000

and get:

Stoachastic Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.997559  0.069829  200.0   14.0  0.3  0.300715  0.002383
1   0.009345  0.999956    1.0  107.0  0.7  0.709386  0.013408
2   0.998053  0.062378  160.0   10.0  0.3  0.299211  0.002629
3   0.056902  0.998380   11.0  193.0  0.7  0.700095  0.000136
4   0.989586  0.143940  220.0   32.0  0.3  0.316505  0.055018
5   0.061680  0.998096   11.0  178.0  0.7  0.699129  0.001244
6   0.999861  0.016664  120.0    2.0  0.3  0.290309  0.032305
7   0.066519  0.997785   14.0  210.0  0.7  0.698144  0.002652
8   0.998112  0.061422  195.0   12.0  0.3  0.299019  0.003269
9   0.111245  0.993793   15.0  134.0  0.7  0.688737  0.016090
10  0.998553  0.053768  130.0    7.0  0.3  0.297492  0.008360
11  0.029927  0.999552    5.0  167.0  0.7  0.705438  0.007769
12  0.998542  0.053975  185.0   10.0  0.3  0.297533  0.008223
13  0.069699  0.997568   16.0  229.0  0.7  0.697493  0.003581

Batch Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.997559  0.069829  200.0   14.0  0.3  0.300723  0.002409
1   0.009345  0.999956    1.0  
107.0  0.7  0.709388  0.013411
2   0.998053  0.062378  160.0   10.0  0.3  0.299219  0.002604
3   0.056902  0.998380   11.0  193.0  0.7  0.700097  0.000139
4   0.989586  0.143940  220.0   32.0  0.3  0.316513  0.055044
5   0.061680  0.998096   11.0  178.0  0.7  0.699131  0.001241
6   0.999861  0.016664  120.0    2.0  0.3  0.290316  0.032280
7   0.066519  0.997785   14.0  210.0  0.7  0.698146  0.002649
8   0.998112  0.061422  195.0   12.0  0.3  0.299027  0.003244
9   0.111245  0.993793   15.0  134.0  0.7  0.688739  0.016087
10  0.998553  0.053768  130.0    7.0  0.3  0.297500  0.008335
11  0.029927  0.999552    5.0  167.0  0.7  0.705440  0.007771
12  0.998542  0.053975  185.0   10.0  0.3  0.297541  0.008198
13  0.069699  0.997568   16.0  229.0  0.7  0.697495  0.003578

Total error stoch descent:  0.011219103621660675
Total error batch descent:  0.01121352661948904         

Well, this is a almost perfect result on the training set; just between 1% and 3% deviation from the aspired output values. We have obviously found something new! Before, we always had deviations up to 15% or even 20% in the prediction for some of the data samples in our training set.

The final values of the weights become:

Final (w1,w2)-values stoch : ( -0.9093 ,  0.9009 )
Final (w1,w2)-values batch : ( -0.9090 ,  0.9009 )

Also very perfect. You should not forget – we worked with just 14 samples and 1 neuron.

The evolution data look like:

Cost and weight evolution during stochastic gradient descent


Cost and weight evolution during batch gradient descent

Smooth development; fast convergence!

Separation lines in the {K1, K2}-space after “per sample”-normalization with “Normalizer”

Now we turn to the answer to the second question we asked above: What changes regarding the separation or contour lines of the output values of our solitary neuron? Well as in our last article, we are interested in the output of our neuron after the normalization transformation of the data. I.e. we are on the search for contour lines, which we get for those points in the original {K1, K2}-space for which the sigmoid function produces a constant after transformation.

Here is the plot:

Ooops, now we get a real difference. The contour curves are straight lines, but now directed radially outwards from the origin into the {K1, K2}-space! You see in addition that most of the data points are located very close to the lines for the set values A=0.3 and A=0.7!

We also get a very clear separation line close to diagonal at 45°ree;. A few comments on this finding:

The subdivision of the {K1, K2}-plane into sectors is very appropriate for clusters with data which show a tendency of a constant ration between the K1 and K2 values or clusters with a narrow extension
in both directions. Note, however, that if we had two clusters at different radial distances but at roughly the same angle our present Normalizer transformation per sample would not have been helpful but disastrous regarding separation. So: The application of special normalization procedures ahead of classification training must be done with a feeling or insight into the clustering structure in the feature space.

Why radial contour lines?

What are the contour lines in the original {K1, K2}-space which produce the same output A for the transformed data? If we name the transformed (K1, K2) values by (k1, k2) we get in our case

k1 = K1/(K1**2 + K2**2)
k2 = K2/(K1**2 + K2**2)

So, we are looking for points in the {K1, K2}-space for which the equation

expit(w1*k1+ w2*k2) = const.

We now have to show that this is fulfilled for lines that have the property K2/K1 = tan(alpha) with alpha = const.. The proof is a small algebraic exercise, which I leave to my readers. Of course a genius like Mr Trump would give a direct answer based on the transformation properties itself: We just eliminated the radial distance to the origin as a feature! I leave it up to you which way of reasoning you want to go.

Clustering ahead of gradient descent?

Our very specific way of using the “Normalizer” has led us to a clearer clustering after the scaling transformation. This gives rise to a fundamental idea:

What if you could use some method to detect clusters in the distribution of datapoints in feature space ahead of gradient decent?

But, on basis of what input or feature data then? Well, we could use some norm (as L2) to describe the distance of the data points from the centers of the different identified clusters as the new features! If we knew the centers of the clusters such an approach could have a potential advantage: It would set the the number of the new features to the number of the identified clusters. And this number could be substantially smaller than the number of originally features Why? Because in general not all features may be independent of each other and not all may be of major importance for the classification and cluster membership.

We shall follow this idea in my other series on a real MLP and MNIST in more detail.


In this article we studied the application of the “Normalizer” offered by Scikit-Learn in two different ways to a training scenario for a one neuron perceptron and data with two input features (only). Normally we would apply “Normalizer” such that we would scale the data of all samples for each feature separately. And use the found stretching factors later on on new data points for which we want to make a classification prediction.

We saw that such a transformation roughly kept up the structure of the datapoint distribution in the {K1, K2}-fature-space. Scaling into an interval [-1, 1] had a major and healthy impact on the structure of the cost hyperplane in the {w1, w2}-weight-space. This helped us to perform a smooth gradient descent calculation.

Then we performed an application of “Normalizer” per sample. This corresponded to a radial stretch of all datapoints down to a unit cycle, whilst keeping up the values of the angles. We got a more structured cost hyperplane afterwards and a stronger clustering effect in the special case of our transformed data distribution in feature space. This helped gradient descent quite a lot: We could classify our data much better according to our discrimination prescription A=0.3 vs. A=0.7.

Our transformation also had the interesting effect of sub-dividing the feature space into radial sectors instead of parallel stripes. This would be helpful in case of data clusters with a certain radial elongation in the feature space but a clear difference and separation in angle. Such data do indeed exist – just think of the distribution of stars or
microwave radiation clusters on the nightly sky sphere. At least in the latter case the radial distance of the sources may be of minor importance: You do not need radial distance information to note a concentration in a region which we call “milky way”!

What we actually did with our special normalization was to indirectly eliminate the radial distance information hidden in our (K1, K2)-data. We could also have calculated the angle (or a function of it) directly and thrown away all other information. If we had done so, we would have reduced our 2-dim the feature space to just one dimension! We saw this directly on the plot of the contour lines! Thus: It would have been much more intelligent, if we had used our transformation in a slightly modified form, determined just the angle of our data-points directly and uses these data as the only feature guiding gradient descent.

This led us to the idea that a clear identification of clusters by some appropriate method before we start a gradient descent analysis might be helpful for classification tasks.

This in turn triggers the idea of a cluster detection in feature space – which itself actually is a major discipline of Machine Learning. An advantage of using cluster detection ahead of gradient descent would be the possible reduction of the number of input features for the artificial neural network. Take a look at a forthcoming article in my other series on a Multilayer Perceptron [MLP] in this blog for an application in combination with a MLP and the MNIST daset.

In the next article of this series on a minimalistic perceptron we shall add a bias neuron to the input layer and investigate the impact.