Variational Autoencoder with Tensorflow 2.8 – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?

This post continues my series on Variational Autoencoders [VAE] with some considerations regarding a VAE whose settings allow only for a tiny amount of the so called Kullback-Leibler [KL] loss.

Variational Autoencoder with Tensorflow 2.8 – I – some basics
Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images
Variational Autoencoder with Tensorflow 2.8 – XI – image creation by a VAE trained on CelebA
Variational Autoencoder with Tensorflow 2.8 – XII – save some VRAM by an extra Dense layer in the Encoder

So far, most of the posts in this series have covered a variety of methods (provided by Tensorflow and Keras) to control the KL loss. One of the previous posts (XI) provided (indirect) evidence that also GradientTape()-based methods for KL-loss calculation work as expected. In stark contrast to a standard Autoencoder [AE] our VAE trained on CelebA data proved its ability to reconstruct humanly interpretable images from random z-points (or z-vectors) in the latent space. Provided that the z-points lie within a reasonable distance to the origin.

We could leave it at that. One of the basic motivations to work with VAEs is to use the latent space “creatively”. This requires that the data points coming from similar training images should fill the latent space densely and without gaps between clusters or filaments. We have obviously achieved this objective. Now we could start to do funny things like to combine reconstruction with vector arithmetic in the latent space.

But to look a bit deeper into the latent space may give us some new insights. The central point of the KL-loss is that it induces a statistical element into the training of AEs. As a consequence a VAE fills the so called “latent space” in a different way than a simple AE. The z-point distribution gets confined and areas around z-points for meaningful training images are forced to get broader and overlap. So two questions want an answer:

  • Can we get more direct evidence of what the KL-loss does to the data distribution in latent space?
  • Can we get some direct evidence supporting the assumption that most of the latent space of an AE is empty or only sparsely populated? in contrast to a VAE’s latent space?

Therefore, I thought it would be funny to compare the data organization in latent space caused by an AE with that of a VAE. But to get there we need some solid starting point. If you consider a bit where you yourself would start with an AE vs. VAE comparison you will probably come across the following additional and also interesting questions:

  • Can one safely assume that a VAE with only a very tiny amount of KL-loss reproduces the same z-point distribution vs. radius which an AE would give us?
  • In general: Can we really expect a VAE with a very tiny Kullback-Leibler loss to behave as a corresponding AE with the same structure of convolutional layers?

The answers to all these questions are the topics of this post and a forthcoming one. To get some answers I will compare a VAE with a very small KL-loss contribution with a similar AE. Both network types will consist of equivalent convolutional layers and will be trained on the CelebA dataset. We shall look at the resulting data point density distribution vs. radius, clustering properties and the ability to create images from statistical z-points.

This will give us a solid base to proceed to larger and more natural values of the KL-loss in further posts. I got some new insights along this path and hope the presented data will be interesting for the reader, too.

Below and in following posts I will sometimes call the target space of the Encoder also the “z-space“.

CelebA data to fill the latent vector-space

The training of an AE or a VAE occurs in a self-supervised manner. A VAe or an AE learns to create a point, a z-point, in the latent space for each of the training objects (e.g. CelebA images). In such a way that the Decoder can reconstruct an object (image) very close to the original from the z-point’s coordinate data. We will use the “CelebA” dataset to study the KL-impact on the z-point distribution.CelebA is more challenging for a VAE than MNIST. And the latent space requires a substantially higher number of dimensions than in the MNIST case for reasonable reconstructions. This makes things even more interesting.

The latent z-space filled by a trained AE or VAE is a multi-dimensional vector space. Meaning: Each z-point can be described by a vector defining a position in z-space. A vector in turn is defined by concrete values for as many vector components as the z-space has dimensions.

Of course, we would like to see some direct data visualizing the impact of the KL-loss on the z-point distribution which the Encoder creates for our training data. As we deal with a multidimensional vector space we cannot plot the data distribution. We have to simplify and somehow get rid of the many dimensions. A simple solution is to look at the data point distribution in latent space with respect to the distance of these points from the origin. Thereby we transform the problem into a one-dimensional one.

More precisely: I want to analyze the change in numbers of z-points within “radius“-intervals. Of course, a “radius” has to be defined in a multidimensional vector space as the z-space. But this can easily be achieved via an Euclidean L2-norm. As we expect the KL loss to have a confining effect on the z-point distribution it should reduce the average radius of the z-points. We shal later see that this is indeed the case.

Another simple method to reduce dimensions is to look at just one coordinate axis and the data distribution for the calculated values in this direction of the vector space. Therefore, I will also check the variation in the number of data points along each coordinate axis in one of the next posts.

A look at clustering via projections to a plane may be helpful, too.

The expected similarity of a VAE with tiny KL-loss to an AE is not really obvious

Regarding the answers to the 3rd and 4th questions posed above your intuition tells you: Yes, you probably can bet on a similarity between a VAE with tiny KL-loss and an AE.

But when you look closer at the network architectures you may get a bit nervous. Why should a VAE network that has many more degrees of freedom than an AE not use both of its layers for “mu” and “logvar” to find a different distribution solution? A solution related to another minimum of the loss hyperplane in the weight configuration space? Especially as this weight-related space is significantly bigger than that of a corresponding AE with the same convolutional layers?

The whole point has to do with the following facts: In an AE’s Encoder the last flattening layer after the Conv2D-layers is connected to just one output layer. In a VAE, instead, the flattening layer feeds data into two consecutive layers (for mu and logvar) across twice as many connections (with twice as many weight parameters to optimize).

In the last post of this series we dealt with this point from the perspective of VRAM consumption. Now, its the question in how far a VAE will be similar to an AE for a tiny KL-loss.

Why should the z-points found be determined only by mu-values and not also by logvar-values? And why should the mu values reproduce the same distribution as an AE? At least the architecture does not guarantee this by any obvious means …

Well, let us look at some data.

Structure of an AE for CelebA and its total loss after some epochs

Our test AE contains the same simple sequence of four Conv2D layers per Encoder and four 4 Conv2DTranspose layers as our VAE. See the AE’s Encoder layer structure below.

A difference, however, will be that I will not use any BatchNormalizer layers in the AE. But as a correctly implemented BatchNormalization should not affect the representational powers of a VAE network for very principle reasons this should not influence the comparison of the final z-point distribution in a significant way.

I performed an AE training run for 170,000 CelebA training images over 24 epochs. The latent space has a dimension if z_dim=256. (This relatively low number of dimensions will make it easier for a VAE to confine z_points around the origin; see the discussion in previous posts).

The resulting total loss of our AE became ca. 0.49 per pixel. This translates into a total value of

AE total loss on Celeb A after 24 epochs (for a step size of 0.0005): 4515

This value results from a summation over all geometric pixels of our CelebA images which were downsized to 96×96 px (see post IX). The given value can be compared to results measured by our GradientTape()-based VAE-model which delivers integrated values and not averages per pixel.

This value is significantly smaller than values we would get for the total loss of a VAE with a reasonably big KL-loss of contribution in the order of some percent of the reconstruction loss. A VAE produces values around 4800 up to 5000. Apparently, an AE’s Decoder reconstructs originals much better than a VAE with a significant KL-loss contribution to the total loss.

But what about a VAE with a very small KL-loss? You will get the answer in a minute.

Where does a standard Autoencoder [AE] place the z-points for CelebA data?

We can not directly plot a data point distribution in a 256-dimensional vector-space. But we can look at the data point density variation with a calculated distance from the origin of the latent space.

The distance R from the origin to the z-point for each image can be measured in terms of a L2 (= Euclidean) norm of the latent vector space. Afterward it is easy to determine the number of images within all radius intervals with e.g. a length of 0.5 e.g. between radii R

0  <  R  <  35 .

We perform the following steps to get respective numbers. We let the Encoder of our trained AE predict the z-points of all 170,000 training data

z_points  = AE.encoder.predict(data_flow) 

data_flow was created by a Keras DataImageGenerator to send batches of training data to the GPU (see the previous posts).

Radius values are then calculated as

print("NUM_Images_Train = ", NUM_IMAGES_TRAIN)
ay_rad_z = np.zeros((NUM_IMAGES_TRAIN,),  dtype='float32')
for i in range(0, NUM_IMAGES_TRAIN):
    sq = np.square(z_points[i]) 
    sqrt_sum_sq = math.sqrt(sq.sum())
    ay_rad_z[i] = sqrt_sum_sq

The numbers vs. radius relation then results from:

li_rad      = []
li_num_rad  = []
int_width = 0.5
for i in range(0,70):
    low   = int_width * i
    high  = int_width * (i+1) 
    num   = np.count_nonzero( (ay_rad_z >= low) & (ay_rad_z < high ) )
    li_rad.append(0.5 * (low + high))
    li_num_rad.append(num)

The resulting curve is shown below:

There seems to be a peak around R = 16.75. So, yet another question arises:

>What is so special about the radius values of 16 or 17 ?

We shall return to this point in the next post. For now we take this result as god-given.

Clustering of CelebA z-point data in the AE’s latent space?

Another interesting question is: Do we get some clustering in the latent space? Will there be a difference between an AE and a VAE?

A standard method to find an indication of clustering is to look for an elbow in the so called “inertia” curve for different assumed numbers of clusters. Below you find an inertia plot retrieved from the z-point data with the help of MiniBatchKMeans.

This result was achieved for data taken at every second value of the number of clusters “num_clus” between 2 ≤ num_clus ≤ 80. Unfortunately, the result does not show a pronounced elbow. Instead the variation at some special cluster numbers is relatively high. But, if we absolutely wanted to define a value then something between 38 and 42 appears to be reasonable. Up to that point the decline in inertia is relatively smooth. But do not let you get misguided – the data depend on statistics and initial cluster values. When you change to a different calculation you may get something like the following plot with more pronounced spikes:

This is always as sign that the clustering is not very clear – and that the clusters do not have a significant distance, at least not in all coordinate directions. Filamental structures will not be honored well by KMeans.

Nevertheless: A value of 40 is reasonable as we have 40 labels coming with the CelebA data. I.e. 40 basic features in the face images are considered to be significant and were labeled by the creators of the dataset.

t-SNE projections

We can also have a look at a 2-dimensional t-SNE-projection of the z-point distribution. The plots below have been produced with different settings for early exaggeration and perplexity parameters. The first plot resulted from standard parameter values for sklearn’s t-SNE variant.

tsne = TSNE(n_components=2, early_exaggeration=12, perplexity=30, n_iter=1000)

Other plots were produced by the following setting:

tsne = TSNE(n_components=2, early_exaggeration=16, perplexity=10, n_iter=1000)

Below you find some plots of a t-SNE-analysis for different numbers and different adjusted parameters for the resulting scatter plot. The number of statistically chosen z-point varies between 20,000 and 140,000.

Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)

Actually we see some indication of clustering, though it is not very pronounced. The clusters in the projection are not separated by clear and broad gaps. Of course a 2-dimensional projection can not completely visualize the original separations in a 256-dim space. However, we get the impression that clusters are located rather close to each other. Remember: We already know that almost all points are locates in a multidimensional sphere shell between 12 < R < 24. And more than 50% between 14 ≤ R ≤ 19.

However, how the actual distribution of meaningful z-points (in the sense of a recognizable face reconstruction) really looks like cannot be deduced from the above t-SNE analysis. The concentration of the z-points may still be one which follows thin and maybe curved filaments in some directions of the multidimensional latent space on relatively small or various scales. We shall get a much clearer picture of the fragmentation of the z-point distribution in an AE’s latent space in the next post of this series.

Number of statistical z-points: 80,000

For the higher number of selected z-points the room between some concentration centers appears to be filled in the projection. But remember: This may only be due to projection effects in the presently chosen coordinate system. Another calculation with the above non-standard data for perplexity and early_exaggeration gives us:

Number of statistical z-points: 140,000

Note that some islands appear. Obviously, there is at least some clustering going on. However, due to projection effects we cannot deduce much for the real structure of the point distribution between possible clusters. Even the clustering itself could appear due to overlapping two or more broader filaments along a projection line.

Whether correlations would get more pronounced and therefore could also be better handled by t-SNE in a rotated coordinate system based on a PCA-analysis remains to be seen. The next post will give an answer.

At least we have got a clear impression about the radial distribution of the z-points. And thereby gathered some data which we can compare to corresponding results of a VAE.

Total loss of a VAE with a tiny KL-loss for CelebA data

Our test VAE is parameterized to create only a very small KL-loss contribution to the total loss. With the Python classes we have developed in the course of this post series we can control the ratio between the KL-loss and a standard reconstruction loss as e.g. BCE (binary-crossentropy) by a parameter “fact“.

For BCE

fact = 1.0e-5

is a very small value. For a working VAE we would normally choose something like fact=5 (see post XI).

A value like 1.0e-5 ensures a KL loss around 0.0178 compared to a reconstruction loss of 4550, which gives us a ratio below 4.e-6. Now, what is a VAE going to do, when the KL-loss is so small?

For the total loss the last epochs produced the following values:

AE total loss on Celeb A after 24 epochs for a step size of 0.0005: 4,553

Output of the last 6 of 24 epochs.

Epoch 1/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4557.1694 - reco_loss: 4557.1523 - kl_loss: 0.0179
Epoch 2/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.9111 - reco_loss: 4556.8940 - kl_loss: 0.0179
Epoch 3/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.6626 - reco_loss: 4556.6450 - kl_loss: 0.0179
Epoch 4/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.3862 - reco_loss: 4556.3682 - kl_loss: 0.0179
Epoch 5/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4555.9595 - reco_loss: 4555.9395 - kl_loss: 0.0179
Epoch 6/6
1329/1329 [==============================] - 118s 89ms/step - total_loss: 4555.6641 - reco_loss: 4555.6426 - kl_loss: 0.0178

This is not too far away from the value of our AE. Other training runs confirmed this result. On four different runs the total loss value came to lie between

VAE total loss on Celeb A after 24 epochs: 4553 ≤ loss ≤ 4555 .

VAE with tiny KL-loss – z-point density distribution vs. radius

Below you find the plot for the variation of the number density of z-points vs. radius for our VAE:

Again, we get a maximum close to R = rad = 16. The maximum value lies a bit below the one found for a KL-loss-free AE. But all in all the form and width of the distribution of the VAE are very comparable to that of our test AE.

Can this result be reproduced?
Unfortunately not at a 100% of test runs performed. There are two main reasons:

  1. Firstly, we can not be sure that a second minimum does not exist for a distribution of points at bigger radii. This may be the case both for the AE and the VAE!
  2. Secondly, we have a major factor of statistical fluctuation in our game:
    The epsilon value which scales the logvar-contribution to the loss in the sampling layer of the Encoder may in very seldom cases abruptly jump to an unreasonable high value. A Gaussian covers extreme values, although the chances to produce such a value are pretty small. and a Gaussian is invilved in the calculation of z-points by our VAE.

Remember that the z-point coordinates are calculated via the the mu and logvar tensors according to

 
z = mu + B.exp(log_var / 2.) * epsilon

See Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics for respective code elements of the Encoder.

So, a lot depends on epsilon which is calculated as a statistically fluctuating quantity, namely as

epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)

Is there a chance that the training process may sometimes drive the system to another corner of the weight-loss configuration space due to abrupt fluctuations? With the result for the z-point distribution vs. radius that it may significantly deviate from a distribution around R = 16? I think: Yes, this is possible!

From some other training runs I actually have an indication that there is a second minimum of the cost hyperplane with similar properties for higher average radius-values, namely for a distribution with an average radius at R ≈ 19.75. I got there after changing the initialization of the weights a bit.

Another indication that the cost surface has a relative rough structure and that extreme fluctuations of epsilon and a resulting gradient-fluctuation can drive the position of the network in the weight configuration space to some strange corners. The weight values there can result in different z-point distributions at higher average radii. This actually happened during yet another training run: At epoch 22 the Adam optimizer suddenly directed the whole system to weight values resulting in a maximum of the density distribution at R = 66 ! This appeared as totally crazy. At the same time the KL-loss also jumped to a much higher value.

When I afterward repeated the run from epoch 18 this did not happen again. Therefore, a statistical fluctuation must have been the reason for the described event. Such an erratic behavior can only be explained by sudden and extreme changes of z-point data enforcing a substantial change in size and direction of the loss gradient. And epsilon is a plausible candidate for this!

So far I had nothing in our Python classes which would limit the statistical variation of epsilon. The effects seen spoke for a code change such that we do not allow for extreme epsilon-values. I set limits in the respective part of the code for the sampling layer and its lambda function

        # The following function will be used by an eventual Lambda layer of the Encoder 
        def z_point_sampling(args):
            '''
            A point in the latent space is calculated statistically 
            around an optimized mu for each sample 
            '''
            mu, log_var = args # Note: These are 1D tensors !
            epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) 
            if abs(epsilon) >= 5: 
                epsilon *= 5. / abs(epsilon)       
            return mu + B.exp(log_var / 2.) * epsilon * self.enc_v_eps_factor

This stabilized everything. But even without these limitations on average three out of 4 runs which I performed for the VAE ran into a cost minimum which was associated with a pronounced maximum of the z-point-distribution around R ≈ 16. Below you see the plot for the fourth run:

So, there is some chance that the degrees of freedom associated with the logvar-layer and the statistical variation for epsilon may drive a VAE into other local minima or weight parameter ranges which do not lead to a z-point distribution around R = 16. But after the limitation of epsilon fluctuations all training runs found a loss minimum similar to the one of our simple AE – in the sense that it creates a z-point density distribution around R ≈ 16.

VAE with tiny KL-loss: Inertia and clustering of the CelebA data?

Our VAE gives the following variation of the inertia vs. the number of assumed clusters:

This also looks pretty similar to one of the plots shown for our AE above.

t-SNE for our VAE with a tiny KL loss

Below you find t-SNE plots for 20,000, 80,000 and 140,000 images:

Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)

This is quite similar to the related image for the AE. You just have to rotate it.

Number of statistical z-points: 80,000

Number of statistical z-points: 140,000

All in all we get very similar indications as from our AE that some clustering is going on.

VAE with tiny KL-loss: Should its logvar values become tiny, too?

Besides reproducing a similar z-point distribution with respect to radius values, is there another indication that a VAE behaves similar to an AE? What would be a clear sign that the similarity really exists on a deeper level of the layers and their weights?

The z-vector is calculated from the mu and logvar-vectors by:

z = mu + exp(logvar/2)*epsilon

with epsilon coming from a normal distribution. Please note that we are talking about vectors of size z_dim=256 per image.

If a VAE with a tiny KL-loss really becomes similar to an AE it should define and set its z-points basically by using mu-values, only, and not by logvar-values. I.e. the VAE should become intelligent enough to ignore the degrees of freedom associated with the logvar-layer. Meaning that the z-point coordinates of a VAE with a very small Kl-loss should in the end be almost identical to the mu-component-values.

Ok, but to me it was not self-evident that a VAE during its training would learn

  • to produce significant mu-related weight-values, only,
  • and to keep the weight values for the connections to the logvar-layer so small that the logvar-impact on the z-space position gets negligible.

Before we speculate about reasons: Is there any evidence for a negligible logvar-contribution to the z-point coordinates or, equivalently, to the respective vector components?

A VAE with tiny KL-loss produces tiny logvar values …

To get some quantitative data on the logvar impact the following steps are appropriate:

  1. Get the size and algebraic sign of the logvar-values. Negative values logvar < -3 would be optimal.
  2. Measure the deviation between the mu- and z_points vector components. There should only be a few components which show significant values &br; abs(mu – z) > 0.05
  3. Compare the the radius-value determined by z-components vs. the radius values derived from mu-components, only, and measure the absolute and relative deviations. The relative deviation should be very small on average.

Some values of logvar, (z – mu), z-radii and z-radius-deviations for a VAE with small KL-loss

Regarding the maximum value of the logvar’s vector-components I found

3.4 ≥ max(logvar) ≥ -3.2. # for 1 up to 3 components out of a total 45.52 million components

The first value may appear to be big for a component. But typically there are only 2 (!) out of 170,000 x 256 = 43.52 million vector components in an interval of [-3, 5]. On the component level I found the following minimum, maximum and average-values

Maximum value for logvar:  -2.0205276
Minimum value for logvar:  -24.660698
Average value for logvar:  -13.682616

The average value of logvar is pretty pleasing: Such big negative values indeed render the logvar-impact on the position of our z-points negligible. So we should only find very small deviations of the mu-components from the z-point components. And, actually, the maximum of the deviation between a z_point component and a mu component was delta_mu_z = 0.26:

Maximum (z_points - mu) = delta_mu_z = 0.26  # on the component level 

There were only 5 out of the 45.52 million components which showed an absolute deviation in the interval

0.05 < abs(delta_mu_z) < 2. 

The rest was much, much smaller!

What about radius values? Here the situation looks, of course, even better:

max radius defined by z  :  33.10274
min radius defined by z  :  6.4961233
max radius defined by mu :  33.0972
min radius defined by mu :  6.494558

avg_z:      16.283989  
avg_mu:     16.283972

max absolute difference :  0.018045425 
avg absolute difference :  0.00094899256

max relative difference  :   0.00072147313
avg relative difference  :   6.1240215e-05

As expected, the relative deviations between z- and mu-based radius values became very small.

In another run (the one corresponding to the second density distribution curve above) I got the following values:

Maximum value for logvar:  3.3761873
Minimum value for logvar:  -22.777826
Average value for logvar:  -13.4265175

max radius z :  35.51387
min radius z :  7.209168
max radius mu :  35.515926
min radius mu :  7.2086616

avg_z:  17.37412
avg_mu: 17.374104

max delta rad relative :   0.012512478
avg delta rad relative :   6.5660715e-05

This tells us that the z-point distributions may vary a bit in their width, their precise center and average values. But, overall they appear to be similar. Especially with respect to a relative negligible contribution of logvar-terms to the z-point position. The relative impact of logvar on the radius value of a z-point is of the order 6.e-5, only.

All the above data confirm that a trained VAE with a very small KL-loss primarily uses mu-values to set the position of its z-points. During training the VAE moves along a path to an overall minimum on the loss hyperplane which leads to an area with weights that produce negligible logvar values.

Explanation of the overall similarity of a VAE with tiny KL-loss to an AE

o far we can summarize: Under normal conditions the VAE’s behavior is pretty close to that of a similar AE. The VAE produces only small logvar values. z-point coordinates are extremely close to just the mu-coordinates.

Can we find a plausible reason for this result? Looking at the cost-hyperplane with respect to the Encoder weights helps:

The cost surface of a VAE spans across a space of many more weight parameters than a corresponding AE. The reason is that we have weights for the connection to the logvar-layer in addition to the weights for the mu-layer (or a single output layer as in a corresponding AE). But if we look at the corner of the weight-vector-space where the logvar-related values are pretty small, then we would at least find a local (if not global) loss minimum there for the same values of the mu-related weight parameters as in the corresponding AE (with mu replacing the z-output).

So our question reduces to the closely related question whether the old minimum of an AE remains at least a local one when we shift to a VAE – and this is indeed the case for the basic reason that the KL-contributions to the height of the cost-hyperplane are negligibly small everywhere (!) – even for higher logvar-related values.

This tells us that a gradient descent algorithm should indeed be able to find a cost minimum for very small values of logvar-related weights and for weight-values related to the mu-layer very close to the AE’s weight-values for direct connections to its output layer. And, of course, with all other weight parameter of the VAE-Decoder being close to the values of the weights of a corresponding AE. At least under the condition that all variable quantities really change smoothly during training.

Does a VAE with small KL-loss produce reasonable face images?

A last test to confirm that a VAE with a very small KL-loss operates as an comparable AE is a trial to create images with recognizable human faces from randomly chosen points in z-space. Such a trial should fail! I just show you three results – one for a normal distribution of the z-point components. And two for equidistant distribution of component values up to 3, 8 and 16:

z-point coordinates from normal distribution

z-point coordinates from equidistant distribution in [-2,2]

z-point coordinates from equidistant distribution in [-10,10]

This reminds us very much about the behavior of an AE. See: Autoencoders, latent space and the curse of high dimensionality – I.

The z-point distribution in latent space of a VAE with a very small KL-loss obviously is as complicated as that of an AE. Neighboring points of a z-point which leads to a good image produce chaotic images. The transition path from good z-points to other meaningful z-points is confined to a very small filament-like volume.

Conclusion

A trained VAE with only a tiny KL-loss contribution will under normal circumstances behave similar to an AE with a the same hidden (convolutional) layers. It may, however, be necessary to limit the statistical variation of the epsilon factor in the z-point calculation based on mu– and logvar-values.

The similarity is based on very small logvar-values after training. The VAE creates a z-point distribution which shows the same dependency on the radius as an AE. We see similar indications and patterns of clustering. And the VAE fails to produce human faces from random z-points in the latent space – as a comparable AE.

We have found a plausible reason for this similarity by comparing the minimum of the loss hyperplane in the weight-loss parameter space with a corresponding minimum in the weight-loss space of the VAE – at a position with small weights for the connection to the logvar layers.

The z-point density distribution shows a maximum at a radius between 16 and 17. The z-point distribution basically has a Gaussian form. In the next post we shall look a bit closer at these findings – and their origin in Gaussian distributions along the coordinate axes of the latent space. After an application of a PCA analysis we shall furthermore see that the z-point distribution in an AE’s latent vector space is indeed fragmented and shows filaments on certain length scales. A VAE with a tiny KL-loss will show the same fragmentation.

In further forthcoming post we shall afterward investigate the confining and at the same time blurring impact of the KL-loss on the latent space. Which will make it usable for creative purposes.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!

 

Variational Autoencoder with Tensorflow 2.8 – XII – save some VRAM by an extra Dense layer in the Encoder

I continue with my series on Variational Autoencoders [VAEs] and related methods to control the KL-loss.

Variational Autoencoder with Tensorflow 2.8 – I – some basics
Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images
Variational Autoencoder with Tensorflow 2.8 – XI – image creation by a VAE trained on CelebA

After having successfully trained a VAE with CelebA data, we have shown that our VAE can afterward create images with human-like looking faces from statistically selected data points (z-points) in its latent space. We still have to analyze the confinement of the z-point distribution due to the KL-loss a bit in more depth. But before we turn to this topic I want to briefly discuss an option to reduce the VRAM requirements of the VAE’s Encoder.

Limited VRAM – a problem for ML training runs on older graphics cards

In my opinion exploring the field of Machine Learning on a PC should not be limited to people who can afford a state of the art graphics card with a lot of VRAM. One could use Google’s Colab – but … I do not want to go into tax and personal data politics here. I really miss an EU-wide platform that offers services like Google Colab.

Anyway, a reduction of VRAM consumption may be decisive to be able to perform training runs for CNN-based VAEs on older graphic cards . Not only concerning VRAM limits but also regarding computational time: The less VRAM the weight parameters of our VAE models require the bigger we can size the batches the GPU operates on and the more CPU time we may potentially save. At least in principle. Therefore, we should consider the amount of trainable parameters of a neural network model and reduce them if possible.

The number of parameters depends heavily on the connections to the mu and logvar-layer of the Encoder

When you print out the layer structure and related parameters of a VAE (see below) you will find that the Encoder requires more parameters than the Decoder. Around twice as many. A closer look reveals:

It is the transition from the convolutional part of the Encoder to its Dense layers for mu and logvar which plays an important role for the number of weight parameters.

For a layer structure comprising 4 Conv2D layers, related filters=(32,64,128,256) and an input image size of (96,96,3) pixels we arrive at a Flatten-layer of 9216 neurons at the end of the convolutional part of the Encoder. For z_dim = 512 the direct connections from the Flatten-layer to both the mu- and logvar-layers lead to more the 9.4 million (float32) parameters for the Encoder. This is the absolutely dominant part of all required 9.83 million parameters of the Encoder. In contrast the Decoder part requires a total of 5.1 million parameters, only.

Encoder

Decoder

This is due to the fact that the flattened layer supplies input to two connected layers before output is created by yet another layer. In the Decoder, instead, only one layer, namely the input layer is connected to the flattened layer ahead of the first Conv2DTranspose layer.

In the case of z_dim=256 we arrive at around half of the parameters, i.e. 4.9 million parameters for the Encoder and around 2.76 million for the Decoder.

It is obvious that the existence of two layers for the variational parameters inside the Encoder is the source of the high parameter number on the Encoder side.

Would a reduction of convolutional layers help to reduce the weight parameters?

A reduction of Conv2D-layers in the Encoder would of course reduce the parameters for the weights between the convolutional layers. But turning to only three convolutional layers whilst keeping up a stride value of stride=2 for all filters would raise the already dominant number of parameters after the flattened layer by a factor of 4!

So, one has to work with a delicate balance between the number of convolutional layers and the eventual number of maps at the innermost layer and the size of these maps. They determine the number of neurons and related weights on the flattening layer:

From the perspective of a low total number of parameters you should consider higher stride values when reducing the number of Conv2D-layers.

On the other hand side using more than 4 convolutional layers would reduce the resolution of the maps of the innermost Conv2D layer below a usable threshold for reasonable mu and logvar values.

Off topic remark: All in all it seems to be reasonable also to think about ResNets of low depth instead of plain CNNs to keep weight numbers under control.

An intermediate dense layer ahead of the mu- and logvar-layers of the Encoder?

The reader who followed the posts in this series may have looked at the recipe which F. Chollet has discussed in his Keras documentation on VAEs. See:
https://keras.io/ examples/ generative/vae/.

There is an element in Chollet’s Encoderstructure which one easily can overlook at first sight. In his example for the MNIST dataset Chollet adds an intermediate Dense layer between the Flatten-layer and the layers for mu and logvar.

...
x = layers.Flatten()(x)
x = layers.Dense(16, activation="relu")(x)
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)
...

In the special case of MNIST an intermediate layer seems appropriate for bridging the gap between an input dimension of 784 to z_dim = 2. You do not expect major problems to arise from such a measure.

But: This intermediate layer introduced by Chollet also has the advantage of reducing the total number of trainable parameters substantially.

We could try something similar for our network. But here we have to be a bit more careful as we work in a latent space of much higher dimensions, typically with z_dim >= 256. Here we are in dilemma as we want to keep the intermediate dimension relatively high for using as much information as possible coming from the maps of the last Conv2D-layers. A fair compromise seems to be to use at least the dimension of the mu and varlog-layers, namely z_dim.

For z_dim=256 an additional Dense layer of the same size would reduce the total number of Encoder parameters from 5.11 million to 2.88 million.

If we took the dimension of the intermediate layer to be 384 we would still go down with the total Encoder parameters to 4.13 million. So an additional Dense layer really saves us some VRAM.

Images constructed by a VAE model with an additional Dense layer in the Encoder

Will an additional dense layer have a negative impact on our VAE’s ability to create images from randomly chosen z-points in the latent space?

Let us try it out. To include an option for an additional Dense layer in the Encoder related part of our class “MyVariationalAutoencoder()” is a pretty simple task. I leave this to the reader. Note that if we choose the dimension of the additional Dense layer to be exactly z_dim there is no need to change the reconstruction logic and layer structure of the Decoder. Also for other choices for the size of the Dense layer I would refrain from changing the Decoder.

I used z_dim=256 for the extra layer’s size. Then I repeated the experiments described in my last post. Some results for random z-points picked from a normal distribution in all coordinates are shown below:

So we see that from the generative point of view an extra Dense layer does not hurt too much.

What have we gained?

First and foremost:

We found a simple method to reduce the VRAM consumption of the Encoder.

But I have to admit that this method did NOT save any GPU time during training as long as I kept the size of image batches equally big as before (128). The reason is:

Due to the extra layer more matrix operations have to be performed than before, although some of matrixes becae smaller. On my old graphics card a full epoch with 170,000 (96×96) images takes around 120 secs – with or without an extra Encoder layer. Unfortunately, increasing the size of the batches the DataImageGenerator feeds into the GPU from 128 images to 256 did not change the required GPU time very much. More tests showed that a size of 128 already gave me an optimal turnaround time per epoch on my old graphics card (960 GTX).

Conclusion

An extra intermediate Dense layer between the Flatten-layer and the mu- and logvar-layers of the Encoder can help us to save some VRAM during the training of a VAE. Such a layer does not lead to a visible reduction of the quality of VAE-generated images from randomly selected points in the latent-space.

In the next post of this series
Variational Autoencoder with Tensorflow 2.8 – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?
we will compare a VAE with only a tiny contribution of KL-loss to the total loss with a corresponding AE. We shall investigate their similarity regarding their z-point distributions. This will give us a solid basis to investigate the impact of higher KL-loss values on the latent space in more detail afterwards.

Variational Autoencoder with Tensorflow 2.8 – XI – image creation by a VAE trained on CelebA

I continue with my series on Variational Autoencoders [VAEs] and related methods to control the KL-loss.

Variational Autoencoder with Tensorflow 2.8 – I – some basics
Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images

VAEs fall into a section of ML which is called “Generative Deep Learning“. The reason is that we can VAEs to create images with contain objects with features of objects learned from training images. One interesting category of such objects are human faces – of different color, with individual expressions and features and hairstyles, seen from different perspectives. One dataset which contains such images is the CelebA dataset.

During the last posts we came so far that we could train a CNN-based Variational Autoencoder [VAE] with images of the CelebA dataset. Even on graphics cards with low VRAM. Our VAE was equipped with a GradientTape()-based method for KL-loss control. We still have to prove that this method works in the expected way:

The distribution of data points (z-points) created by the VAE’s Encoder for training input should be confined to a region around the origin in the latent space (z-space). And neighboring z-points up to a limited distance should result in similar output of the Decoder.

Therefore, we have to look a bit deeper into the results of some VAE-experiments with the CelebA dataset. I have already pointed out why creating rather complex images from arbitrarily chosen points in the latent space is a suitable and good test for a VAE. Please remember that our efforts regarding the KL-loss have to do with the following fact:

not create reasonable images/objects from arbitrarily chosen z-points in the latent space.

This eliminates the use of an AE for creative purposes. A VAE, however, should be able to solve this type of task – at least for z-points in a limited surroundings of the latent space’s origin. Thus, by creating images from randomly selected z-points with the Decoder of a VAE, which has been trained on the CelebA data set, we cover two points:

  • Test 1: We test the functionality of the VAE-class, which we have developed and which includes the code for KL-loss handling via TF2’s GradientTape() and Keras’ train_step().
  • Test 2: We test the ability of the VAE’s Decoder to create images with convincing human-like face and hairstyle features from random z-points within an area close to the origin of the latent space.

Most of the experiments discussed below follow the same prescription: We take our trained VAE, select some random points in the latent space, feed the z-point-data into the VAE’s Decoder for a prediction and plot the images created on the Decoder’s output side. The Encoder only plays a role when we want to test reconstruction abilities.

For a low dimension z_dim=256 of the latent space we will find that the generated images display human faces reasonably well. But the images appear a bit blurry or unsharp – as if not fully focused. So, we need to discuss what we can do about this point. I will also name some plausible causes for the loss of accuracy in the representation of details.

Afterwards I want to show you that a VAE Decoder reconstructs original images relatively badly from the z-points calculated by the Encoder. At least when one looks at details. A simple AE with a sufficiently high dimension of the latent space performs much better. One may feel disappointed about the reconstruction ability of a VAE. But actually it is the ability of a VAE to forget about details and instead to focus on general features which enables it (the VAE) to create something meaningful from randomly chosen z-points in the latent space.

In a last step in this post we are going to look at images created from z-points with a growing distance from the origin of the multidimensional latent space [z-space]. (Distance can be defined by a L2-Euclidean norm). We will see that most z-points which have some z-coordinates above a value of 3 produce fancy images where the face structures get dominated or distorted by background structures learned from CelebA images. This effect was to be expected as the KL-loss enforced a distribution of the z-points which is confined to a region relatively close to the origin. Ideally, this distribution would be characterized by a normal distribution in all coordinates with a sigma of only 1. So, the fact that z-points in the vicinity of the origin of the latent space lead to a construction of images which show recognizable human faces is an indirect proof of the confining impact of the KL-loss on the z-point distribution. In another post I shall deliver data which prove this more directly.

Below I will call the latent space of a (V)AE also z-space.

Characteristics of the VAE tested

Our trained VAE with four Conv2D-layers in the Encoder and 4 corresponding Conv2DTranspose-Layers in the Decoder has the following basic characteristics:

(Encoder-CNN-) filters=(32,64,128,256), kernels=(3,3), stride=2,
reconstruction loss = BCE (binary crossentropy), fact=5.0, z_dim=256

The order of the filter- (= map-) numbers is, of course reversed for the Decoder. The factor fact to scale the KL-loss in comparison to the reconstruction loss was chosen to be fact=5, which led to a 3% contribution of the KL-loss to the total loss during training. The VAE was trained on 170,000 CelebA images with 24 epochs and a small epsilon=0.0005 plus Adam optimizer.

When you perform similar experiments on your own you may notice that the total loss values after around 24 epochs ( > 5015) are significantly higher than those of comparable experiments with a standard AE (4850). This already is an indication that our VAE will not reproduce a similar good match between an image reconstructed by the Decoder in comparison to the original input image fed into the Encoder.

Results for z-points with coordinates taken from a normal distribution around the origin of the latent space

The picture below shows some examples of generated face-images coming from randomly chosen z-points in the vicinity of the z-space’s origin. To calculate the coordinates of such z-points I applied a normal distribution:

z_points = np.random.normal(size = (n_to_show, z_dim)) # n_to_show = 28

So, what do the results for z_dim=256 look like?

Ok, we get reasonable images of human-like faces. The variations in perspective, face forms and hairstyles are also clearly visible and reflect part of the related variety in the training set. You will find more variations in more images below. So, we take this result as a success! In contrast to a pure AE we DO get something from random z-points which we clearly can interpret as human faces. The whole effort of confining z-points around the origin and at the same time of smearing out z-points with similar content over a region instead of a fixed point-mapping (as in an AE) has paid off. See for comparison:
Autoencoders, latent space and the curse of high dimensionality – I

Unfortunately, the images and their details details appear a bit blurry and not very sharp. Personally, this reminded me of the times when the first CCD-chips with relative low resolution were introduced in cameras and the raw image data looked disappointing as long as we did not apply some sharpening filters. The basic information to enhance details were there, but they had to be used explicitly to improve the plain raw data of the CCD.

The quality in details is about the same as what we see in example images in the book of D.Foster on “Generative Deep Learning”, 2019, O’Reilly. Despite the fact that Foster used a slightly higher resolution of the input images (128x128x3 pixels). The higher input resolution there also led to a higher resolution of the maps of the innermost convolutional layer. Regarding quality see also the images presented in:
https://datagen.tech/guides/image-datasets/celeba/

Enhancement processing of the images ?

Just for fun, I took a screenshot of my result, saved it and applied two different sharpening filters from the ShowFoto program:

Much better! And we do not have the impression that we added some fake information to the images by our post-processing ….

Now I hear already argument saying that such an enhancement should not be done. Frankly, I do not see any reason against post-processing of images created by a VAE-algorithm.

Remember: This is NOT about reproduction quality with respect to originals or a close-to-reality show. This is about generating new mages of human-like faces based on basic features which a VAE-algorithm hopefully has learned from training images. All of what we do with a VAE is creative. And it also comes close to a proof that ML-algorithms based on convolutional layers really can “learn” something about the basic features of objects presented to them. (The learned features are e.g. in the Encoder’s case saved in the sensitivity of the convolutional maps to typical patterns in the input images.)

And as in the case of raw-images of CCD or CMOS camera chips: Sometimes some post-processing is required to utilize the information optimally for sharpness.

Sharpening by PIL’s enhancement functionality

Of course we do not want to produce images in a ML run, take screenshots and sharpen each image individually. We need some tool that fits into the ML process pipeline. The good old PIL library for Python offers sharpening as one of multiple enhancement options for images. The next examples are results from the application of a PIL enhancement procedure:

These images look quite OK, too. The basic code fragment I used for each individual image in the above grid:

    # reconst_new is the output from my VAE's Decoder 
    ay_img      = reconst_new[i, :,:,:] * 255
    ay_img      = np.asarray(ay_img, dtype="uint8" )
    img_orig    = Image.fromarray(ay_img)
    img_shr_obj = ImageEnhance.Sharpness(img)
    sh_factor   = 7   # Specified Factor for Enhancing Sharpness
    img_sh      = img_shr_obj.enhance(sh_factor)

The sharpening factor I chose was quite high, namely sh_factor = 7.

The effect of PIL’s sharpening factor

Just to further demonstrate the effect of different factors for sharpening by PIL you find some examples below for sh_factor = 0, 3, 6.

sh_factor = 0

sh_factor = 3

sh_factor = 6

Obviously, the enhancement is important to get clearer and sharper images.
However, when you enlarge the images sufficiently enough you see some artifacts in the form of crossing lines. These artifacts are partially already existing in the Decoder’s output, but they are enhanced by the Sharpening mechanism used by PIL (unsharp masking). The artifacts become more pronounced with a growing sh_factor.
Hint: According to ML-literature the use of Upsampling layers instead of Conv2DTranspose layers in the Decoder may reduce such artefacts a bit. I have not yet tried it myself.

Assessment

How do we assess the point of relatively unclear, unsharp images produced by our VAE? What are plausible reasons for the loss of details?

  1. Firstly, already AEs with a latent space dimension z_dim=256 in general do not reconstruct brilliant images from z-points in the latent space. To get a good reconstruction quality even from an AE which does nothing else than to compress and reconstruct images of a size (96x96x3) z_dim-values > 1000 are required in my experience. More about this in another post in the future.
  2. A second important aspect is the following: Enforcing a compact distribution of similar images in the latent space via the KL-loss automatically introduces a loss of detail information. The KL-loss is designed to lead to a smear-out effect in z-space. Only basic concepts and features will be kept by the VAE to ensure a similarity of neighboring images. Details will be omitted and “smoothed” out. This has consequences also with respect to sharpness of detail structures. A detail as an eyebrow in a face is to be considered as an average of similar details found for images in the same region of the z-space. This alone brings some loss of clarity with it.
  3. Thirdly, a simple (V)AE based on some directly connected Conv2D-layers has limited capabilities in general. The reason is that we systematically reduce resolution whilst information is propagated from one Conv2D layer to the next neighboring one. Remember that we use a stride > 2 or pooling layers to cover filters on larger image scales. Due to this information processing a convolutional network automatically suppresses details in its inner layers – their resolution shrinks with growing distance from the input layer. In later posts of this blog we shall see that using ResNets instead of CNNs in the Encoder and Decoder already helps a bit regarding the reconstruction of clearer images. The correlation between details and large scale information is better kept up there than in CNNs.

Regarding the first point one may think of increasing z_dim. This may not be the best idea. It contradicts the whole idea of a VAE which at its core is a reduction of the degrees of freedom for z-points. For a higher dimensional space we may have to raise the ratio of KL-loss to reconstruction loss even further.

Regarding the third point: Of course it would also help to increase kernel sizes for the first two Conv2D layers and the number of maps there. A higher resolution of the input images would also be of advantage. Both methods may, however, conflict with your VRAM or GPU time limits.

If the second point were true then reduction of fact in our models, which controls the ration of KL-loss to reconstruction loss, would lead to a better image quality. In this case we are doomed to find an optimal value for fact – satisfying both the need for generalization and clarity of details in our images. You cannot have both … here we see a basic problem related to VAEs and the creation of realistic images. Actually, I tried this out – the effect is there, but the gain actually is not worth the effort. And for too small values of fact we eventually loose the ability to create reasonable images from arbitrary z-points at all.

All in all post-processing appears to be a simple and effective method to get images with somewhat sharper details.
Hint: If you want to create images of artificially generated faces with a really high quality, you have to turn to GANs.

Further examples – with PIL sharpening

In this example you see that not all points give you good images of faces. The z-point of the middle image in the second to last of the first illustration below has a relatively high distance from the origin. The higher the distance from the origin in z-space the weirder the images get. We shall see this below in a more systematic way.

Reconstruction quality of a VAE vs. an AE – or the “female” side of myself

If I were not afraid of copy and personal rights aspects of using CelebA images directly I could show you now a comparison of the the reconstruction ability of an AE in comparison to a VAE. You find such a comparison, though a limited one, by looking at some images in the book of D. Foster.

To avoid any problems I just tried to work with an image of myself. Which really gave me a funny result.

A plain Autoencoder with

  • an extended latent space dimension of z_dim = 1600,
  • a reasonable convolutional filter sequence of (64, 64, 128, 128)
  • a stride value of stride=2
  • and kernels ((5,5),(5,5),(3,3),(3,3))

is well able to reproduce many detailed features one’s face after a training on 80,000 CelebA images. Below see the result for an image of myself after 24 training epochs of such an AE:

The left image is the original, the right one the reconstruction. The latter is not perfect, but many details have been reproduced. Please note that the trained AE never had seen an image of myself before. For biometric analysis the reproduction would probably be sufficient.

Ok, so much about an AE and a latent space with a relatively high dimension. But what does a VAE think of me?
With fact = 5.0, filters like (32,64,128,256), (3,3)-kernels, z_dim=256 and after 18 epochs with 170,000 training images of CelebA my image really got a good cure:

My wife just laughed and said: Well, now in the age of 64 at least an AI has found something soft and female in you … Well, had the CelebA included many faces of heavy metal figures the results would have looked differently. I bet …

So with generative VAEs we obviously pay a price: Details are neglected in favor of very general face features and hairstyle aspects. And we loose sharpness. Which is good if you have wrinkles. Good for me and the celebrities, too. 🙂

However, I recommend anybody who wants to study VAEs to check the reproduction quality for CelebA test images (not from the training set). You will see the generalization effect for a broader range of images. And, of course, a better reproduction with smaller values for the ratio of the KL-loss to the reconstruction loss. However, for too small values of fact you will not be able to create realistic face images at all from arbitrary z-points – even if you choose them to be relatively close to the origin of the latent space.

Dependency of the creation of reasonable images on the distance from the origin

In another post in this blog I have discussed why we need VAEs at all if we want to reconstruct reasonable face images from randomly picked points in the latent space. See:
Autoencoders, latent space and the curse of high dimensionality – I

I think the reader is meanwhile convinced that VAEs do a reasonably good job to create images from randomly chosen z-points. But all of the above images were taken from z-points calculated with the help of a function assuming a normal distribution in the z-space coordinates. The width of the resulting distribution around the origin is of course rather limited. Most points lie within a 3 sigma distance around the origin. This is OK as we have put a lot of effort into the KL-loss to force the z-points to approach such a normal distribution around the origin of the latent space.

But what happens if and when we increase the distance of our random z-points from the origin? An easy way to investigate this is to create the z-points with a function that creates the coordinates randomly, but equally distributed in an interval ]0,limit]. The chance that at least one of the coordinates gets a high value is rather big then. This in turn ensures relatively high radius values (in terms of an L2-distance norm).

Below you find the results for z-points created by the function random.uniform:

r_limit = 1.5
l_limit = -r_limit
znew = np.random.uniform(l_limit, r_limit, size = (n_to_show, z_dim))

r_limit is varied as indicated:

r_limit = 0.5

r_limit = 1.0

r_limit = 1.5

r_limit = 2.0

r_limit = 2.5

r_limit = 3.0

r_limit = 3.5

r_limit = 5.0

r_limit = 8.0

Well, this proves that we get reasonable images only up to a certain distance from the origin – and only in certain areas or pockets of the z-space at higher radii.

Another notable aspect is the fact that the background variations are completely smoothed out a low distances from the origin. But they get dominant in the outer regions of the z-space. This is consistent with the fact that we need more information to distinguish various background shapes, forms and colors than basic face patterns. Note also that the faces appear relatively homogeneous for r_limit = 0.5. The farther we are away from the origin the larger the volumes to cover and distinguish certain features of the training images become.

Conclusion

Our VAE with the GradientTape()-mechanism for the control of the KL-loss seems to do its job. In contrast to a pure AE the smear-out effect of the KL-loss allows now for the creation of images with interpretable contents from arbitrary z-points via the VAE’s Decoder – as long as the selected z-points are not too far away from the z-space’s origin. Thus, by indirect evidence we can conclude that the z-points for training images of the CelebA dataset were distributed and at the same time confined around the origin. The strongest indication came from the last series of images. But we pay a price: The reconstruction abilities of a VAE are far below those of AEs. A relatively low number of dimensions of the latent space helps with an effective confinement of the z-points. But it leads to a significant loss in detail sharpness of the generated images, too. However, part of this effect can be compensated by the application of standard procedures for image enhancemnet.

In the next post
Variational Autoencoder with Tensorflow 2.8 – XII – save some VRAM by an extra Dense layer in the Encoder
I will discuss a simple trick to reduce the VRAM consumption of the Encoder. In a further post we shall then analyze the confinement of the z-point distribution with the help of more explicit data.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. An aggressor who orders the bombardment of civilian infrastructure, civilian buildings, schools and hospitals with drones bought from other anti-democrats and women oppressors puts himself in the darkest and most rotten corner of human history.