Variational Autoencoder with Tensorflow – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?

Image

This post continues my series on Variational Autoencoders [VAE] with some considerations regarding a VAE whose settings allow for only a tiny amount of the so called Kullback-Leibler [KL] loss.

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow – X – VAE application to CelebA images
Variational Autoencoder with Tensorflow – XI – image creation by a VAE trained on CelebA
Variational Autoencoder with Tensorflow – XII – save some VRAM by an extra Dense layer in the Encoder

So far, most of the posts in this series have covered a variety of methods (provided by Tensorflow and Keras) to control the KL loss. One of the previous posts (XI) provided (indirect) evidence that also GradientTape()-based methods for KL-loss calculation work as expected. In stark contrast to a standard Autoencoder [AE] our VAE trained on CelebA data proved its ability to reconstruct humanly interpretable images from random z-points (or z-vectors) in the latent space. Provided that the z-points lie within a reasonable distance to the origin.

We could leave it at that. One of the basic motivations to work with VAEs is to use the latent space “creatively”. This requires that the data points coming from similar training images should fill the latent space densely and without gaps between clusters or filaments. We have obviously achieved this objective. Now we could start to do funny things like to combine reconstruction with vector arithmetic in the latent space.

But to look a bit deeper into the latent space may give us some new insights. The central point of the KL-loss is that it induces a statistical element into the training of AEs. As a consequence a VAE fills the so called “latent space” in a different way than a simple AE. The z-point distribution gets confined and areas around z-points for meaningful training images are forced to get broader and overlap. So two questions want an answer:

  • Can we get more direct evidence of what the KL-loss does to the data distribution in latent space?
  • Can we get some direct evidence supporting the assumption that most of the latent space of an AE is empty or only sparsely populated? in contrast to a VAE’s latent space?

Therefore, I thought it would be funny to compare the data organization in latent space caused by an AE with that of a VAE. But to get there we need some solid starting point. If you consider a bit where you yourself would start with an AE vs. VAE comparison you will probably come across the following additional and also interesting questions:

  • Can one safely assume that a VAE with only a very tiny amount of KL-loss reproduces the same z-point distribution vs. radius which an AE would give us?
  • In general: Can we really expect a VAE with a very tiny Kullback-Leibler loss to behave as a corresponding AE with the same structure of convolutional layers?

The answers to all these questions are the topics of this post and a forthcoming one. To get some answers I will compare a VAE with a very small KL-loss contribution with a similar AE. Both network types will consist of equivalent convolutional layers and will be trained on the CelebA dataset. We shall look at the resulting data point density distribution vs. radius, clustering properties and the ability to create images from statistical z-points.

This will give us a solid base to proceed to larger and more natural values of the KL-loss in further posts. I got some new insights along this path and hope the presented data will be interesting for the reader, too.

Below and in following posts I will sometimes call the target space of the Encoder also the “z-space“.

CelebA data to fill the latent vector-space

The training of an AE or a VAE occurs in a self-supervised manner. A VAe or an AE learns to create a point, a z-point, in the latent space for each of the training objects (e.g. CelebA images). In such a way that the Decoder can reconstruct an object (image) very close to the original from the z-point’s coordinate data. We will use the “CelebA” dataset to study the KL-impact on the z-point distribution.CelebA is more challenging for a VAE than MNIST. And the latent space requires a substantially higher number of dimensions than in the MNIST case for reasonable reconstructions. This makes things even more interesting.

The latent z-space filled by a trained AE or VAE is a multi-dimensional vector space. Meaning: Each z-point can be described by a vector defining a position in z-space. A vector in turn is defined by concrete values for as many vector components as the z-space has dimensions.

Of course, we would like to see some direct data visualizing the impact of the KL-loss on the z-point distribution which the Encoder creates for our training data. As we deal with a multidimensional vector space we cannot plot the data distribution. We have to simplify and somehow get rid of the many dimensions. A simple solution is to look at the data point distribution in latent space with respect to the distance of these points from the origin. Thereby we transform the problem into a one-dimensional one.

More precisely: I want to analyze the change in numbers of z-points within “radius“-intervals. Of course, a “radius” has to be defined in a multidimensional vector space as the z-space. But this can easily be achieved via an Euclidean L2-norm. As we expect the KL loss to have a confining effect on the z-point distribution it should reduce the average radius of the z-points. We shal later see that this is indeed the case.

Another simple method to reduce dimensions is to look at just one coordinate axis and the data distribution for the calculated values in this direction of the vector space. Therefore, I will also check the variation in the number of data points along each coordinate axis in one of the next posts.

A look at clustering via projections to a plane may be helpful, too.

The expected similarity of a VAE with tiny KL-loss to an AE is not really obvious

Regarding the answers to the 3rd and 4th questions posed above your intuition tells you: Yes, you probably can bet on a similarity between a VAE with tiny KL-loss and an AE.

But when you look closer at the network architectures you may get a bit nervous. Why should a VAE network that has many more degrees of freedom than an AE not use both of its layers for “mu” and “logvar” to find a different distribution solution? A solution related to another minimum of the loss hyperplane in the weight configuration space? Especially as this weight-related space is significantly bigger than that of a corresponding AE with the same convolutional layers?

The whole point has to do with the following facts: In an AE’s Encoder the last flattening layer after the Conv2D-layers is connected to just one output layer. In a VAE, instead, the flattening layer feeds data into two consecutive layers (for mu and logvar) across twice as many connections (with twice as many weight parameters to optimize).

In the last post of this series we dealt with this point from the perspective of VRAM consumption. Now, its the question in how far a VAE will be similar to an AE for a tiny KL-loss.

Why should the z-points found be determined only by mu-values and not also by logvar-values? And why should the mu values reproduce the same distribution as an AE? At least the architecture does not guarantee this by any obvious means …

Well, let us look at some data.

Structure of an AE for CelebA and its total loss after some epochs

Our test AE contains the same simple sequence of four Conv2D layers per Encoder and four 4 Conv2DTranspose layers as our VAE. See the AE’s Encoder layer structure below.

A difference, however, will be that I will not use any BatchNormalizer layers in the AE. But as a correctly implemented BatchNormalization should not affect the representational powers of a VAE network for very principle reasons this should not influence the comparison of the final z-point distribution in a significant way.

I performed an AE training run for 170,000 CelebA training images over 24 epochs. The latent space has a dimension if z_dim=256. (This relatively low number of dimensions will make it easier for a VAE to confine z_points around the origin; see the discussion in previous posts).

The resulting total loss of our AE became ca. 0.49 per pixel. This translates into a total value of

AE total loss on Celeb A after 24 epochs (for a step size of 0.0005): 4515

This value results from a summation over all geometric pixels of our CelebA images which were downsized to 96×96 px (see post IX). The given value can be compared to results measured by our GradientTape()-based VAE-model which delivers integrated values and not averages per pixel.

This value is significantly smaller than values we would get for the total loss of a VAE with a reasonably big KL-loss of contribution in the order of some percent of the reconstruction loss. A VAE produces values around 4800 up to 5000. Apparently, an AE’s Decoder reconstructs originals much better than a VAE with a significant KL-loss contribution to the total loss.

But what about a VAE with a very small KL-loss? You will get the answer in a minute.

Where does a standard Autoencoder [AE] place the z-points for CelebA data?

We can not directly plot a data point distribution in a 256-dimensional vector-space. But we can look at the data point density variation with a calculated distance from the origin of the latent space.

The distance R from the origin to the z-point for each image can be measured in terms of a L2 (= Euclidean) norm of the latent vector space. Afterward it is easy to determine the number of images within all radius intervals with e.g. a length of 0.5 e.g. between radii R

0  <  R  <  35 .

We perform the following steps to get respective numbers. We let the Encoder of our trained AE predict the z-points of all 170,000 training data

z_points  = AE.encoder.predict(data_flow) 

data_flow was created by a Keras DataImageGenerator to send batches of training data to the GPU (see the previous posts).

Radius values are then calculated as

print("NUM_Images_Train = ", NUM_IMAGES_TRAIN)
ay_rad_z = np.zeros((NUM_IMAGES_TRAIN,),  dtype='float32')
for i in range(0, NUM_IMAGES_TRAIN):
    sq = np.square(z_points[i]) 
    sqrt_sum_sq = math.sqrt(sq.sum())
    ay_rad_z[i] = sqrt_sum_sq

The numbers vs. radius relation then results from:

li_rad      = []
li_num_rad  = []
int_width = 0.5
for i in range(0,70):
    low   = int_width * i
    high  = int_width * (i+1) 
    num   = np.count_nonzero( (ay_rad_z >= low) & (ay_rad_z < high ) )
    li_rad.append(0.5 * (low + high))
    li_num_rad.append(num)

The resulting curve is shown below:

There seems to be a peak around R = 16.75. So, yet another question arises:

>What is so special about the radius values of 16 or 17 ?

We shall return to this point in the next post. For now we take this result as god-given.

Clustering of CelebA z-point data in the AE’s latent space?

Another interesting question is: Do we get some clustering in the latent space? Will there be a difference between an AE and a VAE?

A standard method to find an indication of clustering is to look for an elbow in the so called “inertia” curve for different assumed numbers of clusters. Below you find an inertia plot retrieved from the z-point data with the help of MiniBatchKMeans.

This result was achieved for data taken at every second value of the number of clusters “num_clus” between 2 ≤ num_clus ≤ 80. Unfortunately, the result does not show a pronounced elbow. Instead the variation at some special cluster numbers is relatively high. But, if we absolutely wanted to define a value then something between 38 and 42 appears to be reasonable. Up to that point the decline in inertia is relatively smooth. But do not let you get misguided – the data depend on statistics and initial cluster values. When you change to a different calculation you may get something like the following plot with more pronounced spikes:

This is always as sign that the clustering is not very clear – and that the clusters do not have a significant distance, at least not in all coordinate directions. Filamental structures will not be honored well by KMeans.

Nevertheless: A value of 40 is reasonable as we have 40 labels coming with the CelebA data. I.e. 40 basic features in the face images are considered to be significant and were labeled by the creators of the dataset.

t-SNE projections

We can also have a look at a 2-dimensional t-SNE-projection of the z-point distribution. The plots below have been produced with different settings for early exaggeration and perplexity parameters. The first plot resulted from standard parameter values for sklearn’s t-SNE variant.

tsne = TSNE(n_components=2, early_exaggeration=12, perplexity=30, n_iter=1000)

Other plots were produced by the following setting:

tsne = TSNE(n_components=2, early_exaggeration=16, perplexity=10, n_iter=1000)

Below you find some plots of a t-SNE-analysis for different numbers and different adjusted parameters for the resulting scatter plot. The number of statistically chosen z-point varies between 20,000 and 140,000.

Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)

Actually we see some indication of clustering, though it is not very pronounced. The clusters in the projection are not separated by clear and broad gaps. Of course a 2-dimensional projection can not completely visualize the original separations in a 256-dim space. However, we get the impression that clusters are located rather close to each other. Remember: We already know that almost all points are locates in a multidimensional sphere shell between 12 < R < 24. And more than 50% between 14 ≤ R ≤ 19.

However, how the actual distribution of meaningful z-points (in the sense of a recognizable face reconstruction) really looks like cannot be deduced from the above t-SNE analysis. The concentration of the z-points may still be one which follows thin and maybe curved filaments in some directions of the multidimensional latent space on relatively small or various scales. We shall get a much clearer picture of the fragmentation of the z-point distribution in an AE’s latent space in the next post of this series.

Number of statistical z-points: 80,000

For the higher number of selected z-points the room between some concentration centers appears to be filled in the projection. But remember: This may only be due to projection effects in the presently chosen coordinate system. Another calculation with the above non-standard data for perplexity and early_exaggeration gives us:

Number of statistical z-points: 140,000

Note that some islands appear. Obviously, there is at least some clustering going on. However, due to projection effects we cannot deduce much for the real structure of the point distribution between possible clusters. Even the clustering itself could appear due to overlapping two or more broader filaments along a projection line.

Whether correlations would get more pronounced and therefore could also be better handled by t-SNE in a rotated coordinate system based on a PCA-analysis remains to be seen. The next post will give an answer.

At least we have got a clear impression about the radial distribution of the z-points. And thereby gathered some data which we can compare to corresponding results of a VAE.

Total loss of a VAE with a tiny KL-loss for CelebA data

Our test VAE is parameterized to create only a very small KL-loss contribution to the total loss. With the Python classes we have developed in the course of this post series we can control the ratio between the KL-loss and a standard reconstruction loss as e.g. BCE (binary-crossentropy) by a parameter “fact“.

For BCE

fact = 1.0e-5

is a very small value. For a working VAE we would normally choose something like fact=5 (see post XI).

A value like 1.0e-5 ensures a KL loss around 0.0178 compared to a reconstruction loss of 4550, which gives us a ratio below 4.e-6. Now, what is a VAE going to do, when the KL-loss is so small?

For the total loss the last epochs produced the following values:

AE total loss on Celeb A after 24 epochs for a step size of 0.0005: 4,553

Output of the last 6 of 24 epochs.

Epoch 1/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4557.1694 - reco_loss: 4557.1523 - kl_loss: 0.0179
Epoch 2/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.9111 - reco_loss: 4556.8940 - kl_loss: 0.0179
Epoch 3/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.6626 - reco_loss: 4556.6450 - kl_loss: 0.0179
Epoch 4/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.3862 - reco_loss: 4556.3682 - kl_loss: 0.0179
Epoch 5/6
1329/1329 [==============================] - 120s 90ms/step - total_loss: 4555.9595 - reco_loss: 4555.9395 - kl_loss: 0.0179
Epoch 6/6
1329/1329 [==============================] - 118s 89ms/step - total_loss: 4555.6641 - reco_loss: 4555.6426 - kl_loss: 0.0178

This is not too far away from the value of our AE. Other training runs confirmed this result. On four different runs the total loss value came to lie between

VAE total loss on Celeb A after 24 epochs: 4553 ≤ loss ≤ 4555 .

VAE with tiny KL-loss – z-point density distribution vs. radius

Below you find the plot for the variation of the number density of z-points vs. radius for our VAE:

Again, we get a maximum close to R = rad = 16. The maximum value lies a bit below the one found for a KL-loss-free AE. But all in all the form and width of the distribution of the VAE are very comparable to that of our test AE.

Can this result be reproduced?
Unfortunately not at a 100% of test runs performed. There are two main reasons:

  1. Firstly, we can not be sure that a second minimum does not exist for a distribution of points at bigger radii. This may be the case both for the AE and the VAE!
  2. Secondly, we have a major factor of statistical fluctuation in our game:
    The epsilon value which scales the logvar-contribution to the loss in the sampling layer of the Encoder may in very seldom cases abruptly jump to an unreasonable high value. A Gaussian covers extreme values, although the chances to produce such a value are pretty small. and a Gaussian is invilved in the calculation of z-points by our VAE.

Remember that the z-point coordinates are calculated via the the mu and logvar tensors according to

 
z = mu + B.exp(log_var / 2.) * epsilon

See Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics for respective code elements of the Encoder.

So, a lot depends on epsilon which is calculated as a statistically fluctuating quantity, namely as

epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)

Is there a chance that the training process may sometimes drive the system to another corner of the weight-loss configuration space due to abrupt fluctuations? With the result for the z-point distribution vs. radius that it may significantly deviate from a distribution around R = 16? I think: Yes, this is possible!

From some other training runs I actually have an indication that there is a second minimum of the cost hyperplane with similar properties for higher average radius-values, namely for a distribution with an average radius at R ≈ 19.75. I got there after changing the initialization of the weights a bit.

Another indication that the cost surface has a relative rough structure and that extreme fluctuations of epsilon and a resulting gradient-fluctuation can drive the position of the network in the weight configuration space to some strange corners. The weight values there can result in different z-point distributions at higher average radii. This actually happened during yet another training run: At epoch 22 the Adam optimizer suddenly directed the whole system to weight values resulting in a maximum of the density distribution at R = 66 ! This appeared as totally crazy. At the same time the KL-loss also jumped to a much higher value.

When I afterward repeated the run from epoch 18 this did not happen again. Therefore, a statistical fluctuation must have been the reason for the described event. Such an erratic behavior can only be explained by sudden and extreme changes of z-point data enforcing a substantial change in size and direction of the loss gradient. And epsilon is a plausible candidate for this!

So far I had nothing in our Python classes which would limit the statistical variation of epsilon. The effects seen spoke for a code change such that we do not allow for extreme epsilon-values. I set limits in the respective part of the code for the sampling layer and its lambda function

        # The following function will be used by an eventual Lambda layer of the Encoder 
        def z_point_sampling(args):
            '''
            A point in the latent space is calculated statistically 
            around an optimized mu for each sample 
            '''
            mu, log_var = args # Note: These are 1D tensors !
            epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) 
            if abs(epsilon) >= 5: 
                epsilon *= 5. / abs(epsilon)       
            return mu + B.exp(log_var / 2.) * epsilon * self.enc_v_eps_factor

This stabilized everything. But even without these limitations on average three out of 4 runs which I performed for the VAE ran into a cost minimum which was associated with a pronounced maximum of the z-point-distribution around R ≈ 16. Below you see the plot for the fourth run:

So, there is some chance that the degrees of freedom associated with the logvar-layer and the statistical variation for epsilon may drive a VAE into other local minima or weight parameter ranges which do not lead to a z-point distribution around R = 16. But after the limitation of epsilon fluctuations all training runs found a loss minimum similar to the one of our simple AE – in the sense that it creates a z-point density distribution around R ≈ 16.

VAE with tiny KL-loss: Inertia and clustering of the CelebA data?

Our VAE gives the following variation of the inertia vs. the number of assumed clusters:

This also looks pretty similar to one of the plots shown for our AE above.

t-SNE for our VAE with a tiny KL loss

Below you find t-SNE plots for 20,000, 80,000 and 140,000 images:

Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)

This is quite similar to the related image for the AE. You just have to rotate it.

Number of statistical z-points: 80,000

Number of statistical z-points: 140,000

All in all we get very similar indications as from our AE that some clustering is going on.

VAE with tiny KL-loss: Should its logvar values become tiny, too?

Besides reproducing a similar z-point distribution with respect to radius values, is there another indication that a VAE behaves similar to an AE? What would be a clear sign that the similarity really exists on a deeper level of the layers and their weights?

The z-vector is calculated from the mu and logvar-vectors by:

z = mu + exp(logvar/2)*epsilon

with epsilon coming from a normal distribution. Please note that we are talking about vectors of size z_dim=256 per image.

If a VAE with a tiny KL-loss really becomes similar to an AE it should define and set its z-points basically by using mu-values, only, and not by logvar-values. I.e. the VAE should become intelligent enough to ignore the degrees of freedom associated with the logvar-layer. Meaning that the z-point coordinates of a VAE with a very small Kl-loss should in the end be almost identical to the mu-component-values.

Ok, but to me it was not self-evident that a VAE during its training would learn

  • to produce significant mu-related weight-values, only,
  • and to keep the weight values for the connections to the logvar-layer so small that the logvar-impact on the z-space position gets negligible.

Before we speculate about reasons: Is there any evidence for a negligible logvar-contribution to the z-point coordinates or, equivalently, to the respective vector components?

A VAE with tiny KL-loss produces tiny logvar values …

To get some quantitative data on the logvar impact the following steps are appropriate:

  1. Get the size and algebraic sign of the logvar-values. Negative values logvar < -3 would be optimal.
  2. Measure the deviation between the mu- and z_points vector components. There should only be a few components which show significant values &br; abs(mu – z) > 0.05
  3. Compare the the radius-value determined by z-components vs. the radius values derived from mu-components, only, and measure the absolute and relative deviations. The relative deviation should be very small on average.

Some values of logvar, (z – mu), z-radii and z-radius-deviations for a VAE with small KL-loss

Regarding the maximum value of the logvar’s vector-components I found

3.4 ≥ max(logvar) ≥ -3.2. # for 1 up to 3 components out of a total 45.52 million components

The first value may appear to be big for a component. But typically there are only 2 (!) out of 170,000 x 256 = 43.52 million vector components in an interval of [-3, 5]. On the component level I found the following minimum, maximum and average-values

Maximum value for logvar:  -2.0205276
Minimum value for logvar:  -24.660698
Average value for logvar:  -13.682616

The average value of logvar is pretty pleasing: Such big negative values indeed render the logvar-impact on the position of our z-points negligible. So we should only find very small deviations of the mu-components from the z-point components. And, actually, the maximum of the deviation between a z_point component and a mu component was delta_mu_z = 0.26:

Maximum (z_points - mu) = delta_mu_z = 0.26  # on the component level 

There were only 5 out of the 45.52 million components which showed an absolute deviation in the interval

0.05 < abs(delta_mu_z) < 2. 

The rest was much, much smaller!

What about radius values? Here the situation looks, of course, even better:

max radius defined by z  :  33.10274
min radius defined by z  :  6.4961233
max radius defined by mu :  33.0972
min radius defined by mu :  6.494558

avg_z:      16.283989  
avg_mu:     16.283972

max absolute difference :  0.018045425 
avg absolute difference :  0.00094899256

max relative difference  :   0.00072147313
avg relative difference  :   6.1240215e-05

As expected, the relative deviations between z- and mu-based radius values became very small.

In another run (the one corresponding to the second density distribution curve above) I got the following values:

Maximum value for logvar:  3.3761873
Minimum value for logvar:  -22.777826
Average value for logvar:  -13.4265175

max radius z :  35.51387
min radius z :  7.209168
max radius mu :  35.515926
min radius mu :  7.2086616

avg_z:  17.37412
avg_mu: 17.374104

max delta rad relative :   0.012512478
avg delta rad relative :   6.5660715e-05

This tells us that the z-point distributions may vary a bit in their width, their precise center and average values. But, overall they appear to be similar. Especially with respect to a relative negligible contribution of logvar-terms to the z-point position. The relative impact of logvar on the radius value of a z-point is of the order 6.e-5, only.

All the above data confirm that a trained VAE with a very small KL-loss primarily uses mu-values to set the position of its z-points. During training the VAE moves along a path to an overall minimum on the loss hyperplane which leads to an area with weights that produce negligible logvar values.

Explanation of the overall similarity of a VAE with tiny KL-loss to an AE

o far we can summarize: Under normal conditions the VAE’s behavior is pretty close to that of a similar AE. The VAE produces only small logvar values. z-point coordinates are extremely close to just the mu-coordinates.

Can we find a plausible reason for this result? Looking at the cost-hyperplane with respect to the Encoder weights helps:

The cost surface of a VAE spans across a space of many more weight parameters than a corresponding AE. The reason is that we have weights for the connection to the logvar-layer in addition to the weights for the mu-layer (or a single output layer as in a corresponding AE). But if we look at the corner of the weight-vector-space where the logvar-related values are pretty small, then we would at least find a local (if not global) loss minimum there for the same values of the mu-related weight parameters as in the corresponding AE (with mu replacing the z-output).

So our question reduces to the closely related question whether the old minimum of an AE remains at least a local one when we shift to a VAE – and this is indeed the case for the basic reason that the KL-contributions to the height of the cost-hyperplane are negligibly small everywhere (!) – even for higher logvar-related values.

This tells us that a gradient descent algorithm should indeed be able to find a cost minimum for very small values of logvar-related weights and for weight-values related to the mu-layer very close to the AE’s weight-values for direct connections to its output layer. And, of course, with all other weight parameter of the VAE-Decoder being close to the values of the weights of a corresponding AE. At least under the condition that all variable quantities really change smoothly during training.

Does a VAE with small KL-loss produce reasonable face images?

A last test to confirm that a VAE with a very small KL-loss operates as an comparable AE is a trial to create images with recognizable human faces from randomly chosen points in z-space. Such a trial should fail! I just show you three results – one for a normal distribution of the z-point components. And two for equidistant distribution of component values up to 3, 8 and 16:

z-point coordinates from normal distribution

z-point coordinates from equidistant distribution in [-2,2]

z-point coordinates from equidistant distribution in [-10,10]

This reminds us very much about the behavior of an AE. See: Autoencoders, latent space and the curse of high dimensionality – I.

The z-point distribution in latent space of a VAE with a very small KL-loss obviously is as complicated as that of an AE. Neighboring points of a z-point which leads to a good image produce chaotic images. The transition path from good z-points to other meaningful z-points is confined to a very small filament-like volume.

Conclusion

A trained VAE with only a tiny KL-loss contribution will under normal circumstances behave similar to an AE with a the same hidden (convolutional) layers. It may, however, be necessary to limit the statistical variation of the epsilon factor in the z-point calculation based on mu– and logvar-values.

The similarity is based on very small logvar-values after training. The VAE creates a z-point distribution which shows the same dependency on the radius as an AE. We see similar indications and patterns of clustering. And the VAE fails to produce human faces from random z-points in the latent space – as a comparable AE.

We have found a plausible reason for this similarity by comparing the minimum of the loss hyperplane in the weight-loss parameter space with a corresponding minimum in the weight-loss space of the VAE – at a position with small weights for the connection to the logvar layers.

The z-point density distribution shows a maximum at a radius between 16 and 17. The z-point distribution basically has a Gaussian form. In the next post we shall look a bit closer at these findings – and their origin in Gaussian distributions along the coordinate axes of the latent space. After an application of a PCA analysis we shall furthermore see that the z-point distribution in an AE’s latent vector space is indeed fragmented and shows filaments on certain length scales. A VAE with a tiny KL-loss will show the same fragmentation.

In further forthcoming posts we shall afterward investigate the confining and at the same time blurring impact of the KL-loss on the latent space. Which will make it usable for creative purposes. But the next post

Variational Autoencoder with Tensorflow – XIV – Change of variational model parameters at inference time

will first show you how to change model parameters at inference time.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!

 

Variational Autoencoder with Tensorflow – XI – image creation by a VAE trained on CelebA

I continue with my series on Variational Autoencoders [VAEs] and related methods to control the KL-loss.

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow – X – VAE application to CelebA images

VAEs fall into a section of ML which is called “Generative Deep Learning“. The reason is that we can VAEs to create images with contain objects with features of objects learned from training images. One interesting category of such objects are human faces – of different color, with individual expressions and features and hairstyles, seen from different perspectives. One dataset which contains such images is the CelebA dataset.

During the last posts we came so far that we could train a CNN-based Variational Autoencoder [VAE] with images of the CelebA dataset. Even on graphics cards with low VRAM. Our VAE was equipped with a GradientTape()-based method for KL-loss control. We still have to prove that this method works in the expected way:

The distribution of data points (z-points) created by the VAE’s Encoder for training input should be confined to a region around the origin in the latent space (z-space). And neighboring z-points up to a limited distance should result in similar output of the Decoder.

Therefore, we have to look a bit deeper into the results of some VAE-experiments with the CelebA dataset. I have already pointed out why creating rather complex images from arbitrarily chosen points in the latent space is a suitable and good test for a VAE. Please remember that our efforts regarding the KL-loss have to do with the following fact:

not create reasonable images/objects from arbitrarily chosen z-points in the latent space.

This eliminates the use of an AE for creative purposes. A VAE, however, should be able to solve this type of task – at least for z-points in a limited surroundings of the latent space’s origin. Thus, by creating images from randomly selected z-points with the Decoder of a VAE, which has been trained on the CelebA data set, we cover two points:

  • Test 1: We test the functionality of the VAE-class, which we have developed and which includes the code for KL-loss handling via TF2’s GradientTape() and Keras’ train_step().
  • Test 2: We test the ability of the VAE’s Decoder to create images with convincing human-like face and hairstyle features from random z-points within an area close to the origin of the latent space.

Most of the experiments discussed below follow the same prescription: We take our trained VAE, select some random points in the latent space, feed the z-point-data into the VAE’s Decoder for a prediction and plot the images created on the Decoder’s output side. The Encoder only plays a role when we want to test reconstruction abilities.

For a low dimension z_dim=256 of the latent space we will find that the generated images display human faces reasonably well. But the images appear a bit blurry or unsharp – as if not fully focused. So, we need to discuss what we can do about this point. I will also name some plausible causes for the loss of accuracy in the representation of details.

Afterwards I want to show you that a VAE Decoder reconstructs original images relatively badly from the z-points calculated by the Encoder. At least when one looks at details. A simple AE with a sufficiently high dimension of the latent space performs much better. One may feel disappointed about the reconstruction ability of a VAE. But actually it is the ability of a VAE to forget about details and instead to focus on general features which enables it (the VAE) to create something meaningful from randomly chosen z-points in the latent space.

In a last step in this post we are going to look at images created from z-points with a growing distance from the origin of the multidimensional latent space [z-space]. (Distance can be defined by a L2-Euclidean norm). We will see that most z-points which have some z-coordinates above a value of 3 produce fancy images where the face structures get dominated or distorted by background structures learned from CelebA images. This effect was to be expected as the KL-loss enforced a distribution of the z-points which is confined to a region relatively close to the origin. Ideally, this distribution would be characterized by a normal distribution in all coordinates with a sigma of only 1. So, the fact that z-points in the vicinity of the origin of the latent space lead to a construction of images which show recognizable human faces is an indirect proof of the confining impact of the KL-loss on the z-point distribution. In another post I shall deliver data which prove this more directly.

Below I will call the latent space of a (V)AE also z-space.

Characteristics of the VAE tested

Our trained VAE with four Conv2D-layers in the Encoder and 4 corresponding Conv2DTranspose-Layers in the Decoder has the following basic characteristics:

(Encoder-CNN-) filters=(32,64,128,256), kernels=(3,3), stride=2,
reconstruction loss = BCE (binary crossentropy), fact=5.0, z_dim=256

The order of the filter- (= map-) numbers is, of course reversed for the Decoder. The factor fact to scale the KL-loss in comparison to the reconstruction loss was chosen to be fact=5, which led to a 3% contribution of the KL-loss to the total loss during training. The VAE was trained on 170,000 CelebA images with 24 epochs and a small epsilon=0.0005 plus Adam optimizer.

When you perform similar experiments on your own you may notice that the total loss values after around 24 epochs ( > 5015) are significantly higher than those of comparable experiments with a standard AE (4850). This already is an indication that our VAE will not reproduce a similar good match between an image reconstructed by the Decoder in comparison to the original input image fed into the Encoder.

Results for z-points with coordinates taken from a normal distribution around the origin of the latent space

The picture below shows some examples of generated face-images coming from randomly chosen z-points in the vicinity of the z-space’s origin. To calculate the coordinates of such z-points I applied a normal distribution:

z_points = np.random.normal(size = (n_to_show, z_dim)) # n_to_show = 28

So, what do the results for z_dim=256 look like?

Ok, we get reasonable images of human-like faces. The variations in perspective, face forms and hairstyles are also clearly visible and reflect part of the related variety in the training set. You will find more variations in more images below. So, we take this result as a success! In contrast to a pure AE we DO get something from random z-points which we clearly can interpret as human faces. The whole effort of confining z-points around the origin and at the same time of smearing out z-points with similar content over a region instead of a fixed point-mapping (as in an AE) has paid off. See for comparison:
Autoencoders, latent space and the curse of high dimensionality – I

Unfortunately, the images and their details details appear a bit blurry and not very sharp. Personally, this reminded me of the times when the first CCD-chips with relative low resolution were introduced in cameras and the raw image data looked disappointing as long as we did not apply some sharpening filters. The basic information to enhance details were there, but they had to be used explicitly to improve the plain raw data of the CCD.

The quality in details is about the same as what we see in example images in the book of D.Foster on “Generative Deep Learning”, 2019, O’Reilly. Despite the fact that Foster used a slightly higher resolution of the input images (128x128x3 pixels). The higher input resolution there also led to a higher resolution of the maps of the innermost convolutional layer. Regarding quality see also the images presented in:
https://datagen.tech/guides/image-datasets/celeba/

Enhancement processing of the images ?

Just for fun, I took a screenshot of my result, saved it and applied two different sharpening filters from the ShowFoto program:

Much better! And we do not have the impression that we added some fake information to the images by our post-processing ….

Now I hear already argument saying that such an enhancement should not be done. Frankly, I do not see any reason against post-processing of images created by a VAE-algorithm.

Remember: This is NOT about reproduction quality with respect to originals or a close-to-reality show. This is about generating new mages of human-like faces based on basic features which a VAE-algorithm hopefully has learned from training images. All of what we do with a VAE is creative. And it also comes close to a proof that ML-algorithms based on convolutional layers really can “learn” something about the basic features of objects presented to them. (The learned features are e.g. in the Encoder’s case saved in the sensitivity of the convolutional maps to typical patterns in the input images.)

And as in the case of raw-images of CCD or CMOS camera chips: Sometimes some post-processing is required to utilize the information optimally for sharpness.

Sharpening by PIL’s enhancement functionality

Of course we do not want to produce images in a ML run, take screenshots and sharpen each image individually. We need some tool that fits into the ML process pipeline. The good old PIL library for Python offers sharpening as one of multiple enhancement options for images. The next examples are results from the application of a PIL enhancement procedure:

These images look quite OK, too. The basic code fragment I used for each individual image in the above grid:

    # reconst_new is the output from my VAE's Decoder 
    ay_img      = reconst_new[i, :,:,:] * 255
    ay_img      = np.asarray(ay_img, dtype="uint8" )
    img_orig    = Image.fromarray(ay_img)
    img_shr_obj = ImageEnhance.Sharpness(img)
    sh_factor   = 7   # Specified Factor for Enhancing Sharpness
    img_sh      = img_shr_obj.enhance(sh_factor)

The sharpening factor I chose was quite high, namely sh_factor = 7.

The effect of PIL’s sharpening factor

Just to further demonstrate the effect of different factors for sharpening by PIL you find some examples below for sh_factor = 0, 3, 6.

sh_factor = 0

sh_factor = 3

sh_factor = 6

Obviously, the enhancement is important to get clearer and sharper images.
However, when you enlarge the images sufficiently enough you see some artifacts in the form of crossing lines. These artifacts are partially already existing in the Decoder’s output, but they are enhanced by the Sharpening mechanism used by PIL (unsharp masking). The artifacts become more pronounced with a growing sh_factor.
Hint: According to ML-literature the use of Upsampling layers instead of Conv2DTranspose layers in the Decoder may reduce such artefacts a bit. I have not yet tried it myself.

Assessment

How do we assess the point of relatively unclear, unsharp images produced by our VAE? What are plausible reasons for the loss of details?

  1. Firstly, already AEs with a latent space dimension z_dim=256 in general do not reconstruct brilliant images from z-points in the latent space. To get a good reconstruction quality even from an AE which does nothing else than to compress and reconstruct images of a size (96x96x3) z_dim-values > 1000 are required in my experience. More about this in another post in the future.
  2. A second important aspect is the following: Enforcing a compact distribution of similar images in the latent space via the KL-loss automatically introduces a loss of detail information. The KL-loss is designed to lead to a smear-out effect in z-space. Only basic concepts and features will be kept by the VAE to ensure a similarity of neighboring images. Details will be omitted and “smoothed” out. This has consequences also with respect to sharpness of detail structures. A detail as an eyebrow in a face is to be considered as an average of similar details found for images in the same region of the z-space. This alone brings some loss of clarity with it.
  3. Thirdly, a simple (V)AE based on some directly connected Conv2D-layers has limited capabilities in general. The reason is that we systematically reduce resolution whilst information is propagated from one Conv2D layer to the next neighboring one. Remember that we use a stride > 2 or pooling layers to cover filters on larger image scales. Due to this information processing a convolutional network automatically suppresses details in its inner layers – their resolution shrinks with growing distance from the input layer. In later posts of this blog we shall see that using ResNets instead of CNNs in the Encoder and Decoder already helps a bit regarding the reconstruction of clearer images. The correlation between details and large scale information is better kept up there than in CNNs.

Regarding the first point one may think of increasing z_dim. This may not be the best idea. It contradicts the whole idea of a VAE which at its core is a reduction of the degrees of freedom for z-points. For a higher dimensional space we may have to raise the ratio of KL-loss to reconstruction loss even further.

Regarding the third point: Of course it would also help to increase kernel sizes for the first two Conv2D layers and the number of maps there. A higher resolution of the input images would also be of advantage. Both methods may, however, conflict with your VRAM or GPU time limits.

If the second point were true then reduction of fact in our models, which controls the ration of KL-loss to reconstruction loss, would lead to a better image quality. In this case we are doomed to find an optimal value for fact – satisfying both the need for generalization and clarity of details in our images. You cannot have both … here we see a basic problem related to VAEs and the creation of realistic images. Actually, I tried this out – the effect is there, but the gain actually is not worth the effort. And for too small values of fact we eventually loose the ability to create reasonable images from arbitrary z-points at all.

All in all post-processing appears to be a simple and effective method to get images with somewhat sharper details.
Hint: If you want to create images of artificially generated faces with a really high quality, you have to turn to GANs.

Further examples – with PIL sharpening

In this example you see that not all points give you good images of faces. The z-point of the middle image in the second to last of the first illustration below has a relatively high distance from the origin. The higher the distance from the origin in z-space the weirder the images get. We shall see this below in a more systematic way.

Reconstruction quality of a VAE vs. an AE – or the “female” side of myself

If I were not afraid of copy and personal rights aspects of using CelebA images directly I could show you now a comparison of the the reconstruction ability of an AE in comparison to a VAE. You find such a comparison, though a limited one, by looking at some images in the book of D. Foster.

To avoid any problems I just tried to work with an image of myself. Which really gave me a funny result.

A plain Autoencoder with

  • an extended latent space dimension of z_dim = 1600,
  • a reasonable convolutional filter sequence of (64, 64, 128, 128)
  • a stride value of stride=2
  • and kernels ((5,5),(5,5),(3,3),(3,3))

is well able to reproduce many detailed features one’s face after a training on 80,000 CelebA images. Below see the result for an image of myself after 24 training epochs of such an AE:

The left image is the original, the right one the reconstruction. The latter is not perfect, but many details have been reproduced. Please note that the trained AE never had seen an image of myself before. For biometric analysis the reproduction would probably be sufficient.

Ok, so much about an AE and a latent space with a relatively high dimension. But what does a VAE think of me?
With fact = 5.0, filters like (32,64,128,256), (3,3)-kernels, z_dim=256 and after 18 epochs with 170,000 training images of CelebA my image really got a good cure:

My wife just laughed and said: Well, now in the age of 64 at least an AI has found something soft and female in you … Well, had the CelebA included many faces of heavy metal figures the results would have looked differently. I bet …

So with generative VAEs we obviously pay a price: Details are neglected in favor of very general face features and hairstyle aspects. And we loose sharpness. Which is good if you have wrinkles. Good for me and the celebrities, too. 🙂

However, I recommend anybody who wants to study VAEs to check the reproduction quality for CelebA test images (not from the training set). You will see the generalization effect for a broader range of images. And, of course, a better reproduction with smaller values for the ratio of the KL-loss to the reconstruction loss. However, for too small values of fact you will not be able to create realistic face images at all from arbitrary z-points – even if you choose them to be relatively close to the origin of the latent space.

Dependency of the creation of reasonable images on the distance from the origin

In another post in this blog I have discussed why we need VAEs at all if we want to reconstruct reasonable face images from randomly picked points in the latent space. See:
Autoencoders, latent space and the curse of high dimensionality – I

I think the reader is meanwhile convinced that VAEs do a reasonably good job to create images from randomly chosen z-points. But all of the above images were taken from z-points calculated with the help of a function assuming a normal distribution in the z-space coordinates. The width of the resulting distribution around the origin is of course rather limited. Most points lie within a 3 sigma distance around the origin. This is OK as we have put a lot of effort into the KL-loss to force the z-points to approach such a normal distribution around the origin of the latent space.

But what happens if and when we increase the distance of our random z-points from the origin? An easy way to investigate this is to create the z-points with a function that creates the coordinates randomly, but equally distributed in an interval ]0,limit]. The chance that at least one of the coordinates gets a high value is rather big then. This in turn ensures relatively high radius values (in terms of an L2-distance norm).

Below you find the results for z-points created by the function random.uniform:

r_limit = 1.5
l_limit = -r_limit
znew = np.random.uniform(l_limit, r_limit, size = (n_to_show, z_dim))

r_limit is varied as indicated:

r_limit = 0.5

r_limit = 1.0

r_limit = 1.5

r_limit = 2.0

r_limit = 2.5

r_limit = 3.0

r_limit = 3.5

r_limit = 5.0

r_limit = 8.0

Well, this proves that we get reasonable images only up to a certain distance from the origin – and only in certain areas or pockets of the z-space at higher radii.

Another notable aspect is the fact that the background variations are completely smoothed out a low distances from the origin. But they get dominant in the outer regions of the z-space. This is consistent with the fact that we need more information to distinguish various background shapes, forms and colors than basic face patterns. Note also that the faces appear relatively homogeneous for r_limit = 0.5. The farther we are away from the origin the larger the volumes to cover and distinguish certain features of the training images become.

Conclusion

Our VAE with the GradientTape()-mechanism for the control of the KL-loss seems to do its job. In contrast to a pure AE the smear-out effect of the KL-loss allows now for the creation of images with interpretable contents from arbitrary z-points via the VAE’s Decoder – as long as the selected z-points are not too far away from the z-space’s origin. Thus, by indirect evidence we can conclude that the z-points for training images of the CelebA dataset were distributed and at the same time confined around the origin. The strongest indication came from the last series of images. But we pay a price: The reconstruction abilities of a VAE are far below those of AEs. A relatively low number of dimensions of the latent space helps with an effective confinement of the z-points. But it leads to a significant loss in detail sharpness of the generated images, too. However, part of this effect can be compensated by the application of standard procedures for image enhancemnet.

In the next post
Variational Autoencoder with Tensorflow – XII – save some VRAM by an extra Dense layer in the Encoder
I will discuss a simple trick to reduce the VRAM consumption of the Encoder. In a further post we shall then analyze the confinement of the z-point distribution with the help of more explicit data.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. An aggressor who orders the bombardment of civilian infrastructure, civilian buildings, schools and hospitals with drones bought from other anti-democrats and women oppressors puts himself in the darkest and most rotten corner of human history.

 

Autoencoders, latent space and the curse of high dimensionality – I

Recently, I had to give a presentation about standard Autoencoders (AEs) and related use cases. Whilst preparing examples I stumbled across a well-known problem: The AE solved tasks as to reconstruct faces hidden in extreme noisy or leaky input images perfectly. But the reconstruction of human faces from arbitrarily chosen points in the so called “latent space” of a standard Autoencoder did not work well.

In this series of posts I want to discuss this problem a bit as it illustrates why we need Variational Autoencoders for a systematic creation of faces with varying features from points and clusters in the latent space. But the problem also raises some fundamental and interesting questions

  • about a certain “blindness” of neural networks during training in general, and
  • about the way we save or conserve the knowledge which a neural network has gained about patterns in input data during training.

This post requires experience with the architecture and principles of Autoencoders.

Note, 02/14/2023: I have revised and edited this post to get consistent with new insights from extended experiments with AEs and VAEs.

Standard tasks for conventional Autoencoders

For preparing my talk I worked with relatively simple Autoencoders. I used Convolutional Neural Networks [CNNs] with just 4 convolutional layers to create the Encoder and Decoder parts of the Autoencoder. As typical applications I chose the following:

  • Effective image compression and reconstruction by using a latent space of relatively low dimensionality. The trained AEs were able to compress input images into latent vectors with only few components and reconstruct the original image from the compressed format.
  • Denoising of images where the original data were obscured by the superposition of statistical noise and/or statistically dropped pixels. (This is my favorite task for AEs which they solve astonishingly well.)
  • Recolorization of images: The trained AE in this case transforms images with only gray pixels into colorful images.

Such challenges for AEs are discussed in standard ML literature. In a first approach I applied my Autoencoders to the usual MNIST and Fashion MNIST datasets. For the task of recolorization I used the Cifar 10 dataset. But a bit later I turned to the Celeb A dataset with images of celebrity faces. Just to make all of the tasks a bit more challenging.

Standard Autoencoders and low dimensions of the latent space for (Fashion) MNIST and Cifar10 data

My Autoencoders excelled in all the tasks named above – both for MNIST, CELEB A and, regarding recolorization, CIFAR 10.

Regarding MNIST and MNIST/Fashion 4-layer CNNs for the Encoder and Decoder are almost an overkill. For MNIST the dimension z_dim of the latent space can be chosen to be pretty small:

z_dim = 12 gives a really good reconstruction quality of (test) images compressed to minimum information in the latent space. z_dim=4 still gave an acceptable quality and even with z_dim = 2 most of test images were reconstructed well enough. The same was true for the reconstruction of images superimposed with heavy statistical noise – such that the human eye could no longer guess the original information. For Fashion MNIST a dimension number 20 < z_dim < 40 gave good results. Also for recolorization the results were very plausible. I shall present the results in other blog posts in the future.

Face reconstructions of (noisy) Celeb A images require a relative high dimension of the latent space

Then I turned to the Celeb A dataset. By the way: I got interested in Celeb A when reading the books of David Foster on “Generative Deep Learning” and of Tariq Rashi “Make Your First GANs with PyTorch” (see the complete references in the last section of this post).

The Celeb A data set contains images of around 200,000 faces with varying contours, hairdos and very different, in-homogeneous backgrounds. And the faces are displayed from very different viewing angles.

For a good performance of image reconstruction in all of the named use cases one needs to raise the number of dimensions of the latent space significantly. Instead of 12 dimensions of the latent space as for MNIST we now talk about 200 up to 1200 dimensions for CELEB A – depending on the task the AE gets trained for and, of course, on the quality expectations. For reconstruction of normal images and for the reconstruction of clear images from noisy input images higher numbers of dimensions z_dim ≥ 512 gave visibly better results.

Actually, the impressive quality for the reconstruction of test images of faces, which were almost totally obscured by the superimposition of statistical noise or the statistical removal of pixels after a self-supervised training on around 100,000 images surprised me. (Totalitarian states and security agencies certainly are happy about the superb face reconstruction capabilities of even simple AEs.) Part of the explanation, of course, is that 20% un-obscured or un-blurred pixels out of 30,000 pixels still means 6,000 clear pixels. Obviously enough for the AE to choose the right pattern superposition to compose a plausible clear image.

Note that we are not talking about overfitting here – the Autoencoder handled test images, i.e. images which it had never seen before, very well. AEs based on CNNs just seem to extract and use patterns characteristic for faces extremely effectively.

But how is the target space of the Encoder, i.e. the latent space, filled for Celeb A data? Do all points in the latent space give us images with well recognizable faces in the end?

Face reconstruction after a training based on Celeb A images

To answer the last question I trained an AE with 100,000 images of Celeb A for the reconstruction task named above. The dimension of the latent space was chosen to be z_dim = 200 for the results presented below. (Actually, I used a VAE with a tiny amount of KL loss by a factor of 1.e-6 smaller than the standard Binary Cross-Entropy loss for reconstruction – to get at least a minimum confinement of the z-points in the latent space. But the results are basically similar to those of a pure AE.)

My somewhat reworked and centered Celeb A images had a dimension of 96×96 pixels. So the original feature space had a number of dimensions of 27,648 (almost 30000). The challenge was to reproduce the original images from latent data points created of test images presented to the Encoder. To be more precise:

After a certain number of training epochs we feed the Encoder (with fixed weights) with test images the AE has never seen before. Then we get the components of the vectors from the origin to the resulting points in the latent space (z-points). After feeding these data into the Decoder we expect the reproduction of images close to the test input images.

With a balanced training controlled by an Adam optimizer I already got a good resemblance after 10 epochs. The reproduction got better and very acceptable also with respect to tiny details after 25 epochs for my AE. Due to possible copyright and personal rights violations I do not dare to present the results for general Celeb A images in a public blog. But you can write me a mail if you are interested.

Most of the data points in the latent space were created in a region of 0 < |x_i| < 20 with x_i meaning one of the vector components of a z-point in the latent space. I will provide more data on the z-point distribution produced by the Encoder in later posts of this mini-series.

Face reconstruction from randomly chosen points in the latent space

Then I selected arbitrary data points in the latent space with randomly chosen and uniformly distributed components 0 < |x_i| < boundary. The values for boundary were systematically enlarged.

Note that most of the resulting points will have a tendency to be located in outer regions of the multidimensional cube with an extension in each direction given by boundary. This is due to the big chance that one of the components will get a relatively high value.

Then I fed these arbitrary z-points into the Decoder. Below you see the results after 10 training epochs of the AE; I selected only 10 of 100 data points created for each value of boundary (the images all look more or less the same regarding the absence or blurring of clear face contours):

boundary = 0.5

boundary = 2.5

boundary = 5.0

boundary = 8.0

boundary = 10.0

boundary = 15.0

boundary = 20.0

boundary = 30.0

boundary = 50

This is more a collection of face hallucinations than of usable face images. (Interesting for artists, maybe? Seriously meant …).

So, most of the points in the latent space of an Autoencoder do NOT represent reasonable faces. Sometimes our random selection came close to a region in latent space where the result do resemble a face. See e.g. the central image for boundary=10.

From the images above it becomes clear that some arbitrary path inside the latent space will contain more points which do NOT give you a reasonable face reproduction than points that result in plausible face images – despite a successful training of the Autoencoder.

This result supports the impression that the latent space of well trained Autoencoders is almost unusable for creative purposes. It also raises the interesting question of what the distribution of “meaningful points” in the latent space really looks like. I do not know whether this has been investigated in depth at all. Some links to publications which prove a certain scientific interest in this question are given in the last section of this posts.

I also want to comment on an article published in the Quanta Magazine lately. See “Self-Taught AI Shows Similarities to How the Brain Works”. This article refers to “masked” Autoencoders and self-supervised learning. Reconstructing masked images, i.e. images with a superposition of a mask hiding/blurring pixels with a reasonably equipped Autoencoder indeed works very well. Regarding this point I totally agree. Also with the term “self-supervised learning”.

But to suggest that an Autoencoder with this (rather basic) capability reflects methods of the human brain is in my opinion a massive exaggeration. On the contrary, in my opinion an AE reflects a dumbness regarding the storage and usage of otherwise well extracted feature patterns. This is due to its construction and the nature of its mapping of image contents to the latent space. A child can, after some teaching, draw characteristic features of human faces – out of nothing on a plain white piece of paper. The Decoder part of a standard Autoencoder (in some contrast to a GAN) can not – at least not without help to pick a meaningful point in latent space. And this difference is a major one, in my opinion.

A first interpretation – the curse of many dimensions of the latent space

I think the reason why arbitrary points in the multi-dimensional latent space cannot be mapped to images with recognizable faces is yet another effect of the so called “curse of high dimensionality”. But this time also related to the latent space.

A normal Autoencoder (i.e. one without the Kullback-Leibler loss) uses the latent space in its vast extension to produce points where typical properties (features) of faces and background are encoded in a most unique way for each of the input pictures. But the distinct volume filled by such points is a pretty small one – compared to the extensions of the high dimensional latent space. The volume of data points resulting from a mapping-transformation of arbitrary points in the original feature space to points of the latent space is of course much bigger than the volume of points which correspond to images showing typical human faces.

This is due to the fact that there are many more images with arbitrary pixel values already in the original feature space of the input images (with lets say 30000 dimensions for 100×100 color pixels) than images with reasonable values for faces in front of some background. The points in the feature space which correspond to reasonable images of faces (right colors and dominant pixel values for face features), is certainly small compared to the extension of the original feature space. Therefore: If you pick a random point in latent space – even within a confined (but multidimensional) volume around the origin – the chance that this point lies outside the particular volume of points which make sense regarding face reproduction is big. I guess that for z_dim > 200 the probability is pretty close to 1.

In addition: As the mapping algorithm of a neural Encoder network as e.g. CNNs is highly non-linear it is difficult to say how the boundary hyperplanes of mapping areas for faces look like. Complicated – but due to the enormous number of original images with arbitrary pixel values – we can safely guess that they enclose a rather small volume.

The manifold of data points in the z-space giving us recognizable faces in front of a reasonably separated background may follow a curved and wiggly “path” through the latent space. In principal there could even be isolated unconnected regions separated by areas of “chaotic reconstructions”.

I think this kind of argumentation line holds for standard Autoencoders and variational Autoencoders with a very small KL loss in comparison to the reconstruction loss (BCE (binary cross-entropy) or MSE).

Why do Variational Autoencoders [VAEs] help?

The fist point is: VAEs reduce the total occupied volume of the latent space. Due to mu-related term in the Kullback-Leibler loss the whole distribution of z-points gets condensed into a limited volume around the origin of the latent space.

The second reason is that the distribution of meaningful points are smeared out by the logvar-term of the Kullback-Leibler loss.

Both effects enforce overlapping regions of meaningful standard Gaussian-like z-point distributions in the latent space. So VAEs significantly increase the probability to hit a meaningful z-point in latent space – if you chose points around the origin within a distance of “1” per coordinate (or vector component).

The total distance of a point and its vector in z-space has to be measured with some norm, e.g. the Euclidian one. Actually we should get meaningful reconstructions around a multidimensional sphere of radius “16”. Why this is reasonable will be discussed in forthcoming posts.

Please, also look at the series on the technical realization of VAEs in this blog. The last posts there prove the effects of the KL-loss experimentally for Celeb A data. Below you find a selection of images created from randomly chosen points in the latent space of a Variational Autoencoder with z_dim=200 after 10 epochs.

Conclusion

Enough for today. Whilst standard Autoencoders solve certain tasks very well, they seem to produce very specific data distributions in the latent space for CelebA images: Only certain regions seem to be suitable for the reconstruction of “meaningful” images with human faces.

This problem may have its origin already in the feature space of the original images. Also there only a small minority of points represents humanly interpretable face images. This becomes obvious when you look at the vast amount of possible pixel values in a feature space of lets say 96x96x3 = 27,648. Each of these dimension can get a value between 0 and 255. This gives us more than 7 million combinations. Only a tiny fraction of these possible images will show reasonable faces in the center with a reasonably structured background around.

From a first experiment the chance of hitting a data point in latent space which gives you a meaningful image seems to be small. This result appears to be a variant of the curse of high dimensionality – this time including the latent space.

In a forthcoming post
Autoencoders, latent space and the curse of high dimensionality – II – a view on fragments and filaments of the latent space for CelebA images
we will investigate the z-point distribution in latent space with a variety of tools. And find that this distribution is fragmented and that the z-points for CelebA images are arranged in certain regions of the latent space. In addition we will get indications that the distribution contains filament-like structures.

Links

https://towardsdatascience.com/ exploring-the-latent-space-of-your-convnet-classifier-b6eb862e9e55

Felix Leeb, Stefan Bauer, Michel Besserve,Bernhard Schölkopf, “Exploring the Latent Space of Autoencoders with
Interventional Assays”, 2022,
https://arxiv.org/abs/2106.16091v2 // https://arxiv.org/pdf/2106.16091.pdf
https://wiredspace.wits.ac.za/ handle/10539/33094?show=full
https://www.elucidate.ai/post/ exploring-deep-latent-spaces

Books:
T. Rashid, “GANs mit PyTorch selbst programmieren”, 2020, O’Reilly, dpunkt.verlag, Heidelberg, ISBN 978-3-96009-147-9
D. Foster, “Generatives Deep Learning”, 2019, O’Reilly, dpunkt.verlag, Heidelberg, ISBN 978-3-96009-128-8

 

Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics

I continue with my series on options for an implementation of the Kullback-Leibler divergence as a loss [KL loss] contribution in Variational Autoencoder [VAE] models:

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()

Our objective is to avoid or circumvent potential problems with the eager execution mode of present Tensorflow 2 versions. I have already described three solutions based on standard Keras functionality:

  • Either we add loss contributions via the function layer.add_loss()and a special layer of the Encoder part of the VAE
  • or we add a loss to the output of a full VAE-model via function model.add_loss()
  • or we build a complex model which transports required KL-related tensors from the Encoder part of the VAE model to the Decoder’s output layer.

In all these cases we invoke native Keras functions to handle loss contributions and related operations. Keras controls the calculation of the gradient components of the KL related tensors “mu” and “log_var” in the background for us. This comprises partial derivatives with respect to trainable weight variables of lower Encoder layers and related operations. The same holds for partial derivatives of reconstruction tensors at the Decoder’s output layer with respect to trainable parameters of all layers of the VAE-model. Keras does most of the job

  • of derivative calculation and the registration of related operation sequences during forward pass
  • and the correct application of the registered operations and values in later weight corrections during backward propagation

for us in the background as long as we respect certain rules for eager mode execution.

But Tensorflow 2 [TF2] gives us a much more flexible and low-level option to control the calculation of gradients under the conditions of eager execution. This option requires that we inform the TF/Keras machinery which processes the training steps of an epoch of how to exactly calculate losses and their partial derivatives. Rules to determine and create metrics output must be provided in addition.

TF2 provides a context for registering operations for loss and derivative evaluations. This context is provided by a functional object called GradientTape(). In addition we have to write an encapsulating function “train_step()” to control gradient calculations and output during training.

In this post I will describe how we integrate such an approach with our class “MyVariationalAutoencoder()” for the setup of a VAE model based on convolutional layers. I have discussed the elements and methods of this class MyVariationalAutoencoder() in detail during the last posts.

Regarding the core of the technical solution for train_step() and GradientTape() I follow more or less the recommendations of one of the masters of Keras: F. Chollet. His original code for a TF2-compatible implementation of a VAE can be found here:
https://keras.io/ examples/ generative/vae/

However, in my opinion Chollet’s code contains a small problem, which I have allowed myself to correct.

The general recipe presented here can, of course, be extended to more complex tasks beyond the optimization of KL and reconstruction losses of VAEs. Therefore, a brief study of the methods to establish detailed loss control is really worth it for ML and VAE beginners. But TF2 and Keras experts will not learn anything new from this post.

I provide the pure code of the classes in this post. In the next post you will find Jupyter cell code for an application to the Celeb A dataset. To prove that the classes below do their job in the end I show you some faces which have been created from arbitrarily chosen points in the latent space after training.

These faces do not exist in reality. They are constructed by the VAE based on compressed and “normalized” data for face patterns and face attribute distributions in the latent space. Note that I used a latent space with a dimension of z_dim =200.

Layer setup by class MyVariationalAutoencoder()

We have already many of the required methods ready. In the last posts we used the flexible functional interface of Keras to set up Neural Network models for both Encoder and Decoder, each with sequences of (convolutional) layers. For our present purposes we will not change the elementary layer structure of the Encoder or Decoder. In particular the layers for the “mu” and “log_var” contributions to the KL loss and a subsequent sampling-layer of the Encoder will remain unchanged.

In the course of the last two posts I have already introduced a parameter “solution_type” to control specifics of our VAE model. We shall use it now to invoke a child class of Keras’ Model() which allows for detailed steps of loss and gradient evaluations.

A child class of keras.models.Model() for loss and gradient evaluation

The standard Keras method Model.fit() normally provides a convenient interface for Keras users. We do not have to think about calling the low-level functions at all if we do not want to or do not need to control gradient calculations in detail. In our present approach, however, we use the low level functionality of GradientTape() directly. This requires to overwrite a specific method of the standard Keras class Model() – namely the function “train_step()”.

If you have never worked with a self-defined training_step() and GradientTape() before then I recommend to read the following introductions first:
https://www.tensorflow.org/ guide/ autodiff
customizing what happens in fit() and the relation to training_step()
These articles contain valuable information about how to operate at low level with training_step() regarding losses, derivatives and metrics. This information will help to better understand the methods of a new class VAE() which I am going to derive from Keras’ class Model() below.

Let us first briefly repeat some imports required.

Imports

# Imports 
# ~~~~~~~~ 
import sys
import numpy as np
import os
import pickle

import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.layers import Layer, Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \
                                    Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, AlphaDropout
from tensorflow.keras.models import Model
# to be consistent with my standard loading of the Keras backend in Jupyter notebooks:  
from tensorflow.keras import backend as B      
from tensorflow.keras import metrics
#from tensorflow.keras.backend import binary_crossentropy

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint 
from tensorflow.keras.utils import plot_model

#from tensorflow.python.debug.lib.check_numerics_callback import _maybe_lookup_original_input_tensor

# Personal: The following works only if the path in the notebook is supplemented by the path to /projects/GIT/mlx
# The user has to organize his paths for modules to be referred to from Jupyter notebooks himself and 
# replace this settings  
from mynotebooks.my_AE_code.utils.callbacks import CustomCallback, VAE_CustomCallback, step_decay_schedule    
from keras.callbacks import ProgbarLogger

Now we define a class VAE() which inherits basic functionality from the Keras class Model() and overwrite the method train_step(). We shall later create an instance of this new class within an object of class MyVariationalAutoencoder().

New Class VAE

from tensorflow.keras import metrics
...
...
# A child class of Model() to control train_step with GradientTape() 
class VAE(keras.Model): 
    
    # We use our self defined __init__() to provide a reference MyVAE 
    # to an object of type "MyVariationalAutoencoder" 
    # This in turn allows us to address the Encoder and the Decoder  
    def __init__(self, MyVAE, **kwargs):
        super(VAE, self).__init__(**kwargs)
        self.MyVAE   = MyVAE 
        self.encoder = self.MyVAE.encoder
        self.decoder = self.MyVAE.decoder
        
        # A factor to control the ratio between the KL loss and the reconstruction loss 
        self.fact = MyVAE.fact
        
        # A counter 
        self.count = 0 
        
        # A factor to scale the absolute values of the losses 
        # e.g. by the number of pixels of an image
        self.scale_fact = 1.0  # no scaling
        # self.scale_fact = tf.constant(self.MyVAE.input_dim[0] * self.MyVAE.input_dim[1], dtype=tf.float32)
        self.f_scale    = 1. / self.scale_fact
        
        # loss type : 0: BCE, 1: MSE 
        self.loss_type = self.MyVAE.loss_type
        
        # track loss development via metrics 
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
        self.reco_loss_tracker  = keras.metrics.Mean(name="reco_loss")
        self.kl_loss_tracker    = keras.metrics.Mean(name="kl_loss")

    def call(self, inputs):
        x, z_m, z_var = self.encoder(inputs)
        return self.decoder(x)  

    # Overwrite the metrics() of Model() - use getter mechanism  
    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reco_loss_tracker,
            self.kl_loss_tracker
        ]

    # Core function to control all operations regarding eager differentiation operations, 
    # i.e. the calculation of loss terms with respect to tensors and differentiation variables 
    # and metrics data 
    def train_step(self, data):
        # We use the GradientTape context to record differntiation operations/results 
        #self.count += 1 
        
        with tf.GradientTape() as tape:
            z, z_mean, z_log_var = self.encoder(data)
            reconstruction = self.decoder(z)
            #reco_shape = tf.shape(self.reconstruction)
            #print("reco_shape = ", reco_shape, self.reconstruction.shape, data.shape)
            
            #BCE loss (Binary Cross Entropy) 
            if self.loss_type == 0: 
                reconstruction_loss = tf.reduce_mean(
                    tf.reduce_sum(
                        keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
                    )
                ) * self.f_scale
            
            # MSE loss (Mean Squared Error) 
            if self.loss_type == 1: 
                reconstruction_loss = tf.reduce_mean(
                    tf.reduce_sum(
                        keras.losses.mse(data, reconstruction), axis=(1, 2)
                    )
                ) * self.f_scale
            
            kl_loss = -0.5 * self.fact * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
            kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))  
            total_loss = reconstruction_loss + kl_loss 
        
        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        #if self.count == 1: 
            
        self.total_loss_tracker.update_state(total_loss)
        self.reco_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        return {
            "total_loss": self.total_loss_tracker.result(),
            "reco_loss": self.reco_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }
        
    def compile_VAE(self, learning_rate):

        # Optimizer
        # ~~~~~~~~~ 
        optimizer = Adam(learning_rate=learning_rate)
        # save the learning rate for possible intermediate output to files 
        self.learning_rate = learning_rate
        self.compile(optimizer=optimizer)

Explanation of class VAE(): Details of the methods of the additional class

First, we need to import an additional library tensorflow.keras.metrics. Its functions, as e.g. Mean(), will help us to print out intermediate data about various loss contributions during training – averaged over the batches of an epoch.

Then, we have added four central methods to class VAE:

  • a function __init__(),
  • a function metrics() together with Python’s getter-mechanism
  • a function call()
  • and our central function training_step().

All functions overwrite the defaults of the parent class Model(). Be careful to distinguish the range of batches which keras.metrics() and training_step() operate on:

  • A “training step” covers just one batch eventually provided to the training mechanism by the Model.fit()-function.
  • Averaging performed by functions of keras.metrics instead works across all batches of an epoch.

Functions “__init__() ” and call() to instantiate a Model based on class VAE()

In general we can use the standard interface of __init__(inputs, outputs, …) or a call()-interface to instantiate an object of class-type Model(). See
https://www.tensorflow.org/ api_docs/python/ tf/ keras/ Model
https://docs.w3cub.com/ tensorflow~python/ tf/ keras/ model.html

We have to be precise about the parameters of __init()__ or the call()-interface if we intend to use properties of the standard compile()– and fit()-interfaces of a model – at least in application cases where we do not control everything regarding losses and gradients ourselves.

To define a complete model for the general case we therefore add the call()-method. At the same time we “misuse” the __init__() function of VAE() to provide a reference to our instance of class “MyVariationalAutoencoder”. Actually, providing “call()” is done only for the sake of flexibility in other use cases than the one discussed here. For our present purposes we could actually omit call().

The __init__()-function retrieves some parameters from MyVAE. You see the factor “fact” which controls the ratio of the KL-loss to the reconstruction loss. In addition I provided an option to scale the loss values by a division by the number of pixels of input images. You just have to un-comment the respective statement. Sorry, I have not yet made it controllable by a parameter of MyVariationalAutoencoder().

Finally, the parameter loss_type is evaluated; for a value of “1” we take MSE as a loss function instead of the standard BCE (Binary Cross-Entropy); see the Jupyter cells in the next post. This allows for some more flexibility in dealing with certain datasets.

Function “metrics()” to produce loss values as output during training

With the function metrics() we are able to establish our own “tracking” of the evolution of the Model’s loss contributions during training. In our case we are particularly interested in the evolution of the “reconstruction loss” and the “KL-loss“.

Note that the @property decorator is added to the metrics()-function. This allows us to define its output via the getter-mechanism for Python classes. In our case the __init__()-function defines the mechanism to fill required variables:
The three “tracker”-variables there get their values from the function tensorflow.keras.metrics.Mean(). Note that the names given to the loss-trackers in __init__() are of importance for later output handling!

Note also that keras.metrics.Mean() calculates averages over values derived for all batches of an epoch. The tf.reduce_mean()-statements in the GradientTape() section of the code above, instead, refer to averages calculated over the samples of a single batch.

Actualized loss output is later delivered during each training step by the method update_state(). You find a description of the methods of keras.metrics.Mean() here.

The result of all this is that metrics() delivers loss values by actualized tracker-variables of our child class VAE(). Note that neither __init__() nor metrics() define what exactly is to be done to calculate each loss term. __init__() and metrics() only prepare the later output technically by formal class constructs. Note also that all the data defined by metrics() are updated and averaged per epoch without the requirement to call the function “reset_states()” (see the Keras docs). This is automatically done at the beginning of each epoch.

train_step() and GradientTape() to control losses and their gradients

Let us turn to the necessary calculations which must be performed during each training step. In an eager environment we must watch the trainable variables, on which the different loss terms depend, to be able to calculate the partial derivatives and record related operations and intermediate results already during forward pass:

We must track the differentiation operations and resulting values to know exactly what has to be done in reverse during error backward propagation. To be able to do this TF2 offers us a recording mechanism called GradientTape(). Its results are kept in an object which often is called a “tape”.

You find more information about these topics at
https://debuggercafe.com/ basics-of-tensorflow-gradienttape/
https://runebook.dev/de/docs/ tensorflow/gradienttape

Within train_step() we need some tensors which are required for loss calculations in an explicit form. So, we must change the Keras model for the Encoder to give us the tensors for “mu” and “log_var” as its output.

This is no problem for us. We have already made the output of the Encoder dependent on a variable “solution_type” and discussed a multi-output Encoder model already in the post Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output.

Therefore, we just have to add a new value 3 to the checks of “solution_type”. The same is true for the input control of the Decoder (see a section about the related methods of MyVariationalAutoencoder() below).

The statements within the section for GradientTape() deal with the calculation of loss terms and record the related operations. All the calculations should be be familiar from previous posts of this series.

This includes an identification of the trainable_weights of the involved layers. Quote from
https://keras.io/ guides/ writing_a_ training_loop_ from_scratch/ #using-the-gradienttape-a-first-endtoend-example:

Calling a model inside a GradientTape scope enables you to retrieve the gradients of the trainable weights of the layer with respect to a loss value. Using an optimizer instance, you can use these gradients to update these variables (which you can retrieve using model.trainable_weights).

In train_step() we need to register that the total loss is dependent on all trainable weights and that all related partial derivatives have to be taken into account during optimization. This is done by

        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))

To be able to get actualized output during training we update the state of all tracked variables:

        self.total_loss_tracker.update_state(total_loss)
        self.reco_loss_tracker.update_state(reco_loss)
        self.kl_loss_tracker.update_state(kl_loss)

A small problem with F. Chollet’s code

The careful reader may have noticed that my code of the function “train_step()” deviates from F. Chollet’s recommendations. Regarding the return statement I use

        return {
            "total_loss": self.total_loss_tracker.result(),
            "reco_loss": self.reco_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

whilst F. Chollet’s original code contains a statement like

        return {
            "loss": self.total_loss_tracker.result(),     # here lies the main difference - different "name" than defined in __init__!
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),  # ignore my abbreviation to reco_loss 
            "kl_loss": self.kl_loss_tracker.result(),
        }

Chollet’s original code unfortunately gives inconsistent loss data: The sum of his “reconstruction loss” and the “KL (Kullback Leibler) loss” do not add up to the (total) “loss”. This can be seen from the data of the first epochs in F. Chollet’s example on the tutorial at
keras.io/ examples/ generative/ vae.

Some of my own result data for the MNIST example with this error look like:

Epoch 1/5
469/469 [============================_build_dec==] - 7s 13ms/step - reconstruction_loss: 209.0115 - kl_loss: 3.4888 - loss: 258.9048
Epoch 2/5
469/469 [==============================] - 7s 14ms/step - reconstruction_loss: 173.7905 - kl_loss: 4.8220 - loss: 185.0963
Epoch 3/5
469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 160.4016 - kl_loss: 5.7511 - loss: 167.3470
Epoch 4/5
469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 155.5937 - kl_loss: 5.9947 - loss: 162.3994
Epoch 5/5
469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 152.8330 - kl_loss: 6.1689 - loss: 159.5607

Things do get better from epoch to epoch – but we want a consistent output from the beginning: The averaged (total) loss should always be printed as equal to the sum of the averaged) KL loss plus the reconstruction loss.

The deviation is surprising as we seem to use the right tracker-results in the code. And the name used in the return statement of the train_step()-function here should only be relevant for the printing …

However, the name “loss” is NOT consistent with the name defined in the statement Mean(name=”total_loss”) in the __init__() function of Chollet, where he defines his tracking mechanisms.

self.total_loss_tracker = keras.metrics.Mean(name="total_loss")

This has consequences: The inconsistency triggers a different output than a consistent use of names. Just try it out on your own …

This is not only true for the deviation between “loss” in

return {
            "loss": self.total_loss_tracker.result(),
            ....
       }

and “total_loss” in the __init__) function

self.total_loss_tracker = keras.metrics.Mean(name="total_loss") , namely a value lacking averaging  -     

but also for deviations in the names used for the other loss contributions. In case of an inconsistency Keras seems to fall back to a default here which does not reflect the standard linear averaging of Mean() over all values calculated for the batches of an epoch (without any special weights).

That there is some common default mechanism working can be seen from the fact that wrong names for all individual losses (here the KL loss and the reconstruction loss) give us at least a consistent sum-value for the total amount again. But all the values derived by the fallback are much closer to the start values at an epochs begin than the mean values averaged over an epoch. You may test this yourself.

To get on the safe side we use the correct “names” defined in the __init__()-function of our code:

        return {
            "total_loss": self.total_loss_tracker.result(),
            "reco_loss": self.reco_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

For MNIST data fed into our VAE model we then get:

Epoch 1/5
469/469 [==============================] - 8s 13ms/step - reco_loss: 214.5662 - kl_loss: 2.6004 - total_loss: 217.1666
Epoch 2/5
469/469 [==============================] - 7s 14ms/step - reco_loss: 186.4745 - kl_loss: 3.2799 - total_loss: 189.7544
Epoch 3/5
469/469 [==============================] - 6s 13ms/step - reco_loss: 181.9590 - kl_loss: 3.4186 - total_loss: 185.3774
Epoch 4/5
469/469 [==============================] - 6s 13ms/step - reco_loss: 177.5216 - kl_loss: 3.9433 - total_loss: 181.4649
Epoch 5/5
469/469 [==============================] - 6s 13ms/step - reco_loss: 163.7209 - kl_loss: 5.5816 - total_loss: 169.3026

This is exactly what we want.

A general recipe to use train_step()

So, the general recipe is:

  • Define what metric properties you are interested in. Create respective tracker-variables in the __init__() function.
  • Use the getter mechanism to define your metrics() function and its output via references to the trackers.
  • Define your own training step by a function train_step().
  • Use Tensorflow’s GradientTape context to register statements which control the calculation of loss contributions from elementary tensors of your (functional) Keras model. Provide all layers there, e.g. by references to their models.
  • Register gradient-operations of the total loss with respect to all trainable weights and updates of metrics data within function “train_step()”.

Actually, I have used the GradientTape() mechanism already in this blog when I played a bit with approaches to create so called DeepDream images. See
https://linux-blog.anracom.com/category/machine-learning/deep-dream/
for more information – there in a different context.

How to combine the classes “VAE()” and “MyVariationalAutoencoder()” ?

Where do we stand? We have defined a new class “VAE()” which modifies the original Keras Model() class. And we have our class “MyVariationalAutoencoder()” to control the setup of a VAE model.

Next we need to address the question of how we combine these two classes. If you have read my previous posts you may expect a major change to the method “_build_VAE()“. This is correct, but we also have to modify the conditions for the Encoder output construction in _build_enc() and the definition of the Decoder’s input in _build_dec(). Therefore I give you the modified code for these functions. For reasons of completeness I add the code for the __init__()-function:

Function __init__()

    def __init__(self
        , input_dim                  # the shape of the input tensors (for MNIST (28,28,1)) 
        , encoder_conv_filters       # number of maps of the different Conv2D layers   
        , encoder_conv_kernel_size   # kernel sizes of the Conv2D layers 
        , encoder_conv_strides       # strides - here also used to reduce spatial resolution avoid pooling layers 
                                     # used instead of Pooling layers 
        , encoder_conv_padding       # padding - valid or same  
        
        , decoder_conv_t_filters     # number of maps in Con2DTranspose layers 
        , decoder_conv_t_kernel_size # kernel sizes of Conv2D Transpose layers  
        , decoder_conv_t_strides     # strides for Conv2dTranspose layers - inverts spatial resolution
        , decoder_conv_t_padding     # padding - valid or same  
        
        , z_dim                      # A good start is 16 or 24  
        , solution_type  = 0         # Which type of solution for the KL loss calculation ?
        , act            = 0         # Which type of activation function?  
        , fact           = 0.65e-4   # Factor for the KL loss (0.5e-4 < fact < 1.e-3is reasonable) 
        , loss_type      = 0         # 0: BCE, 1: MSE   
        , use_batch_norm = False     # Shall BatchNormalization be used after Conv2D layers? 
        , use_dropout    = False     # Shall statistical dropout layers be used for tregularization purposes ? 
        , dropout_rate   = 0.25      # Rate for statistical dropout layer  
        , b_build_all    = False     # Added by RMO - full Model is build in 2 steps 
        ):
        
        '''
        Input: 
        The encoder_... and decoder_.... variables are Python lists,
        whose length defines the number of Conv2D and Conv2DTranspose layers 
        
        input_dim : Shape/dimensions of the input tensor - for MNIST (28,28,1) 
        encoder_conv_filters:     List with the number of maps/filters per Conv2D layer    
        encoder_conv_kernel_size: List with the kernel sizes for the Conv-Layers   
        encoder_conv_strides:     List with the strides used for the Conv-Layers   

        z_dim : dimension of the "latent_space"
        solution_type : Type of solution for KL loss calculation (0: Customized Encoder layer, 
                                                                  1: transfer of mu, var_log to Decoder 
                                                                  2: model.add_loss()
                                                                  3: definition of training step with Gradient.Tape()
        
        act :  determines activation function to use (0: LeakyRELU, 1:RELU , 2: SELU)
               !!!! NOTE: !!!!
               If SELU is used then the weight kernel initialization and the dropout layer need to be special   
               https://github.com/christianversloot/machine-learning-articles/blob/main/using-selu-with-tensorflow-and-keras.md
               AlphaDropout instead of Dropout + LeCunNormal for kernel initializer
        fact = 0.65e-4 : Factor to scale the KL loss relative to the reconstruction loss
                         Must be adapted to the way of calculation - 
                         e.g. for solution_type == 3 the loss is not averaged over all pixels 
                         => at least factor of around 1000 bigger than normally 
        loss-type = 0:   Defines the way we calculate a reconstruction loss 
                         0: Binary Cross Entropy - recommended by many authors 
                         1: Mean Square error - recommended by some authors especially for "face arithmetics"
        use_batch_norm = False   # True : We use BatchNormalization   
        use_dropout    = False   # True : We use dropout layers (rate = 0.25, see Encoder)
        b_build_all    = False   # True : Full VAE Model is build in 1 step; 
                                   False: Encoder, Decoder, VAE are build in separate steps   
        '''
        
        self.name = 'variational_autoencoder'

        # Parameters for Layers which define the Encoder and Decoder 
        self.input_dim                  = input_dim
        self.encoder_conv_filters       = encoder_conv_filters
        self.encoder_conv_kernel_size   = encoder_conv_kernel_size
        self.encoder_conv_strides       = encoder_conv_strides
        self.encoder_conv_padding       = encoder_conv_padding
        
        self.decoder_conv_t_filters     = decoder_conv_t_filters
        self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size
        self.decoder_conv_t_strides     = decoder_conv_t_strides
        self.decoder_conv_t_padding     = decoder_conv_t_padding
        
        self.z_dim = z_dim

        # Check param for activation function 
        if act < 0 or act > 2: 
            print("Range error: Parameter act = " + str(act) + " has unknown value ")  
            sys.exit()
        else:
            self.act = act 
        
        # Factor to scale the KL loss relative to the Binary Cross Entropy loss 
        self.fact = fact 
        
        # Type of loss - 0: BCE, 1: MSE 
        self.loss_type = loss_type
        
        
        # Check param for solution approach  
        if solution_type < 0 or solution_type > 3: 
            print("Range error: Parameter solution_type = " + str(solution_type) + " has unknown value ")  
            sys.exit()
        else:
            self.solution_type = solution_type 

        self.use_batch_norm = use_batch_norm
        self.use_dropout    = use_dropout
        self.dropout_rate   = dropout_rate

        # Preparation of some variables to be filled later 
        self._encoder_input  = None  # receives the Keras object for the Input Layer of the Encoder 
        self._encoder_output = None  # receives the Keras object for the Output Layer of the Encoder 
        self.shape_before_flattening = None # info of the Encoder => is used by Decoder 
        self._decoder_input  = None  # receives the Keras object for the Input Layer of the Decoder
        self._decoder_output = None  # receives the Keras object for the Output Layer of the Decoder

        # Layers / tensors for KL loss 
        self.mu      = None # receives special Dense Layer's tensor for KL-loss 
        self.log_var = None # receives special Dense Layer's tensor for KL-loss 

        # Parameters for SELU - just in case we may need to use it somewhere 
        # https://keras.io/api/layers/activations/ see selu
        self.selu_scale = 1.05070098
        self.selu_alpha = 1.67326324

        # The number of Conv2D and Conv2DTranspose layers for the Encoder / Decoder 
        self.n_layers_encoder = len(encoder_conv_filters)
        self.n_layers_decoder = len(decoder_conv_t_filters)

        self.num_epoch = 0 # Intialization of the number of epochs 

        # A matrix for the values of the losses 
        self.std_loss  = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False)

        # We only build the whole AE-model if requested
        self.b_build_all = b_build_all
        if b_build_all:
            self._build_all()


Changes to the Encoder and Decoder code

We just need to set the right options for the output tensors of the Encoder and the input tensors of the Decoder. The relevant code parts are controlled by the parameter “solution_type”.

Modified code of _build_enc() of class MyVariationalAutoencoder

    def _build_enc(self, solution_type = -1, fact=-1.0):
        '''  Your documentation '''
        # Checking whether "fact" and "solution_type" for the KL loss shall be overwritten
        if fact < 0:
            fact = self.fact  
        if solution_type < 0:
            solution_type = self.solution_type
        else: 
            self.solution_type = solution_type
        
        # Preparation: We later need a function to calculate the z-points in the latent space 
        # The following function wiChangedll be used by an eventual Lambda layer of the Encoder 
        def z_point_sampling(args):
            '''
            A point in the latent space is calculated statistically 
            around an optimized mu for each sample 
            '''
            mu, log_var = args # Note: These are 1D tensors !
            epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)
            return mu + B.exp(log_var / 2) * epsilon

        
        # Input "layer"
        self._encoder_input = Input(shape=self.input_dim, name='encoder_input')

        # Initialization of a running variable x for individual layers 
        x = self._encoder_input

        # Build the CNN-part with Conv2D layers 
        # Note that stride>=2 reduces spatial resolution without the help of pooling layers 
        for i in range(self.n_layers_encoder):
            conv_layer = Conv2D(
                filters = self.encoder_conv_filters[i]
                , kernel_size = self.encoder_conv_kernel_size[i]
                , strides = self.encoder_conv_strides[i]
                , padding = 'same'  # Important ! Controls the shape of the layer tensors.    
                , name = 'encoder_conv_' + str(i)
                )
            x = conv_layer(x)
            
            # The "normalization" should be done ahead of the "activation" 
            if self.use_batch_norm:
                x = BatchNormalization()(x)

            # Selection of activation function (out of 3)      
            if self.act == 0:
                x = LeakyReLU()(x)
            elif self.act == 1:
                x = ReLU()(x)
            elif self.act == 2: 
                # RMO: Just use the Activation layer to use SELU with predefined (!) parameters 
                x = Activation('selu')(x) 

            # Fulfill some SELU requirements 
            if self.use_dropout:
                if self.act == 2: 
                    x = AlphaDropout(rate = 0.25)(x)
                else:
                    x = Dropout(rate = 0.25)(x)

        # Last multi-dim tensor shape - is later needed by the decoder 
        self._shape_before_flattening = B.int_shape(x)[1:]

        # Flattened layer before calculating VAE-output (z-points) via 2 special layers 
        x = Flatten()(x)
        
        # "Variational" part - create 2 Dense layers for a statistical distribution of z-points  
        self.mu      = Dense(self.z_dim, name='mu')(x)
        self.log_var = Dense(self.z_dim, name='log_var')(x)

        if solution_type == 0: 
            # Customized layer for the calculation of the KL loss based on mu, var_log data
            # We use a customized layer according to a class definition  
            self.mu, self.log_var = My_KL_Layer()([self.mu, self.log_var], fact=fact)


        # Layer to provide a z_point in the Latent Space for each sample of the batch 
        self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var])

        # The Encoder Model 
        # ~~~~~~~~~~~~~~~~~~~
        # With extra KL layer or with vae.add_loss()
        if self.solution_type == 0 or self.solution_type == 2: 
            self.encoder = Model(self._encoder_input, self._encoder_output, name="encoder")
        
        # Transfer solution => Multiple outputs 
        if self.solution_type == 1  or self.solution_type == 3: 
            self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")

The difference is the dependency of the output on “solution_type 3”. For the Decoder we have:

Modified code of _build_enc() of class MyVariationalAutoencoder

    def _build_dec(self):
        ''' Your documentation       '''       
 
        # Input layer - aligned to the shape of z-points in the latent space = output[0] of the Encoder 
        self._decoder_inp_z = Input(shape=(self.z_dim,), name='decoder_input')
        
        # Additional Input layers for the KL tensors (mu, log_var) from the Encoder
        if self.solution_type == 1  or self.solution_type == 3: 
            self._dec_inp_mu       = Input(shape=(self.z_dim), name='mu_input')
            self._dec_inp_var_log  = Input(shape=(self.z_dim), name='logvar_input')
            
            # We give the layers later used as output a name 
            # Each of the Activation layers below just correspond to an identity passed through 
            #self._dec_mu            = self._dec_inp_mu 
            #self._dec_var_log       = self._dec_inp_var_log 
            self._dec_mu            = Activation('linear',name='dc_mu')(self._dec_inp_mu) 
            self._dec_var_log       = Activation('linear', name='dc_var')(self._dec_inp_var_log) 

        # Here we use the tensor shape info from the Encoder          
        x = Dense(np.prod(self._shape_before_flattening))(self._decoder_inp_z)
        x = Reshape(self._shape_before_flattening)(x)

        # The inverse CNN
        for i in range(self.n_layers_decoder):
            conv_t_layer = Conv2DTranspose(
                filters = self.decoder_conv_t_filters[i]
                , kernel_size = self.decoder_conv_t_kernel_size[i]
                , strides = self.decoder_conv_t_strides[i]
                , padding = 'same' # Important ! Controls the shape of tensors during reconstruction
                                   # we want an image with the same resolution as the original input 
                , name = 'decoder_conv_t_' + str(i)
                )
            x = conv_t_layer(x)

            # Normalization and Activation 
            if i < self.n_layers_decoder - 1:
                # Also in the decoder: normalization before activation  
                if self.use_batch_norm:
                    x = BatchNormalization()(x)
                
                # Choice of activation function
                if self.act == 0:
                    x = LeakyReLU()(x)
                elif self.act == 1:
                    x = ReLU()(x)
                elif self.act == 2: 
                    #x = self.selu_scale * ELU(alpha=self.selu_alpha)(x)
                    x = Activation('selu')(x)
                
                # Adaptions to SELU requirements 
                if self.use_dropout:
                    if self.act == 2: 
                        x = AlphaDropout(rate = 0.25)(x)
                    else:
                        x = Dropout(rate = 0.25)(x)
                
            # Last layer => Sigmoid output 
            # => This requires s<pre style="padding:8px; height: 400px; overflow:auto;">caled input => Division of pixel values by 255
            else:
                x = Activation('sigmoid', name='dc_reco')(x)

        # Output tensor => a scaled image 
        self._decoder_output = x

        # The Decoder model 
        # solution_type == 0/2/3: Just the decoded input 
        if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: 
            self.decoder = Model(self._decoder_inp_z, self._decoder_output, name="decoder")
        
        # solution_type == 1: The decoded tensor plus the transferred tensors mu and log_var a for the variational distribution 
        if self.solution_type == 1: 
            self.decoder = Model([self._decoder_inp_z, self._dec_inp_mu, self._dec_inp_var_log], 
                                 [self._decoder_output, self._dec_mu, self._dec_var_log], name="decoder")

Changes to the methods _build_VAE for building the VAE model

Our VAE model now is set up with the help of the __init__() method of our new class VAE. We just have to supplement the object created by MyVariationalAutoencoder.

Modified code of _build_VAE() of class MyVariationalAutoencoder

    def _build_VAE(self):     
        ''' Your documentation '''
        
        # Solution with train_step() and GradientTape(): Control is transferred to class VAE  
        if self.solution_type == 3:
            self.model = VAE(self)   # Here parameter "self" provides a reference to an instance of MyVariationalAutoencoder  
            self.model.summary()
        
        # Solutions with layer.add_loss or model.add_loss() 
        if self.solution_type == 0 or self.solution_type == 2:
            model_input  = self._encoder_input
            model_output = self.decoder(self._encoder_output)
            self.model = Model(model_input, model_output, name="vae")

        # Solution with transfer of data from the Encoder to the Decoder output layer
        if self.solution_type == 1: 
            enc_out      = self.encoder(self._encoder_input)
            dc_reco, dc_mu, dc_var = self.decoder(enc_out)
            # We organize the output and later association of cost functions and metrics via a dictionary 
            mod_outputs = {'vae_out_main': dc_reco, 'vae_out_mu': dc_mu, 'vae_out_var': dc_var}
            self.model = Model(inputs=self._encoder_input, outputs=mod_outputs, name="vae")

Note that we keep the resulting model within the object for class MyVariationalAutoencoder. See the Jupyter cells in my next post.

Changes to the method compile_myVAE()

The modification of the function compile_myVAE is simple

    def compile_myVAE(self, learning_rate):

        # Optimizer
        # ~~~~~~~~~ 
        optimizer = Adam(learning_rate=learning_rate)
        # save the learning rate for possible intermediate output to files 
        self.learning_rate = learning_rate
        
        # Parameter "fact" will be used by the cost functions defined below to scale the KL loss relative to the BCE loss 
        fact = self.fact
        
        # Function for solution_type == 1
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  
        @tf.function
        def mu_loss(y_true, y_pred):
            loss_mux = fact * tf.reduce_mean(tf.square(y_pred))
            return loss_mux
        
        @tf.function
        def logvar_loss(y_true, y_pred):
            loss_varx = -fact * tf.reduce_mean(1 + y_pred - tf.exp(y_pred))    
            return loss_varx

        # Function for solution_type == 2 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        # We follow an approach described at  
        # https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer
        # NOTE: We can NOT use @tf.function here 
        def get_kl_loss(mu, log_var):
            kl_loss = -fact * tf.reduce_mean(1 + log_var - tf.square(mu) - tf.exp(log_var))
            return kl_loss


        # Required operations for solution_type==2 => model.add_loss()
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        res_kl = get_kl_loss(mu=self.mu, log_var=self.log_var)

        if self.solution_type == 2: 
            self.model.add_loss(res_kl)    
            self.model.add_metric(res_kl, name='kl', aggregation='mean')
        
        # Model compilation 
        # ~~~~~~~~~~~~~~~~~~~~
        
        # Solutions with layer.add_loss or model.add_loss() 
        if self.solution_type == 0 or self.solution_type == 2: 
            if self.loss_type == 0: 
                self.model.compile(optimizer=optimizer, loss="binary_crossentropy",
                                   metrics=[tf.keras.metrics.BinaryCrossentropy(name='bce')])
            if self.loss_type == 1: 
                self.model.compile(optimizer=optimizer, loss="mse",
                                   metrics=[tf.keras.metrics.MSE(name='mse')])
        
        # Solution with transfer of data from the Encoder to the Decoder output layer
        if self.solution_type == 1: 
            if self.loss_type == 0: 
                self.model.compile(optimizer=optimizer
                                   , loss={'vae_out_main':'binary_crossentropy', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} 
                                   #, metrics={'vae_out_main':tf.keras.metrics.BinaryCrossentropy(name='bce'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss }
                                   )
            if self.loss_type == 1: 
                self.model.compile(optimizer=optimizer
                                   , loss={'vae_out_main':'mse', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} 
                                   #, metrics={'vae_out_main':tf.keras.metrics.MSE(name='mse'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss }
                                   )
       
        # Solution with train_step() and GradientTape(): Control is transferred to class VAE  
        if self.solution_type == 3:
            self.model.compile(optimizer=optimizer)

Note the adaptions to the new parameter “loss_type” which we have added to the __init__()-function!

Changes to the method train_myVAE() – inclusion of a dataflow “generator

It gets a bit more complicated for the function “train_myVAE()”. The reason is that we use the opportunity to include the output of so called generators which create limited batches on the fly from disc or memory.

Such a generator is very useful if you have to handle datasets which you cannot get into the VRAM of your video card. A typical case might be the Celeb A dataset for older graphics cards as mine.

In such a case we provide a dataflow to the function. The batches in this data flow are continuously created as needed and handed over to Tensorflows data processing on the graphics card. So, “x_train” as an input variable must not be taken literally in this case! It is replaced by the generator’s dataflow then. See the code for the Jupyter cells in the next post.

In addition we prepare for cases where we have to provide target data to compare the input data “x_train” to which deviate from each other. Typical cases are the application of AEs/VAEs for denoising or recolorization.

    # Function to initiate training 
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    def train_myVAE(self, x_train, x_target=None
                    , b_use_generator   = False 
                    , b_target_ne_train = False
                    , batch_size = 32
                    , epochs = 2
                    , initial_epoch = 0, 
                    t_mu=None, 
                    t_logvar=None 
                    ):

        ''' 
        @note: Sometimes x_target MUST be provided - e.g. for Denoising, Recolorization 
        @note: x_train will come as a dataflow in case of a generator 
        '''

        # cax = ProgbarLogger(count_mode='samples', stateful_metrics=None)
        
        class MyPrinterCallback(tf.keras.callbacks.Callback):
            # def on_train_batch_begin(self, batch, logs=None):
            #     # Do something on begin of training batch
        
            def on_epoch_end(self, epoch, logs=None):
                # Get overview over available keys 
                #keys = list(logs.keys())
                print("\nEPOCH: {}, Total Loss: {:8.6f}, // reco loss: {:8.6f}, mu Loss: {:8.6f}, logvar loss: {:8.6f}".format(epoch, 
                      logs['loss'], logs['decoder_loss'], logs['decoder_1_loss'], logs['decoder_2_loss'] 
                                            ))
                print()
                #print('EPOCH: {}, Total Loss: {}'.format(epoch, logs['loss']))
                #print('EPOCH: {}, metrics: {}'.format(epoch, logs['metrics']))
        
            def on_epoch_begin(self, epoch, logs=None):
                print('-'*50)
                print('STARTING EPOCH: {}'.format(epoch))
                
        if not b_target_ne_train : 
            x_target = x_train

        # Data are provided from tensors in the Video RAM 
        if not b_use_generator: 

            # Solutions with layer.add_loss or model.add_loss() 
            # Solution with train_step() and GradientTape(): Control is transferred to class VAE  
            if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: 
                self.model.fit(     
                    x_train
                    , x_target
                    , batch_size = batch_size
                    , shuffle = True
                    , epochs = epochs
                    , initial_epoch = initial_epoch
                )
            
            # Solution with transfer of data from the Encoder to the Decoder output layer
            if self.solution_type == 1: 
                self.model.fit(     
                    x_train
                    , {'vae_out_main': x_target, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar}
    #               also working  
    #                , [x_train, t_mu, t_logvar] # we provide some dummy tensors here  
                    , batch_size = batch_size
                    , shuffle = True
                    , epochs = epochs
                    , initial_epoch = initial_epoch
                    #, verbose=1
                    , callbacks=[MyPrinterCallback()]
                )
    
        # If data are provided as a batched dataflow from a generator - e.g. for Celeb A 
        else: 

            # Solution with transfer of data from the Encoder to the Decoder output layer
            if self.solution_type == 1: 
                print("We have no solution yet for solution_type==1 and generators !")
                sys.exit()

            # Solutions with layer.add_loss or model.add_loss() 
            # Solution with train_step() and GradientTape(): Control is transferred to class VAE  
            if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: 
                self.model.fit(     
                    x_train   # coming as a batched dataflow from the outside generator - no batch size required here 
                    , shuffle = True
                    , epochs = epochs
                    , initial_epoch = initial_epoch
                )

As I have not tested a solution for solution_type==1 and generators, yet, I leave the writing of a working code to the reader. Sorry, I did not find the time for experiments. Presently, I use generators only in combination with the add_loss() based solutions and the solution based on train_step() and GradientTape().

Note also that if we use generators they must take care for a flow of target data to. As said: You must not take “x_train” literally in the case of generators. It is more of a continuously created “dataflow” of batches then – both for the training’s input and target data.

Conclusion

In this post I have outlined how we can use the methods train_step() and the tape-context of Tensorflows GradientTape() to control loss contributions and their gradients. Though done for the specific case of the KL-loss of a VAE the general approach should have become clear.

I have added a new class to create a Keras model from a pre-constructed Encoder and Decoder. For convenience reasons we still create the layer structure with our old class “MyVariationalAutoencoder(). But we switch control then to a new instance of a class representing a child class of Keras’ Model. This class uses customized versions of train_step() and GradientTape().

I have added some more flexibility in addition: We can now include a dataflow generator for input data (as images) which do not fit into the VRAM (Video RAM) of our graphics card but into the PC’s standard RAM. We can also switch to MSE for reconstruction losses instead of BCE.

The task of the KL-loss is to compress the data distribution in the latent space and normalize the distribution around certain feature centers there. In the next post
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator
we apply this to images of faces. We shall use the “Celeb A” dataset for this purpose. We are going to see that the scaling factor of the KL loss in this case has to be chosen rather big in comparison to simple cases like MNIST. We will also see that chosing a high dimension of the latent space does not really help to create a reasonable face from statistically chosen points in the latent space.

And before I forget it:
Ceterum Censeo: The worst living fascist and war criminal living today is the Putler. He must be isolated at all levels, be denazified and imprisoned. Long live a free and democratic Ukraine!

 

Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution

In the last posts of this series

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss

I have discussed basics of Autoencoders. We have also set up a simple Autoencoder with the help of the functional Keras interface to Tensorflow 2. This worked flawlessly and we could apply our Autoencoder [AE] to the MNIST dataset. Thus we got a good reference point for further experiments. Now let us turn to the design of “Variational Autoencoders” [VAEs].

In the present post I want to demonstrate that some simple classical recipes for the construction of VAEs may not work with Tensorflow [TF] > version 2.3 due to “eager execution mode”, which is activated as the default environment for all command interpretation and execution. This includes gradient determination during the forward pass through the layers of an artificial neural network [ANN]. In contrast to “graph mode” for TF 1.x versions.

Addendum 25.05.2022: This post had to be changed as its original version contained wrong statements.

As we know already form the first post of this series we need a special loss function to control the parameters of distributions in the latent space. These distributions are used to calculate z-points for individual samples and must be “fit” optimally. I list four methods to calculate such a loss. All methods are taken form introductory books on Machine Learning (see the book references in the last section of this post). I use one concrete and exemplary method to realize a VAE: We first extend the layers of the AE-Encoder by two layers (“mu”, “var_log”) which give us the basis for the calculation of z-points on a statistical distribution. Then we use a special layer on top of the Decoder model to calculate the so called “Kullback-Leibler loss” based on data of the “mu” and “var_log” layers. Our VAE Keras model will be based on the Encoder, the Decoder and the special layer. This approach will give us a typical error message to think about.

Building a VAE

A VAE (as an AE) maps points/vectors of the “variable space” to points/vectors in the low-dimensional “latent space”. However, a VAE does not calculate the “z-points” directly. Instead it uses a statistical variable distribution around a mean value. This opens up for further degrees of freedom, which become subjects to the optimization process. These degrees of freedom are a mean value “mu” of a distribution and a “standard deviation”. The latter is derived from a variance “var“, of which we take the logarithm “log_var” for practical and numerical reasons.

Note: Our neural network parts of the VAE will decide themselves during training for which samples they use which mu and which var_log values. They optimize the required values via specific weights at special layers.

A special function

We first define a function whose meaning will become clear in a minute:

# Function to calculate a z-point based on a Gaussian standard distribution  
def sampling(args):
    mu, log_var = args
    # A randomized value from a standard distribution
    epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)
    # A point in a Gaussian standard distribution defined by mu and var with log_var = log(var)
    return mu + B.exp(log_var / 2) * epsilon

This function will be used to calculate z-points from other variables, namely mu and log_var, of the Encoder.

The Encoder

The VAE Encoder looks almost the same as the Encoder of the AE which we build in the last post:

z_dim = 16

# The Encoder 
# ************
encoder_input = Input(shape=(28,28,1))
x = encoder_input
x = Conv2D(filters = 32, kernel_size = 3, strides = 1, padding='same')(x)
x = LeakyReLU()(x)
x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding='same')(x)
x = LeakyReLU()(x)
x = Conv2D(filters = 128, kernel_size = 3, strides = 2, padding='same')(x)
x = LeakyReLU()(x)

# some information we later need for the decoder - for MNIST and our layers (7, 7, 128)
shape_before_flattening = B.int_shape(x)[1:]  # B is the tensorflow.keras backend ! See last post. 

x = Flatten()(x)of

# differences to AE-models. The following layers central elements of VAEs!   
mu      = Dense(z_dim, name='mu')(x)
log_var = Dense(z_dim, name='log_var')(x)

# We calculate z-points/vectors in the latent space by a special function
# used by a Keras Lambda layer   
enc_out = Lambda(sampling, name='enc_out_z')([mu, log_var])    

# The Encoder model 
encoder = Model(encoder_input, [enc_out], name="encoder")
encoder.summary()

The differences to the AE of the last post comprise two extra Dense layers: mu and log_var. Note that the dimension of the respective rank 1 tensor (a vector!) is equal to z_dim.

mu and log_var are (vector) variables which later shall be optimized. So, we have to treat them as trainable network variables. In the above code this is done via the weights of the two defined Dense layers which we integrated in the network. But note that we did not define any activation function for the layers! The calculation of “output” of these layers is done in a more complex function which we encapsulated in the function “sampling()”.
of
Therefore, we have in addition defined a Lambda layer which applies the (Lambda) function “sampling()” to the vectors mu and var_log. The Lambda layer thus delivers the required z-point (i.e. a vector) for each sample in the z-dimensional latent space.

z-point calculation

How exactly do we calculate a z-point or vector in the latent space? And what does the KL loss, which we later shall use, punish? For details you need to read the literature. But in short terms:

Instead of calculating a z-point directly we use a Gaussian distribution depending on 2 parameters – our mean value “mu” and a standard deviation “var“. We calculate a statistically random z-point within the range of this distribution. Note that we talk about a vector-distribution. The distribution “variables” mu and log_var are therefore vectors.

If “x” is the last layer of our conventional network part then

mu = Dense(self.z_dim, name='mu')(x)
log_var = Dense(self.z_dim, name='log_var')(x) 

Having a value for mu and log_var we use a randomized factor “epsilon” to calculate a “variational” z-point with some statistical fluctuation:

 
epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)
z_point = mu + B.exp(log_var/2)*epsilon        

This is the core of the “sampling()”-function which we defined a minute ago, already.

As you see the randomization of epsilon assumes a standard Gaussian distribution of possible values around a mean = 0 with a standard-deviation of 1. The calculated z-point is then placed in the vicinity of a fictitious z-point “mu”. The coordinate of mu” and the later value for log_var are subjects of optimization. More specifically, the distance of mu to the center of the latent space and too big values of the standard-deviation around mu will be punished by the loss. Mathematically this is achieved by the KL loss (see below). It will help to get a compact arrangements of the z-points for the training samples in the z-space (= latent space).

To transform the result of the above calculation into an ordinary z-point output of the Encoder for a given input sample we apply a Keras Lambda layer which takes the mu- and log_var-layers as input and applies the function “sampling()“, which we have already defined, to input in form of the mu and var_log tensors.

The Decoder

The Decoder is pretty much the same as the on which we have defined earlier for the Autoencoder.

dec_inp_z = Input(shape=(z_dim))
x = Dense(np.prod(shape_before_flattening))(dec_inp_z)
x = Reshape(shape_before_flattening)(x)
x = Conv2DTranspose(filters=64, kernel_size=3, strides=2, padding='same')(x)
x = LeakyReLU()(x) 
x = Conv2DTranspose(filters=32, kernel_size=3, strides=2, padding='same')(x)
x = LeakyReLU()(x) 
x = Conv2DTranspose(filters=1,  kernel_size=3, strides=1, padding='same')(x)
x = LeakyReLU()(x) 
# Output - 
x = Activation('sigmoid')(x)
dec_out = x
decoder = Model([dec_inp_z], [dec_out], name="decoder")
decoder.summary()

The Kullback-Leibler loss

Now, we must define a loss which helps to optimize the weights associated with the mu and var_log layers. We will choose the so called “Kullback-Leibler [KL] loss” for this purpose. Mathematically this convex loss function is derived from a general distance metric (Kullback_Leibler divergence) for two probability distributions. See:
https://en.wikipedia.org/ wiki/ Kullback-Leibler_divergence – and especially the examples section there
https://en.wikipedia.org/ wiki/ Divergence_(statistics)

In our case our target distribution we pick a normal distribution calculated from a specific set of vectors mu and log_var as the first distribution and compare it to a standard normal distribution with mu = 0.0 and standard deviation sqrt(exp(log_var) = 1.0. All for a specific z-point (corresponding to an input sample).
amp;
Written symbolically the KL loss is calculated by:

kl_loss  = -0.5* mean(1 + log_var - square(mu) - exp(log_var))

The “mean()” accounts for the fact that we deal with vectors. (At this point not with respect to batches.) Our loss punishes a “mu” far off the origin of the “latent space” by a quadratic term and a more complicated term dependent on log_var and var = square(sigma) with sigma being the standard deviation. The loss term for sigma is convex around sigma = 1.0. Regarding batches we shall later take a mean of the individual samples’ losses.

This means that the KL loss tries to push the mu for the samples towards the center of the latent space’s origin and keep the variance around mu within acceptable limits. The mu-dependent term leads to an effective use of the latent space around the origin. Potential clusters of similar samples in the z-space will have centers not too far from the origin. The data shall not spread over vast regions of the latent space. The reduction of the spread around a mu-value keeps similar data-points in close vicinity – leading to rather confined clusters – as good as possible.

What are the consequences for the calculation of the gradient components, i.e. partial derivatives with respect to the individual weights or other trainable parameters of the ANN?

Note that the eventual values of the mu and log_var vectors potentially will depend

  • on the KL loss and its derivatives with respect to the elements of the mu and var_log tensors
  • the dependence of the mu and var_log tensors on contributory sums, activations and weights of previous layers of the Encoder
  • and a sequence of derivatives defined by
    • a loss function for differences between reconstructed output of the decoder in comparison to the encoders input,
    • the decoder’s layers’ sum like contributions, activation functions and weights
    • the output layer (better its weights and activation function) of the Encoder
    • the sum-like contributions, weights and activation function of the Encoder’s layers

All partial derivatives are coupled by the “chain rule” of differential calculus and followed through the layers – until a specific weight or trainable parameter in the layer sequence is reached. So we speak about a mix, more precisely of a sum of two loss contributions:

  • a reconstruction loss [reco_loss]: a standard loss for the differences of the output of the VAE with respect to its input and its dependence on all layers’ activations and weights. We have called this type of loss which measures the reconstruction accuracy of samples from their z-data the “reconstruction loss” for an AE in the last post. We name the respective variable “reco_loss”.
  • Kullback-Leibler loss [kl_loss]: a special loss with respect to the values of the mu and log-var tensors and its dependence on previous layers’ activations and weights. We name the variable

The two different losses can be combined using weight factors to balance the impact of good reconstruction of the original samples and a confined distribution of data points in the z-space.

total_loss = reco_loss + fact * kl_loss

I shall use the “binary_crossentropy” loss for the reco_loss term as it leads to good and fast convergence during training.

reco_loss =>  binary_crossentropy 

The total loss is a sum of two terms, but the corrections of the VAE’s network weights from either term by partial derivatives during error backward propagation affect different parts of the VAE: The optimization of the KL-loss directly affects only encoder weights, in our example of the mu-, the var_log- and the Conv2D-layers of the Encoder. The weights of the Decoder are only indirectly influenced by the KL-loss. The optimization of the “reconstruction loss” instead has a direct impact on all weights of all layers.

A VAE, therefore, is an interesting example for loss contributions depending on certain layers, only, or specific parts of a complex, composite ANN model with sub-divisions. So, the consequences of a rigorous “eager execution mode” of the present Tensorflow versions for gradient calculations are of real and general interest also for other networks with customized loos contributions.

How to implement the KL-loss within a VAE-model? Some typical (older) methods ….

We could already define a (preliminary) keras model comprising the Encoder and the Decoder:

enc_output = encoder(encoder_input)
decoder_output = decoder(enc_output)
vae_pre = Model(encoder_input, decoder_output, name="vae_witout_kl_loss")

This model would, however, not include the KL-loss. We, therefore, must make changes to it. I found four different methods in introductory ML, where most authors use the Keras functional interface to TF2 (however early versions) to set up a layer structure similar to ours above. See the references to the books in the final section of this post. Most authors the handle the KL-loss by using the tensors of our two special (Dense) layers for “mu” and “log_var” in the Encoder somewhat later:

  • Method 1: D. Foster first creates his VAE-model and then uses separate functions to calculate the KL loss and a MSE loss regarding the difference between the output and input tensors. Afterward he puts a function (e.g. total_loss()) for adding both contribution up to a total loss into the Keras compile function as a closure – as in “compile(optimizer=’adam’, loss=total_loss)”.
  • Method 2: F. Chollet in his (somewhat older) book on “Deep Learning with Keras and Python” defines a special customized Keras layer in addition to the Encoder and Decoder models of the VAE. This layer receives an internal function to calculate the total loss. The layer is then used as a final layer to define the VAE model and invokes the loss by the Layer.add_loss() functionality.
  • Method 3: A. Geron in his book on “Machine Learning with Scikit-Learn, Keras and Tensorflow” also calculates the KL-loss after the definition of the VAE-model, but associates it with the (Keras) Model by the “model.add_loss()” functionality.
    He then calls the models compile statement with the inclusion of a “binary_crossentropy” loss as the main loss component. As in
    compile(optimizer=’adam’, loss=’binary_crossentropy’).
    Geron’s approach relies on the fact that a loss defined by Model.add_loss() automatically gets added to the loss defined in the compile statement behind the scenes. Geron directly refers to the mu and log_var layers’ tensors when calculating the loss. He does not use an intermediate function.
  • Method 4: A. Atienza structurally does something similar as Geron, but he calculates a total loss, adds it with model.add_loss() and then calls the compile statement without any loss – as in compile(optimzer=’adam’). Also Atienza calculates the KL-loss by directly operating on the layer’s tensors.

A specific way of including the VAE loss (adaption of method 2)

A code snippet to realize method 2 is given below. First we define a class for a “Customized keras Layer”.

Customized Keras layer class:

class CustVariationalLayer (Layer):
    
    def vae_loss(self, x_inp_img, z_reco_img):
        # The references to the layers are resolved outside the function 
        x = B.flatten(x_inp_img)   # B: tensorflow.keras.backend
        z = B.flatten(z_reco_img)
        
        # reconstruction loss per sample 
        # Note: that this is averaged over all features (e.g.. 784 for MNIST) 
        reco_loss = tf.keras.metrics.binary_crossentropy(x, z)
        
        # KL loss per sample - we reduce it by a factor of 1.e-3 
        # to make it comparable to the reco_loss  
        kln_loss  = -0.5e-4 * B.mean(1 + log_var - B.square(mu) - B.exp(log_var), axis=1) 
        # mean per batch (axis = 0 is automatically assumed) 
        return B.mean(reco_loss + kln_loss), B.mean(reco_loss), B.mean(kln_loss) 
           
    def call(self, inputs):
        inp_img = inputs[0]
        out_img = inputs[1]
        total_loss, reco_loss, kln_loss = self.vae_loss(inp_img, out_img)
        self.add_loss(total_loss, inputs=inputs)
        self.add_metric(total_loss, name='total_loss', aggregation='mean')
        self.add_metric(reco_loss, name='reco_loss', aggregation='mean')
        self.add_metric(kln_loss, name='kl_loss', aggregation='mean')
        
        return out_img  #not really used in this approach  

Now, we add a special layer based on the above class and use it in the definition of a VAE model. I follow the code in F. Cholet’s book; there the customized layer concludes the layer structure after the Encoder and the Decoder:

enc_output = encoder(encoder_input)
decoder_output = decoder(enc_output)

# add the custom layer to the model  
fc = CustVariationalLayer()([encoder_input, decoder_output])

vae = Model(encoder_input, fc, name="vae")
vae.summary()

The output fits the expectations

Eventually we need to compile. F. Chollet in his book from 2017 defines the loss as “None” because it is covered already by the layer.

vae.compile(optimizer=Adam(), loss=None)

fit() fails … a typical error message

We are ready to train – at least I thought so … We start training with scaled x_train images coming e.g. from the MNIST dataset by the following statements :

n_epochs = 3; 
batch_size = 128
vae.fit( x=x_train, y=None, shuffle=True, 
         epochs = n_epochs, batch_size=batch_size)

Unfortunately we get the following typical

Error message:

 
TypeError: You are passing KerasTensor(type_spec=TensorSpec(shape=(), dtype=tf.float32, name=None), name='tf.math.reduce_sum_1/Sum:0', description="created by layer 'tf.math.reduce_sum_1'"), an intermediate Keras symbolic input/output, to a TF API that does not allow registering custom dispatchers, such as `tf.cond`, `tf.function`, gradient tapes, or `tf.map_fn`. Keras Functional model construction only supports TF API calls that *do* support dispatching, such as `tf.math.add` or `tf.reshape`. Other APIs cannot be called directly on symbolic Kerasinputs/outputs. You can work around this limitation by putting the operation in a custom Keras layer `call` and calling that layer on this symbolic input/output.

The error is due to “eager execution”!

The problem I experienced was related to a concise “eager mode execution” implemented in Tensorflow ≥ 2.3. If we deactivate “eager execution” by the following statement before we define any layers

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

then we get

You can safely ignore the warning which is due to a superfluous environment variable.

And the other approaches for the KL loss?

Addendum 25.05.2022: This paragraph had to be changed as its original version contained wrong statements.

I had and have code examples for all the four variants of implementing the KL loss listed above. Both methods 1 and 2 do NOT work with standard TF 2.7 or 2.8 and active “eager execution” mode. Similar error messages as described for method 2 came up when I tried the original code of D. Foster, which you can download from his github repository. However, methods 3 and 4 do work – as long as you avoid any intermediate function to calculate the KL loss.

Important remark about F. Chollet’s present solution:
I should add that F. Chollet has, of course, meanwhile published a modern and working approach in the present online Keras documentation. I shall address his actual solution in a later post. His book is also the oldest of the four referenced.

Conclusion

In this post we have invested a lot of effort to realize a Keras model for a VAE – including the Kullback-Leibler loss. We have followed recipes of various books on Machine Learning. Unfortunately, we have to face a challenge:

Recipes of how to implement the KL loss for VAEs in books which are older than ca. 2 years may not work any longer with present versions of Tensorflow 2 and its “eager execution mode”.

In the next post of this series
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
we shall have a closer look at the problem. I will try to derive a central rule which will be helpful for the design various solutions that do work with TF 2.8.

Literature references

F. Chollet, Deep Learning mit Python und Keras, 2018, 1-te dt. Auflage, mitp Verlags GmbH & Co.KG, Frechen

D. Foster, “Generatives Deep Learning”, 2020, 1-te dt. Auflage, dpunkt Verlag, Heidelberg in Kooperation mit Media Inc.O’Reilly, ISBN 978-3-960009-128-8. See Kap. 3 and the VAE code published at
https://github.com/ davidADSP/ GDL_code/

A. Geron, “Hands-On Machine Learning with Scikit-Learn, keras & Tensorflow”, 2019, 2nd edition, O’Reilly, Sebastpol, Canada, ISBN 978-1-492-03264-9. See chapter 17.

R. Atienza, “Advanced Deep Learniing with Tensorflow 2 and Keras, 2020, 2nd edition, Packt Publishing, Birminham, UK, ISBN 978-1-83882-165-4. See chapter 8.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.