Variational Autoencoder with Tensorflow – XIV – Change of variational model parameters at inference time

In my last post of this series I compared a Variational Autoencoder [VAE] with only a tiny amount of KL-loss to a standard Autoencoder [AE]. See the links at

Variational Autoencoder with Tensorflow – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?

for more information and preparational posts.

Both the Keras models for the VAE and the AE were trained on the CelebA dataset of images of human heads. We found a tight similarity regarding the clustering of predicted data points for the training and test data in the latent space. In addition the VAE with tiny KL loss failed to reconstruct reasonable human face images from arbitrarily chosen points in the latent space. Just as a standard AE does. In forthcoming posts we will continue to study the relation between VAEs and AEs.

But in this post I want to briefly point out an interesting technical problem which may arise when you start to tests predictions for certain data samples after a training phase. Your Encoder or Decoder models may include parameters which you want to experiment with when predicting results for interesting input data. This raises the question whether we can vary such parameters at inference time. Actually, this was not quite as easy as it seemed to be when I started with respective experiments. To perform them I had to learn two aspects of Keras models I had not been aware of before.

How to switch of the z-point variation at inference time?

In my particular case the starting point was the following consideration:

At inference time there is no real need for using the logvar-based variation around mu-values predicted by the Encoder.

The variation in z-point values in VAEs is done by adding a statistical value to mu-values. The added term is based on a log_var value multiplied by a statistically fluctuating factor “epsilon”, which comes from a normal Gaussian distribution around zero. mu, therefore, is the center of a distribution a specific input tensor is mapped to in latent space for consecutive predictions. The mu- and log_var-values depend on weights of two dense layers of the Encoder and thus indirectly on the optimization during training.

But while the variation is essential during training one may NOT regard it necessary for predictions. During inference we may in some experiments have good reasons to only refer to the central mu value when predicting a z-point in latent space. For test and analysis purposes it could be interesting to omit the log_var contribution.

The question then is: How we can switch off the log_var component for the Encoder’s predictions, i.e. predictions of our Keras based encoder model?

One idea is to include variables of a Python class hosting the Keras models for the Encoder, Decoder and the composed VAE in the function for the calculation of z-point vectors.

The mu-, logvar and sampling layers of the VAE’s Encoder model encapsulated in a Python class

During this post series we have encapsulated the code for Encoder, Decoder and resulting VAE models in a Python class. Remember that the Encoder produced its output, namely z-points in the latent space via two “dense” Keras layers and a sampling layer (based on a Keras Lambda-layer). The dense layers followed a series of convolutional layers and gave us mu and log_var values. The Lambda-layer produced the eventual z-point-vector including the variation. In my case the code fragments for the layers look similar to the following ones:

# .... Layer model of the Encoder part 
...
...     # Definition of an input layer and multiple Conv2D layers 
...
        # "Variational" part - 2 Dense layers for a statistical distribution of z-points  
        self.mu      = Dense(self.z_dim, name='mu')(x)
        self.log_var = Dense(self.z_dim, name='log_var')(x)
        # Layer to provide a z_point in the Latent Space for each sample of the batch 
        self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var])
...
        # The Encoder model 
        self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")
...

The “self” refers to a class “MyVariationalAutoencoder” comprising the Encoder’s, Decoder’s and the VAE’s model and layer structures. See for details and explained code fragments of the class e.g. the posts around Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images.

The sampling is in my case done by a function “z_point_sampling”:

        def z_point_sampling(args):
            '''
            A point in the latent space is calculated statistically 
            around an optimized mu for each sample 
            '''
            mu, log_var = args # Note: These are 1D tensors !
            epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)
            return mu + B.exp(log_var / 2.) * epsilon * self.enc_v_eps_factor

You see that this function uses a class member variable “self.enc_v_eps_factor”.

Switch the variation with log_var on and off for predictions?

Our objective is to switch the log_var contribution on or off for input certain images or batches of such images fed into the Encoder. For this purpose we could in principle use the variable “self.enc_v_eps_factor” as a kind of boolean switch with values of either 0.0 or 1.0. To set the variable I had defined two class methods:

    def set_enc_to_predict(self):
        self.enc_v_eps_factor = 0.0 
    
    def set_enc_to_train(self):  
        self.enc_v_eps_factor = 1.0 

The basic idea was that the sampling function would pick the value of enc_v_eps_factor given at the runtime of a prediction, i.e. at inference time. This assumption was, however, wrong. At least in a certain sense.

Is a class variable change impacting a layer output noted during consecutive predictions of a Keras model?

Let us assume that we have instantiated our class and assigned it to a Python variable MyVae. Let us further assume the the comprised Keras models are referenced by variables

  • MyVae.encoder (for the Encoder part),
  • MyVae.decoder (for the Decoder part)
  • and MyVae.model (for the full VAE-model).

We do not care about further details of the VAE (consisting of the Encoder, the Decoder and GradientTape based cost control). But we should not forget that all the models’ layers and their weights determine cost functions derivatives and are therefore targets of the optimization performed during training. All factors determining gradients and value calculation with given weights are encoded with the compilation of a Keras model – for training purposes [using model.fit() with a Keras model], but as well for predictions! Without a compiled Keras model we cannot use model.predict().

This means: As soon as you have a compiled Keras model almost and load the weight-values saved after a a sufficient number of training epochs everything is settled for inference and predictions. Including the present value of self.enc_v_eps_factor. At compile time?

Well, thinking about it a bit more in detail from a developer perspective tells us:
The compilation would in principle not prevent the use of a changed variable at the run-time of a prediction. But on the other hand side we also have the feeling that Keras must do something to make training (which also requires predictions in the sense of a forward pass) and later raw predictions for batches at inference time pretty fast. Intermediate functionality changes would hamper performance – if only for the reason that you have to watch out for such changes.

So, it is natural to assume that Keras would keep any factors in the Lambda-layer taken from a class variable constant after compilation and during training or inference, i.e. predictions.

If this assumption were true then chain of actions AFTER a training of a VAE model (!) like

Define a Keras based VAE-model with a sampling layer and a factor enc_v_eps_factor = 1   =>   compile it (including the sub-models for the Encoder and Decoder)   =>   load weight parameters which had been saved after training   =>   switch the value of the class variable enc_v_eps_factor to 0.0   =>   Load an image or image batch for prediction

would probably NOT work as expected. To be honest: This is “wisdom” derived after experiments which did not give me the naively expected results.

Indeed, some first simple experiments showed: The value of enc_v_eps_factor (e.g. enc_v_eps_factor = 1) which was given at compile-time was used during all following prediction calculations, e.g. for a particular image in tensorial form. So, a command sequence like

MyVae.set_enc_to_train()
MyVae.encoder.compile() 
...
# Load weights from a set path 
MyVae.model.load_weights(path_weights)
...

z_point, mu, log_var = MyVae.encoder.predict(img1)
print(z_point, mu)
MyVae.set_enc_to_predict(self)
z_point, mu, log_var = MyVae.encoder.predict(img1)

did not give different results. Note, that I did not change the value of enc_v_eps_factor between compile time and the first call for a prediction.

Let us look at the example in more detail.

A concrete example

After a full training of my VAE on CelebA for 24 epochs I checked for a maximum log_var-value. Non-negligible values may indeed occur for certain images and special z-vector components for some images despite the tiny KL-loss. And indeed such a value occurred for a very special (singular) image with two diagonal black border stripes on the left and right of the photographed person’s head. I do not show the image due to digital rights concerns. But let us look at the predicted z, mu and log_var values (for a specific z-point vector component) of this image. Before I compiled the VAE model I had set enc_v_eps_factor = 1.0 :

# Set the learning rate and COMPILE the model 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
learning_rate = 0.0005

# The following is only required for compatibility reasons
b_old_optimizer = True     

# Set enc_v_eps_factor to 1.0
MyVae.set_enc_to_train()

# Separate Encoder compilation
# - does not harm the compilation of the full VAE-model, but is useful to avoid later trace warnings after changes 
MyVae.encoder.compile()

# Compilation of the full VAE model (Encoder, Decoder, cost functions used by GradientTape) 
MyVae.compile_myVAE(learning_rate=learning_rate, b_old_optimizer = b_old_optimizer )
...

# Load weights from a set path 
MyVae.model.load_weights(path_weights)
...
# start predictions
...

with

    def compile_myVAE(self, learning_rate, b_old_optimizer = False):
        # Version 1.1 of 221212
        # Forced to special handling of the optimizer for data resulting from training before TF 11.2 , due to warnings:  
        #      ValueError: You are trying to restore a checkpoint from a legacy Keras optimizer into a v2.11+ Optimizer, 
        #      which can cause errors. Please update the optimizer referenced in your code to be an instance 
        #      of `tf.keras.optimizers.legacy.Optimizer`, e.g.: `tf.keras.optimizers.legacy.Adam`.

        # Optimizer handling
        # ~~~~~~~~~ 
        if b_old_optimizer: 
            optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=learning_rate)
        else:    
            optimizer = Adam(learning_rate=learning_rate)
        ....

        # Solution type with train_step() and GradientTape()  
        if self.solution_type == 3:
            self.model.compile(optimizer=optimizer)

Details of the compilation are not important – though you may be interested in the fact that training data saved after a training based on Python module version < 11.2 of TF 2 requires a legacy version of the optimizer when later using a module version ≥ 11.2 (corresponding to TF V2.11 and above).

However, the really important point is that the compilation is done given a certain value of enc_v_eps_factor = 1.

Then we load the image in form a prepared training batch with just one element set and provide it to the predict() function of the Keras model for the Encoder. We perform two predictions and ahead of the second one we change the value of enc_v_eps_factor to 0.0:

# Choose an image 
j = 123223
img = x_train[j] # this has already a tensor compatible format 
img_list = []
img_list.append(img)
tf_img = tf.convert_to_tensor(img_list)

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)

print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

# !!!! Set enc_v_eps_factor to 0.0 !!!!
MyVae.set_enc_to_predict()

# New Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)

print()
print()

print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

The result is:

...
2.637196
-0.141142
3.3761873
...
-0.2085279
-0.141142
3.3761873

The result depends on statistical variations for the factor epsilon (Gaussian statistics; see the sampling function above).

But the central point is not the deviation for the two different prediction calls. The real point is that we have used MyVae.set_enc_to_predict() ahead of the second prediction and, yet, the values for mu and log_var for the special z_point-component (230; out of 256 components) were NOT identical. I.e. the variable value enc_v_eps_factor = 1.0, which we set before the compilation, was used during both of our prediction calculations!

Can we just recompile between different calls to model.predict() ?

The experiment described above seems to indicate that the value for the class variable enc_v_eps_factor given at compile time is used during all consecutive predictions. We could, of course, enforce a zero variation for all predictions by using MyVae.set_enc_to_predict() ahead of the compilation of the Encoder model. But this would not give us no flexibility to switch the log_var contribution off ahead of predictions for some special images and then turn it on again for other images.

But the is simple – if we need not do this switching permanently: We just recompile the Encoder model!

Compilation does not take much time for Encoder models with only a few (convolutional and dense) layers. Let us test this by modifying the code above:

# Choose an image 
j = 123223
img = x_train[j] # this has already a tensor compatible format 
img_list = []
img_list.append(img)
tf_img = tf.convert_to_tensor(img_list)

# Set enc_v_eps_factor to 0.0
MyVae.set_enc_to_predict()
# !!!!
MyVae.encoder.compile() 

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
# Decoder prediction - just for fun 
reco_list = MyVae.decoder.predict(z_points) # just for fun 
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

print() 
print()

# Set enc_v_eps_factor back to 1.0
MyVae.set_enc_to_train()
MyVae.encoder.compile()

# New Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)

print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

We get

Shape of img_list =  (1, 96, 96, 3)
eps_fact =  0.0
1/1 [==============================] - 0s 280ms/step
1/1 [==============================] - 0s 18ms/step
Shape of reco_list =  (1, 96, 96, 3)

-0.141142
-0.141142
3.3761873


eps_fact =  1.0
1/1 [==============================] - 0s 288ms/step

-0.63676023
-0.141142
3.3761873

This is exactly what we want!

The function for the prediction step of a Keras model is cached at inference time …

The example above gave us the impression that it could be the compilation of a model which “settles” all of the functionality used during predictions, i.e. at inference time. Actually, this is no quite true.

The documentation on a Keras model helped me to get a better understanding. Near the section on the method “predict()” we find some other interesting functions. A look at the remarks on “predict_step()“, reveals (quotation)

The logic for one inference step.

This method can be overridden to support custom inference logic. This method is called by Model.make_predict_function.

This method should contain the mathematical logic for one step of inference. This typically includes the forward pass.

This leads us to the function “make_predict_function()” for Keras models. And there we find the following interesting remarks – I quote:

This method can be overridden to support custom inference logic. This method is called by Model.predict and Model.predict_on_batch.

Typically, this method directly controls tf.function and tf.distribute.Strategy settings, and delegates the actual evaluation logic to Model.predict_step.

This function is cached the first time Model.predict or Model.predict_on_batch is called. The cache is cleared whenever Model.compile is called. You can skip the cache and generate again the function with force=True.

Ah! The function predict_step() normally covers the forward pass through the network and “make_predict_function()” caches the resulting (function) object at the first invocation of model.predict(). And the respective cache is not cleared automatically.

So, what really may have hindered my changes of the sampling functionality at inference time is a cache filled at the first call to encoder.predict()!

Let us test this!

Changing the sampling parameters after compilation, but before the first call of encoder.predict()

If our suspicion is right we should be able to set up the model from scratch again, compile it, use MyVae.set_enc_to_predict() and afterward call MyVae.encoder.predict() – and get the same values for mu and z_point.

So we do something like

# Build encoder according to layer parameters 
MyVae._build_enc()
# Build decoder according to layer parameters 
MyVae._build_dec()
# Build the VAE-model 
MyVae._build_VAE()
...

# Set variable to 1.0
MyVae.set_enc_to_train()

# Compile 
learning_rate = 0.0005
b_old_optimizer = True     
MyVae.compile_myVAE(learning_rate=learning_rate, b_old_optimizer = b_old_optimizer )
MyVae.encoder.compile()   # used to prevent retracing delays - when later changing encoder variables 

...
# Load weights from a set path 
MyVae.model.load_weights(path_weights)
...

# preparation if the selected img. 
...
...

MyVae.set_enc_to_predict()
print("eps_fact = ", MyVae.enc_v_eps_factor)
# Note: NO recompilation is done !

# First prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
print()
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])
..

Note that the change of enc_v_eps_factor ahead of the first call of predict(). And, indeed:

Shape of img_list =  (1, 96, 96, 3)
eps_fact =  0.0
...
-0.141142
-0.141142
3.3761873

Use make_predict_function(force=True) to clear and refill the cache for predict_step() and its forward pass functionality

The other option the documentation indicates is to use the function make_predict_function(force=True).
This leads to yet another experiment:

# img preparation 
....

# Set enc_v_eps_factor to 1.0
MyVae.set_enc_to_train()
MyVae.encoder.compile() 
print("eps_fact = ", MyVae.enc_v_eps_factor)

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

print() 
print()

# Set enc_v_eps_factor to 0.0
MyVae.set_enc_to_predict()
# !!!!
MyVae.encoder.make_predict_function(
    force=True
)
print("eps_fact = ", MyVae.enc_v_eps_factor)

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

We get

...
eps_fact =  1.0
1/1 [==============================] - 0s 287ms/step

-5.5451365
-0.141142
3.3761873


eps_fact =  0.0
1/1 [==============================] - 0s 271ms/step

-0.141142
-0.141142
3.3761873

Yes, exactly as expected. This again shows us that it is the cache which counts after the first call of model.predict() – and not the compilation of the Keras model (for the Encoder) !

Other approaches?

The general question of changing parameters at inference time also triggers the question whether we may be able to deliver parameters to the function model.predict() and transfer them further to customized variants of predict_step(). I found a similar question at stack overflow
Passing parameters to model.predict in tf.keras.Model

However, the example there was rather special – and I did not apply the lines of thought explained there to my own case. But the information given in the answer may still be useful for other readers.

Conclusion

In this post we have seen that we can change parameters influencing the forward pass of a Keras model at inference time. We saw, however, that we have to clear and fill a cache to make the changes effective. This can be achieved by

  • either applying a recompilation of the model
  • or enforcing a clearance and refilling of the cache for the model’s function predict_step().

In the special case of a VAE this allows for deactivating and re-activating the logvar-dependent statistical variation of the z-points a specific image is mapped to by the Encoder model during predictions. This gives us the option to focus on the central mu-dependent position of certain images in the latent space during experiments at inference time.

In the next post of this series we shall have a closer look at the filamental structure of the latent space of a VAE with tiny KL loss in comparison to the z-space structure of a VAE with sufficiently high KL loss.

 

Ceterum censeo: The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Somebody who orders the systematic destruction of civilian infrastructure must be fought and defeated because he is a permanent danger to basic principles of humanity – not only in Europe. Long live a free and democratic Ukraine!

 

Variational Autoencoder with Tensorflow – XI – image creation by a VAE trained on CelebA

I continue with my series on Variational Autoencoders [VAEs] and related methods to control the KL-loss.

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator
Variational Autoencoder with Tensorflow – X – VAE application to CelebA images

VAEs fall into a section of ML which is called “Generative Deep Learning“. The reason is that we can VAEs to create images with contain objects with features of objects learned from training images. One interesting category of such objects are human faces – of different color, with individual expressions and features and hairstyles, seen from different perspectives. One dataset which contains such images is the CelebA dataset.

During the last posts we came so far that we could train a CNN-based Variational Autoencoder [VAE] with images of the CelebA dataset. Even on graphics cards with low VRAM. Our VAE was equipped with a GradientTape()-based method for KL-loss control. We still have to prove that this method works in the expected way:

The distribution of data points (z-points) created by the VAE’s Encoder for training input should be confined to a region around the origin in the latent space (z-space). And neighboring z-points up to a limited distance should result in similar output of the Decoder.

Therefore, we have to look a bit deeper into the results of some VAE-experiments with the CelebA dataset. I have already pointed out why creating rather complex images from arbitrarily chosen points in the latent space is a suitable and good test for a VAE. Please remember that our efforts regarding the KL-loss have to do with the following fact:

not create reasonable images/objects from arbitrarily chosen z-points in the latent space.

This eliminates the use of an AE for creative purposes. A VAE, however, should be able to solve this type of task – at least for z-points in a limited surroundings of the latent space’s origin. Thus, by creating images from randomly selected z-points with the Decoder of a VAE, which has been trained on the CelebA data set, we cover two points:

  • Test 1: We test the functionality of the VAE-class, which we have developed and which includes the code for KL-loss handling via TF2’s GradientTape() and Keras’ train_step().
  • Test 2: We test the ability of the VAE’s Decoder to create images with convincing human-like face and hairstyle features from random z-points within an area close to the origin of the latent space.

Most of the experiments discussed below follow the same prescription: We take our trained VAE, select some random points in the latent space, feed the z-point-data into the VAE’s Decoder for a prediction and plot the images created on the Decoder’s output side. The Encoder only plays a role when we want to test reconstruction abilities.

For a low dimension z_dim=256 of the latent space we will find that the generated images display human faces reasonably well. But the images appear a bit blurry or unsharp – as if not fully focused. So, we need to discuss what we can do about this point. I will also name some plausible causes for the loss of accuracy in the representation of details.

Afterwards I want to show you that a VAE Decoder reconstructs original images relatively badly from the z-points calculated by the Encoder. At least when one looks at details. A simple AE with a sufficiently high dimension of the latent space performs much better. One may feel disappointed about the reconstruction ability of a VAE. But actually it is the ability of a VAE to forget about details and instead to focus on general features which enables it (the VAE) to create something meaningful from randomly chosen z-points in the latent space.

In a last step in this post we are going to look at images created from z-points with a growing distance from the origin of the multidimensional latent space [z-space]. (Distance can be defined by a L2-Euclidean norm). We will see that most z-points which have some z-coordinates above a value of 3 produce fancy images where the face structures get dominated or distorted by background structures learned from CelebA images. This effect was to be expected as the KL-loss enforced a distribution of the z-points which is confined to a region relatively close to the origin. Ideally, this distribution would be characterized by a normal distribution in all coordinates with a sigma of only 1. So, the fact that z-points in the vicinity of the origin of the latent space lead to a construction of images which show recognizable human faces is an indirect proof of the confining impact of the KL-loss on the z-point distribution. In another post I shall deliver data which prove this more directly.

Below I will call the latent space of a (V)AE also z-space.

Characteristics of the VAE tested

Our trained VAE with four Conv2D-layers in the Encoder and 4 corresponding Conv2DTranspose-Layers in the Decoder has the following basic characteristics:

(Encoder-CNN-) filters=(32,64,128,256), kernels=(3,3), stride=2,
reconstruction loss = BCE (binary crossentropy), fact=5.0, z_dim=256

The order of the filter- (= map-) numbers is, of course reversed for the Decoder. The factor fact to scale the KL-loss in comparison to the reconstruction loss was chosen to be fact=5, which led to a 3% contribution of the KL-loss to the total loss during training. The VAE was trained on 170,000 CelebA images with 24 epochs and a small epsilon=0.0005 plus Adam optimizer.

When you perform similar experiments on your own you may notice that the total loss values after around 24 epochs ( > 5015) are significantly higher than those of comparable experiments with a standard AE (4850). This already is an indication that our VAE will not reproduce a similar good match between an image reconstructed by the Decoder in comparison to the original input image fed into the Encoder.

Results for z-points with coordinates taken from a normal distribution around the origin of the latent space

The picture below shows some examples of generated face-images coming from randomly chosen z-points in the vicinity of the z-space’s origin. To calculate the coordinates of such z-points I applied a normal distribution:

z_points = np.random.normal(size = (n_to_show, z_dim)) # n_to_show = 28

So, what do the results for z_dim=256 look like?

Ok, we get reasonable images of human-like faces. The variations in perspective, face forms and hairstyles are also clearly visible and reflect part of the related variety in the training set. You will find more variations in more images below. So, we take this result as a success! In contrast to a pure AE we DO get something from random z-points which we clearly can interpret as human faces. The whole effort of confining z-points around the origin and at the same time of smearing out z-points with similar content over a region instead of a fixed point-mapping (as in an AE) has paid off. See for comparison:
Autoencoders, latent space and the curse of high dimensionality – I

Unfortunately, the images and their details details appear a bit blurry and not very sharp. Personally, this reminded me of the times when the first CCD-chips with relative low resolution were introduced in cameras and the raw image data looked disappointing as long as we did not apply some sharpening filters. The basic information to enhance details were there, but they had to be used explicitly to improve the plain raw data of the CCD.

The quality in details is about the same as what we see in example images in the book of D.Foster on “Generative Deep Learning”, 2019, O’Reilly. Despite the fact that Foster used a slightly higher resolution of the input images (128x128x3 pixels). The higher input resolution there also led to a higher resolution of the maps of the innermost convolutional layer. Regarding quality see also the images presented in:
https://datagen.tech/guides/image-datasets/celeba/

Enhancement processing of the images ?

Just for fun, I took a screenshot of my result, saved it and applied two different sharpening filters from the ShowFoto program:

Much better! And we do not have the impression that we added some fake information to the images by our post-processing ….

Now I hear already argument saying that such an enhancement should not be done. Frankly, I do not see any reason against post-processing of images created by a VAE-algorithm.

Remember: This is NOT about reproduction quality with respect to originals or a close-to-reality show. This is about generating new mages of human-like faces based on basic features which a VAE-algorithm hopefully has learned from training images. All of what we do with a VAE is creative. And it also comes close to a proof that ML-algorithms based on convolutional layers really can “learn” something about the basic features of objects presented to them. (The learned features are e.g. in the Encoder’s case saved in the sensitivity of the convolutional maps to typical patterns in the input images.)

And as in the case of raw-images of CCD or CMOS camera chips: Sometimes some post-processing is required to utilize the information optimally for sharpness.

Sharpening by PIL’s enhancement functionality

Of course we do not want to produce images in a ML run, take screenshots and sharpen each image individually. We need some tool that fits into the ML process pipeline. The good old PIL library for Python offers sharpening as one of multiple enhancement options for images. The next examples are results from the application of a PIL enhancement procedure:

These images look quite OK, too. The basic code fragment I used for each individual image in the above grid:

    # reconst_new is the output from my VAE's Decoder 
    ay_img      = reconst_new[i, :,:,:] * 255
    ay_img      = np.asarray(ay_img, dtype="uint8" )
    img_orig    = Image.fromarray(ay_img)
    img_shr_obj = ImageEnhance.Sharpness(img)
    sh_factor   = 7   # Specified Factor for Enhancing Sharpness
    img_sh      = img_shr_obj.enhance(sh_factor)

The sharpening factor I chose was quite high, namely sh_factor = 7.

The effect of PIL’s sharpening factor

Just to further demonstrate the effect of different factors for sharpening by PIL you find some examples below for sh_factor = 0, 3, 6.

sh_factor = 0

sh_factor = 3

sh_factor = 6

Obviously, the enhancement is important to get clearer and sharper images.
However, when you enlarge the images sufficiently enough you see some artifacts in the form of crossing lines. These artifacts are partially already existing in the Decoder’s output, but they are enhanced by the Sharpening mechanism used by PIL (unsharp masking). The artifacts become more pronounced with a growing sh_factor.
Hint: According to ML-literature the use of Upsampling layers instead of Conv2DTranspose layers in the Decoder may reduce such artefacts a bit. I have not yet tried it myself.

Assessment

How do we assess the point of relatively unclear, unsharp images produced by our VAE? What are plausible reasons for the loss of details?

  1. Firstly, already AEs with a latent space dimension z_dim=256 in general do not reconstruct brilliant images from z-points in the latent space. To get a good reconstruction quality even from an AE which does nothing else than to compress and reconstruct images of a size (96x96x3) z_dim-values > 1000 are required in my experience. More about this in another post in the future.
  2. A second important aspect is the following: Enforcing a compact distribution of similar images in the latent space via the KL-loss automatically introduces a loss of detail information. The KL-loss is designed to lead to a smear-out effect in z-space. Only basic concepts and features will be kept by the VAE to ensure a similarity of neighboring images. Details will be omitted and “smoothed” out. This has consequences also with respect to sharpness of detail structures. A detail as an eyebrow in a face is to be considered as an average of similar details found for images in the same region of the z-space. This alone brings some loss of clarity with it.
  3. Thirdly, a simple (V)AE based on some directly connected Conv2D-layers has limited capabilities in general. The reason is that we systematically reduce resolution whilst information is propagated from one Conv2D layer to the next neighboring one. Remember that we use a stride > 2 or pooling layers to cover filters on larger image scales. Due to this information processing a convolutional network automatically suppresses details in its inner layers – their resolution shrinks with growing distance from the input layer. In later posts of this blog we shall see that using ResNets instead of CNNs in the Encoder and Decoder already helps a bit regarding the reconstruction of clearer images. The correlation between details and large scale information is better kept up there than in CNNs.

Regarding the first point one may think of increasing z_dim. This may not be the best idea. It contradicts the whole idea of a VAE which at its core is a reduction of the degrees of freedom for z-points. For a higher dimensional space we may have to raise the ratio of KL-loss to reconstruction loss even further.

Regarding the third point: Of course it would also help to increase kernel sizes for the first two Conv2D layers and the number of maps there. A higher resolution of the input images would also be of advantage. Both methods may, however, conflict with your VRAM or GPU time limits.

If the second point were true then reduction of fact in our models, which controls the ration of KL-loss to reconstruction loss, would lead to a better image quality. In this case we are doomed to find an optimal value for fact – satisfying both the need for generalization and clarity of details in our images. You cannot have both … here we see a basic problem related to VAEs and the creation of realistic images. Actually, I tried this out – the effect is there, but the gain actually is not worth the effort. And for too small values of fact we eventually loose the ability to create reasonable images from arbitrary z-points at all.

All in all post-processing appears to be a simple and effective method to get images with somewhat sharper details.
Hint: If you want to create images of artificially generated faces with a really high quality, you have to turn to GANs.

Further examples – with PIL sharpening

In this example you see that not all points give you good images of faces. The z-point of the middle image in the second to last of the first illustration below has a relatively high distance from the origin. The higher the distance from the origin in z-space the weirder the images get. We shall see this below in a more systematic way.

Reconstruction quality of a VAE vs. an AE – or the “female” side of myself

If I were not afraid of copy and personal rights aspects of using CelebA images directly I could show you now a comparison of the the reconstruction ability of an AE in comparison to a VAE. You find such a comparison, though a limited one, by looking at some images in the book of D. Foster.

To avoid any problems I just tried to work with an image of myself. Which really gave me a funny result.

A plain Autoencoder with

  • an extended latent space dimension of z_dim = 1600,
  • a reasonable convolutional filter sequence of (64, 64, 128, 128)
  • a stride value of stride=2
  • and kernels ((5,5),(5,5),(3,3),(3,3))

is well able to reproduce many detailed features one’s face after a training on 80,000 CelebA images. Below see the result for an image of myself after 24 training epochs of such an AE:

The left image is the original, the right one the reconstruction. The latter is not perfect, but many details have been reproduced. Please note that the trained AE never had seen an image of myself before. For biometric analysis the reproduction would probably be sufficient.

Ok, so much about an AE and a latent space with a relatively high dimension. But what does a VAE think of me?
With fact = 5.0, filters like (32,64,128,256), (3,3)-kernels, z_dim=256 and after 18 epochs with 170,000 training images of CelebA my image really got a good cure:

My wife just laughed and said: Well, now in the age of 64 at least an AI has found something soft and female in you … Well, had the CelebA included many faces of heavy metal figures the results would have looked differently. I bet …

So with generative VAEs we obviously pay a price: Details are neglected in favor of very general face features and hairstyle aspects. And we loose sharpness. Which is good if you have wrinkles. Good for me and the celebrities, too. 🙂

However, I recommend anybody who wants to study VAEs to check the reproduction quality for CelebA test images (not from the training set). You will see the generalization effect for a broader range of images. And, of course, a better reproduction with smaller values for the ratio of the KL-loss to the reconstruction loss. However, for too small values of fact you will not be able to create realistic face images at all from arbitrary z-points – even if you choose them to be relatively close to the origin of the latent space.

Dependency of the creation of reasonable images on the distance from the origin

In another post in this blog I have discussed why we need VAEs at all if we want to reconstruct reasonable face images from randomly picked points in the latent space. See:
Autoencoders, latent space and the curse of high dimensionality – I

I think the reader is meanwhile convinced that VAEs do a reasonably good job to create images from randomly chosen z-points. But all of the above images were taken from z-points calculated with the help of a function assuming a normal distribution in the z-space coordinates. The width of the resulting distribution around the origin is of course rather limited. Most points lie within a 3 sigma distance around the origin. This is OK as we have put a lot of effort into the KL-loss to force the z-points to approach such a normal distribution around the origin of the latent space.

But what happens if and when we increase the distance of our random z-points from the origin? An easy way to investigate this is to create the z-points with a function that creates the coordinates randomly, but equally distributed in an interval ]0,limit]. The chance that at least one of the coordinates gets a high value is rather big then. This in turn ensures relatively high radius values (in terms of an L2-distance norm).

Below you find the results for z-points created by the function random.uniform:

r_limit = 1.5
l_limit = -r_limit
znew = np.random.uniform(l_limit, r_limit, size = (n_to_show, z_dim))

r_limit is varied as indicated:

r_limit = 0.5

r_limit = 1.0

r_limit = 1.5

r_limit = 2.0

r_limit = 2.5

r_limit = 3.0

r_limit = 3.5

r_limit = 5.0

r_limit = 8.0

Well, this proves that we get reasonable images only up to a certain distance from the origin – and only in certain areas or pockets of the z-space at higher radii.

Another notable aspect is the fact that the background variations are completely smoothed out a low distances from the origin. But they get dominant in the outer regions of the z-space. This is consistent with the fact that we need more information to distinguish various background shapes, forms and colors than basic face patterns. Note also that the faces appear relatively homogeneous for r_limit = 0.5. The farther we are away from the origin the larger the volumes to cover and distinguish certain features of the training images become.

Conclusion

Our VAE with the GradientTape()-mechanism for the control of the KL-loss seems to do its job. In contrast to a pure AE the smear-out effect of the KL-loss allows now for the creation of images with interpretable contents from arbitrary z-points via the VAE’s Decoder – as long as the selected z-points are not too far away from the z-space’s origin. Thus, by indirect evidence we can conclude that the z-points for training images of the CelebA dataset were distributed and at the same time confined around the origin. The strongest indication came from the last series of images. But we pay a price: The reconstruction abilities of a VAE are far below those of AEs. A relatively low number of dimensions of the latent space helps with an effective confinement of the z-points. But it leads to a significant loss in detail sharpness of the generated images, too. However, part of this effect can be compensated by the application of standard procedures for image enhancemnet.

In the next post
Variational Autoencoder with Tensorflow – XII – save some VRAM by an extra Dense layer in the Encoder
I will discuss a simple trick to reduce the VRAM consumption of the Encoder. In a further post we shall then analyze the confinement of the z-point distribution with the help of more explicit data.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. An aggressor who orders the bombardment of civilian infrastructure, civilian buildings, schools and hospitals with drones bought from other anti-democrats and women oppressors puts himself in the darkest and most rotten corner of human history.

 

Variational Autoencoder with Tensorflow – X – VAE application to CelebA images

I continue with my series on Variational Autoencoders and methods to control the Kullback-Leibler [KL] loss.

Variational Autoencoder with Tensorflow – I – some basics
Variational Autoencoder with Tensorflow – II – an Autoencoder with binary-crossentropy loss
Variational Autoencoder with Tensorflow – III – problems with the KL loss and eager execution
Variational Autoencoder with Tensorflow – IV – simple rules to avoid problems with eager execution
Variational Autoencoder with Tensorflow – V – a customized Encoder layer for the KL loss
Variational Autoencoder with Tensorflow – VI – KL loss via tensor transfer and multiple output
Variational Autoencoder with Tensorflow – VII – KL loss via model.add_loss()
Variational Autoencoder with Tensorflow – VIII – TF 2 GradientTape(), KL loss and metrics
Variational Autoencoder with Tensorflow – IX – taming Celeb A by resizing the images and using a generator

The last method discussed made use of Tensorflow’s GradientTape()-class. We still have to test this approach on a challenging dataset like CelebA. Our ultimate objective will be to pick up randomly chosen data points in the VAE’s latent space and create yet unseen but realistic face images by the trained Decoder’s abilities. This task falls into the category of Generative Deep Learning. It has nothing to do with classification or a simple reconstruction of images. Instead we let a trained Artificial Neural Network create something new.

The code fragments discussed in the last post of this series helped us to prepare images of CelebA for training purposes. We cut and downsized them. We saved them in their final form in Numpy arrays: Loading e.g. 170,000 training images from a SSD as a Numpy array is a matter of a few seconds. We also learned how to prepare a Keras ImageDataGenerator object to create a flow of batches with image data to the GPU.

We have also developed two Python classes “MyVariationalAutoencoder” and “VAE” for the setup of a CNN-based VAE. These classes allow us to control a VAE’s input parameters, its layer structure based on Conv2D- and Conv2DTranspose layers, and the handling of the Kullback-Leibler [KL-] loss. In this post I will give you Jupyter code fragments that will help you to apply these classes in combination with CelebA data.

Basic structure of the CNN-based VAE – and sizing of the KL-loss contribution

The Encoder and Decoder CNNs of our VAE shall consist of 4 convolutional layers and 4 transpose convolutional layers, respectively. We control the KL loss by invoking GradientTape() and train_step().

Regarding the size of the KL-loss:
Due to the “curse of dimensionality” we will have to choose the KL-loss contribution to the total loss large enough. We control the relative size of the KL-loss in comparison to the standard reconstruction loss by a parameter “fact“. To determine an optimal value requires some experiments. It also depends on the kind of reconstruction loss: Below I assume that we use a “Binary Crossentropy” loss. Then we must choose fact > 3.0 to get the KL-loss to become bigger than 3% of the total loss. Otherwise the confining and smoothing effect of the KL-loss on the data distribution in the latent space will not be big enough to force the VAE to learn general and not specific features of the training images.

Imports and GPU usage

Below I present Jupyter cells for required imports and GPU preparation without many comments. Its all standard. I keep the Python file with the named classes in a folder “my_AE_code.models”. This folder must have been declared as part of the module search path “sys.path”.

Jupyter Cell 1 – Imports

import os, sys, time, random 
import math
import numpy as np

import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpat 

import PIL as PIL 
from PIL import Image
from PIL import ImageFilter

# temsorflow and keras 
import tensorflow as tf
from tensorflow import keras as K
from tensorflow.keras import backend as B 
from tensorflow.keras.models import Model
from tensorflow.keras import regularizers
from tensorflow.keras import optimizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import metrics
from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \
                                    Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, \
                                    AlphaDropout, Concatenate, Rescaling, ZeroPadding2D, Layer

#from tensorflow.keras.utils import to_categorical
#from tensorflow.keras.optimizers import schedules

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from my_AE_code.models.MyVAE_3 import MyVariationalAutoencoder
from my_AE_code.models.MyVAE_3 import VAE

Jupyter Cell 2 – List available Cuda devices

# List Cuda devices 
# Suppress some TF2 warnings on negative NUMA node number
# see https://www.programmerall.com/article/89182120793/
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}

tf.config.experimental.list_physical_devices()

Jupyter Cell 3 – Use GPU and limit VRAM usage

# Restrict to GPU and activate jit to accelerate 
# *************************************************
# NOTE: To change any of the following values you MUST restart the notebook kernel ! 

b_tf_CPU_only      = False   # we need to work on a GPU  
tf_limit_CPU_cores = 4 
tf_limit_GPU_RAM   = 2048

b_experiment  = False # Use only if you want to use the deprecated way of limiting CPU/GPU resources 
                      # see the next cell 

if not b_experiment: 
    if b_tf_CPU_only: 
        ... 
    else: 
        gpus = tf.config.experimental.list_physical_devices('GPU')
        tf.config.experimental.set_virtual_device_configuration(gpus[0], 
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit = tf_limit_GPU_RAM)])
    
    # JiT optimizer 
    tf.config.optimizer.set_jit(True)

You see that I limited the VRAM consumption drastically to leave some of the 4GB VRAM available on my old GPU for other purposes than ML.

Setting some basic parameters for VAE training

The next cell defines some basic parameters – you know this already from my last post.

Juypter Cell 4 – basic parameters

# Some basic parameters
# ~~~~~~~~~~~~~~~~~~~~~~~~
INPUT_DIM          = (96, 96, 3) 
BATCH_SIZE         = 128

# The number of available images 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
num_imgs = 200000  # Check with notebook CelebA 

# The number of images to use during training and for tests
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NUM_IMAGES_TRAIN  = 170000   # The number of images to use in a Trainings Run 
#NUM_IMAGES_TO_USE  = 60000   # The number of images to use in a Trainings Run 

NUM_IMAGES_TEST = 10000   # The number of images to use in a Trainings Run 

# for historic comapatibility reasons 
N_ImagesToUse        = NUM_IMAGES_TRAIN 
NUM_IMAGES           = NUM_IMAGES_TRAIN 
NUM_IMAGES_TO_TRAIN  = NUM_IMAGES_TRAIN   # The number of images to use in a Trainings Run 
NUM_IMAGES_TO_TEST   = NUM_IMAGES_TEST  # The number of images to use in a Test Run 

# Define some shapes for Numpy arrays with all images for training
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
shape_ay_imgs_train = (N_ImagesToUse, ) + INPUT_DIM
print("Assumed shape for Numpy array with train imgs: ", shape_ay_imgs_train)

shape_ay_imgs_test = (NUM_IMAGES_TO_TEST, ) + INPUT_DIM
print("Assumed shape for Numpy array with test  imgs: ",shape_ay_imgs_test)

Load the image data and prepare a generator

Also the next cells were already described in the last blog.

Juypter Cell 5 – fill Numpy arrays with image data from disk

# Load the Numpy arrays with scaled Celeb A directly from disk 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
print("Started loop for train and test images")
start_time = time.perf_counter()

x_train = np.load(path_file_ay_train)
x_test  = np.load(path_file_ay_test)

end_time = time.perf_counter()
cpu_time = end_time - start_time
print()
print("CPU-time for loading Numpy arrays of CelebA imgs: ", cpu_time) 
print("Shape of x_train: ", x_train.shape)
print("Shape of x_test:  ", x_test.shape)

The Output is

Started loop for train and test images

CPU-time for loading Numpy arrays of CelebA imgs:  2.7438277259999495
Shape of x_train:  (170000, 96, 96, 3)
Shape of x_test:   (10000, 96, 96, 3)

Juypter Cell 6 – create an ImageDataGenerator object

# Generator based on Numpy array of image data (in RAM)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
b_use_generator_ay = True

BATCH_SIZE    = 128
SOLUTION_TYPE = 3

if b_use_generator_ay:

    if SOLUTION_TYPE == 0: 
        data_gen = ImageDataGenerator()
        data_flow = data_gen.flow(
                           x_train 
                         , x_train
                         , batch_size = BATCH_SIZE
                         , shuffle = True
                         )
    
    if SOLUTION_TYPE == 3: 
        data_gen = ImageDataGenerator()
        data_flow = data_gen.flow(
                           x_train 
                         , batch_size = BATCH_SIZE
                         , shuffle = True
                         )

In our case we work with SOLUTION_TYPE = 3. This specifies the use of GradientTape() to control the KL-loss. Note that we do NOT need to define label data in this case.

Setting up the layer structure of the VAE

Next we set up the sequence of convolutional layers of the Encoder and Decoder of our VAE. For this objective we feed the required parameters into the __init__() function of our class “MyVariationalAutoencoder” whilst creating an object instance (MyVae).

Juypter Cell 7 – Parameters for the setup of VAE-layers

from my_AE_code.models.MyVAE_3 import MyVariationalAutoencoder
from my_AE_code.models.MyVAE_3 import VAE

z_dim = 256  # a first good guess to get a sufficient basic reconstruction quality 
#              due to the KL-loss the general reconstruction quality will 
#              nevertheless be poor in comp. to an AE  

solution_type = SOLUTION_TYPE     # We test GradientTape => SOLUTION_TYPE = 3 
loss_type     = 0                 # Reconstruction loss => 0: BCE, 1: MSE  
act           = 0                 # standard leaky relu activation function 

# Factor to scale the KL-loss in comparison to the reconstruction loss   
fact           = 5.0     #  - for BCE , other working values 1.5, 2.25, 3.0 
                         #              best: fact >= 3.0   
# fact           = 2.0e-2   #  - for MSE, other working values 1.2e-2, 4.0e-2, 5.0e-2

use_batch_norm  = True
use_dropout     = False
dropout_rate    = 0.1

n_ch  = INPUT_DIM[2]   # number of channels
print("Number of channels = ",  n_ch)
print()

# Instantiation of our main class
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
MyVae = MyVariationalAutoencoder(
    input_dim = INPUT_DIM
    , encoder_conv_filters     = [32,64,128,256]
    , encoder_conv_kernel_size = [3,3,3,3]
    , encoder_conv_strides     = [2,2,2,2]
    , encoder_conv_padding     = ['same','same','same','same']

    , decoder_conv_t_filters     = [128,64,32,n_ch]
    , decoder_conv_t_kernel_size = [3,3,3,3]
    , decoder_conv_t_strides     = [2,2,2,2]
    , decoder_conv_t_padding     = ['same','same','same','same']

    , z_dim = z_dim
    , solution_type = solution_type    
    , act   = act
    , fact  = fact
    , loss_type      = loss_type
    , use_batch_norm = use_batch_norm
    , use_dropout    = use_dropout
    , dropout_rate   = dropout_rate
)

There are some noteworthy things:

Choosing working values for “fact”

Reaonable values of “fact” depend on the type of reconstruction loss we choose. In general the “Binary Cross-Entropy Loss” (BCE) has steep walls around a minimum. BCE, therefore, creates much larger loss values than a “Mean Square Error” loss (MSE). Our class can handle both types of reconstruction loss. For BCE some trials show that values “3.0 <= fact <= 6.0" produce z-point distributions which are well confined around the origin of the latent space. If you lie to work with "MSE" for the reconstruction loss you must assign much lower values to fact - around fact = 0.01.

Batch normalization layers, but no drop-out layers

I use batch normalization layers in addition to the convolution layers. It helps a bit or a faster convergence, but produces GPU-time overhead during training. In my experience batch normalization is not an absolute necessity. But try out by yourself. Drop-out layers in addition to a reasonable KL-loss size appear to me as an unnecessary double means to enforce generalization.

Four convolutional layers

Four Convolution layers allow for a reasonable coverage of patterns on different length scales. Four layers make it also easy to use a constant stride of 2 and a “same” padding on all levels. We use a kernel size of 3 for all layers. The number of maps of the layers are defined as 32, 64, 128 and 256.

All in all we use a standard approach to combine filters at different granularity levels. We also cover 3 color layers of a standard image, reflected in the input dimensions of the Encoder. The Decoder creates corresponding arrays with color information.

Building the Encoder and the Decoder models

We now call the classes methods to build the models for the Encoder and Decoder parts of the VAE.

Juypter Cell 8 – Creation of the Encoder model

# Build the Encoder 
# ~~~~~~~~~~~~~~~~~~
MyVae._build_enc()
MyVae.encoder.summary()

Output:

You see that the KL-loss related layers dominate the number of parameters.

Juypter Cell 9 – Creation of the Decoder model

# Build the Decoder 
# ~~~~~~~~~~~~~~~~~~~
MyVae._build_dec()
MyVae.decoder.summary()

Output:

Building and compiling the full VAE based on GradientTape()

Building and compiling the full VAE based on parameter solution_type = 3 is easy with our class:

Juypter Cell 10 – Creation and compilation of the VAE model

# Build the full AE 
# ~~~~~~~~~~~~~~~~~~~
MyVae._build_VAE()

# Compile the model 
learning_rate = 0.0005
MyVae.compile_myVAE(learning_rate=learning_rate)

Note that internally an instance of class “VAE” is built which handles all loss calculations including the KL-contribution. Compilation and inclusion of an Adam optimizer is also handled internally. Our classes make or life easy …

Our initial learning_rate is relatively small. I followed recommendations of D. Foster’s book on “Generative Deep Learning” regarding this point. A value of 1.e-4 does not change much regarding the number of epochs for convergence.

Due to the chosen low dimension of the latent space the total number of trainable parameters is relatively moderate.

Prepare saving and loading of model parameters

To save some precious computational time (and energy consumption) in the future we need a basic option to save and load model weight parameters. I only describe a direct method; I leave it up to the reader to define a related Callback.

Juypter Cell 11 – Paths to save or load weight parameters

path_model_save_dir = 'YOUR_PATH_TO_A_WEIGHT_SAVING_DIR'

dir_name = 'MyVAE3_sol3_act0_loss0_epo24_fact_5p0emin0_ba128_lay32-64-128-256/'
path_dir = path_model_save_dir + dir_name
if not os.path.isdir(path_dir): 
    os.mkdir(path_dir, mode = 0o755)

dir_all_name = 'all/'
dir_enc_name = 'enc/'
dir_dec_name = 'dec/'

path_dir_all = path_dir + dir_all_name
if not os.path.isdir(path_dir_all): 
    os.mkdir(path_dir_all, mode = 0o755)

path_dir_enc = path_dir + dir_enc_name
if not os.path.isdir(path_dir_enc): 
    os.mkdir(path_dir_enc, mode = 0o755)

path_dir_dec = path_dir + dir_dec_name
if not os.path.isdir(path_dir_dec): 
    os.mkdir(path_dir_dec, mode = 0o755)

name_all = 'all_weights.hd5'
name_enc = 'enc_weights.hd5'
name_dec = 'dec_weights.hd5'

#save all weights
path_all = path_dir + dir_all_name + name_all
path_enc = path_dir + dir_enc_name + name_enc
path_dec = path_dir + dir_dec_name + name_dec

You see that I define separate files in “hd5” format to save parameters of both the full model as well as of its Encoder and Decoder parts.

If we really wanted to load saved weight parameters we could set the parameter “b_load_weight_parameters” in the next cell to “True” and execute the cell code:

Juypter Cell 12 – Load saved weight parameters into the VAE model

b_load_weight_parameters = False

if b_load_weight_parameters:
    MyVae.model.load_weights(path_all)

Training and saving calculated weights

We are ready to perform a training run. For our 170,000 training images and the parameters set I needed a bit more than 18 epochs, namely 24. I did this in two steps – first 18 epochs and then another 6.

Juypter Cell 13 – Load saved weight parameters into the VAE model

INITIAL_EPOCH = 0 

#n_epochs      = 18
n_epochs      = 6

MyVae.set_enc_to_train()
MyVae.train_myVAE(   
             data_flow
            , b_use_generator = True 
            , epochs = n_epochs
            , initial_epoch = INITIAL_EPOCH
            )

The total loss starts in the beginning with a value above 6,900 and quickly closes in to something like 5,100 and below. The KL-loss during raining rises continuously from something like 30 to 176 where it stays almost constant. The 6 epochs after epoch 18 gave the following result:

I stopped the calculation at this point – though a full convergence may need some more epochs.

You see that an epoch takes about 2 minutes GPU time (on a GTX960; a modern graphics card will deliver far better values). For 170,000 images the training really costs. On the other side you get a broader variation of face properties in the resulting artificial images later on.

After some epoch we may want to save the weights calculated. The next Jupyter cell shows how.

Juypter Cell 14 – Save weight parameters to disk

print(path_all)
MyVae.model.save_weights(path_all)
print("saving all weights is finished")

print()
#save enc weights
print(path_enc)
MyVae.encoder.save_weights(path_enc)
print("saving enc weights is finished")

print()
#save dec weights
print(path_dec)
MyVae.decoder.save_weights(path_dec)
print("saving dec weights is finished")

How to test the reconstruction quality?

After training you may first want to test the reconstruction quality of the VAE’s Decoder with respect to training or test images. Unfortunately, I cannot show you original data of the Celeb A dataset. However, the following code cells will help you to do the test by yourself.

Juypter Cell 15 – Choose images and compare them to their reconstructed counterparts

# We choose 14 "random" images from the x_train dataset
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence
# For another method to create reproducale "random numbers" see https://albertcthomas.github.io/good-practices-random-number-generators/

n_to_show = 7  # per row 

# To really recover all data we must have one and the same input dataset per training run 
l_seed = [33, 44]   #l_seed = [33, 44, 55, 66, 77, 88, 99]
num_exmpls = len(l_seed)
print(num_exmpls) 

# a list to save the image rows 
l_img_orig_rows = []
l_img_reco_rows = []

start_time = time.perf_counter()

# Set the Encoder to prediction = epsilon * 0.0 
# MyVae.set_enc_to_predict()

for i in range(0, num_exmpls):

    # fixed random distribution 
    rs1 = RandomState(MT19937( SeedSequence(l_seed[i]) ))

    # indices of example array selected from the test images 
    #example_idx = np.random.choice(range(len(x_test)), n_to_show)
    example_idx    = rs1.randint(0, len(x_train), n_to_show)
    example_images = x_train[example_idx]

    # calc points in the latent space 
    if solution_type == 3:
        z_points, mu, logvar  = MyVae.encoder.predict(example_images)
    else:
        z_points  = MyVae.encoder.predict(example_images)

    # Reconstruct the images - note that this results in an array of images  
    reconst_images = MyVae.decoder.predict(z_points)

    # save images in a list 
    l_img_orig_rows.append(example_images)
    l_img_reco_rows.append(reconst_images)

end_time = time.perf_counter()
cpu_time = end_time - start_time

# Reset the Encoder to prediction = epsilon * 1.00 
# MyVae.set_enc_to_train()

print()
print("n_epochs : ", n_epochs, ":: CPU-time to reconstr. imgs: ", cpu_time) 

We save the selected original images and the reconstructed images in Python lists.
We then display the original images in one row of a matrix and the reconstructed ones in a row below. We arrange 7 images per row.

Juypter Cell 16 – display original and reconstructed images in a matrix-like array

# Build an image mesh 
# ~~~~~~~~~~~~~~~~~~~~
fig = plt.figure(figsize=(16, 8))
fig.subplots_adjust(hspace=0.2, wspace=0.2)

n_rows = num_exmpls*2 # One more for the original 

for j in range(num_exmpls): 
    offset_orig = n_to_show * j * 2
    for i in range(n_to_show): 
        img = l_img_orig_rows[j][i].squeeze()
        ax = fig.add_subplot(n_rows, n_to_show, offset_orig + i+1)
        ax.axis('off')
        ax.imshow(img, cmap='gray_r')
    
    offset_reco = offset_orig + n_to_show
    for i in range(n_to_show): 
        img = l_img_reco_rows[j][i].squeeze()
        ax = fig.add_subplot(n_rows, n_to_show, offset_reco+i+1)
        ax.axis('off')
        ax.imshow(img, cmap='gray_r')

You will find that the reconstruction quality is rather limited – and not really convincing by any measures regarding details. Only the general shape of faces an their features are reproduced. But, actually, it is this lack of precision regarding details which helps us to create images from arbitrary z-points. I will discuss these points in more detail in a further post.

First results: Face images created from randomly distributed points in the latent space

The technique to display images can also be used to display images reconstructed from arbitrary points in the latent space. I will show you various results in another post.

For now just enjoy the creation of images derived from z-points defined by a normal distribution around the center of the latent space:

Most of these images look quite convincing and crispy down to details. The sharpness results from some photo-processing with PIL functions after the creation by the VAE. But who said that this is not allowed?

Conclusion

In this post I have presented Jupyter cells with code fragments which may help you to apply the VAE-classes created previously. With the VAE setup discussed above we control the KL-loss by a GradientTape() object.
Preliminary results show that the images created of arbitrarily chosen z-points really show heads with human-like faces and hair-dos. In contrast to what a simple AE would produce (see:
Autoencoders, latent space and the curse of high dimensionality – I

In the next post
Variational Autoencoder with Tensorflow – XI – image creation by a VAE trained on CelebA
I will have a look at the distribution of z-points corresponding to the CelebA data and discuss the delicate balance between the representation of details and the generalization of features. With VAEs you cannot get both.

And let us all who praise freedom not forget:
The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!