Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

We still have to test the Python classes which we have so laboriously developed during the last posts. One of these classes, „VAE()“, supports a specific approach to control the KL-loss parameters during training and cost optimization by gradient descent: The class may use Tensorflow’s [TF 2] *GradientTape*-mechanism and the Keras function *train_step()* – instead of relying on Keras‘ standard „add_loss()“ functions.

Instead of recreating simple MNIST images of digits from ponts in a latent space I now want to train a VAE (with GradienTape-based loss control) to solve a more challenging task:

We want to create artificial images of naturally appearing human faces from randomly chosen points in the latent space of a VAE, which has been trained with images of real human faces.

Actually, we will train our VAE with images provided by the so called „**Celeb A**“ dataset. This dataset contains around 200,000 images showing the heads of so called celebrities. Due to the number and size of its images this dataset forces me (due to my very limited hardware) to use a *Keras Image Data Generator*. A generator is a tool to transfer huge amounts of data in a continuous process and in form of small batches to the GPU during neural network training. The batches must be small enough such that the respective image data fit into the VRAM of the GPU. Our VAE classes have been designed to support a generator.

In this post I first explain why Celeb A poses a thorough test for a VAE. Afterwards I shall bring the Celeb A data into a form suitable for older graphics cards with small VRAM.

To answer the question we first have to ask ourselves why we need VAEs at all. Why do certain ML tasks require more than just a simple plain Autoencoder [AE]?

The answer to the latter question lies in the data distribution an AE creates in its latent space. An AE, which is trained for the precise reconstruction of presented images will use a sufficiently broad area/volume of the latent space to place different points corresponding to different imageswith a sufficiently large distance between them. The position in an AE’s latent space (together with the Encode’s and Decoder’s weights) encodes specific features of an image. A standard AE is not forced to generalize sufficiently during training for reconstruction tasks. On the contrary: A good reconstruction AE shall learn to encode as many details of input images as possible whilst filling the latent space.

However: The neural networks of a (V)AE correspond to a (non-linear) mapping functions between multi-dimensional vector spaces, namely

- between the feature space of the input data objects and the AE’s latent space
- and also between the latent space and the reconstruction space (normally with the same dimension as the original feature space for the input data).

This poses some risks whenever some tasks require to use arbitrary points in the latent space. Let us, e.g., look at the case of images of certain real objects in font of varying backgrounds:

During the AE’s training we map points of a high-dimensional feature-space for the pixel values of (colored) images to points in the multi-dimensional latent space. The target region in the latent space stemming from regions in the original feature-space which correspond to „reasonable“ images displaying *real* objects may cover only a relatively thin, wiggled manifold within in the latent space (z-space). For points outside the curved boundaries of such regions in z-space the Decoder may not give you clear realistic and interpretable images.

The most important objectives of invoking the KL-loss as an additional optimization element by a VAE are

- to
*confine*the data point distribution, which the VAE’s Encoder part produces in the multidimensional latent space, around the*origin*of the z-space – as far as possible symmetrically and within a very limited distance from**O**,**O** - to normalize the data distribution around any z-point calculated during training. Whenever a real training object marks the center of a
*limited area*in latent space then reconstructed data objects (e.g. images) within such an area should not be too different from the original training object.

I.e.: We force the VAE to generalize much more than a simple AE.

Both objectives are achieved via specific parameterized parts of the KL-loss. We optimize the KL-loss parameters – and thus the data distribution in the latent space – during training. After the training phase we want the VAE’s Decoder to behave well and smoothly for *neighboring* points in extended areas of the latent space:

The content of reconstructed objects (e.g. images) resulting from neighboring points within limited z-space areas (up to a certain distance from the origin) should vary only smoothly.

The KL loss provides the necessary *smear-out effect* for the data distribution in z-space.

During this series I have only shown you the effects of the KL-loss on **MNIST** data for a dimension of the latent space *z_dim = 2*. We saw the general confinement of z-points around the origin and also a confinement of points corresponding to different MNIST-numbers (= specific features of the original images) in limited areas. With some overlaps and transition regions for different numbers.

But note: The low dimension of the latent space in the MNIST case (between 2 and 16) simplifies the confinement task – close to the origin there are not many degrees of freedom and no big volume available for the VAE Encoder. Even a standard AE would be rather limited when trying to vastly distribute z-points resulting from MNIST images of different digits.

However, a more challenging task is posed by the data distribution, which a (V)AE creates e.g. of images showing human heads and faces with characteristic features in front of varying backgrounds. To get a reasonable image reconstruction we must assign a much higher number of dimensions to the latent space than in the MNIST case: **z_dim = 256** or **z_dim = 512** are reasonable values at the lower end!

Human faces or heads with different hair-dos are much more complex than digit figures. In addition the influence of details in the background of the faces must be handled – and for our objective be damped. As we have to deal with *many* more dimensions of the z-space than in the MNIST case a simple standard AE will run into trouble:

Without the confinement and local smear-out effect of the KL-loss only tiny and thin areas of the latent space will correspond to reconstructions of human-like „faces“. I have discussed this point in more detail also in the post

https://linux-blog.anracom.com/2022/08/15/autoencoders-latent-space-and-the-curse-of-high-dimensionality-i/

As a result a standard AE will **NOT** reconstruct human faces from randomly picked z-points in the latent space. So, an AE will fail on the challenge posed in the introduction of this post.

I recommend to get the Celeb A data from some trustworthy Kaggle contributor – and not from the original Chinese site. You may find cropped images e.g. at here. Still check the image container and the images carefully for unwanted add-ons.

The Celeb A dataset contains around 200,000 images of the heads of celebrities with a resolution of 218×178 pixels. Each image shows a celebrity face in front of some partially complex background. The amount of data to be handled during VAE training is relatively big – even if you downscale the images. The whole set will not fit into the limited VRAM of older graphics cards as mine (GTX960 with 4 GB, only). This post will show you how to deal with this problem.

You may wonder why the Celeb A dataset poses a problem as the original data only consume about 1.3 GByte on a hard disk. But do not forget that we need to provide floating point * tensors* of size (height x width x 3 x 32Bit) instead of compressed integer based jpg-information to the VAE algorithm. You can do the math on your own. In addition: Working with multiple screens and KDE on Linux may already consume more than 1 GB of our limited VRAM.

We use three tricks to work reasonably fast with the Celeb A data on a Linux systems with limited VRAM, but with around 32 GB or more standard RAM:

- We first crop and downscale the images – in my case to 96×96 pixels.
- We save a binary of a Numpy array of all images on a SSD and read it into the RAM during Jupyter experiments.
- We then apply a so called Keras
to transfer the images to the graphics card when required.*Image Data Generator*

The first point reduces the amount of MBytes per image. For basic experiments we do not need the full resolution.

The second point above is due to performance reasons: (1) Each time we want to work with a Jupyter notebook on the data we want to keep the time to load the data small. (2) We need the array data already in the system’s RAM to transfer them efficiently and in portions to the GPU.

A „**generator**“ is a Keras tool which allows us to deliver input data for the VAE training in form of a continuously replenished dataflow from the CPU environment to the GPU. The amount of data provided with each transfer step to the GPU is reduced to a batch of images. Of course, we have to choose a reasonable size for such a batch. It should be compatible with the training batch size defined in the VAE-model’s fit() function.

A batch alone will fit into the VRAM whereas the whole dataset may not. The control of the data stream costs some overhead time – but this is better than not top be able to work at all. The second point helps to accelerate the transfer of data to the GPU significantly: A generator which sequentially picks data from a hard disk, transfers it to RAM and then to VRAM is too slow to get a convenient performance in the end.

Each time before we start VAE applications on the Jupyter side, we first fill the RAM with all image data in tensor-like form. From a SSD the totally required time should be small. The disadvantage of this approach is the amount of RAM we need. In my case close to 20 GB!

We first crop each of the original images to reduce background information and then resize the result to 96×96 px. D. Foster uses 128×128 px in his book on „Generative Deep Learning“. But for small VRAM 96×96 px is a bit more helpful.

I also wanted the images to have a quadratic shape because then one does not have to adjust the strides of

the VAE’s CNN Encoder and Decoder kernels differently for the two geometrical dimensions. 96 px in each dimension is also a good number as it allows for exactly 4 layers in the VAE’s CNNs. Each of the layers then reduces the resolution of the analyzed patterns by a factor of 2. At the innermost layer of the Encoder we deal with e.g. 256 maps with an extension of 6×6.

Cropping the original images is a bit risky as we may either cut some parts of the displayed heads/faces or the neck region. I decided to cut the upper part of the image. So I lost part of the hair-do in some cases, but this did not affect the ability to create realistic images of new heads or faces in the end. You may with good reason decide differently.

I set the edge points of the cropping region to

top=40, bottom = 0, left=0, right=178 .

This gave me quadratic pictures. But you may choose your own parameters, of course.

To prepare the pictures of the Celeb A dataset I used the PIL library.

import os, sys, time import numpy as np import scipy from glob import glob import PIL as PIL from PIL import Image from PIL import ImageFilter import matplotlib as mpl from matplotlib import pyplot as plt from matplotlib.colors import ListedColormap import matplotlib.patches as mpat

A Juyter cell with a loop to deal with almost all CelebA images would then look like:

**Jupyter cell 1**

dir_path_orig = 'YOUR_PATH_TO_THE_ORIGINAL_CELEB A_IMAGES' dir_path_save = 'YOUR_PATH_TO_THE_RESIZED_IMAGES' num_imgs = 200000 # the number of images we use print("Started loop for images") start_time = time.perf_counter() # cropping corner positions and new img size left = 0; top = 40 right = 178; bottom = 218 width_new = 96 height_new = 96 # Cropping and resizing for num in range(1, num_imgs): jpg_name ='{:0>6}'.format(num) jpg_orig_path = dir_path_orig + jpg_name +".jpg" jpg_save_path = dir_path_save + jpg_name +".jpg" im = Image.open(jpg_orig_path) imc = im.crop((left, top, right, bottom)) #imc = imc.resize((width_new, height_new), resample=PIL.Image.BICUBIC) imc = imc.resize((width_new, height_new), resample=PIL.Image.LANCZOS) imc.save(jpg_save_path, quality=95) # we save with high quality im.close() end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time: ", cpu_time)

Note that we save the images with high quality. Without the quality parameter PIL’s save function for a jpg target format would reduce the given quality unnecessarily and without having a positive impact on the RAM or VRAM consumption of the tensors we have to use in the end.

The whole process of cropping and resizing takes about 240 secs on my old PC *without* any parallelized operations on the CPU. The data were read from a standard old hard disk and not a SSD. As we have to make this investment of CPU time only once I did not care about optimization.

To prepare and save a huge Numpy array which contains all training images for our VAE we first need to define some parameters. I normally use 170,000 images for training purposes and around 10,000 for tests.

**Jupyter cell 2**

# Some basic parameters # ~~~~~~~~~~~~~~~~~~~~~~~~ INPUT_DIM = (96, 96, 3) BATCH_SIZE = 128 # The number of available images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_imgs = 200000 # Check with notebook CelebA # The number of images to use during training and for tests # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NUM_IMAGES_TRAIN = 170000 # The number of images to use in a Trainings Run #NUM_IMAGES_TO_USE = 60000 # The number of images to use in a Trainings Run NUM_IMAGES_TEST = 10000 # The number of images to use in a training Run # for historic compatibility reasons of other code-fragments (the reader may not care too much about it) N_ImagesToUse = NUM_IMAGES_TRAIN NUM_IMAGES = NUM_IMAGES_TRAIN NUM_IMAGES_TO_TRAIN = NUM_IMAGES_TRAIN # The number of images to use in a Trainings Run NUM_IMAGES_TO_TEST = NUM_IMAGES_TEST # The number of images to use in a Test Run # Define some shapes for Numpy arrays with all images for training # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ shape_ay_imgs = (N_ImagesToUse, ) + INPUT_DIM print("Assumed shape for Numpy array with train imgs: ", shape_ay_imgs) shape_ay_imgs_test = (NUM_IMAGES_TO_TEST, ) + INPUT_DIM print("Assumed shape for Numpy array with test imgs: ",shape_ay_imgs_test)

We also need to define some parameters to control the following aspects:

- Do we directly load Numpy arrays with train and test data?
- Do we load image data and convert them into Numpy arrays?
- From where do we load image data?

The following Jupyter cells help us:

**Jupyter cell 3**

# Set parameters where to get the image data from # ************************************************ # Use the cropped 96x96 HIGH-Quality images b_load_HQ = True # Load prepared Numpy-arrays # ~~~~~~~~~~~~~~~~~~~~~~~~~+ b_load_ay_from_saved = False # True: Load prepared x_train and x_test Numpy arrays # Load from SSD # ~~~~~~~~~~~~~~~~~~~~~~ b_load_from_SSD = True # Save newly calculated x_train, x_test-arrays in binary format onto disk # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ b_save_to_disk = False # Paths # ****** # Images on SSD # ~~~~~~~~~~~~~ if b_load_from_SSD: if b_load_HQ: dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_SSD/' # high quality else: dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_HD/' # low quality # Images on slow HD # ~~~~~~~~~~~~~~~~~~ if not b_load_from_SSD: if b_load_HQ: # high quality on slow Raid dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_HD/' else: # low quality on slow HD dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_HD/' # x_train, x_test arrays on SSD # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if b_load_from_SSD: dir_path_ay = 'YOUR_PATH_TO_Numpy_ARRAY_DATA_ON_SSD/' if b_load_HQ: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_hq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_hq-x_test.npy" else: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_lq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_lq-x_est.npy" # x_train, x_test arrays on slow HD # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if not b_load_from_SSD: dir_path_ay = 'YOUR_PATH_TO_Numpy_ARRAY_DATA_ON_HD/' if b_load_HQ: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_hq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_hq-x_test.npy" else: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_lq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_lq-x_est.npy"

You must of course define your own paths and names.

Note that the ending „.npy“ defines the standard binary format for Numpy data.

In case that I want to *prepare* the Numpy arrays (and not load already prepared ones from a binary) I make use of the following straightforward function:

**Jupyter cell 4**

def load_and_scale_celeba_imgs(start_idx, num_imgs, shape_ay, dir_path_load): ay_imgs = np.ones(shape_ay, dtype='float32') end_idx = start_idx + num_imgs # We open the images and transform them into Numpy arrays for j in range(start_idx, end_idx): idx = j - start_idx jpg_name ='{:0>6}'.format(j) jpg_orig_path = dir_path_load + jpg_name +".jpg" im = Image.open(jpg_orig_path) # transfrom data into a Numpy array img_array = np.array(im) ay_imgs[idx] = img_array im.close() # scale the images ay_imgs = ay_imgs / 255. return ay_imgs

We call this function for training images as follows:

**Jupyter cell 5**

# Load training images from SSD/HD and prepare Numpy float32-arrays # - (18.1 GByte of RAM required !! Int-arrays) # - takes around 30 to 35 Secs # ************************************ if not b_load_ay_from_saved: # Prepare float32 Numpy array for the training images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start_idx_train = 1 print("Started loop for training images") start_time = time.perf_counter() x_train = load_and_scale_celeba_imgs(start_idx = start_idx_train, num_imgs=NUM_IMAGES_TRAIN, shape_ay=shape_ay_imgs_train, dir_path_load=dir_path_load) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for array of training images: ", cpu_time) print("Shape of x_train: ", x_train.shape) # Plot an example image plt.imshow(x_train[169999])

And for test images:

**Jupyter cell 6**

# Load test images from SSD/HD and prepare Numpy float32-arrays # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if not b_load_ay_from_saved: # Prepare Float32 Numpy array for test images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start_idx_test = NUM_IMAGES_TRAIN + 1 print("Started loop for test images") start_time = time.perf_counter() x_test = load_and_scale_celeba_imgs(start_idx = start_idx_test, num_imgs=NUM_IMAGES_TEST, shape_ay=shape_ay_imgs_test, dir_path_load=dir_path_load) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for array of test images: ", cpu_time) print("Shape of x_test: ", x_test.shape) #Plot an example img plt.imshow(x_test[27])

This takes about 35 secs in my case for the training images (170,000) and about 2 secs for the test images. Other people in the field use much lower numbers for the amount of training images.

If you want to save the Numpy arrays to disk:

**Jupyter cell 7**

# Save the newly calculatd NUMPY arrays in binary format to disk # **************************************************************** if not b_load_ay_from_saved and b_save_to_disk: print("Start saving arrays to disk ...") np.save(path_file_ay_train, x_train) print("Finished saving the train img array") np.save(path_file_ay_test, x_test) print("Finished saving the test img array")

If we wanted to load the Numpy arrays with training and test data from disk we would use the following code:

**Jupyter cell 8**

# Load the Numpy arrays with scaled Celeb A directly from disk # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ print("Started loop for test images") start_time = time.perf_counter() x_train = np.load(path_file_ay_train) x_test = np.load(path_file_ay_test) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for loading Numpy arrays of CelebA imgs: ", cpu_time) print("Shape of x_train: ", x_train.shape) print("Shape of x_test: ", x_test.shape)

This takes about 2 secs on my system, which has enough and fast RAM. So loading a prepared Numpy array for the CelebA data is no problem.

Easy introductions to Keras‘ ImageDataGenerators, their purpose and usage are given here and here.

ImageDataGenerators can not only be used to create a flow of limited batches of images to the GPU, but also for parallel operations on the images coming from some source. The latter ability is e.g. very welcome when we want to create additional augmented images data. The sources of images can be some directory of image files or a Python data structure. Depending on the source different ways of defining a generator object have to be chosen. The ImageDataGenerator-class and its methods can also be customized in very many details.

If we worked on a directory we might have to define our generator similar to the following code fragment

data_gen = ImageDataGenerator(rescale=1./255) # if the image data are not scaled already for float arrays # class_mode = 'input' is used for Autoencoders # see https://vijayabhaskar96.medium.com/tutorial-image-classification-with-keras-flow-from-directory-and-generators-95f75ebe5720 data_flow = data_gen.flow_from_directory(directory = YOUR_PATH_TO_ORIGINAL IMAGE DATA #, target_size = INPUT_DIM[:2] , batch_size = BATCH_SIZE , shuffle = True , class_mode = 'input' , subset = "training" )

This would allow us to read in data from a prepared sub-directory „YOUR_PATH_TO_ORIGINAL IMAGE DATA/train/“ of the file-system and scale the pixel data at the same time to the interval [0.0, 1.0]. However, this approach is too slow for big amounts of data.

As we already have scaled image data available in RAM based Numpy arrays both the parameterization and the usage of the Generator during training is very simple. And the performance with RAM based data is much, much better!

So, how to our Jupyter cells for defining the generator look like?

**Jupyter cell 9**

# Generator based on Numpy array for images in RAM # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ b_use_generator_ay = True BATCH_SIZE = 128 SOLUTION_TYPE = 3 if b_use_generator_ay: # solution_type == 0 works with extra layers and add_loss to control the KL loss # it requires the definition of "labels" - which are the original images if SOLUTION_TYPE == 0: data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train , x_train #, target_size = INPUT_DIM[:2] , batch_size = BATCH_SIZE , shuffle = True #, class_mode = 'input' # Not working with this type of generator #, subset = "training" # Not required ) if .... if .... if SOLUTION_TYPE == 3: data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train #, x_train #, target_size = INPUT_DIM[:2] , batch_size = BATCH_SIZE , shuffle = True #, class_mode = 'input' # Not working with this type of generator #, subset = "training" # Not required )

Besides the method to use extra layers with layer.add_loss() (SOLUION_TYPE == 0) I have discussed other methods for the handling of the KL-loss in previous posts. I leave it to the reader to fill in the correct statements for these cases. In our present study we want to use a GradientTape()-based method, i.e. SOLUTION_TYPE = 3. In this case we do NOT need to pass a label-array to the Generator. Our gradient_step() function is intelligent enough to handle the loss calculation on its own! (See the previous posts).

So it is just

data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train , batch_size = BATCH_SIZE , shuffle = True )

which does a perfect job for us.

In the end we will only need the following call when we want to train our VAE-model

MyVae.train_myVAE( data_flow , b_use_generator = True , epochs = n_epochs , initial_epoch = INITIAL_EPOCH )

to train our VAE-model. This class function in turn will internally call something like

self.model.fit( data_flow # coming as a batched dataflow from the outside generator , shuffle = True , epochs = epochs , batch_size = batch_size # best identical to the batch_size of data_flow , initial_epoch = initial_epoch )

But the setup of a reasonable VAE-model for CelebA images and its training will be the topic of the next post.

What have we achieved? Nothing yet regarding VAE results. However, we have prepared almost 200,000 CelebA images such that we can easily load them from disk into a Numpy float32 array with 2 seconds. Around 20 GB of conventional PC RAM is required. But this array can now easily be used as a source of VAE training.

Furthermore I have shown that the setup of a Keras „ImageDataGenerator“ to provide the image data as a flow of batches fitting into the GPU’s VRAM is a piece of cake – at least for our VAE objectives. We are well prepared now to apply a VAE-algorithm to the CelebA data – even if we only have an old graphics card available with limited VRAM.

In the next and last blog of this series I show you the code for VAE-training with CelebA data. Afterward we will pick random points in the latent space and create artificial images of human faces.

People interested in data augmentation should have a closer look at the parameterization options of the ImageDataGenerator-class.

Celeb A

https://datagen.tech/guides/image-datasets/celeba/

Data generators

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

https://towardsdatascience.com/keras-data-generators-and-how-to-use-them-b69129ed779c

**And last not least my standard statement as long as the war in Ukraine is going on: **

Ceterum censeo: The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!

]]>

In this series of posts I want to discuss this problem a bit as it illustrates why we need Variational Autoencoders for a systematic creation of faces with varying features from points and clusters in the latent space. But the problem also raises some fundamental and interesting questions

- about a certain „blindness“ of neural networks during training in general, and
- about the way we save or conserve the knowledge which a neural network has gained about patterns in input data during training.

This post requires experience with the architecture and principles of Autoencoders.

For preparing my talk I worked with relatively simple Autoencoders. I used Convolutional Neural Networks [CNNs] with just 4 convolutional layers to create the Encoder and Decoder parts of the Autoencoder. As typical applications I chose the following:

- Effective image compression and reconstruction by using a latent space of relatively low dimensionality. The trained AEs were able to compress input images into latent vectors with only few components and reconstruct the original image from the compressed format.
- Denoising of images where the original data were obscured by the superposition of statistical noise and/or statistically dropped pixels. (This is my favorite task for AEs which they solve astonishingly well.)
- Recolorization of images: The trained AE in this case transforms images with only gray pixels into colorful images.

Such challenges for AEs are discussed in standard ML literature. In a first approach I applied my Autoencoders to the usual MNIST and Fashion MNIST datasets. For the task of recolorization I used the Cifar 10 dataset. But a bit later I turned to the **Celeb A** dataset with images of celebrity faces. Just to make all of the tasks a bit more challenging.

My Autoencoders excelled in all the tasks named above – both for MNIST, CELEB A and, regarding recolorization, CIFAR 10.

Regarding MNIST and MNIST/Fashion 4-layer CNNs for the Encoder and Decoder are almost an overkill. For MNIST the dimension **z_dim** of the latent space can be chosen to be pretty small:

**z_dim = 12** gives a really good reconstruction quality of (test) images compressed to minimum information in the latent space. **z_dim=4** still gave an acceptable quality and even with z_dim = 2 most of test images were reconstructed well enough. The same was true for the reconstruction of images superimposed with heavy statistical noise – such that the human eye could no longer guess the original information. For Fashion MNIST a dimension number 20 < z_dim < 40 gave good results.
Also for recolorization the results were very plausible. I shall present the results in other blog posts in the future.

Then I turned to the **Celeb A** dataset. By the way: I got interested in Celeb A when reading the books of David Foster on „Generative Deep Learning“ and of Tariq Rashi „Make Your First GANs with PyTorch“ (see the complete references in the last section of this post).

The Celeb A data set contains images of around 200,000 faces with varying contours, hairdos and very different, in-homogeneous backgrounds. And the faces are displayed from very different viewing angles.

For a good performance of image reconstruction in all of the named use cases one needs to raise the number of dimensions of the latent space *significantly*. Instead of 12 dimensions of the latent space as for MNIST we now talk about 200 up to 1200 dimensions for CELEB A – depending on the task the AE gets trained for and, of course, on the quality expectations. For reconstruction of normal images and for the reconstruction of clear images from noisy input images higher numbers of dimensions z_dim ≥ 512 gave visibly better results.

Actually, the impressive quality for the reconstruction of *test* images of faces, which were almost totally obscured by the superimposition of statistical noise or the statistical removal of pixels after a self-supervised training on around 100,000 images surprised me. (Totalitarian states and security agencies certainly are happy about the superb face reconstruction capabilities of even simple AEs.) Part of the explanation, of course, is that 20% un-obscured or un-blurred pixels out of 30,000 pixels still means 6,000 clear pixels. Obviously enough for the AE to choose the right pattern superposition to compose a plausible clear image.

Note that we are not talking about overfitting here – the Autoencoder handled * test images*, i.e. images which it had never seen before, very well. AEs based on CNNs just seem to extract and use patterns characteristic for faces extremely effectively.

But how is the target space of the Encoder, i.e. the latent space, filled for Celeb A data? Do *all* points in the latent space give us images with well recognizable faces in the end?

To answer the last question I trained an AE with 100,000 images of Celeb A for the reconstruction task named above. The dimension of the latent space was chosen to be z_dim = 200 for the results presented below. (Actually, I used a VAE with a tiny amount of KL loss by a factor of 1.e-6 smaller than the standard Binary Cross-Entropy loss for reconstruction – to get at least a minimum confinement of the z-points in the latent space. But the results are basically similar to those of a pure AE.)

My somewhat reworked and centered Celeb A images had a dimension of 96×96 pixels. So the original feature space had a number of dimensions of 27,648 (almost 30000). The challenge was to reproduce the original images from latent data points created of test images presented to the Encoder. To be more precise:

After a certain number of training epochs we feed the Encoder (with fixed weights) with test images the AE has never seen before. Then we get the components of the vectors from the origin to the resulting points in the latent space (**z-points**). After feeding these data into the Decoder we expect the reproduction of images close to the test input images.

With a balanced training controlled by an Adam optimizer I already got a good resemblance after 10 epochs. The reproduction got better and very acceptable also with respect to tiny details after 25 epochs for my AE. Due to possible copyright and personal rights violations I do not dare to present the results for general Celeb A images in a public blog. But you can write me a mail if you are interested.

Most of the data points in the latent space were created in a region of 0 < **x_i** < 20 with **x_i** meaning one of the vector components of a z-point in the latent space. I will provide more data on the z-point distribution produced by the Encoder in later posts in this blog.

Then I selected arbitrary data points in the latent space with randomly chosen and uniformly distributed components 0 < x_i < *boundary*. The values for *boundary* were systematically enlarged.

Note that most of the resulting points will have a tendency to be located in outer regions of the multidimensional cube with an extension in each direction given by *boundary*. This is due to the big chance that one of the components will get a relatively high value.

Then I fed these arbitrary z-points into the Decoder. Below you see the results after 10 training epochs of the AE; I selected only 10 of 100 data points created for each value of *boundary* (the images all look more or less the same regarding the absence or blurring of clear face contours):

This is more a collection of face hallucinations than of usable face images. (Interesting for artists, maybe? Seriously meant …).

So, most of the points in the latent space of an Autoencoder do NOT represent reasonable faces. Sometimes our random selection came close to a region in latent space where the result do resemble a face. See e.g. the central image for boundary=10.

From the images above it becomes clear that some arbitrary path inside the latent space will contain more points which do NOT give you a reasonable face reproduction than points that result in plausible face images – despite a successful training of the Autoencoder.

This result supports the impression that the latent space of well trained Autoencoders is almost unusable for *creative* purposes. It also raises the interesting question of what the distribution of „**meaningful points**“ in the latent space really looks like. I do not know whether this has been investigated in depth at all. Some links to publications which prove a certain scientific interest in this question are given in the last section of this posts.

I also want to comment on an article published in the Quanta Magazine lately. See „Self-Taught AI Shows Similarities to How the Brain Works“. This article refers to „masked“ Autoencoders and self-supervised learning. Reconstructing masked images, i.e. images with a superposition of a mask hiding/blurring pixels with a reasonably equipped Autoencoder indeed works very well. Regarding this point I totally agree. Also with the term „self-supervised learning“.

But to suggest that an Autoencoder with this (rather basic) capability reflects methods of the human brain is in my opinion a massive exaggeration. On the contrary, in my opinion an AE reflects a dumbness regarding the storage and usage of otherwise well extracted feature patterns. This is due to its construction and the nature of its mapping of image contents to the latent space. A child can, after some teaching, draw characteristic features of human faces – out of nothing on a plain white piece of paper. The Decoder part of a standard Autoencoder (in some contrast to a GAN) can not – at least not without help to pick a *meaningful* point in latent space. And this difference is a major one, in my opinion.

I think the reason why arbitrary points in the multi-dimensional latent space cannot be mapped to images with recognizable faces is yet another effect of the so called „curse of high dimensionality“. But this time also related to the latent space.

A normal Autoencoder (i.e. one without the Kullback-Leibler loss) uses the latent space in its vast extension to produce points where typical properties (features) of faces and background are encoded in a most unique way for each of the input pictures. But the distinct volume filled by such points is a pretty small one – compared to the extensions of the high dimensional latent space. The volume of data points resulting from a mapping-transformation of arbitrary points in the original feature space to points of the latent space is of course much bigger than the volume of points which correspond to images showing typical human faces.

This is due to the fact that there are many more images with arbitrary pixel values already in the original feature space of the input images (with lets say 30000 dimensions for 100×100 color pixels) than images with reasonable values for faces in front of some background. The points in the feature space which correspond to reasonable images of faces (right colors and dominant pixel values for face features), is certainly small compared to the extension of the original feature space. Therefore: If you pick a random point in latent space – even within a confined (but multidimensional) volume around the origin – the chance that this point lies outside the particular volume of points which make sense regarding face reproduction is big. I guess that for z_dim > 200 the probability is pretty close to 1.

In addition: As the mapping algorithm of a neural Encoder network as e.g. CNNs is highly non-linear we cannot say how the boundary hyperplanes of mapping areas for faces look like. Complicated – but due to the enormous number of original images with arbitrary pixel values – we can safely guess that they enclose a very tiny volume.

The manifold of data points in the z-space giving us recognizable faces in front of a reasonably separated background may follow a curved and wiggly „path“ through the latent space. In principal there could even be isolated unconnected regions separated by areas of „chaotic reconstructions“.

I think this kind of argumentation line holds for standard Autoencoders and variational Autoencoders with a very small KL loss in comparison to the reconstruction loss (BCE (binary cross-entropy) or MSE).

The fist point is: VAEs reduce the total occupied volume of the latent space. Due to mu-related term in the Kullback-Leibler loss the whole distribution of z-points gets condensed into a limited volume around the origin of the latent space.

The second reason is that the distribution of meaningful points are smeared out by the logvar-term of the Kullback-Leibler loss.

Both effects enforce overlapping regions of meaningful standard Gaussian-like z-point distributions in the latent space. So VAEs significantly increase the probability to hit a meaningful z-point in latent space – if you chose points around the origin within a distance of „1“. The distance has to be measured with some norm, e.g. the Euclidian one. Actually we should get meaningful reconstructions even beyond a multidimensional sphere of radius „1“. Look at the series on the technical realization VAEs in this blog. The last posts there prove the effects of the KL-loss experimentally for Celeb A data. Below you find a selection of images created from randomly chosen points in the latent space of a Variational Autoencoder with z_dim=200 after 10 epochs.

Enough for today. Whilst standard Autoencoders solve certain tasks very well, they seem to produce very specific data distributions in the latent space for the reconstruction of „meaningful“ images with human faces. The origin of this problem lies already in the original feature space of the images. Also there only a small minority of points represents humanly interpretable face images. This is even true if you use a dimensionality reduction method as PCA ahead.

From a first experiment the chance of hitting a data point in latent space which gives you a meaningful image seems to be small. This result appears to be a variant of the curse of high dimensionality – this time including the latent space.

In a next step we will investigate the surroundings of a selected point in the latent space. This point will be chosen such that it gives us a reasonable reconstruction result in the first place. A systematic investigation of neighboring grid points would be an impossible endeavor in a vector space of 200 dimensions (2.0e200 possibilities are beyond computational powers). We, therefore, *must* reduce the number of selected directions in the latent space significantly. Wee therefore look at simultaneous changes of blocks of multiple dimensions. Such parallel changes of features in the latent space are indeed the really interesting ones. We will see that even significant changes of only one component of a latent vector does not have a major impact on a face image. In a later post we will investigate the variation of images corresponding along straight (one-dimensional geodesic) connections between face feature centers in the latent space. To find that we cross regions of chaotic reconstruction images along such a path. Stay tuned …

https://towardsdatascience.com/exploring-the-latent-space-of-your-convnet-classifier-b6eb862e9e55

Felix Leeb, Stefan Bauer, Michel Besserve,Bernhard Schölkopf, „Exploring the Latent Space of Autoencoders with

Interventional Assays“, 2022,

https://arxiv.org/abs/2106.16091v2 // https://arxiv.org/pdf/2106.16091.pdf

https://wiredspace.wits.ac.za/handle/10539/33094?show=full

https://www.elucidate.ai/post/exploring-deep-latent-spaces

**Books:**

T. Rashid, „GANs mit PyTorch selbst programmieren“, 2020, O’Reilly, dpunkt.verlag, Heidelberg, ISBN 978-3-96009-147-9

D. Foster, „Generatives Deep Learning“, 2019, O’Reilly, dpunkt.verlag, Heidelberg, ISBN 978-3-96009-128-8

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Our objective is to avoid or circumvent potential problems with the **eager execution mode** of present Tensorflow 2 versions. I have already described three solutions based on standard Keras functionality:

- Either we add loss contributions via the function
**layer.add_loss()**and a special layer of the Encoder part of the VAE - or we add a loss to the output of a full VAE-model via function
**model.add_loss()** - or we build a complex model which transports required KL-related tensors from the Encoder part of the VAE model to the Decoder’s output layer.

In all these cases we invoke *native* Keras functions to handle loss contributions and related operations. Keras controls the calculation of the gradient components of the KL related tensors „mu“ and „log_var“ in the background for us. This comprises partial derivatives with respect to trainable weight variables of *lower* Encoder layers and related operations. The same holds for partial derivatives of reconstruction tensors at the Decoder’s output layer with respect to trainable parameters of *all* layers of the VAE-model. Keras does most of the job

- of derivative calculation and the registration of related operation sequences during forward pass
- and the correct application of the registered operations and values in later weight corrections during backward propagation

for us *in the background* as long as we respect certain rules for eager mode execution.

But Tensorflow 2 [TF2] gives us a much more flexible and low-level option to control the calculation of gradients under the conditions of eager execution. This option requires that we inform the TF/Keras machinery which processes the **training steps** of an epoch of how to exactly calculate losses and their partial derivatives. Rules to determine and create metrics output must be provided in addition.

TF2 provides a context for *registering* operations for loss and derivative evaluations. This context is provided by a functional object called **GradientTape()**. In addition we have to write an encapsulating function „**train_step()**“ to control gradient calculations and output during training.

In this post I will describe how we integrate such an approach with our **class „MyVariationalAutoencoder()“** for the setup of a VAE model based on convolutional layers. I have discussed the elements and methods of this class *MyVariationalAutoencoder()* in detail during the last posts.

Regarding the core of the technical solution for **train_step()** and **GradientTape()** I follow more or less the recommendations of one of the masters of Keras: **F. Chollet**. His original code for a TF2-compatible implementation of a VAE can be found here:

https://keras.io/examples/generative/vae/

However, in my opinion Chollet’s code contains a small problem, which I have allowed myself to correct.

The general recipe presented here can, of course, be extended to more complex tasks beyond the optimization of KL and reconstruction losses of VAEs. Therefore, a brief study of the methods to establish detailed loss control is really worth it for ML and VAE beginners. But TF2 and Keras experts will not learn anything new from this post.

I provide the pure code of the classes in this post. In the next post you will find Jupyter cell code for an application to the Celeb A dataset. To prove that the classes below do their job in the end I show you some faces which have been created from arbitrarily chosen points in the latent space after training.

These faces do not exist in reality. They are constructed by the VAE based on compressed and „normalized“ data for face patterns and face attribute distributions in the latent space. Note that I used a latent space with a dimension of z_dim =200.

We have already many of the required methods ready. In the last posts we used the flexible *functional interface of Keras* to set up Neural Network models for both Encoder and Decoder, each with sequences of (convolutional) layers. For our present purposes we will not change the elementary layer structure of the Encoder or Decoder. In particular the layers for the „mu“ and „log_var“ contributions to the KL loss and a subsequent sampling-layer of the Encoder will remain unchanged.

In the course of the last two posts I have already introduced a parameter „*solution_type*“ to control specifics of our VAE model. We shall use it now to invoke a child class of Keras‘ Model() which allows for detailed steps of loss and gradient evaluations.

The standard Keras method **Model.fit()** normally provides a convenient interface for Keras users. We do not have to think about calling the low-level functions at all if we do not want to or do not need to control gradient calculations in detail. In our present approach, however, we use the low level functionality of **GradientTape()** directly. This requires to overwrite a specific method of the standard Keras class Model() – namely the function **„train_step()“**.

If you have never worked with a self-defined **training_step()** and **GradientTape()** before then I recommend to read the following introductions first:

https://www.tensorflow.org/guide/autodiff

customizing what happens in fit() and the relation to training_step()

These articles contain valuable information about how to operate at low level with **training_step()** regarding losses, derivatives and metrics. This information will help to better understand the methods of a new class VAE() which I am going to derive from Keras‘ class Model() below.

Let us first briefly repeat some imports required.

**Imports**

# Imports # ~~~~~~~~ import sys import numpy as np import os import pickle import tensorflow as tf import tensorflow.keras as keras from tensorflow.keras.layers import Layer, Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \ Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, AlphaDropout from tensorflow.keras.models import Model # to be consistent with my standard loading of the Keras backend in Jupyter notebooks: from tensorflow.keras import backend as B from tensorflow.keras import metrics #from tensorflow.keras.backend import binary_crossentropy from tensorflow.keras.optimizers import Adam from tensorflow.keras.callbacks import ModelCheckpoint from tensorflow.keras.utils import plot_model #from tensorflow.python.debug.lib.check_numerics_callback import _maybe_lookup_original_input_tensor # Personal: The following works only if the path in the notebook is supplemented by the path to /projects/GIT/mlx # The user has to organize his paths for modules to be referred to from Jupyter notebooks himself and # replace this settings from mynotebooks.my_AE_code.utils.callbacks import CustomCallback, VAE_CustomCallback, step_decay_schedule from keras.callbacks import ProgbarLogger

Now we define a class VAE() which inherits basic functionality from the Keras class Model() and overwrite the method train_step(). We shall later create an instance of this new class within an object of class MyVariationalAutoencoder().

**New Class VAE**

from tensorflow.keras import metrics ... ... # A child class of Model() to control train_step with GradientTape() class VAE(keras.Model): # We use our self defined __init__() to provide a reference MyVAE # to an object of type "MyVariationalAutoencoder" # This in turn allows us to address the Encoder and the Decoder def __init__(self, MyVAE, **kwargs): super(VAE, self).__init__(**kwargs) self.MyVAE = MyVAE self.encoder = self.MyVAE.encoder self.decoder = self.MyVAE.decoder # A factor to control the ratio between the KL loss and the reconstruction loss self.fact = MyVAE.fact # A counter self.count = 0 # A factor to scale the absolute values of the losses # e.g. by the number of pixels of an image self.scale_fact = 1.0 # no scaling # self.scale_fact = tf.constant(self.MyVAE.input_dim[0] * self.MyVAE.input_dim[1], dtype=tf.float32) self.f_scale = 1. / self.scale_fact # loss type : 0: BCE, 1: MSE self.loss_type = self.MyVAE.loss_type # track loss development via metrics self.total_loss_tracker = keras.metrics.Mean(name="total_loss") self.reco_loss_tracker = keras.metrics.Mean(name="reco_loss") self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss") def call(self, inputs): x, z_m, z_var = self.encoder(inputs) return self.decoder(x) # Overwrite the metrics() of Model() - use getter mechanism @property def metrics(self): return [ self.total_loss_tracker, self.reco_loss_tracker, self.kl_loss_tracker ] # Core function to control all operations regarding eager differentiation operations, # i.e. the calculation of loss terms with respect to tensors and differentiation variables # and metrics data def train_step(self, data): # We use the GradientTape context to record differntiation operations/results #self.count += 1 with tf.GradientTape() as tape: z, z_mean, z_log_var = self.encoder(data) reconstruction = self.decoder(z) #reco_shape = tf.shape(self.reconstruction) #print("reco_shape = ", reco_shape, self.reconstruction.shape, data.shape) #BCE loss (Binary Cross Entropy) if self.loss_type == 0: reconstruction_loss = tf.reduce_mean( tf.reduce_sum( keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2) ) ) * self.f_scale # MSE loss (Mean Squared Error) if self.loss_type == 1: reconstruction_loss = tf.reduce_mean( tf.reduce_sum( keras.losses.mse(data, reconstruction), axis=(1, 2) ) ) * self.f_scale kl_loss = -0.5 * self.fact * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)) kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1)) total_loss = reconstruction_loss + kl_loss grads = tape.gradient(total_loss, self.trainable_weights) self.optimizer.apply_gradients(zip(grads, self.trainable_weights)) #if self.count == 1: self.total_loss_tracker.update_state(total_loss) self.reco_loss_tracker.update_state(reconstruction_loss) self.kl_loss_tracker.update_state(kl_loss) return { "total_loss": self.total_loss_tracker.result(), "reco_loss": self.reco_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), } def compile_VAE(self, learning_rate): # Optimizer # ~~~~~~~~~ optimizer = Adam(learning_rate=learning_rate) # save the learning rate for possible intermediate output to files self.learning_rate = learning_rate self.compile(optimizer=optimizer)

First, we need to import an additional library **tensorflow.keras.metrics**. Its functions, as e.g. **Mean()**, will help us to print out intermediate data about various loss contributions during training – averaged over the batches of an epoch.

Then, we have added four central methods to class VAE:

- a function
**__init__()**, - a function
**metrics()**together with Python’s**getter**-mechanism - a function
**call()** - and our central function
**training_step().**

All functions overwrite the defaults of the parent class Model(). Be careful to distinguish the range of batches which keras.metrics() and training_step() operate on:

- A „training step“ covers just one batch eventually provided to the training mechanism by the Model.fit()-function.
- Averaging performed by functions of keras.metrics instead works across
*all*batches of an epoch.

In general we can use the standard interface of __init__(inputs, outputs, …) **or a call()-interface** to instantiate an object of class-type Model(). See

https://www.tensorflow.org/api_docs/python/tf/ keras/ Model

https://docs.w3cub.com/tensorflow~python/ tf/ keras/ model.html

We have to be precise about the parameters of __init()__ or the call()-interface if we intend to use properties of the standard *compile()*– and *fit()*-interfaces of a model – at least in application cases where we do not control everything regarding losses and gradients ourselves.

To define a complete model for the general case we therefore add the *call()*-method. At the same time we „misuse“ the __init__() function of VAE() to provide a reference to our instance of class „MyVariationalAutoencoder“. Actually, providing „call()“ is done only for the sake of flexibility in other use cases than the one discussed here. For our present purposes we could actually omit call().

The __init__()-function retrieves some parameters from MyVAE. You see the factor *„fact“* which controls the ratio of the KL-loss to the reconstruction loss. In addition I provided an option to scale the loss values by a division by the number of pixels of input images. You just have to un-comment the respective statement. Sorry, I have not yet made it controllable by a parameter of MyVariationalAutoencoder().

Finally, the parameter loss_type is evaluated; for a value of „1“ we take MSE as a loss function instead of the standard BCE (Binary Cross-Entropy); see the Jupyter cells in the next post. This allows for some more flexibility in dealing with certain datasets.

With the function **metrics()** we are able to establish our own „tracking“ of the evolution of the Model’s loss contributions during training. In our case we are particularly interested in the evolution of the „**reconstruction loss**“ and the „**KL-loss**„.

Note that the **@property** decorator is added to the **metrics()**-function. This allows us to define its output via the **getter**-mechanism for Python classes. In our case the __init__()-function defines the mechanism to fill required variables:

The three „tracker“-variables there get their values from the function tensorflow.keras.metrics.Mean(). Note that the *names* given to the loss-trackers in __init__() are of importance for later output handling!

Note also that **keras.metrics.Mean()** calculates averages over values derived for *all* batches of an epoch. The **tf.reduce_mean()**-statements in the GradientTape() section of the code above, instead, refer to averages calculated over the samples of a *single* batch.

Actualized loss output is later delivered during each training step by the method **update_state()**. You find a description of the methods of keras.metrics.Mean() here.

The result of all this is that metrics() delivers loss values by actualized tracker-variables of our child class *VAE()*. Note that neither __init__() nor metrics() define what exactly is to be done to calculate each loss term. __init__() and metrics() only prepare the later output technically by formal class constructs. Note also that all the data defined by metrics() are updated and averaged per epoch *without* the requirement to call the function „reset_states()“ (see the Keras docs). This is automatically done at the beginning of each epoch.

Let us turn to the necessary calculations which must be performed during each training step. In an eager environment we must watch the trainable variables, on which the different loss terms depend, to be able to calculate the partial derivatives and record related operations and intermediate results **already during forward pass**:

We must track the differentiation operations and resulting values to know exactly what has to be done in reverse during error backward propagation. To be able to do this TF2 offers us a recording mechanism called **GradientTape()**. Its results are kept in an object which often is called a „tape“.

You find more information about these topics at

https://debuggercafe.com/basics-of-tensorflow-gradienttape/

https://runebook.dev/de/docs/ tensorflow/gradienttape

Within * train_step()* we need some tensors which are required for loss calculations in an explicit form. So, we must change the Keras model for the Encoder to give us the tensors for „

This is no problem for us. We have already made the output of the Encoder dependent on a variable „solution_type“ and discussed a multi-output Encoder model already in the post Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output.

Therefore, we just have to add a new value 3 to the checks of „solution_type“. The same is true for the input control of the Decoder (see a section about the related methods of MyVariationalAutoencoder() below).

The statements within the section for ** GradientTape()** deal with the calculation of loss terms and record the related operations. All the calculations should be be familiar from previous posts of this series.

This includes an identification of the trainable_weights of the involved layers. Quote from

https://keras.io/guides/ writing_a_training_loop_from_scratch/ #using-the-gradienttape-a-first-endtoend-example:

Calling a model inside a GradientTape scope enables you to retrieve the gradients of the trainable weights of the layer with respect to a loss value. Using an optimizer instance, you can use these gradients to update these variables (which you can retrieve using model.trainable_weights).

In ** train_step()** we need to register that the total loss is dependent on all trainable weights and that all related partial derivatives have to be taken into account during optimization. This is done by

grads = tape.gradient(total_loss, self.trainable_weights) self.optimizer.apply_gradients(zip(grads, self.trainable_weights))

To be able to get actualized output during training we update the state of all tracked variables:

self.total_loss_tracker.update_state(total_loss) self.reco_loss_tracker.update_state(reco_loss) self.kl_loss_tracker.update_state(kl_loss)

The careful reader may have noticed that my code of the function „train_step()“ deviates from F. Chollet’s recommendations. Regarding the return statement I use

return { "total_loss": self.total_loss_tracker.result(), "reco_loss": self.reco_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), }

whilst F. Chollet’s original code contains a statement like

return { "loss": self.total_loss_tracker.result(), # here lies the main difference - different "name" than defined in __init__! "reconstruction_loss": self.reconstruction_loss_tracker.result(), # ignore my abbreviation to reco_loss "kl_loss": self.kl_loss_tracker.result(), }

Chollet’s original code unfortunately gives *inconsistent* loss data: The sum of his „reconstruction loss“ and the „KL (Kullback Leibler) loss“ do * not* add up to the (total) „loss“. This can be seen from the data of the first epochs in F. Chollet’s example on the tutorial at

keras.io/examples/generative/vae.

Some of my own result data for the MNIST example with this error look like:

Epoch 1/5 469/469 [============================_build_dec==] - 7s 13ms/step - reconstruction_loss: 209.0115 - kl_loss: 3.4888 - loss: 258.9048 Epoch 2/5 469/469 [==============================] - 7s 14ms/step - reconstruction_loss: 173.7905 - kl_loss: 4.8220 - loss: 185.0963 Epoch 3/5 469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 160.4016 - kl_loss: 5.7511 - loss: 167.3470 Epoch 4/5 469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 155.5937 - kl_loss: 5.9947 - loss: 162.3994 Epoch 5/5 469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 152.8330 - kl_loss: 6.1689 - loss: 159.5607

Things do get better from epoch to epoch – but we want a consistent output from the beginning: The averaged (total) loss should always be printed as equal to the sum of the averaged) KL loss plus the reconstruction loss.

The deviation is surprising as we *seem* to use the right tracker-results in the code. And the name used in the return statement of the train_step()-function here should only be relevant for the printing …

However, the name „loss“ is NOT consistent with the name defined in the statement Mean(name=“total_loss“) in the __init__() function of Chollet, where he defines his tracking mechanisms.

self.total_loss_tracker = keras.metrics.Mean(name="total_loss")

This has consequences: The inconsistency triggers a different output than a consistent use of names. Just try it out on your own …

This is not only true for the deviation between „loss“ in

return { "loss": self.total_loss_tracker.result(), .... }

and „total_loss“ in the __init__) function

self.total_loss_tracker = keras.metrics.Mean(name="total_loss") , namely a value lacking averaging -

but also for deviations in the names used for the other loss contributions. *In case of an inconsistency Keras seems to fall back to a default* here which does not reflect the standard linear averaging of Mean() over all values calculated for the batches of an epoch (without any special weights).

That there is some common default mechanism working can be seen from the fact that wrong names for **all** individual losses (here the KL loss and the reconstruction loss) give us at least a consistent sum-value for the total amount again. But all the values derived by the fallback are much closer to the start values at an epochs begin than the mean values averaged over an epoch. You may test this yourself.

To get on the safe side we use the correct „names“ defined in the __init__()-function of our code:

return { "total_loss": self.total_loss_tracker.result(), "reco_loss": self.reco_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), }

For MNIST data fed into our VAE model we then get:

Epoch 1/5 469/469 [==============================] - 8s 13ms/step - reco_loss: 214.5662 - kl_loss: 2.6004 - total_loss: 217.1666 Epoch 2/5 469/469 [==============================] - 7s 14ms/step - reco_loss: 186.4745 - kl_loss: 3.2799 - total_loss: 189.7544 Epoch 3/5 469/469 [==============================] - 6s 13ms/step - reco_loss: 181.9590 - kl_loss: 3.4186 - total_loss: 185.3774 Epoch 4/5 469/469 [==============================] - 6s 13ms/step - reco_loss: 177.5216 - kl_loss: 3.9433 - total_loss: 181.4649 Epoch 5/5 469/469 [==============================] - 6s 13ms/step - reco_loss: 163.7209 - kl_loss: 5.5816 - total_loss: 169.3026

This is exactly what we want.

So, the general recipe is:

- Define what metric properties you are interested in. Create respective tracker-variables in the __init__() function.
- Use the getter mechanism to define your metrics() function and its output via references to the trackers.
- Define your own training step by a function train_step().
- Use Tensorflow’s GradientTape context to register statements which control the calculation of loss contributions from elementary tensors of your (functional) Keras model. Provide all layers there, e.g. by references to their models.
- Register gradient-operations of the total loss with respect to all trainable weights and updates of metrics data within function „train_step()“.

Actually, I have used the GradientTape() mechanism already in this blog when I played a bit with approaches to create so called DeepDream images. See

https://linux-blog.anracom.com/category/machine-learning/deep-dream/

for more information – there in a different context.

Where do we stand? We have defined a new class „*VAE()*“ which modifies the original Keras Model() class. And we have our class „MyVariationalAutoencoder()“ to control the setup of a VAE model.

Next we need to address the question of how we combine these two classes. If you have read my previous posts you may expect a major change to the method „**_build_VAE()**„. This is correct, but we also have to modify the conditions for the Encoder output construction in _build_enc() and the definition of the Decoder’s input in _build_dec(). Therefore I give you the modified code for these functions. For reasons of completeness I add the code for the __init__()-function:

def __init__(self , input_dim # the shape of the input tensors (for MNIST (28,28,1)) , encoder_conv_filters # number of maps of the different Conv2D layers , encoder_conv_kernel_size # kernel sizes of the Conv2D layers , encoder_conv_strides # strides - here also used to reduce spatial resolution avoid pooling layers # used instead of Pooling layers , encoder_conv_padding # padding - valid or same , decoder_conv_t_filters # number of maps in Con2DTranspose layers , decoder_conv_t_kernel_size # kernel sizes of Conv2D Transpose layers , decoder_conv_t_strides # strides for Conv2dTranspose layers - inverts spatial resolution , decoder_conv_t_padding # padding - valid or same , z_dim # A good start is 16 or 24 , solution_type = 0 # Which type of solution for the KL loss calculation ? , act = 0 # Which type of activation function? , fact = 0.65e-4 # Factor for the KL loss (0.5e-4 < fact < 1.e-3is reasonable) , loss_type = 0 # 0: BCE, 1: MSE , use_batch_norm = False # Shall BatchNormalization be used after Conv2D layers? , use_dropout = False # Shall statistical dropout layers be used for tregularization purposes ? , dropout_rate = 0.25 # Rate for statistical dropout layer , b_build_all = False # Added by RMO - full Model is build in 2 steps ): ''' Input: The encoder_... and decoder_.... variables are Python lists, whose length defines the number of Conv2D and Conv2DTranspose layers input_dim : Shape/dimensions of the input tensor - for MNIST (28,28,1) encoder_conv_filters: List with the number of maps/filters per Conv2D layer encoder_conv_kernel_size: List with the kernel sizes for the Conv-Layers encoder_conv_strides: List with the strides used for the Conv-Layers z_dim : dimension of the "latent_space" solution_type : Type of solution for KL loss calculation (0: Customized Encoder layer, 1: transfer of mu, var_log to Decoder 2: model.add_loss() 3: definition of training step with Gradient.Tape() act : determines activation function to use (0: LeakyRELU, 1:RELU , 2: SELU) !!!! NOTE: !!!! If SELU is used then the weight kernel initialization and the dropout layer need to be special https://github.com/christianversloot/machine-learning-articles/blob/main/using-selu-with-tensorflow-and-keras.md AlphaDropout instead of Dropout + LeCunNormal for kernel initializer fact = 0.65e-4 : Factor to scale the KL loss relative to the reconstruction loss Must be adapted to the way of calculation - e.g. for solution_type == 3 the loss is not averaged over all pixels => at least factor of around 1000 bigger than normally loss-type = 0: Defines the way we calculate a reconstruction loss 0: Binary Cross Entropy - recommended by many authors 1: Mean Square error - recommended by some authors especially for "face arithmetics" use_batch_norm = False # True : We use BatchNormalization use_dropout = False # True : We use dropout layers (rate = 0.25, see Encoder) b_build_all = False # True : Full VAE Model is build in 1 step; False: Encoder, Decoder, VAE are build in separate steps ''' self.name = 'variational_autoencoder' # Parameters for Layers which define the Encoder and Decoder self.input_dim = input_dim self.encoder_conv_filters = encoder_conv_filters self.encoder_conv_kernel_size = encoder_conv_kernel_size self.encoder_conv_strides = encoder_conv_strides self.encoder_conv_padding = encoder_conv_padding self.decoder_conv_t_filters = decoder_conv_t_filters self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size self.decoder_conv_t_strides = decoder_conv_t_strides self.decoder_conv_t_padding = decoder_conv_t_padding self.z_dim = z_dim # Check param for activation function if act < 0 or act > 2: print("Range error: Parameter act = " + str(act) + " has unknown value ") sys.exit() else: self.act = act # Factor to scale the KL loss relative to the Binary Cross Entropy loss self.fact = fact # Type of loss - 0: BCE, 1: MSE self.loss_type = loss_type # Check param for solution approach if solution_type < 0 or solution_type > 3: print("Range error: Parameter solution_type = " + str(solution_type) + " has unknown value ") sys.exit() else: self.solution_type = solution_type self.use_batch_norm = use_batch_norm self.use_dropout = use_dropout self.dropout_rate = dropout_rate # Preparation of some variables to be filled later self._encoder_input = None # receives the Keras object for the Input Layer of the Encoder self._encoder_output = None # receives the Keras object for the Output Layer of the Encoder self.shape_before_flattening = None # info of the Encoder => is used by Decoder self._decoder_input = None # receives the Keras object for the Input Layer of the Decoder self._decoder_output = None # receives the Keras object for the Output Layer of the Decoder # Layers / tensors for KL loss self.mu = None # receives special Dense Layer's tensor for KL-loss self.log_var = None # receives special Dense Layer's tensor for KL-loss # Parameters for SELU - just in case we may need to use it somewhere # https://keras.io/api/layers/activations/ see selu self.selu_scale = 1.05070098 self.selu_alpha = 1.67326324 # The number of Conv2D and Conv2DTranspose layers for the Encoder / Decoder self.n_layers_encoder = len(encoder_conv_filters) self.n_layers_decoder = len(decoder_conv_t_filters) self.num_epoch = 0 # Intialization of the number of epochs # A matrix for the values of the losses self.std_loss = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False) # We only build the whole AE-model if requested self.b_build_all = b_build_all if b_build_all: self._build_all()

We just need to set the right options for the output tensors of the Encoder and the input tensors of the Decoder. The relevant code parts are controlled by the parameter „solution_type“.

**Modified code of _build_enc() of class MyVariationalAutoencoder**

def _build_enc(self, solution_type = -1, fact=-1.0): ''' Your documentation ''' # Checking whether "fact" and "solution_type" for the KL loss shall be overwritten if fact < 0: fact = self.fact if solution_type < 0: solution_type = self.solution_type else: self.solution_type = solution_type # Preparation: We later need a function to calculate the z-points in the latent space # The following function wiChangedll be used by an eventual Lambda layer of the Encoder def z_point_sampling(args): ''' A point in the latent space is calculated statistically around an optimized mu for each sample ''' mu, log_var = args # Note: These are 1D tensors ! epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) return mu + B.exp(log_var / 2) * epsilon # Input "layer" self._encoder_input = Input(shape=self.input_dim, name='encoder_input') # Initialization of a running variable x for individual layers x = self._encoder_input # Build the CNN-part with Conv2D layers # Note that stride>=2 reduces spatial resolution without the help of pooling layers for i in range(self.n_layers_encoder): conv_layer = Conv2D( filters = self.encoder_conv_filters[i] , kernel_size = self.encoder_conv_kernel_size[i] , strides = self.encoder_conv_strides[i] , padding = 'same' # Important ! Controls the shape of the layer tensors. , name = 'encoder_conv_' + str(i) ) x = conv_layer(x) # The "normalization" should be done ahead of the "activation" if self.use_batch_norm: x = BatchNormalization()(x) # Selection of activation function (out of 3) if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: # RMO: Just use the Activation layer to use SELU with predefined (!) parameters x = Activation('selu')(x) # Fulfill some SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last multi-dim tensor shape - is later needed by the decoder self._shape_before_flattening = B.int_shape(x)[1:] # Flattened layer before calculating VAE-output (z-points) via 2 special layers x = Flatten()(x) # "Variational" part - create 2 Dense layers for a statistical distribution of z-points self.mu = Dense(self.z_dim, name='mu')(x) self.log_var = Dense(self.z_dim, name='log_var')(x) if solution_type == 0: # Customized layer for the calculation of the KL loss based on mu, var_log data # We use a customized layer according to a class definition self.mu, self.log_var = My_KL_Layer()([self.mu, self.log_var], fact=fact) # Layer to provide a z_point in the Latent Space for each sample of the batch self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var]) # The Encoder Model # ~~~~~~~~~~~~~~~~~~~ # With extra KL layer or with vae.add_loss() if self.solution_type == 0 or self.solution_type == 2: self.encoder = Model(self._encoder_input, self._encoder_output, name="encoder") # Transfer solution => Multiple outputs if self.solution_type == 1 or self.solution_type == 3: self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")

The difference is the dependency of the output on „solution_type 3“. For the Decoder we have:

**Modified code of _build_enc() of class MyVariationalAutoencoder**

def _build_dec(self): ''' Your documentation ''' # Input layer - aligned to the shape of z-points in the latent space = output[0] of the Encoder self._decoder_inp_z = Input(shape=(self.z_dim,), name='decoder_input') # Additional Input layers for the KL tensors (mu, log_var) from the Encoder if self.solution_type == 1 or self.solution_type == 3: self._dec_inp_mu = Input(shape=(self.z_dim), name='mu_input') self._dec_inp_var_log = Input(shape=(self.z_dim), name='logvar_input') # We give the layers later used as output a name # Each of the Activation layers below just correspond to an identity passed through #self._dec_mu = self._dec_inp_mu #self._dec_var_log = self._dec_inp_var_log self._dec_mu = Activation('linear',name='dc_mu')(self._dec_inp_mu) self._dec_var_log = Activation('linear', name='dc_var')(self._dec_inp_var_log) # Here we use the tensor shape info from the Encoder x = Dense(np.prod(self._shape_before_flattening))(self._decoder_inp_z) x = Reshape(self._shape_before_flattening)(x) # The inverse CNN for i in range(self.n_layers_decoder): conv_t_layer = Conv2DTranspose( filters = self.decoder_conv_t_filters[i] , kernel_size = self.decoder_conv_t_kernel_size[i] , strides = self.decoder_conv_t_strides[i] , padding = 'same' # Important ! Controls the shape of tensors during reconstruction # we want an image with the same resolution as the original input , name = 'decoder_conv_t_' + str(i) ) x = conv_t_layer(x) # Normalization and Activation if i < self.n_layers_decoder - 1: # Also in the decoder: normalization before activation if self.use_batch_norm: x = BatchNormalization()(x) # Choice of activation function if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: #x = self.selu_scale * ELU(alpha=self.selu_alpha)(x) x = Activation('selu')(x) # Adaptions to SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last layer => Sigmoid output # => This requires s<pre style="padding:8px; height: 400px; overflow:auto;">caled input => Division of pixel values by 255 else: x = Activation('sigmoid', name='dc_reco')(x) # Output tensor => a scaled image self._decoder_output = x # The Decoder model # solution_type == 0/2/3: Just the decoded input if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: self.decoder = Model(self._decoder_inp_z, self._decoder_output, name="decoder") # solution_type == 1: The decoded tensor plus the transferred tensors mu and log_var a for the variational distribution if self.solution_type == 1: self.decoder = Model([self._decoder_inp_z, self._dec_inp_mu, self._dec_inp_var_log], [self._decoder_output, self._dec_mu, self._dec_var_log], name="decoder")

Our VAE model now is set up with the help of the __init__() method of our new class VAE. We just have to supplement the object created by MyVariationalAutoencoder.

**Modified code of _build_VAE() of class MyVariationalAutoencoder**

def _build_VAE(self): ''' Your documentation ''' # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 3: self.model = VAE(self) # Here parameter "self" provides a reference to an instance of MyVariationalAutoencoder self.model.summary() # Solutions with layer.add_loss or model.add_loss() if self.solution_type == 0 or self.solution_type == 2: model_input = self._encoder_input model_output = self.decoder(self._encoder_output) self.model = Model(model_input, model_output, name="vae") # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: enc_out = self.encoder(self._encoder_input) dc_reco, dc_mu, dc_var = self.decoder(enc_out) # We organize the output and later association of cost functions and metrics via a dictionary mod_outputs = {'vae_out_main': dc_reco, 'vae_out_mu': dc_mu, 'vae_out_var': dc_var} self.model = Model(inputs=self._encoder_input, outputs=mod_outputs, name="vae")

Note that we keep the resulting model within the object for class MyVariationalAutoencoder. See the Jupyter cells in my next post.

The modification of the function compile_myVAE is simple

def compile_myVAE(self, learning_rate): # Optimizer # ~~~~~~~~~ optimizer = Adam(learning_rate=learning_rate) # save the learning rate for possible intermediate output to files self.learning_rate = learning_rate # Parameter "fact" will be used by the cost functions defined below to scale the KL loss relative to the BCE loss fact = self.fact # Function for solution_type == 1 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @tf.function def mu_loss(y_true, y_pred): loss_mux = fact * tf.reduce_mean(tf.square(y_pred)) return loss_mux @tf.function def logvar_loss(y_true, y_pred): loss_varx = -fact * tf.reduce_mean(1 + y_pred - tf.exp(y_pred)) return loss_varx # Function for solution_type == 2 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # We follow an approach described at # https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer # NOTE: We can NOT use @tf.function here def get_kl_loss(mu, log_var): kl_loss = -fact * tf.reduce_mean(1 + log_var - tf.square(mu) - tf.exp(log_var)) return kl_loss # Required operations for solution_type==2 => model.add_loss() # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ res_kl = get_kl_loss(mu=self.mu, log_var=self.log_var) if self.solution_type == 2: self.model.add_loss(res_kl) self.model.add_metric(res_kl, name='kl', aggregation='mean') # Model compilation # ~~~~~~~~~~~~~~~~~~~~ # Solutions with layer.add_loss or model.add_loss() if self.solution_type == 0 or self.solution_type == 2: if self.loss_type == 0: self.model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=[tf.keras.metrics.BinaryCrossentropy(name='bce')]) if self.loss_type == 1: self.model.compile(optimizer=optimizer, loss="mse", metrics=[tf.keras.metrics.MSE(name='mse')]) # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: if self.loss_type == 0: self.model.compile(optimizer=optimizer , loss={'vae_out_main':'binary_crossentropy', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} #, metrics={'vae_out_main':tf.keras.metrics.BinaryCrossentropy(name='bce'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss } ) if self.loss_type == 1: self.model.compile(optimizer=optimizer , loss={'vae_out_main':'mse', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} #, metrics={'vae_out_main':tf.keras.metrics.MSE(name='mse'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss } ) # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 3: self.model.compile(optimizer=optimizer)

Note the adaptions to the new parameter „loss_type“ which we have added to the __init__()-function!

It gets a bit more complicated for the function „train_myVAE()“. The reason is that we use the opportunity to include the output of so called generators which create limited batches on the fly from disc or memory.

Such a generator is very useful if you have to handle datasets which you cannot get into the VRAM of your video card. A typical case might be the Celeb A dataset for older graphics cards as mine.

In such a case we provide a dataflow to the function. The batches in this data flow are continuously created as needed and handed over to Tensorflows data processing on the graphics card. *So, „x_train“ as an input variable must not be taken literally in this case*! It is replaced by the generator’s dataflow then. See the code for the Jupyter cells in the next post.

In addition we prepare for cases where we have to provide target data to compare the input data „x_train“ to which deviate from each other. Typical cases are the application of AEs/VAEs for denoising or recolorization.

# Function to initiate training # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def train_myVAE(self, x_train, x_target=None , b_use_generator = False , b_target_ne_train = False , batch_size = 32 , epochs = 2 , initial_epoch = 0, t_mu=None, t_logvar=None ): ''' @note: Sometimes x_target MUST be provided - e.g. for Denoising, Recolorization @note: x_train will come as a dataflow in case of a generator ''' # cax = ProgbarLogger(count_mode='samples', stateful_metrics=None) class MyPrinterCallback(tf.keras.callbacks.Callback): # def on_train_batch_begin(self, batch, logs=None): # # Do something on begin of training batch def on_epoch_end(self, epoch, logs=None): # Get overview over available keys #keys = list(logs.keys()) print("\nEPOCH: {}, Total Loss: {:8.6f}, // reco loss: {:8.6f}, mu Loss: {:8.6f}, logvar loss: {:8.6f}".format(epoch, logs['loss'], logs['decoder_loss'], logs['decoder_1_loss'], logs['decoder_2_loss'] )) print() #print('EPOCH: {}, Total Loss: {}'.format(epoch, logs['loss'])) #print('EPOCH: {}, metrics: {}'.format(epoch, logs['metrics'])) def on_epoch_begin(self, epoch, logs=None): print('-'*50) print('STARTING EPOCH: {}'.format(epoch)) if not b_target_ne_train : x_target = x_train # Data are provided from tensors in the Video RAM if not b_use_generator: # Solutions with layer.add_loss or model.add_loss() # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: self.model.fit( x_train , x_target , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch ) # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: self.model.fit( x_train , {'vae_out_main': x_target, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar} # also working # , [x_train, t_mu, t_logvar] # we provide some dummy tensors here , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch #, verbose=1 , callbacks=[MyPrinterCallback()] ) # If data are provided as a batched dataflow from a generator - e.g. for Celeb A else: # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: print("We have no solution yet for solution_type==1 and generators !") sys.exit() # Solutions with layer.add_loss or model.add_loss() # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: self.model.fit( x_train # coming as a batched dataflow from the outside generator - no batch size required here , shuffle = True , epochs = epochs , initial_epoch = initial_epoch )

As I have not tested a solution for solution_type==1 and generators, yet, I leave the writing of a working code to the reader. Sorry, I did not find the time for experiments. Presently, I use generators only in combination with the add_loss() based solutions and the solution based on train_step() and GradientTape().

Note also that if we use generators they must take care for a flow of target data to. As said: You must not take „x_train“ literally in the case of generators. It is more of a continuously created „dataflow“ of batches then – *both for the training’s input and target data*.

In this post I have outlined how we can use the methods **train_step()** and the tape-context of Tensorflows **GradientTape()** to control loss contributions and their gradients. Though done for the specific case of the KL-loss of a VAE the general approach should have become clear.

I have added a new class to create a Keras model from a pre-constructed Encoder and Decoder. For convenience reasons we still create the layer structure with our old class „MyVariationalAutoencoder(). But we switch control then to a new instance of a class representing a child class of Keras‘ Model. This class uses customized versions of train_step() and GradientTape().

I have added some more flexibility in addition: We can now include a dataflow generator for input data (as images) which do not fit into the VRAM (Video RAM) of our graphics card but into the PC’s standard RAM. We can also switch to MSE for reconstruction losses instead of BCE.

The task of the KL-loss is to compress the data distribution in the latent space and normalize the distribution around certain feature centers there. In the next post we apply this to images of faces. We shall use the „**Celeb A**“ dataset for this purpose. We are going to see that the scaling factor of the KL loss in this case has to be chosen rather big in comparison to simple cases like MNIST. We will also see that chosing a high dimension of the latent space does not really help to create a reasonable face from statistically chosen points in the latent space.

**And before I forget it:**

*Ceterum Censeo:* The worst living fascist and war criminal living today is the Putler. He must be isolated at all levels, be denazified and imprisoned. Long live a free and democratic Ukraine!

]]>

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Our objective is to find solutions which avoid potential problems with the **eager execution mode** of present Tensorflow 2 implementations. Popular recipes of some teaching books on ML may lead to non-working codes in present TF2 environments. We have already looked at two working alternatives.

In the last post we transferred the „mu“ and „log_var“ tensors from the Encoder to the Decoder and fed some Keras standard loss functions with these tensors. These functions could in turn be inserted into the model.compile() statement. The approach was a bit complex because it involved multi-input-output model definitions for the Encoder and Decoder.

The present article will discuss a third and lighter approach – namely using the Keras **add_loss()** mechanism on the level of a Keras model, i.e. **model.add_loss()**.

The advantage of this function is that its parameter interface is not reduced to the form of standardized Keras cost function interfaces which I used in my last post. This gives us flexibility. A solution based on model.add_loss() is also easy to understand and realize on the programming level. It is, however, an approach which may under certain conditions reduce performance by roughly a factor between 1.3 and 1.5 – which is significant. I admit that I have not yet understood what the reasons are. But the concrete solution version I present below works well.

The way how to use Keras‘ add_loss() functionality is described in the Keras documentation. I quote from this part of TF2’s documentation about the use of add_loss():

This method can also be called directly on a Functional Model during construction. In this case, any loss Tensors passed to this Model must be symbolic and be able to be traced back to the model’s Inputs. These losses become part of the model’s topology and are tracked in get_config.

The documentation also contains a simple example. The strategy is to first define a full VAE model with standard mu and log_var layers in the Encoder part – and afterwards add the KL-loss to this model. This is depicted in the following graphics:

We implement this strategy below via the Python class for a VAE setup which we have used already in the last 4 posts of this series. We control the Keras model setup and the layer construction by the parameter „solution_type“, which I have introduced in my last post.

The class method _build_enc(self, …) can remain as it was defined in the last post. We just have to change the condition for the layer setup as follows:

**Change to _build_enc(self, …)**

... # see other posts ... # The Encoder Model # ~~~~~~~~~~~~~~~~~~~ # With extra KL layer or with vae.add_loss() if solution_type == 0 or solution_type == 2: self.encoder = Model(self._encoder_input, self._encoder_output) # Transfer solution => Multiple outputs if solution_type == 1: self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")

Something similar holds for the Decoder part _build_decoder(…):

**Change to _build_dec(self, …)**

... # see other posts ... # The Decoder model # solution_type == 0/2: Just the decoded input if self.solution_type == 0 or self.solution_type == 2: self.decoder = Model(self._decoder_inp_z, self._decoder_output) # solution_type == 1: The decoded tensor plus the transferred tensors mu and log_var a for the variational distribution if self.solution_type == 1: self.decoder = Model([self._decoder_inp_z, self._dec_inp_mu, self._dec_inp_var_log], [self._decoder_output, self._dec_mu, self._dec_var_log], name="decoder")

A similar change is done regarding the model definition in the method _build_VAE(self):

**Change to _build_VAE(self)**

solution_type = self.solution_type if solution_type == 0 or solution_type == 2: model_input = self._encoder_input model_output = self.decoder(self._encoder_output) self.model = Model(model_input, model_output, name="vae") ... # see other posts ...

More interesting is a function which we add inside the method compile_myVAE(self, learning_rate, …).

**Changes to compile_myVAE(self, learning_rate):**

# Function to compile the full VAE # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def compile_myVAE(self, learning_rate): # Optimizer # ~~~~~~~~~ optimizer = Adam(learning_rate=learning_rate) # save the learning rate for possible intermediate output to files self.learning_rate = learning_rate # Parameter "fact" will be used by the cost functions defined below to scale the KL loss relative to the BCE loss fact = self.fact # Function for solution_type == 1 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @tf.function def mu_loss(y_true, y_pred): loss_mux = fact * tf.reduce_mean(tf.square(y_pred)) return loss_mux @tf.function def logvar_loss(y_true, y_pred): loss_varx = -fact * tf.reduce_mean(1 + y_pred - tf.exp(y_pred)) return loss_varx # Function for solution_type == 2 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # We follow an approach described at # https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer # NOTE: We can NOT use @tf.function here def get_kl_loss(mu, log_var): kl_loss = -fact * tf.reduce_mean(1 + log_var - tf.square(mu) - tf.exp(log_var)) return kl_loss # Required operations for solution_type==2 => model.add_loss() # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ res_kl = get_kl_loss(mu=self.mu, log_var=self.log_var) if self.solution_type == 2: self.model.add_loss(res_kl) self.model.add_metric(res_kl, name='kl', aggregation='mean') # Model compilation # ~~~~~~~~~~~~~~~~~~~~ if self.solution_type == 0 or self.solution_type == 2: self.model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=[tf.keras.metrics.BinaryCrossentropy(name='bce')]) if self.solution_type == 1: self.model.compile(optimizer=optimizer , loss={'vae_out_main':'binary_crossentropy', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} #, metrics={'vae_out_main':tf.keras.metrics.BinaryCrossentropy(name='bce'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss } )

I have supplemented function **get_kl_loss(mu, log_var)**. We explicitly provide the tensors „self.mu“ and „self.log_var“ via the function’s interface and thus follow one of our basic rules for the Keras add_loss()-functionality (see post IV).

Note that this is a **MUST** to get a working solution for eager execution mode!

Interestingly, the flexibility of model.add_loss() has a price, too. We can NOT use a **@tf.function** indicator here – in contrast to the standard cost functions which we used in the last post.

Note also that I have added some metrics to get detailed information about the size of the crossentropy-loss and the KL-loss during training!

Eventually we must include solution_type==2 in method train_myVAE(self, x_train, batch_size, …)

**Changes to train_myVAE(self, x_train, batch_size,…)**

... # see other posts ... if self.solution_type == 0 or self.solution_type == 2: self.model.fit( x_train , x_train , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch ) if self.solution_type == 1: self.model.fit( x_train # Working # , [x_train, t_mu, t_logvar] # we provide some dummy tensors here # by dict: , {'vae_out_main': x_train, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar} , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch #, verbose=1 , callbacks=[MyPrinterCallback()] )

We can use a slightly adapted version of the Jupyter notebook cells discussed in post V

**Cell 6:**

from my_AE_code.models.MyVAE_2 import MyVariationalAutoencoder z_dim = 12 solution_type = 2 fact = 6.5e-4 vae = MyVariationalAutoencoder( input_dim = (28,28,1) , encoder_conv_filters = [32,64,128] , encoder_conv_kernel_size = [3,3,3] , encoder_conv_strides = [1,2,2] , decoder_conv_t_filters = [64,32,1] , decoder_conv_t_kernel_size = [3,3,3] , decoder_conv_t_strides = [2,2,1] , z_dim = z_dim , solution_type = solution_type , act = 0 , fact = fact )

**Cell 11:**

BATCH_SIZE = 256 EPOCHS = 37 PRINT_EVERY_N_BATCHES = 100 INITIAL_EPOCH = 0 if solution_type == 2: vae.train_myVAE( x_train[0:60000] , batch_size = BATCH_SIZE , epochs = EPOCHS , initial_epoch = INITIAL_EPOCH )

Note that I have changed the BATCH_SIZE to 256 this time; the performance got a bit better then on my old Nvidia 960 GTX:

Epoch 3/37 235/235 [==============================] - 10s 44ms/step - loss: 0.1135 - bce: 0.1091 - kl: 0.0044 Epoch 4/37 235/235 [==============================] - 10s 44ms/step - loss: 0.1114 - bce: 0.1070 - kl: 0.0044 Epoch 5/37 235/235 [==============================] - 10s 44ms/step - loss: 0.1098 - bce: 0.1055 - kl: 0.0044 Epoch 6/37 235/235 [==============================] - 10s 43ms/step - loss: 0.1085 - bce: 0.1041 - kl: 0.0044

This is comparable to data we got for our previous solution approaches. But see an additional section on performance below.

As in the last posts I show some results for the MNIST data without many comments. The first plot proves the reconstruction abilities of the VAE for a dimension z-dim=12 of the latent space.

**MNIST with z-dim=12 and fact=6.5e-4**

For z_dim=2 we get a reasonable data point distribution in the latent space due to the KL loss, but the reconstruction ability suffers, of course:

**MNIST with z-dim=2 and fact=6.5e-4 – train data distribution in the z-space**

For a dimension of z_dim=2 of the latent space and MNIST data we get the following reconstruction chart for data points in a region around the latent space’s origin

I also tested a version of the approach with model.add_loss() without encapsulating everything in a class. But with the same definition of the Encoder, the Decoder, the model, etc. But all variables as e.g. mu, log_var were directly kept as data of and in the Jupyter notebook. Then a call

n_epochs = 3 batch_size = 128 initial_epoch = 0 vae.fit( x_train[0:60000], x_train[0:60000], # entscheidend ! batch_size=batch_size, shuffle=True, epochs = n_epochs, initial_epoch = initial_epoch )

reduced the performance by a factor of 1.5. I have experimented quite a while. But I have no clue at the moment why this happens and how the effect can be avoided. I assume some strange data handling or data transfer between the Jupyter notebook and the graphics card. I can provide details if some developer is interested.

But as one should encapsulate functionality in classes anyway I have not put efforts in a detail analysis.

In this article we have studied an approach to handle the Kullback-Leibler loss via the model.add_loss() functionality of Keras. We supplemented our growing class for a VAE with respective methods. All in all the approach is almost more convenient as the solution based on a special layer and layer.add_loss(); see post V.

However, there seems to exist some strange performance problem when you avoid a reasonable encapsulation in a class and do the modell setup directly in Jupyter cells and for Jupyter variables.

In the next post

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

I shall have a look at the solution approach recommended by F. Chollet.

We must provide tensors explicitly to model.add_loss()

https://towardsdatascience.com/shared-models-and-custom-losses-in-tensorflow-2-keras-6776ecb3b3a9

Ceterum censeo: The worst fascist, war criminal and killer living today, who must be isolated, be denazified and imprisoned, is the Putler. Long live a free and democratic Ukraine!

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

In the last post we delegated the KL loss calculation to a special customized layer of the Encoder. The layer directly followed two Dense layers which produced the tensors for

- the mean values
**mu** - and the logarithms of the variances
**log_var**

of statistical standard distributions for z-points in the latent space. (Keep in mind that we have one mu and log_var for each sample. The KL loss function has a compactification impact on the z-point distribution as a whole and a normalization effect regarding the distribution around each z-point.)

The layer centered approach for the KL loss proved to be both elegant and fast in combination with Tensorflow 2. And it fits very well to the way we build ANNs with Keras.

In the present post I focus on a different and more complicated strategy: We shall couple the Encoder and the Decoder by multi-output and multi-input interfaces to transfer mu and log_var tensors to the output side of our VAE model. And then we will calculate the KL loss by using a Keras standard mechanism for costs related to multiple output channels:

We can define a standardized customizable *cost function* per output channel (= per individual output tensor). Such a Keras cost function accepts two standard input variables: a predicted output tensor for the output channel and a related tensor with true values.

Such costs will *automatically* be added up to get a total loss and they will be subject to *automatic* error back propagation under eager execution conditions. However, to use this mechanism requires to transport KL related tensors to the Decoder’s output side and to split the KL loss into components.

The approach is a bit of an overkill to handle the KL loss. But it will also sheds a light on

- multi-in- and multi-output models
- multi-loss models
- and a transfer of tensors between to co-working neural nets.

Therefore the approach is interesting beyond VAE purposes.

Below I will first explain some more details of the present strategy. Afterward we need to find out how to handle standard customized Keras cost-functions for the KL loss contributions and the main loss. Furthermore we have to deal with reasonable output for the different loss terms during the training epochs. A performance comparison will show that the solution – though complicated – is a fast one.

First a general reminder: During training of a Keras model we have to guarantee a correct calculation of partial derivatives of losses with respect to trainable parameters (weights) according to the chain rule. The losses and related tensors themselves depend on matrix operations involving the layers‘ weights and activation functions. So the chain rule has to be applied along all paths through the network. With *eager execution* all required operations and tensors must already be totally clear during a forward pass to the layers. We saw this already with the solution approach which we discussed in

This means that relevant tensors must explicitly be available whenever derivatives shall be handled or pre-defined. This in turn means: When we want to calculate cost contributions *after* the definition of the full VAE model then we must transfer all required tensors down the line. Wth the functional Keras API we could use them by a direct Python reference to a layer. The alternative is to use them as explicit *output* of our VAE-model.

The strategy of this post is basically guided by a general Keras rule:

A personally customized cost function which can effortlessly be used in the compile()-statement for a Keras model in an eager execution environment should have a standard interface given by

**cost_function( y_true, y_pred )**

With exactly these two tensors as parameters – and nothing else!

See https://keras.io/api/losses/#creating-custom-losses. Such a function can be used for each of the multiple outputs of a Keras model.

One reason for this strict rule is that with eager execution the dependence of any function on input variables (tensors) must explicitly be defined via the function’s interface. For a standardized interface of a customizable model’s cost function the necessary steps can be generalized. The advantage of invoking cost functions with standardized interfaces for multiple output channels is, of course, the ease of use.

In the case of an Autoencoder the dominant *predicted* output is the (reconstructed) output tensor calculated from a z-point by the Decoder. By a comparison of this output tensor (e.g. a reconstructed image) with the original input tensor of the Encoder (e.g. an original image) a value for the binary crossentropy loss can be calculated. We extend this idea about output tensors of the VAE model now to the KL related tensors:

When you look at the KL loss definition in the previous posts with respect to mu and log_var tensors of the Encoder

kl_loss = -0.5e-4 * tf.reduce_mean(1 + log_var - tf.square(mu) - tf.exp(log_var))

you see that we can split it in log_var- and mu-dependent terms. If we could transfer the mu and log_var tensors from the Encoder part to the Decoder part of a VAE we could use these tensors as explicit output of the VAE-model and thus as input for the simple standardized Keras loss functions. Without having to take any further care of eager execution requirements …

So: Why not use

- a multiple-output model for the Encoder, which then provides z-points
mu**plus***and*log_var tensors, - a multiple-input, multiple-output model for the Decoder, which then accepts the multiple output tensors of the Encoder as input and provides a reconstruction tensor
the mu**plus***and*log_var tensors as multiple outputs - and simple customizable Keras cost-functions in the compile() statement for the VAE-model with each function handling one of the VAE’s (= Decoder’s) multiple outputs afterward?

In the last post I have already described a class which handles all model-setup operations. We are keeping the general structure of the class – but we allow now for options in various methods to realize a different solution based on our present strategy. We shall use the input variable „solution_type“ to the __init__() function for controlling the differences. The __init__() function itself can remain as it was defined in the last post.

We change the method to build the encoder of the class „MyVariationalAutoencoder“ in the following way:

# Method to build the Encoder # ~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_enc(self, solution_type = -1, fact=-1.0): # Checking whether "fact" and "solution_type" for the KL loss shall be overwritten if fact < 0: fact = self.fact if solution_type < 0: solution_type = self.solution_type else: self.solution_type = solution_type # Preparation: We later need a function to calculate the z-points in the latent space # The following function will be used by an eventual Lambda layer of the Encoder def z_point_sampling(args): ''' A point in the latent space is calculated statistically around an optimized mu for each sample ''' mu, log_var = args # Note: These are 1D tensors ! epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) return mu + B.exp(log_var / 2) * epsilon # Input "layer" self._encoder_input = Input(shape=self.input_dim, name='encoder_input') # Initialization of a running variable x for individual layers x = self._encoder_input # Build the CNN-part with Conv2D layers # Note that stride>=2 reduces spatial resolution without the help of pooling layers for i in range(self.n_layers_encoder): conv_layer = Conv2D( filters = self.encoder_conv_filters[i] , kernel_size = self.encoder_conv_kernel_size[i] , strides = self.encoder_conv_strides[i] , padding = 'same' # Important ! Controls the shape of the layer tensors. , name = 'encoder_conv_' + str(i) ) x = conv_layer(x) # The "normalization" should be done ahead of the "activation" if self.use_batch_norm: x = BatchNormalization()(x) # Selection of activation function (out of 3) if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: # RMO: Just use the Activation layer to use SELU with predefined (!) parameters x = Activation('selu')(x) # Fulfill some SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last multi-dim tensor shape - is later needed by the decoder self._shape_before_flattening = B.int_shape(x)[1:] # Flattened layer before calculating VAE-output (z-points) via 2 special layers x = Flatten()(x) # "Variational" part - create 2 Dense layers for a statistical distribution of z-points self.mu = Dense(self.z_dim, name='mu')(x) self.log_var = Dense(self.z_dim, name='log_var')(x) if solution_type == 0: # Customized layer for the calculation of the KL loss based on mu, var_log data # We use a customized layer according to a class definition self.mu, self.log_var = My_KL_Layer()([self.mu, self.log_var], fact=fact) # Layer to provide a z_point in the Latent Space for each sample of the batch self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var]) # The Encoder Model # ~~~~~~~~~~~~~~~~~~~ # With KL -layer if solution_type == 0: self.encoder = Model(self._encoder_input, self._encoder_output) # With transfer solution => Multiple outputs if solution_type == 1: self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder") # Other option #self.enc_inputs = {'mod_ip': self._encoder_input} #self.encoder = Model(inputs=self.enc_inputs, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")

For our present approach those parts are relevant which depend on the condition „solution_type == 1“.

**Hint:** Note that we could have used a dictionary to describe the input to the Encoder. In more complex models this may be reasonable to achieve formal consistency with the multiple outputs of the VAE-model which will often be described by a dictionary. In addition the losses and metrics of the VAE-model will also be handled by dictionaries. By the way: The outputs as well the respective cost and metric assignments of a Keras model must all be controlled by the same class of a Python enumerator.

The Encoder’s multi-output is described by a Python *list* of 3 tensors: The encoded z-point vectors (length: z_dim!), the mu- and the log_var 1D-tensors (length: z_dim!). (Note that the full shape of all tensors also depends on the batch-size during training where these tensors are of rank 2.) We can safely use a list here as we do not couple this output directly with VAE loss functions or metrics controlled by dictionaries. We use dictionaries only in the output definitions of the VAE model itself.

Now we must realize the transfer of the mu and log_var tensors to the Decoder. We have to change the Decoder into a multi-input model:

# Method to build the Decoder # ~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_dec(self): # 1st Input layer - aligned to the shape of z-points in the latent space = output[0] of the Encoder self._decoder_inp_z = Input(shape=(self.z_dim,), name='decoder_input') # Additional Input layers for the KL tensors (mu, log_var) from the Encoder if self.solution_type == 1: self._dec_inp_mu = Input(shape=(self.z_dim), name='mu_input') self._dec_inp_var_log = Input(shape=(self.z_dim), name='logvar_input') # We give the layers later used as output a name # Each of the Activation layers below just corresponds to an identity passed through self._dec_mu = Activation('linear',name='dc_mu')(self._dec_inp_mu) self._dec_var_log = Activation('linear', name='dc_var')(self._dec_inp_var_log) # Nxt we use the tensor shape info from the Encoder x = Dense(np.prod(self._shape_before_flattening))(self._decoder_inp_z) x = Reshape(self._shape_before_flattening)(x) # The inverse CNN for i in range(self.n_layers_decoder): conv_t_layer = Conv2DTranspose( filters = self.decoder_conv_t_filters[i] , kernel_size = self.decoder_conv_t_kernel_size[i] , strides = self.decoder_conv_t_strides[i] , padding = 'same' # Important ! Controls the shape of tensors during reconstruction # we want an image with the same resolution as the original input , name = 'decoder_conv_t_' + str(i) ) x = conv_t_layer(x) # Normalization and Activation if i < self.n_layers_decoder - 1: # Also in the decoder: normalization before activation if self.use_batch_norm: x = BatchNormalization()(x) # Choice of activation function if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: #x = self.selu_scale * ELU(alpha=self.selu_alpha)(x) x = Activation('selu')(x) # Adaptions to SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last layer => Sigmoid output # => This requires scaled input => Division of pixel values by 255 else: x = Activation('sigmoid', name='dc_reco')(x) # Output tensor => a scaled image self._decoder_output = x # The Decoder model # solution_type == 0: Just the decoded input if self.solution_type == 0: self.decoder = Model(self._decoder_inp_z, self._decoder_output) # solution_type == 1: The decoded tensor plus # plus the transferred tensors mu and log_var a for the variational distributions if self.solution_type == 1: self.decoder = Model([self._decoder_inp_z, self._dec_inp_mu, self._dec_inp_var_log], [self._decoder_output, self._dec_mu, self._dec_var_log], name="decoder")

You see that the Decoder has evolved into a „multi-input, multi-output model“ for „solution_type==1“.

Next we define the full VAE model. We want to organize its multiple outputs and align them with distinct loss functions and maybe also some metrics information. I find it clearer to do this via dictionaries, which refer to layer names in a concise way.

# Function to build the full VAE # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_VAE(self): solution_type = self.solution_type if solution_type == 0: model_input = self._encoder_input model_output = self.decoder(self._encoder_output) self.model = Model(model_input, model_output, name="vae") if solution_type == 1: enc_out = self.encoder(self._encoder_input) dc_reco, dc_mu, dc_var = self.decoder(enc_out) # We organize the output and later association of cost functions and metrics via a dictionary mod_outputs = {'vae_out_main': dc_reco, 'vae_out_mu': dc_mu, 'vae_out_var': dc_var} self.model = Model(inputs=self._encoder_input, outputs=mod_outputs, name="vae") # Another option if we had defined a dictionary for the encoder input #self.model = Model(inputs=self.enc_inputs, outputs=mod_outputs, name="vae")

The next logical step is to define our cost contributions. I am going to do this as with the help of two sub-functions of a method leading to the compilation of the VAE-model.

# Function to compile the full VAE # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def compile_myVAE(self, learning_rate): # Optimizer optimizer = Adam(learning_rate=learning_rate) # save the learning rate for possible intermediate output to files self.learning_rate = learning_rate # Parameter "fact" will be used by the cost functions defined below to scale the KL loss relative to the BCE loss fact = self.fact #mu-dependent cost contributions to the KL loss @tf.function def mu_loss(y_true, y_pred): loss_mux = fact * tf.reduce_mean(tf.square(y_pred)) return loss_mux #log_var dependent cost contributions to the KL loss @tf.function def logvar_loss(y_true, y_pred): loss_varx = -fact * tf.reduce_mean(1 + y_pred - tf.exp(y_pred)) return loss_varx # Model compilation # ~~~~~~~~~~~~~~~~~~~~ if self.solution_type == 0: self.model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=[tf.keras.metrics.BinaryCrossentropy(name='bce')]) if self.solution_type == 1: self.model.compile(optimizer=optimizer , loss={'vae_out_main':'binary_crossentropy', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} #, metrics={'vae_out_main':tf.keras.metrics.BinaryCrossentropy(name='bce'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss } )

The first interesting thing is that the statements inside the two cost functions ignore „y_true“ completely. Unfortunately, a small test shows that we nevertheless must provide some reasonable dummy tensors here. „None“ is **NOT** working in this case.

The dictionary organizes the different costs and their relation to the three output channels of our VAE-model. I have included the metrics as a comment for the moment. It would only produce double output and consume a bit of performance.

To enable training we use the following function:

def train_myVAE(self, x_train, batch_size, epochs, initial_epoch = 0, t_mu=None, t_logvar=None ): if self.solution_type == 0: self.model.fit( x_train , x_train , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch ) if self.solution_type == 1: self.model.fit( x_train # , [x_train, t_mu, t_logvar] # we provide some dummy tensors here , {'vae_out_main': x_train, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar} , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch )

You may wonder what the „t_mu“ and „t_log_var“ stand for. These are the dummy tensors which have to provide to the cost functions. The fit() function gets „x_train“ as the model’s input. The tensors „y_pred“, for which we optimize, are handed over to the three loss functions by

{ 'vae_out_main': x_train, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar}

Again, I have organized the correct association to each output and loss contribution via a dictionary.

We can use the same Jupyter notebook with almost the same cells as in my last post V. An adaption is only required for the cells starting the training.

I build a „vae“ object (which can later be used for the MNIST dataset) by

**Cell 6**

from my_AE_code.models.MyVAE_2 import MyVariationalAutoencoder z_dim = 2 vae = MyVariationalAutoencoder( input_dim = (28,28,1) , encoder_conv_filters = [32,64,128] , encoder_conv_kernel_size = [3,3,3] , encoder_conv_strides = [1,2,2] , decoder_conv_t_filters = [64,32,1] , decoder_conv_t_kernel_size = [3,3,3] , decoder_conv_t_strides = [2,2,1] , z_dim = z_dim , solution_type = 1 # now we must provide the solution type - here the solution with KL tensor Transfer , act = 0 , fact = 1.e-3 )

Afterwards I use the Jupyter cells presented in my last post to build the Encoder, the Decoder and then the full VAE-model. For z_dim = 2 the summary outputs for the models now look like:

**Encoder**

Model: "encoder" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== encoder_input (InputLayer) [(None, 28, 28, 1)] 0 [] encoder_conv_0 (Conv2D) (None, 28, 28, 32) 320 ['encoder_input[0][0]'] leaky_re_lu_15 (LeakyReLU) (None, 28, 28, 32) 0 ['encoder_conv_0[0][0]'] encoder_conv_1 (Conv2D) (None, 14, 14, 64) 18496 ['leaky_re_lu_15[0][0]'] leaky_re_lu_16 (LeakyReLU) (None, 14, 14, 64) 0 ['encoder_conv_1[0][0]'] encoder_conv_2 (Conv2D) (None, 7, 7, 128) 73856 ['leaky_re_lu_16[0][0]'] leaky_re_lu_17 (LeakyReLU) (None, 7, 7, 128) 0 ['encoder_conv_2[0][0]'] flatten_3 (Flatten) (None, 6272) 0 ['leaky_re_lu_17[0][0]'] mu (Dense) (None, 2) 12546 ['flatten_3[0][0]'] log_var (Dense) (None, 2) 12546 ['flatten_3[0][0]'] encoder_output (Lambda) (None, 2) 0 ['mu[0][0]', 'log_var[0][0]'] ================================================================================================== Total params: 117,764 Trainable params: 117,764 Non-trainable params: 0 __________________________________________________________________________________________________

**Decoder**

Model: "decoder" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== decoder_input (InputLayer) [(None, 2)] 0 [] dense_4 (Dense) (None, 6272) 18816 ['decoder_input[0][0]'] reshape_4 (Reshape) (None, 7, 7, 128) 0 ['dense_4[0][0]'] decoder_conv_t_0 (Conv2DTransp (None, 14, 14, 64) 73792 ['reshape_4[0][0]'] ose) leaky_re_lu_23 (LeakyReLU) (None, 14, 14, 64) 0 ['decoder_conv_t_0[0][0]'] decoder_conv_t_1 (Conv2DTransp (None, 28, 28, 32) 18464 ['leaky_re_lu_23[0][0]'] ose) leaky_re_lu_24 (LeakyReLU) (None, 28, 28, 32) 0 ['decoder_conv_t_1[0][0]'] decoder_conv_t_2 (Conv2DTransp (None, 28, 28, 1) 289 ['leaky_re_lu_24[0][0]'] ose) mu_input (InputLayer) [(None, 2)] 0 [] logvar_input (InputLayer) [(None, 2)] 0 [] dc_reco (Activation) (None, 28, 28, 1) 0 ['decoder_conv_t_2[0][0]'] dc_mu (Activation) (None, 2) 0 ['mu_input[0][0]'] dc_var (Activation) (None, 2) 0 ['logvar_input[0][0]'] ================================================================================================== Total params: 111,361 Trainable params: 111,361 Non-trainable params: 0 __________________________________________________________________________________________________

**VAE-model**

Model: "vae" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to height: 200px; overflow:auto; ================================================================================================== encoder_input (InputLayer) [(None, 28, 28, 1)] 0 [] encoder (Functional) [(None, 2), 117764 ['encoder_input[0][0]'] (None, 2), (None, 2)] model_3 (Functional) [(None, 28, 28, 1), 111361 ['encoder[0][0]', (None, 2), 'encoder[0][1]', (None, 2)] 'encoder[0][2]'] ================================================================================================== Total params: 229,125 Trainable params: 229,125 Non-trainable params: 0 __________________________________________________________________________________________________

We can use our modified class in a Jupyter notebook in the same way as I have discussed in the last . Of course you have to adapt the cells slightly; the parameter solution_type must be set to 1:

Training can be started with some dummy tensors for „y_true“ handed over to our two special cost functions for the KL loss as:

**Cell 11**

BATCH_SIZE = 128 EPOCHS = 6 PRINT_EVERY_N_BATCHES = 100 INITIAL_EPOCH = 0 # Dummy tensors t_mu = tf.convert_to_tensor(np.zeros((60000, z_dim), dtype='float32')) t_logvar = tf.convert_to_tensor(np.ones((60000, z_dim), dtype='float32')) vae.train_myVAE( x_train[0:60000] , batch_size = BATCH_SIZE , epochs = EPOCHS , initial_epoch = INITIAL_EPOCH , t_mu = t_mu , t_logvar = t_logvar )

Note that I have provided dummy tensors with a shape fitting the length of x_train (60,000) and the other dimension as z_dim! This, of course, costs some memory ….

As output we get:

Epoch 1/6 469/469 [==============================] - 14s 23ms/step - loss: 0.2625 - decoder_loss: 0.2575 - decoder_1_loss: 0.0017 - decoder_2_loss: 0.0032 Epoch 2/6 469/469 [==============================] - 12s 25ms/step - loss: 0.2205 - decoder_loss: 0.2159 - decoder_1_loss: 0.0013 - decoder_2_loss: 0.0032 Epoch 3/6 469/469 [==============================] - 11s 22ms/step - loss: 0.2137 - decoder_loss: 0.2089 - decoder_1_loss: 0.0014 - decoder_2_loss: 0.0034 Epoch 4/6 469/469 [==============================] - 11s 23ms/step - loss: 0.2100 - decoder_loss: 0.2050 - decoder_1_loss: 0.0013 - decoder_2_loss: 0.0037 Epoch 5/6 469/469 [==============================] - 10s 22ms/step - loss: 0.2072 - decoder_loss: 0.2021 - decoder_1_loss: 0.0013 - decoder_2_loss: 0.0039 Epoch 6/6 469/469 [==============================] - 10s 22ms/step - loss: 0.2049 - decoder_loss: 0.1996 - decoder_1_loss: 0.0013 - decoder_2_loss: 0.0041

Heureka, our complicated setup works!

And note: It is fast! Just compare the later epoch times to the ones we got in the last post. 10 ms compared to 11 ms per epoch!

One thing which is not convincing is the fact that Keras provides all losses with some standard (non-speaking) names. To make things clearer you could

- either define some loss related metrics for which you define understandable names
- or invoke a customized Callback and maybe stop the standard output.

With the metrics you will get double output – the losses with standard names and once again with you own names. And it will cost a bit of performance.

The standard output of Keras can be stopped by a parameter „verbose=0“ of the train()-function. However, this will stop the progress bar, too.

I did not find any simple solution so far for this problem of customizing the output. If you do not need a progress bar then just set „verbose = 0“ and use your own Callback to control the output. Note that you should first look at the available keys for logged output in a test run first. Below I give you the code for your own experiments:

def train_myVAE(self, x_train, batch_size, epochs, initial_epoch = 0, t_mu=None, t_logvar=None ): class MyPrinterCallback(tf.keras.callbacks.Callback): # def on_train_batch_begin(self, batch, logs=None): # # Do something on begin of training batch def on_epoch_end(self, epoch, logs=None): # Get overview over available keys #keys = list(logs.keys()) #print("End epoch {} of training; got log keys: {}".format(epoch, keys)) print("\nEPOCH: {}, Total Loss: {:8.6f}, // reco loss: {:8.6f}, mu Loss: {:8.6f}, logvar loss: {:8.6f}".format(epoch, logs['loss'], logs['decoder_loss'], logs['decoder_1_loss'], logs['decoder_2_loss'] )) print() def on_epoch_begin(self, epoch, logs=None): print('-'*50) print('STARTING EPOCH: {}'.format(epoch)) if self.solution_type == 0: self.model.fit( x_train , x_train , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch ) if self.solution_type == 1: self.model.fit( x_train #Exp.: , {'vae_out_main': x_train, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar} , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch #, verbose=0 , callbacks=[MyPrinterCallback()] )

Output example:

EPOCH: 2, Total Loss: 0.203891, // reco loss: 0.198510, mu Loss: 0.001242, logvar loss: 0.004139 469/469 [==============================] - 11s 23ms/step - loss: 0.2039 - decoder_loss: 0.1985 - decoder_1_loss: 0.0012 - decoder_2_loss: 0.0041

Just to show that the VAE is doing what is expected some out put from the latent space:

In this post we have used a standard option of Keras to define (eager execution compatible) loss functions. We transferred the KL loss related tensors „mu“ and „logvar“ to the Decoder and used them as different output tensors of our VAE-model. We needed to provide some dummy „y_true“ tensors to the cost functions. The approach is a bit complicated, but it is working under eager execution conditions and it does not reduce performance.

It also provided us with some insights into coupled „multi-input/multi-output models“ and cost handling for each of the outputs.

Still, this interesting approach appears as an overkill for handling the KL loss. In the next post

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

I shall turn to a seemingly much lighter approach which will use the *model.add_loss()* functionality of Keras.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

The configuration of all involved clients and servers is a bit tricky – and all components have special settings to interact smoothly. So I was always happy when upgrade processes of the servers respected my settings and things went smoothly. This was not always the case, but at least the main components survived the upgrade processes. But NOT this time.

The Leap 15.3 repositories do not contain Cyrus packages any longer! And I became aware of this when it was too late. Also the SLES update repositories available after the upgrade did not contain any Cyrus packages. After the upgrade the IMAP components of my mail servers were annihilated. Not funny at all!

Fortunately, I had backuped my VMs – and could restore them to bridge the time when I had to solve the problem. Afterward I spent some hours to try to reconstruct a running Cyrus configuration on the upgraded Leap 15.3 version of the mail server VM.

I got a suitable version of a Cyrus package which works with Leap 15.3 versions from the following repository :

download.opensuse.org / repositories / server: / mail / 15.3/.

However, while the installation after some changes of the configuration file worked well locally, I could not get access to it from external clients. In Kmail I got the message that the server did not support any security mechanisms. But STARTTLS should have worked! I checked the SSSD configuration, checked the imapd.config, nsswitch, ldap.config and certificate references. All OK.

I found the solution after having read some of my own old blog posts. The Leap upgrade had brutally deleted my carefully crafted PAM files „imap“ and „smtp“ in „/etc/pam.d/“. This has happened before. See:

Mail-server-upgrade to Opensuse Leap 15 – and some hours with authentication trouble …

So: Keep backpus of your PAM configuration if you have some complicated TLS-interactions between your Opensuse machines!

And start acquiring knowledge on Dovecot and the migration from Cyrus to Dovecot. Who knows when Cyrus disappears from all SuSE Repositories. And be prepared for problems with Cyrus and Leap 15.4, too.

I find it also frustrating that „https://doc.opensuse.org/release-notes/x86_64/openSUSE/Leap/15.3/“ does not explicitly state that the package „cyrus_imapd“ was removed. The information refers to changes in „cyrus-sasl“ – but this is a different package. Which ironically still exists (though modified) …

But I am too old to explode just because of the lack of important information …

]]>Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

In the last post it became clear that it might be a good idea to delegate the KL loss calculation to a specific layer within the Encoder model. In this post I discuss the code for such a solution. I am going to encapsulate the construction of a suitable Keras model for the VAE in a class. The class will in further posts be supplemented by more methods for different approaches compatible with TF2.x and eager execution.

The code’s structure has been influenced by the work or books of several people which I want to name explicitly: D. Foster, F. Chollet and Louis Tiao. See the references in the last section of this post.

For the data sets I later want to work with both the Encoder and the Decoder parts of the VAE shall be based upon „convolutional networks“ [CNNs] and respective Keras layers. Based on a suggestions of D. Foster and F. Chollet I use a classes interface to provide the parameters of all invoked Conv2D and Conv2DTranspose layers. But in contrast to D. Foster I also indicate how to include different activation functions (e.g. SeLU). In general I also will use the Keras functional API to define and add layers to the VAE model.

Below I discuss step by step parts of the code I put into a Python module to be used later in Jupyter notebooks. First we need to import some Python modules; note that you may have to add further statements which import personal modules from paths at your local machine:

import sys import numpy as np import os import tensorflow as tf from tensorflow.keras.layers import Layer, Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \ Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, AlphaDropout from tensorflow.keras.models import Model # to be consistent with my standard loading of the Keras backend in Jupyter notebooks: from tensorflow.keras import backend as B from tensorflow.keras.optimizers import Adam

Following the ideas discussed in my last post I now add a class which later allows for the setup of a special *customized* Keras layer in the Encoder model. This layer will calculate the KL loss for us. To be able to do so, the implementation interface „call()“ receives a variable „inputs“ which contains references to the mu and var_log layers of the Encoder (see the two last posts in this series).

class My_KL_Layer(Layer): ''' @note: Returns the input layers ! Required to allow for z-point calculation in a final Lambda layer of the Encoder model ''' # Standard initialization of layers def __init__(self, *args, **kwargs): self.is_placeholder = True super(My_KL_Layer, self).__init__(*args, **kwargs) # The implementation interface of the Layer def call(self, inputs, fact = 4.5e-4): mu = inputs[0] log_var = inputs[1] # Note: from other analysis we know that the backend applies tf.math.functions # "fact" must be adjusted - for MNIST reasonable values are in the range of 0.65e-4 to 6.5e-4 kl_mean_batch = - fact * B.mean(1 + log_var - B.square(mu) - B.exp(log_var)) # We add the loss via the layer's add_loss() - it will be added up to other losses of the model self.add_loss(kl_mean_batch, inputs=inputs) # We add the loss information to the metrics displayed during training self.add_metric(kl_mean_batch, name='kl', aggregation='mean') return inputs

An important point is that a layer based on this class must return its input, namely the mu and var_log layers, for the z-point calculations in the final Encoder layer.

Note that we do not only add the loss to other losses of an eventual VAE model via the layer’s „add_loss()“ method, but that we also ensure to get some information about the the size of the KL loss during training by adding the loss to the metrics.

We now build a class to create the essential parts of a VAE. The class will provide the required flexibility and allow for future extensions comprising other TF2.x compatible solutions for KL loss calculations. (In this post we only use a customized layer to get the KL loss).

We start with the classes „__init__“ function, which basically transfers saves parameters into class variables.

# The Main class # ~~~~~~~~~~~~~~ class MyVariationalAutoencoder(): ''' Coding suggestions of D. Foster and F. Chollet were modified and extended by RMO @version: V0.1, 25.04 @change: added b_build_all @version: V0.2, 08.05 @change: Handling of the KL-loss via functions (partially not working) @version: V0.3, 29.05 @change: Handling of the KL-loss function via a customized Encoder layer ''' def __init__(self , input_dim # the shape of the input tensors (for MNIST (28,28,1)) , encoder_conv_filters # number of maps of the different Conv2D layers , encoder_conv_kernel_size # kernel sizes of the Conv2D layers , encoder_conv_strides # strides - here also used to reduce spatial resolution avoid pooling layers # used instead of Pooling layers , decoder_conv_t_filters # number of maps in Con2DTranspose layers , decoder_conv_t_kernel_size # kernel sizes of Conv2D Transpose layers , decoder_conv_t_strides # strides for Conv2dTranspose layers - inverts spatial resolution , z_dim # A good start is 16 or 24 , solution_type = 0 # Which type of solution for the KL loss calculation ? , act = 0 # Which type of activation function? , fact = 0.65e-4 # Factor for the KL loss (0.5e-4 < fact < 1.e-3is reasonable) , use_batch_norm = False # Shall BatchNormalization be used after Conv2D layers? , use_dropout = False # Shall statistical dropout layers be used for tregularization purposes ? , b_build_all = False # Added by RMO - full Model is build in 2 steps ): ''' Input: The encoder_... and decoder_.... variables are Python lists, whose length defines the number of Conv2D and Conv2DTranspose layers input_dim : Shape/dimensions of the input tensor - for MNIST (28,28,1) encoder_conv_filters: List with the number of maps/filters per Conv2D layer encoder_conv_kernel_size: List with the kernel sizes for the Conv-Layers encoder_conv_strides: List with the strides used for the Conv-Layers act : determines activation function to use (0: LeakyRELU, 1:RELU , 2: SELU) !!!! NOTE: !!!! If SELU is used then the weight kernel initialization and the dropout layer need to be special https://github.com/christianversloot/machine-learning-articles/blob/main/using-selu-with-tensorflow-and-keras.md AlphaDropout instead of Dropout + LeCunNormal for kernel initializer z_dim : dimension of the "latent_space" solution_type : Type of solution for KL loss calculation (0: Customized Encoder layer, 1: model.add_loss() 2: definition of training step with Gradient.Tape() use_batch_norm = False # True : We use BatchNormalization use_dropout = False # True : We use dropout layers (rate = 0.25, see Encoder) b_build_all = False # True : Full VAE Model is build in 1 step; False: Encoder, Decoder, VAE are build in separate steps ''' self.name = 'variational_autoencoder' # Parameters for Layers which define the Encoder and Decoder self.input_dim = input_dim self.encoder_conv_filters = encoder_conv_filters self.encoder_conv_kernel_size = encoder_conv_kernel_size self.encoder_conv_strides = encoder_conv_strides self.decoder_conv_t_filters = decoder_conv_t_filters self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size self.decoder_conv_t_strides = decoder_conv_t_strides self.z_dim = z_dim # Check param for activation function if act < 0 or act > 2: print("Range error: Parameter " + str(act) + " has unknown value ") sys.exit() else: self.act = act # Factor to scale the KL loss relative to the Binary Cross Entropy loss self.fact = fact # Check param for solution approach if solution_type < 0 or solution_type > 2: print("Range error: Parameter " + str(solution_type) + " has unknown value ") sys.exit() else: self.solution_type = solution_type self.use_batch_norm = use_batch_norm self.use_dropout = use_dropout # Preparation of some variables to be filled later self._encoder_input = None # receives the Keras object for the Input Layer of the Encoder self._encoder_output = None # receives the Keras object for the Output Layer of the Encoder self.shape_before_flattening = None # info of the Encoder => is used by Decoder self._decoder_input = None # receives the Keras object for the Input Layer of the Decoder self._decoder_output = None # receives the Keras object for the Output Layer of the Decoder # Layers / tensors for KL loss self.mu = None # receives special Dense Layer's tensor for KL-loss self.log_var = None # receives special Dense Layer's tensor for KL-loss # Parameters for SELU - just in case we may need to use it somewhere # https://keras.io/api/layers/activations/ see selu self.selu_scale = 1.05070098 self.selu_alpha = 1.67326324 # The number of Conv2D and Conv2DTranspose layers for the Encoder / Decoder self.n_layers_encoder = len(encoder_conv_filters) self.n_layers_decoder = len(decoder_conv_t_filters) self.num_epoch = 0 # Intialization of the number of epochs # A matrix for the values of the losses self.std_loss = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False) # We only build the whole AE-model if requested self.b_build_all = b_build_all if b_build_all: self._build_all()

Note that for the present post we (can) only use „solution_type = 0“ !

The class shall provide a method to build the Encoder. For our present purposes including a customized layer based on the class „My_KL_Layer“. This layer just returns its input – namely the layers „mu“ and „var_log“ for the variational calculation of z-points, but it also calculates the KL loss which is added to other model losses.

# Method to build the Encoder # ~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_enc(self, solution_type = 0, fact=-1.0): ''' Encoder @summary: Method to build the Encoder part of the AE This will be a CNN defined by the parameters to __init__ @note: For self.solution = 0, we add an extra layer to calculate the KL loss @note: The last layer uses a sigmoid activation to create the output This may not be compatible with some scalers applied to the input data (images) ''' # Check whether "fact" for the KL loss shall be overwritten if fact < 0: fact = self.fact # Preparation: We later need a function to calculate the z-points in the latent space # this function will be used by an eventual Lambda layer of the Encoder def z_point_sampling(args): ''' A point in the latent space is calculated statistically around an optimized mu for each sample ''' mu, log_var = args # Note: These are 1D tensors ! epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) return mu + B.exp(log_var / 2) * epsilon # Input "layer" self._encoder_input = Input(shape=self.input_dim, name='encoder_input') # Initialization of a running variable x for individual layers x = self._encoder_input # Build the CNN-part with Conv2D layers # Note that stride>=2 reduces spatial resolution without the help of pooling layers for i in range(self.n_layers_encoder): conv_layer = Conv2D( filters = self.encoder_conv_filters[i] , kernel_size = self.encoder_conv_kernel_size[i] , strides = self.encoder_conv_strides[i] , padding = 'same' # Important ! Controls the shape of the layer tensors. , name = 'encoder_conv_' + str(i) ) x = conv_layer(x) # The "normalization" should be done ahead of the "activation" if self.use_batch_norm: x = BatchNormalization()(x) # Selection of activation function (out of 3) if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: # RMO: Just use the Activation layer to use SELU with predefined (!) parameters x = Activation('selu')(x) # Fulfill some SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last multi-dim tensor shape - is later needed by the decoder self._shape_before_flattening = B.int_shape(x)[1:] # Flattened layer before calculating VAE-output (z-points) via 2 special layers x = Flatten()(x) # "Variational" part - create 2 Dense layers for a statistical distribution of z-points self.mu = Dense(self.z_dim, name='mu')(x) self.log_var = Dense(self.z_dim, name='log_var')(x) if solution_type == 0: # Customized layer for the calculation of the KL loss based on mu, var_log data # We use a customized layer accoding to a class definition self.mu, self.log_var = My_KL_Layer()([self.mu, self.log_var], fact=fact) # Layer to provide a z_point in the Latent Space for each sample of the batch self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var]) # The Encoder Model self.encoder = Model(self._encoder_input, self._encoder_output)

The following function should be self-evident; it reverses the Encoder’s operations and uses z-points of the latent space as input.

# Method to build the Decoder # ~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_dec(self): ''' Decoder @summary: Method to build the Decoder part of the AE Normally this will be a reverse CNN defined by the parameters to __init__ ''' # Input layer - aligned to the shape of the output layer self._decoder_input = Input(shape=(self.z_dim,), name='decoder_input') # Here we use the tensor shape info from the Encoder x = Dense(np.prod(self._shape_before_flattening))(self._decoder_input) x = Reshape(self._shape_before_flattening)(x) # The inverse CNN for i in range(self.n_layers_decoder): conv_t_layer = Conv2DTranspose( filters = self.decoder_conv_t_filters[i] , kernel_size = self.decoder_conv_t_kernel_size[i] , strides = self.decoder_conv_t_strides[i] , padding = 'same' # Important ! Controls the shape of tensors during reconstruction # we want an image with the same resolution as the original input , name = 'decoder_conv_t_' + str(i) ) x = conv_t_layer(x) # Normalization and Activation if i < self.n_layers_decoder - 1: # Also in the decoder: normalization before activation if self.use_batch_norm: x = BatchNormalization()(x) # Choice of activation function if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: #x = self.selu_scale * ELU(alpha=self.selu_alpha)(x) x = Activation('selu')(x) # Adaptions to SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last layer => Sigmoid output # => This requires scaled input => Division of pixel values by 255 else: x = Activation('sigmoid')(x) # Output tensor => a scaled image self._decoder_output = x # The Decoder model self.decoder = Model(self._decoder_input, self._decoder_output)

Note that we do not include any loss calculations in the Decoder model. The main loss – namely according to the „**binary cross entropy**“ will later be added to the „**fit()**“ method of the full Keras based VAE model.

We have already created two Keras models for the Encoder and Decoder. We now combine them to the full VAE model and save this model in a variable of the object derived from our class.

# Function to build the full AE # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_VAE(self): model_input = self._encoder_input model_output = self.decoder(self._encoder_output) self.model = Model(model_input, model_output, name="vae") # Function to build full AE in one step if requested # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def _build_all(self): self._build_enc() self._build_dec() self._build_VAE()

For our present solution with the customized layer for the KL loss we now provide a matching „**compile()**“ function:

# Function to compile VA-model with a KL-layer in the Encoder # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def compile_for_KL_Layer(self, learning_rate): if self.solution_type != 0: print("The compile_L() function is only compatible with solution_type = 0") sys.exit() self.learning_rate = learning_rate # Optimizer optimizer = Adam(learning_rate=learning_rate) self.model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=[tf.keras.metrics.BinaryCrossentropy(name='bce')])

This is the place where we include the main contribution to the loss – namely by a „binary cross-entropy“ calculation with respect to the differences between the original input tensor top our model and its output tensor. We had to use the *function* BinaryCrossentropy(name=’bce‘) to be able to give the respective output during training a short name. All in all we expect an output during training comprising:

- the total loss
- the contribution from the binary_crossentropy
- the KL contribution

We are almost finished. We just need a matching method for starting the training via calling the „**fit()**„-function of our Keras based VAE model:

def train_model_with_KL_Layer(self, x_train, batch_size, epochs, initial_epoch = 0): self.model.fit( x_train , x_train , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch )

Note that we called the same „x_train“ batch of samples twice: The standard „y“ output „labels“ actually are the input samples (which is, of course, the core characteristic of AEs). We shuffle data during training.

Why use a special function of the class at all and not directly call fit() from Jupyter notebook cells?

Well, at this point we could include multiple other things as custom callbacks (e.g. for special output or model saving) and a scheduler. See e.g. the code of D. Foster at his Github site for variants. For the sake of briefness I skip these techniques in my post.

Let us see how we can use our carefully crafted class with a Jupyter notebook. As I personally gather Python modules (via Eclipse PyDev) in some special folders, I first have to add a path:

**Cell 1**:

import sys # !!! ADAPT to YOUR needs !!!!! sys.path.append("/projects/GIT/ml_4/") print(sys.path)

Of course, *you must adapt this path to your personal situation*.

The next cell contains module imports

**Cell 2**

import numpy as np import time import os import sklearn # could be used for scalers import matplotlib as mpl from matplotlib import pyplot as plt from matplotlib.colors import ListedColormap import matplotlib.patches as mpat # tensorflow and keras import tensorflow as tf from tensorflow import keras as K from tensorflow.python.keras import backend as B from tensorflow.keras import models from tensorflow.keras import layers from tensorflow.keras import regularizers from tensorflow.keras import optimizers from tensorflow.keras import metrics from tensorflow.keras.datasets import mnist from tensorflow.keras.optimizers import schedules from tensorflow.keras.utils import to_categorical from tensorflow.python.client import device_lib from tensorflow.keras.datasets import mnist # My VAE-class from my_AE_code.models.My_VAE import MyVariationalAutoencoder

I then suppress some warnings regarding my Nvidia card and list the available Cuda devices.

**Cell 3**

# Suppress some TF2 warnings on negative NUMA node number # see https://www.programmerall.com/article/89182120793/ os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'} tf.config.experimental.list_physical_devices()

We then control resource usage:

**Cell 4**

# Restrict to GPU and activate jit to accelerate # IMPORTANT NOTE: To change any of the following values you MUT restart the notebook kernel ! b_tf_CPU_only = False # we want to work on a GPU tf_limit_CPU_cores = 4 tf_limit_GPU_RAM = 2048 if b_tf_CPU_only: tf.config.set_visible_devices([], 'GPU') # No GPU, only CPU # Restrict number of CPU cores tf.config.threading.set_intra_op_parallelism_threads(tf_limit_CPU_cores) tf.config.threading.set_inter_op_parallelism_threads(tf_limit_CPU_cores) else: gpus = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit = tf_limit_GPU_RAM)]) # JiT optimizer tf.config.optimizer.set_jit(True)

Let us load MNIST for test purposes:

**Cell 5**

def load_mnist(): (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_train = x_train.reshape(x_train.shape + (1,)) x_test = x_test.astype('float32') / 255. x_test = x_test.reshape(x_test.shape + (1,)) return (x_train, y_train), (x_test, y_test) (x_train, y_train), (x_test, y_test) = load_mnist()

Provide the VAE setup variables to our class:

**Cell 6**

z_dim = 2 vae = MyVariationalAutoencoder( input_dim = (28,28,1) , encoder_conv_filters = [32,64,128] , encoder_conv_kernel_size = [3,3,3] , encoder_conv_strides = [1,2,2] , decoder_conv_t_filters = [64,32,1] , decoder_conv_t_kernel_size = [3,3,3] , decoder_conv_t_strides = [2,2,1] , z_dim = z_dim , act = 0 , fact = 5.e-4 )

Set up the Encoder:

**Cell 7**

# overwrite the KL fact from the class fact = 2.e-4 vae._build_enc(fact=fact) vae.encoder.summary()

Build the Decoder:

**Cell 8**

vae._build_dec() vae.decoder.summary()

Build the VAE model:

**Cell 9**

vae._build_VAE() vae.model.summary()

Compile

**Cell 10**

LEARNING_RATE = 0.0005 vae.compile_for_KL_Layer(LEARNING_RATE)

Train / fit the model to the training data

**Cell 11**

BATCH_SIZE = 128 EPOCHS = 6 # for real runs ca. 40 INITIAL_EPOCH = 0 vae.train_model_with_KL_Layer( x_train[0:60000] , batch_size = BATCH_SIZE , epochs = EPOCHS , initial_epoch = INITIAL_EPOCH )

For the given parameters I got the following output on my old GTX960

Epoch 1/6 469/469 [==============================] - 12s 24ms/step - loss: 0.2613 - bce: 0.2589 - kl: 0.0024 Epoch 2/6 469/469 [==============================] - 12s 25ms/step - loss: 0.2174 - bce: 0.2159 - kl: 0.0015 Epoch 3/6 469/469 [==============================] - 11s 23ms/step - loss: 0.2100 - bce: 0.2085 - kl: 0.0015 Epoch 4/6 469/469 [==============================] - 11s 23ms/step - loss: 0.2057 - bce: 0.2042 - kl: 0.0015 Epoch 5/6 469/469 [==============================] - 11s 23ms/step - loss: 0.2034 - bce: 0.2019 - kl: 0.0015 Epoch 6/6 469/469 [==============================] - 11s 23ms/step - loss: 0.2019 - bce: 0.2004 - kl: 0.0015

So 11 secs for an epoch of 60,000 samples with batch-size = 128 is a reference point. Note that this is obviously faster than what we got for the solution discussed in the last post.

Just to give you an impression of other results:

For z_dim = 2, fact = 2.e-4 and 60 epochs I got something like the following data point distribution in the latent space:

I shall discuss more results – also for other test data sets – in future posts in this blog.

In this post we have build a class to set up a VAE based on an Encoder and a Decoder model with Conv2D and Conv2dTranspose layers. We delegated the calculation of the KL loss to a customized layer of the Encoder, whilst the main loss contribution was defined in form of a binary-crossentropy evaluation with the help of the fit()-function of the VAE model. All loss contributions were displayed as „metrics“ elements during training. The presented solution is fully compatible with Tensorflow 2.8 and eager execution. It is in my opinion also elegant and very Keras oriented as all important operations are encapsulated in a continuous sequence of layers. We also found this to be a relatively fast solution.

In the next post of this series we are going to use our class to adapt an older suggestion of D.Foster to the requirements of TF2.8.

**F. Chollet**, Deep Learning mit Python und Keras, 2018, 1-te dt. Auflage, mitp Verlags GmbH & Co.KG, Frechen

**D. Foster**, „Generatives Deep Learning“, 2020, 1-te dt. Auflage, dpunkt Verlag, Heidelberg in Kooperation mit Media Inc.O’Reilly, ISBN 978-3-960009-128-8. See Kap. 3 and the VAE code published at

https://github.com/davidADSP/GDL_code/

**Louis Tiao,** „Implementing Variational Autoencoders in Keras: Beyond the Quickstart Tutorial“, 2017, http://louistiao.me/posts/implementing-variational-autoencoders-in-keras-beyond-the-quickstart-tutorial/

**Recommendation**: The article of L. Tiao is not only interesting regarding Keras modularity. I like it very much also for his mathematical depth. I highly recommend his article as a source of inspiration, especially with respect to alternative divergences. Please, also follow Tiao’s list of well selected literature references.

And before I forget it:

*Ceterum censeo:* The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

we have seen that it is a bit more difficult to set up a **Variational Autoencoder [VAE]** with Keras and Tensorflow 2.8 than a pure Autoencoder [AE]. One of the reasons is that we need to include extra layers for a statistical variation of z-points around mean values „mu“ with a variance „var“ for each sample. In addition a special loss – the Kullback Leibler loss – must be taken into account besides a binary-crossentropy loss to optimize the „mu“ and „log_var“ values in parallel to a good reconstruction ability of the Decoder.

In the last post we also saw that a too conservative handling of the Kullback-Leibler divergence may lead to problems with the „**eager execution mode**“ of present Tensorflow 2 versions.

In this post I shall first show you how to remedy the specific problem presented in the last post. Sometimes solutions are easy to achieve … :-). But we should also understand the reason for the problem. Some basic considerations will help. Afterward we have a brief look at the performance. At last, we summarize our experiences in some simple rules.

The next statements are according to my present understanding:

When we designed layered structures of ANNs and related operations with TF 1.x and Keras, Tensorflow built a graph as an intermediate product. The graph contained all mathematical operations in a symbolic way – including the calculation of partial derivatives and gradients. The analysis of the graph by TF afterward lead to a defined sequence of real numerical operations. It is clear that the full knowledge of the graph offers the chance for an optimization of the intended operations, e.g. for ANN-training and error back propagation based on gradient components (=partial derivatives with respect to trainable variables of an ANN, mostly weights). Potential disadvantages of graphs are: Their analysis takes time and it has to be completed before any numerical operations can be started in the background. This in turn means that we cannot test code directly within a sequence of Python statements.

In an eager execution environments planned operations instead are evaluated immediately as the related tensors occur and in case of neural networks as their relation to (weight) variables of interest are properly defined. This includes the calculation of partial derivatives (see my post on error backward calculation for MLPs) with respect to these weights. A requirement is that the operations (= mathematical functions) on specific **tensors** (represented by matrices) must be well defined. Such operations can be defined by a TF2 math operations directly applied to user defined tensors in a Python statement. But they can also be encapsulated in user or Keras defined functions and combined in complicated ways – provided that it is clear how the chain rule must be applied. As the relation between the trainable variables of neighboring Keras layers in a neural network is well defined also the gradient contributions of two neighbor layers to any loss function is properly defined – *and* can be calculated already during the forward pass through a neural network. At least in principle we can get resulting tensor values directly or asap during forward propagation wherever possible.

As there are no graphs in eager execution, automatic differentiation based on a graph analysis is not possible without some help. Something has to track operations and functions applied to tensors and record resulting gradient components (i.e. partial derivative values) during a forward pass through a complicated network such that the derivatives can be used during error back-propagation. The tool for this is **Gradient.Tape()**.

A general interface to TF 2.0 like Keras has to incorporate and use Gradient.Tape() internally. While trainable variables like those of a Keras layer can automatically be watched by Gradient.Tape(), specific user defined operations have to be explicitly registered with Gradient.Tape() if you cannot use some Keras model or Keras layer options. However, when you use Keras to define your models gradient related calculations are done directly already during the forward pass through a network. Whilst moving forward through a defined network’s layers gradient contributions (partial derivatives) are evaluated obeying the chain rule across variables of previous layers, of course. The resulting gradient contributions can later be used and properly combined for error backward calculation.

Just as a reminder: In the last post I introduced a special layer to take care of the KL loss according to a recipe of F. Chollet in his book on Deep Learning of 2017 (see the precise reference at the end of my last post):

**Customized Keras layer class**:

class CustVariationalLayer (Layer): def vae_loss(self, x_inp_img, z_reco_img): # The references to the layers are resolved outside the function x = B.flatten(x_inp_img) # B: tensorflow.keras.backend z = B.flatten(z_reco_img) # reconstruction loss per sample # Note: that this is averaged over all features (e.g.. 784 for MNIST) reco_loss = tf.keras.metrics.binary_crossentropy(x, z) # KL loss per sample - we reduce it by a factor of 1.e-3 # to make it comparable to the reco_loss kln_loss = -0.5e-4 * B.mean(1 + log_var - B.square(mu) - B.exp(log_var), axis=1) # mean per batch (axis = 0 is automatically assumed) return B.mean(reco_loss + kln_loss), B.mean(reco_loss), B.mean(kln_loss) def call(self, inputs): inp_img = inputs[0] out_img = inputs[1] total_loss, reco_loss, kln_loss = self.vae_loss(inp_img, out_img) # We add the loss from the layer self.add_loss(total_loss, inputs=inputs) self.add_metric(total_loss, name='total_loss', aggregation='mean') self.add_metric(reco_loss, name='reco_loss', aggregation='mean') self.add_metric(kln_loss, name='kl_loss', aggregation='mean') return out_img # not really used in this approach

This layer was added on top of the sequence of Encoder and Decoder: Encoder => Decoder => KL_layer.

enc_output = encoder(encoder_input) decoder_output = decoder(enc_output) KL_layer = CustomVariationalLayer()([mu, log_var, encoder_input, decoder_output]) vae = Model(encoder_input, KL_layer, name="vae")

This lead to an error.

Can we remedy the approach above by some simple means? Yes, we can. I first list the solution’s code, then discuss it:

# SOLUTION I: Custom Layer for total and KL loss # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ class CustomVariationalLayer (Layer): def vae_loss(self, mu, log_var, inp_img, out_img): bce = tf.keras.losses.BinaryCrossentropy() reco_loss = bce(inp_img, out_img) kln_loss = -0.5e-4 * B.mean(1 + log_var - B.square(mu) - B.exp(log_var), axis=1) # mean per sample return B.mean(reco_loss + kln_loss), B.mean(reco_loss), B.mean(kln_loss) # means per batch def call(self, inputs): mu = inputs[0] log_var = inputs[1]; inp_img = inputs[2]; out_img = inputs[3] total_loss, reco_loss, kln_loss = self.vae_loss(mu, log_var, inp_img, out_img) self.add_loss(total_loss, inputs=inputs) self.add_metric(total_loss, name='total_loss', aggregation='mean') self.add_metric(reco_loss, name='reco_loss', aggregation='mean') self.add_metric(kln_loss, name='kl_loss', aggregation='mean') return inputs[3] # Not used

**What is the main difference?** Answer: We explicitly provided the tensors as input variables of the function vae_loss()!

Why does it help?

Well, TF2 has to prepare and calculate partial derivatives according to the * chain rule* of differential calculus. What would you yourself want to know on a mathematical level? You would write down any complicated function with further internal operation as a function of well defined arguments! So: We must tell TF2.x explicitly what the variables, namely tensors, of

In our original approach the function’s input was not defined. It obviously matters with TF2.x!

As a consequence the summary of our VAE model has become longer than in the last post:

For our solution we compile and train like follows:

vae.compile(optimizer=Adam(), loss=None) n_epochs = 40 batch_size = 128 vae.fit(x=x_train, y=None, shuffle=True, epochs = n_epochs, batch_size=batch_size)

Note that we do not provide any „y“ to fit against. The costs are already fully defined by our special customized layer. If we, however, had used the binary_crossentropy loss in the compile statement we would have had to provide predicted tensors; see below.

On a Nvidia 960 GTX the calculation proceeds for some epochs like:

After 40 epochs we get with t-SNE well separated clusters for the test-data:

More interesting is the result for **z_dim = 2**, as we expect a more confined usage of the available z-space. And indeed, if we raise the factor in front of the KL loss e.g. to 6.5e-4, we get something like

With the exception of „6“-digits the samples use the space between -4 < y < 3.5 and -3 < x < 4.5 in z-space. This area is smaller by roughly a factor of 4 (i.e. 2 in each direction) than the space used of a standard Autoencoder (see the 1st post of this series). So, the KL loss shows an effect.

However, our new approach is not as fast as it could be. What can we do to optimize? First we can get rid of the extra function in the layer. We could work directly on the tensors in the call function. A further step would be to focus only on the KL loss. Why not let Keras organize the stuff for binary_crossentropy? But all this would not change our performance much.

The real problem in our case (suggested by the master, F. Chollet, himself in an older book) is an inefficient layer structure: We cannot deal directly with the partial derivatives where the tensors appear – namely in the Encoder. Thereby an otherwise possible sequence of linear algebra operations (matrix operations), which could be optimized for error back propagation, is interrupted in a complicated way at the special layers mu and log_var. So, it appears that a strategy which would encapsulate our KL loss calculation in a specific layer of the Encoder would boost performance. This is indeed the case. I will show the solution in my next post, but give you an idea of the performance gain, already:

Instead of **15 secs** as above per epoch we are going to need **only 10 to 11 secs**.

I see two basic rules which I personally was not aware of before:

- If you need to perform complex calculations based on layer related tensors to get certain loss contributions and if you want to use the result with pre-defined Keras functions as „layer.add_loss()“ and „model.add_loss()“ then provide the
*result*tensors explicitly as input variables to the Keras functions. You can use separate personal functions ahead to perform the required tensor operations, but these functions must also have all layer based tensors as explicit input variables and an explicit tensor as output. - If possible apply your calculations within special layers
*closely*following he layers which provide the tensors your loss contribution depends on. Best before new trainable variables are introduced. Use the special layer’s*add_loss()*method. Try to verify that your operations fit into a layer related sequence of matrix operations whose values are needed later for error backward propagation, but are calculated already during the forward pass.

The first rule can be symbolized by something like

# Model definition ... layer1 = Keras_defined_layer() #e.g. Dense() ... layer2 = Keras_defined_layer() # e.g. Activation() ... model = Model(....) # cost calculation res_tensor_cost_contribution = complex_personal_function( layer1, layer2 ) model.add_loss(res_tensor_cost_contribution)

An additional rule may be:

- Try if TF2 math tensor operations are faster than tensorflow.keras.backend operations. I do not think so, but …

In the following posts I am going to pursue three ways to handle the KL loss:

- We add a layer to the Encoder and perform the required KL loss calculation there. We have to take care of a proper output of such a layer not to disrupt the combination of the Encoder with the Decoder. This is in my opinion the most elegant and also the fastest option. It also fits perfectly into the Keras philosophy of defining models via layers. And we can use the Keras compile() and fit() functions seamlessly.
- We calculate the loss after combining the Encoder and Decoder to a VAE-model – and add the KL loss to our VAE model via its add_loss() method. This is a possible and well defined variant as it separates the loss operations from the VAE’s layer structure. Very similar to what we did above – but probably not the fastest method for VAEs.
- We use Gradient.Tape() directly to define an individual training step for our Keras based VAE model. This method will prove to be a fast and very flexible method. But in a way it leaves the path of using only Keras layers to define and fit neural network models. Nevertheless: Although it requires a different view on the Keras interface to TF2.x it is certainly the future we should get used to – even if we are no Keras and TF specialists.

In this post we saw that some old recipes for VAE design with Keras can still be used with some minor modifications. Two rules show us different ways to make Keras based VAE-ANNs work together with TF2.8. In the next post of this series we shall build a VAE with an Encoder layer to deal with the Kullback-Leibler loss.

]]>

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

I have discussed basics of Autoencoders. We have also set up a simple Autoencoder with the help of the *functional* Keras interface to Tensorflow 2. This worked flawlessly and we could apply our Autoencoder [AE] to the MNIST dataset. Thus we got a good reference point for further experiments. Now let us turn to the design of „Variational Autoencoders“ [VAEs].

In the present post I want to demonstrate that * some* simple classical recipes for the construction of VAEs may not work with Tensorflow [TF] > version 2.3 due to „eager execution mode“, which is activated as the default environment for all command interpretation and execution. This includes gradient determination during the forward pass through the layers of an artificial neural network [ANN]. In contrast to „graph mode“ for TF 1.x versions.

**Addendum 25.05.2022:** **This post had to be changed as its original version contained wrong statements.**

As we know already form the first post of this series we need a special loss function to control the parameters of distributions in the latent space. These distributions are used to calculate z-points for individual samples and must be „fit“ optimally. I list four methods to calculate such a loss. All methods are taken form introductory books on Machine Learning (see the book references in the last section of this post). I use one concrete and exemplary method to realize a VAE: We first extend the layers of the AE-Encoder by two layers („mu“, „var_log“) which give us the basis for the calculation of z-points on a statistical distribution. Then we use a special layer on top of the Decoder model to calculate the so called „**Kullback-Leibler loss**“ based on data of the „mu“ and „var_log“ layers. Our VAE Keras model will be based on the Encoder, the Decoder and the special layer. This approach will give us a typical error message to think about.

A VAE (as an AE) maps points/vectors of the „variable space“ to points/vectors in the low-dimensional „latent space“. However, a VAE does not calculate the „z-points“ directly. Instead it uses a statistical variable distribution around a mean value. This opens up for further degrees of freedom, which become subjects to the optimization process. These degrees of freedom are a mean value „mu“ of a distribution and a „standard deviation“. The latter is derived from a variance „**var**„, of which we take the logarithm „**log_var**“ for practical and numerical reasons.

**Note**: Our neural network parts of the VAE will decide themselves during training for which samples they use which **mu** and which **var_log** values. They optimize the required values via specific weights at special layers.

We first define a function whose meaning will become clear in a minute:

# Function to calculate a z-point based on a Gaussian standard distribution def sampling(args): mu, log_var = args # A randomized value from a standard distribution epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) # A point in a Gaussian standard distribution defined by mu and var with log_var = log(var) return mu + B.exp(log_var / 2) * epsilon

This function will be used to calculate z-points from other variables, namely mu and log_var, of the Encoder.

The VAE Encoder looks almost the same as the Encoder of the AE which we build in the last post:

z_dim = 16 # The Encoder # ************ encoder_input = Input(shape=(28,28,1)) x = encoder_input x = Conv2D(filters = 32, kernel_size = 3, strides = 1, padding='same')(x) x = LeakyReLU()(x) x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding='same')(x) x = LeakyReLU()(x) x = Conv2D(filters = 128, kernel_size = 3, strides = 2, padding='same')(x) x = LeakyReLU()(x) # some information we later need for the decoder - for MNIST and our layers (7, 7, 128) shape_before_flattening = B.int_shape(x)[1:] # B is the tensorflow.keras backend ! See last post. x = Flatten()(x)of # differences to AE-models. The following layers central elements of VAEs! mu = Dense(z_dim, name='mu')(x) log_var = Dense(z_dim, name='log_var')(x) # We calculate z-points/vectors in the latent space by a special function # used by a Keras Lambda layer enc_out = Lambda(sampling, name='enc_out_z')([mu, log_var]) # The Encoder model encoder = Model(encoder_input, [enc_out], name="encoder") encoder.summary()

The differences to the AE of the last post comprise two extra Dense layers: **mu** and **log_var**. Note that the dimension of the respective rank 1 tensor (a vector!) is equal to **z_dim**.

**mu** and **log_var** are (vector) variables which later shall be optimized. So, we have to treat them as trainable network variables. In the above code this is done via the weights of the two defined Dense layers which we integrated in the network. But note that we did not define any activation function for the layers! The calculation of „output“ of these layers is done in a more complex function which we encapsulated in the function „sampling()“.

of

Therefore, we have in addition defined a Lambda layer which applies the (Lambda) function „sampling()“ to the vectors mu and var_log. The Lambda layer thus delivers the required z-point (i.e. a vector) for each sample in the z-dimensional latent space.

How exactly do we calculate a z-point or vector in the latent space? And what does the KL loss, which we later shall use, punish? For details you need to read the literature. But in short terms:

Instead of calculating a z-point directly we use a Gaussian distribution depending on 2 parameters – our mean value „**mu**“ and a standard deviation „**var**„. We calculate a statistically random z-point within the range of this distribution. Note that we talk about a vector-distribution. The distribution „variables“ mu and log_var are therefore vectors.

If „x“ is the last layer of our conventional network part then

mu = Dense(self.z_dim, name='mu')(x) log_var = Dense(self.z_dim, name='log_var')(x)

Having a value for mu and log_var we use a randomized factor „epsilon“ to calculate a „variational“ z-point with some statistical fluctuation:

epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) z_point = mu + B.exp(log_var/2)*epsilon

This is the core of the „sampling()“-function which we defined a minute ago, already.

As you see the randomization of epsilon *assumes* a standard Gaussian distribution of possible values around a mean = 0 with a standard-deviation of 1. The *calculated* z-point is then placed in the vicinity of a fictitious z-point „mu“. The coordinate of **mu**“ and the later value for **log_var** are subjects of optimization. More specifically, the distance of mu to the center of the latent space and too big values of the standard-deviation around mu will be punished by the loss. Mathematically this is achieved by the KL loss (see below). It will help to get a compact arrangements of the z-points for the training samples in the z-space (= latent space).

To transform the result of the above calculation into an ordinary z-point output of the Encoder for a given input sample we apply a Keras Lambda layer which takes the mu- and log_var-layers as input and applies the function „**sampling()**„, which we have already defined, to input in form of the mu and var_log tensors.

The Decoder is pretty much the same as the on which we have defined earlier for the Autoencoder.

dec_inp_z = Input(shape=(z_dim)) x = Dense(np.prod(shape_before_flattening))(dec_inp_z) x = Reshape(shape_before_flattening)(x) x = Conv2DTranspose(filters=64, kernel_size=3, strides=2, padding='same')(x) x = LeakyReLU()(x) x = Conv2DTranspose(filters=32, kernel_size=3, strides=2, padding='same')(x) x = LeakyReLU()(x) x = Conv2DTranspose(filters=1, kernel_size=3, strides=1, padding='same')(x) x = LeakyReLU()(x) # Output - x = Activation('sigmoid')(x) dec_out = x decoder = Model([dec_inp_z], [dec_out], name="decoder") decoder.summary()

Now, we must define a loss which helps to optimize the weights associated with the mu and var_log layers. We will choose the so called „Kullback-Leibler [KL] loss“ for this purpose. Mathematically this convex loss function is derived from a general distance metric (Kullback_Leibler divergence) for two probability distributions. See:

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence – and especially the examples section there

https://en.wikipedia.org/wiki/Divergence_(statistics)

In our case our target distribution we pick a normal distribution calculated from a specific set of vectors **mu** and **log_var** as the first distribution and compare it to a standard normal distribution with** mu = 0.0** and standard deviation **sqrt(exp(log_var) = 1.0**. All for a specific z-point (corresponding to an input sample).

amp;

Written symbolically the KL loss is calculated by:

kl_loss = -0.5* mean(1 + log_var - square(mu) - exp(log_var))

The „mean()“ accounts for the fact that we deal with vectors. (At this point not with respect to batches.) Our loss punishes a „mu“ far off the origin of the „latent space“ by a quadratic term and a more complicated term dependent on log_var and var = square(sigma) with sigma being the standard deviation. The loss term for sigma is convex around sigma = 1.0. Regarding batches we shall later take a mean of the individual samples‘ losses.

This means that the KL loss tries to push the mu for the samples towards the center of the latent space’s origin and keep the variance around mu within acceptable limits. The mu-dependent term leads to an effective use of the latent space around the origin. Potential clusters of similar samples in the z-space will have centers not too far from the origin. The data shall not spread over vast regions of the latent space. The reduction of the spread around a mu-value keeps similar data-points in close vicinity – leading to rather confined clusters – as good as possible.

What are the consequences for the calculation of the gradient components, i.e. partial derivatives with respect to the individual weights or other trainable parameters of the ANN?

Note that the eventual values of the mu and log_var vectors potentially will depend

- on the KL loss and its derivatives with respect to the elements of the mu and var_log tensors
- the dependence of the mu and var_log tensors on contributory sums, activations and weights of previous layers of the Encoder
- and a sequence of derivatives defined by
- a loss function for differences between reconstructed output of the decoder in comparison to the encoders input,
- the decoder’s layers‘ sum like contributions, activation functions and weights
- the output layer (better its weights and activation function) of the Encoder
- the sum-like contributions, weights and activation function of the Encoder’s layers

All partial derivatives are coupled by the „chain rule“ of differential calculus and followed through the layers – until a specific weight or trainable parameter in the layer sequence is reached. So we speak about a mix, more precisely of a sum of two loss contributions:

**a reconstruction loss [reco_loss]:**a standard loss for the differences of the output of the VAE with respect to its input and its dependence on all layers‘ activations and weights. We have called this type of loss which measures the reconstruction accuracy of samples from their z-data the „reconstruction loss“ for an AE in the last post. We name the respective variable „reco_loss“.**Kullback-Leibler loss [kl_loss]**: a special loss with respect to the values of the mu and log-var tensors and its dependence on previous layers‘ activations and weights. We name the variable

The two different losses can be combined using weight factors to balance the impact of good reconstruction of the original samples and a confined distribution of data points in the z-space.

total_loss = reco_loss + fact * kl_loss

I shall use the „binary_crossentropy“ loss for the reco_loss term as it leads to good and fast convergence during training.

reco_loss => binary_crossentropy

The total loss is a sum of two terms, but the corrections of the VAE’s network weights from either term by partial derivatives during error backward propagation affect different parts of the VAE: The optimization of the KL-loss directly affects only encoder weights, in our example of the mu-, the var_log- and the Conv2D-layers of the Encoder. The weights of the Decoder are only indirectly influenced by the KL-loss. The optimization of the „reconstruction loss“ instead has a direct impact on all weights of all layers.

A VAE, therefore, is an interesting example for loss contributions depending on certain layers, only, or specific parts of a complex, composite ANN model with sub-divisions. So, the consequences of a rigorous „eager execution mode“ of the present Tensorflow versions for gradient calculations are of real and general interest also for other networks with customized loos contributions.

We could already define a (preliminary) keras model comprising the Encoder and the Decoder:

enc_output = encoder(encoder_input) decoder_output = decoder(enc_output) vae_pre = Model(encoder_input, decoder_output, name="vae_witout_kl_loss")

This model would, however, not include the KL-loss. We, therefore, must make changes to it. I found four different methods in introductory ML, where most authors use the Keras functional interface to TF2 (however early versions) to set up a layer structure similar to ours above. See the references to the books in the final section of this post. Most authors the handle the KL-loss by using the tensors of our two special (Dense) layers for „mu“ and „log_var“ in the Encoder somewhat later:

**Method 1:**D. Foster first creates his VAE-model and then uses separate functions to calculate the KL loss and a MSE loss regarding the difference between the output and input tensors. Afterward he puts a function (e.g. total_loss()) for adding both contribution up to a total loss into the Keras compile function as a closure – as in „compile(optimizer=’adam‘, loss=total_loss)“.**Method 2:**F. Chollet in his (somewhat older) book on „Deep Learning with Keras and Python“ defines a special*customized*Keras layer in addition to the Encoder and Decoder models of the VAE. This layer receives an internal function to calculate the total loss. The layer is then used as a*final*layer to define the VAE model and invokes the loss by the**Layer.add_loss()**functionality.**Method 3:**A. Geron in his book on „Machine Learning with Scikit-Learn, Keras and Tensorflow“ also calculates the KL-loss*after*the definition of the VAE-model, but associates it with the (Keras) Model by the „model.add_loss()“ functionality.

He then calls the models compile statement with the inclusion of a „binary_crossentropy“ loss as the main loss component. As in

**compile(optimizer=’adam‘, loss=’binary_crossentropy‘)**.

Geron’s approach relies on the fact that a loss defined by*Model.add_loss()*automatically gets added to the loss defined in the compile statement behind the scenes. Geron directly refers to the mu and log_var layers‘ tensors when calculating the loss. He does not use an intermediate function.**Method 4:**A. Atienza structurally does something similar as Geron, but he calculates a total loss, adds it with model.add_loss() and then calls the compile statement without any loss – as in compile(optimzer=’adam‘). Also Atienza calculates the KL-loss by directly operating on the layer’s tensors.

A code snippet to realize method 2 is given below. First we define a class for a „Customized keras Layer“.

**Customized Keras layer class**:

class CustVariationalLayer (Layer): def vae_loss(self, x_inp_img, z_reco_img): # The references to the layers are resolved outside the function x = B.flatten(x_inp_img) # B: tensorflow.keras.backend z = B.flatten(z_reco_img) # reconstruction loss per sample # Note: that this is averaged over all features (e.g.. 784 for MNIST) reco_loss = tf.keras.metrics.binary_crossentropy(x, z) # KL loss per sample - we reduce it by a factor of 1.e-3 # to make it comparable to the reco_loss kln_loss = -0.5e-4 * B.mean(1 + log_var - B.square(mu) - B.exp(log_var), axis=1) # mean per batch (axis = 0 is automatically assumed) return B.mean(reco_loss + kln_loss), B.mean(reco_loss), B.mean(kln_loss) def call(self, inputs): inp_img = inputs[0] out_img = inputs[1] total_loss, reco_loss, kln_loss = self.vae_loss(inp_img, out_img) self.add_loss(total_loss, inputs=inputs) self.add_metric(total_loss, name='total_loss', aggregation='mean') self.add_metric(reco_loss, name='reco_loss', aggregation='mean') self.add_metric(kln_loss, name='kl_loss', aggregation='mean') return out_img #not really used in this approach

Now, we add a special layer based on the above class and use it in the definition of a VAE model. I follow the code in F. Cholet’s book; there the customized layer concludes the layer structure * after* the Encoder and the Decoder:

enc_output = encoder(encoder_input) decoder_output = decoder(enc_output) # add the custom layer to the model fc = CustVariationalLayer()([encoder_input, decoder_output]) vae = Model(encoder_input, fc, name="vae") vae.summary()

The output fits the expectations

Eventually we need to compile. F. Chollet in his book from 2017 defines the loss as „None“ because it is covered already by the layer.

vae.compile(optimizer=Adam(), loss=None)

We are ready to train – at least I thought so … We start training with scaled x_train images coming e.g. from the MNIST dataset by the following statements :

n_epochs = 3; batch_size = 128 vae.fit( x=x_train, y=None, shuffle=True, epochs = n_epochs, batch_size=batch_size)

Unfortunately we get the following typical

**Error message**:

TypeError: You are passing KerasTensor(type_spec=TensorSpec(shape=(), dtype=tf.float32, name=None), name='tf.math.reduce_sum_1/Sum:0', description="created by layer 'tf.math.reduce_sum_1'"), an intermediate Keras symbolic input/output, to a TF API that does not allow registering custom dispatchers, such as `tf.cond`, `tf.function`, gradient tapes, or `tf.map_fn`. Keras Functional model construction only supports TF API calls that *do* support dispatching, such as `tf.math.add` or `tf.reshape`. Other APIs cannot be called directly on symbolic Kerasinputs/outputs. You can work around this limitation by putting the operation in a custom Keras layer `call` and calling that layer on this symbolic input/output.

The problem I experienced was related to a concise „eager mode execution“ implemented in Tensorflow ≥ 2.3. If we deactivate „eager execution“ by the following statement **before** we define any layers

from tensorflow.python.framework.ops import disable_eager_execution disable_eager_execution()

then we get

You can safely ignore the warning which is due to a superfluous environment variable.

**Addendum 25.05.2022:** This paragraph had to be changed as its original version contained wrong statements.

I had and have code examples for all the four variants of implementing the KL loss listed above. Both methods 1 and 2 do NOT work with standard TF 2.7 or 2.8 and active „**eager execution**“ mode. Similar error messages as described for method 2 came up when I tried the original code of D. Foster, which you can download from his github repository. However, methods 3 and 4 do work – as long as you avoid any intermediate function to calculate the KL loss.

**Important remark about F. Chollet’s present solution:**

I should add that F. Chollet has, of course, meanwhile published a modern and working approach in the present online Keras documentation. I shall address his actual solution in a later post. His book is also the oldest of the four referenced.

In this post we have invested a lot of effort to realize a Keras model for a VAE – including the Kullback-Leibler loss. We have followed recipes of various books on Machine Learning. Unfortunately, we have to face a challenge:

Recipes of how to implement the KL loss for VAEs in books which are older than ca. 2 years may not work any longer with present versions of Tensorflow 2 and its „eager execution mode“.

In the next post of this series we shall have a closer look at the problem. I will try to derive a central rule which will be helpful for the design various solutions that do work with TF 2.8.

**F. Chollet**, Deep Learning mit Python und Keras, 2018, 1-te dt. Auflage, mitp Verlags GmbH & Co.KG, Frechen

**D. Foster**, „Generatives Deep Learning“, 2020, 1-te dt. Auflage, dpunkt Verlag, Heidelberg in Kooperation mit Media Inc.O’Reilly, ISBN 978-3-960009-128-8. See Kap. 3 and the VAE code published at

https://github.com/davidADSP/GDL_code/

**A. Geron**, „Hands-On Machine Learning with Scikit-Learn, keras & Tensorflow“, 2019, 2nd edition, O’Reilly, Sebastpol, Canada, ISBN 978-1-492-03264-9. See chapter 17.

**R. Atienza**, „Advanced Deep Learniing with Tensorflow 2 and Keras, 2020, 2nd edition, Packt Publishing, Birminham, UK, ISBN 978-1-83882-165-4. See chapter 8.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

]]>Variational Autoencoder with Tensorflow 2.8 – I – some basics

I have briefly discussed the basic elements of a Variational Autoencoder [VAE]. Among other things we identified an Encoder, which transforms sample data into z-points in a low dimensional „latent space“, and a Decoder which reconstructs objects in the original variable space from z-points.

In the present post I want to demonstrate that a simple **Autoencoder** [AE] works as expected with Tensorflow 2.8 [TF2]. In contrast to certain versions of *Variational Autoencoders* which we shall test in the next post. For our AE we use the **„binary cross-entropy“** as a suitable loss to compare reconstructed MNIST images with the original ones.

I use the functional Keras API to set up the Autoencoder. Later on we shall encapsulate everything in a class, but lets keep things simple for the time being. We design both the Encoder and the Decoder as CNNs and use properly configured Conv2D and Conv2DTranspose layers as their basic elements.

**Required Imports**

# tensorflow and keras import tensorflow as tf from tensorflow import keras as K from tensorflow.keras import backend as B from tensorflow.keras.models import Model from tensorflow.keras import regularizers from tensorflow.keras.optimizers import Adam from tensorflow.keras import metrics from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \ Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, \ AlphaDropout, Concatenate, Rescaling, ZeroPadding2D from tensorflow.keras.utils import to_categorical from tensorflow.keras.optimizers import schedules

**z_dim and the Encoder defined with the functional API**

We set the dimension of the „latent space“ to **z-dim = 16**.

This allows for a very good reconstruction of images in the MNIST case.

z_dim = 16 encoder_input = Input(shape=(28,28,1)) # Input images pixel values are scaled by 1./255. x = encoder_input x = Conv2D(filters = 32, kernel_size = 3, strides = 1, padding='same')(x) x = LeakyReLU()(x) x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding='same')(x) x = LeakyReLU()(x) x = Conv2D(filters = 128, kernel_size = 3, strides = 2, padding='same')(x) x = LeakyReLU()(x) shape_before_flattening = B.int_shape(x)[1:] # B: Keras backend x = Flatten()(x) encoder_output = Dense(self.z_dim, name='encoder_output')(x) encoder = Model([encoder_input], [encoder_output], name="encoder")

This is almost an overkill for something as simple as MNIST data. The Decoder reverses the operations. To do so it uses Conv2DTranspose layers.

**The Decoder**

dec_inp_z = Input(shape=(z_dim)) x = Dense(np.prod(shape_before_flattening))(dec_inp_z) x = Reshape(shape_before_flattening)(x) x = Conv2DTranspose(filters=64, kernel_size=3, strides=2, padding='same')(x) x = LeakyReLU()(x) x = Conv2DTranspose(filters=32, kernel_size=3, strides=2, padding='same')(x) x = LeakyReLU()(x) x = Conv2DTranspose(filters=1, kernel_size=3, strides=1, padding='same')(x) x = LeakyReLU()(x) # Output into a region of 0 to 1 - requires a one hot encoding of classes # and scaling of pixel values of the input samples to [0, 1] x = Activation('sigmoid')(x) decoder_output = x decoder = Model([dec_inp_z], [decoder_output], name="decoder")

**The full AE-model**

ae_input = encoder_input ae_output = decoder(encoder_output) AE = Model(ae_input, ae_output, name="AE")

That was easy! Now we * compile*:

AE.compile(optimizer=Adam(learning_rate = 0.0005), loss=['binary_crossentropy'])

I use the „binary_crossentropy“ loss function to evaluate and punish differences between the reconstructed image and the original. Note that we need to scale pixel values of input images (as MNIST images) down to a value range between [0, 1] due to using a sigmoid activation function at the output layer of the Decoder.

Using „binary cross-entropy“ leads to a better and faster convergence of the training process:

Eventually, we can train our model:

n_epochs = 40 batch_size = 128 AE.fit( x=x_train, y=x_train, shuffle=True, epochs = n_epochs, batch_size=batch_size)

After 40 training steps we can visualize the resulting data clusters in z-space. To get a 2-dimensional plot out of 16-dimensional data requires the use of **t-SNE**. For 15.000 training data (out of 60.000) we get:

Our Encoder CNN was able to separate the different classes quite well by extraction basic features of the handwritten digits by its Conv2D layers. The test which the AE never had seen before are well separated too:

Note that the different position of the clusters on the 2-dim plots are due to transformations and arrangements of t-SNE and have no deeper meaning.

**What about the reconstruction quality?**

For z_dim = 16 it is almost perfect: Below you see a sequence of rows presenting original images and their reconstructed counterparts for selected MNIST test data, which the AE had not seen before:

When we set **z_dim = 2** the reconstruction quality of course suffers:

Regarding class related clusters we can directly plot the data distributions in the z-space:

We see a relatively vast spread of data points in clusters for „6“s and „0“s. We also recognize the typical problem zones for „3“s, „5“s, „8“s on one side and for „4“s, „7“s and „9“s on the other side. They are difficult to distinguish with only 2 dimensions.

Autoencoders are simple to set up and work as expected together with Keras and Tensorflow 2.8.

In the next post

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

we shall, however, see that VAEs and their KL loss may lead to severe problems.

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

]]>Actually, I had some 2 years old Python modules for VAEs available – all based on recommendations in the literature. My VAE-model setup had been done with Keras. Moreprecisely the version integrated into Tensorflow 2 [TF2]. But I ran into severe trouble with my present Tensorflow versions, namely 2.7 and 2.8. The cause of the problems was my handling of the so called **Kullback-Leibler [KL] loss** in combination with the „eager execution mode“ of TF2.

Actually, the problems may already have arisen with earlier TF versions. With this post series I hope to save time for some readers who, as myself, are no ML professionals. In a first post I will briefly repeat some basics about Autoencoders [AEs] and Variational Autoencoders [VAEs]. In a second post we shall build a simple Autoencoder. In a third post I shall turn to VAEs and discuss some variants for dealing with the KL loss – all variants will be picked from recipes of ML teaching books. I shall demonstrate afterward that some of these recipes fail for TF 2.8. A fourth post will discuss the problem with „eager execution“ and derive a practical rule for calculations based on layer specific data. In three further posts I apply the rule in the form of three different variants for the KL loss of VAE models. I will set up the VAE models with the help of the functional Keras API. The suggested solutions all * do* work with Tensorflow 2.7 and 2.8.

If you are not interested in the failures of classical approaches to the KL loss presented in teaching books on ML and if you are not at all interested in theory you may skip the first four posts – and jump to the solutions.

I assume that the reader is already familiar with the concepts of „Variational Autoencoders“. I nevertheless discuss some elements below to get a common vocabulary.

I call the vector space which describes the input samples the "**variable space**". As AEs and VAEs are typically used for images this „variable space“ has many dimensions – from several hundred up to millions. I call the number of these dimensions "**n_dim**".

An input sample can either be represented as a vector in the variable space or by a multi-ranked tensor. An image with 28×28 pixels and gray „colors“ corresponds to a 784 dimensional vector in the variable space or a a „rank 2 tensor“ with the *shape* (28, 28). I use the term „**rank**“ to avoid any notional ambiguities regarding the term „dimensions“. Remember that people sometimes speak of 2-„dimensional“ Numpy matrices with 28 elements in each dimension.

An **Autoencoder** [AE] typically consists of two connected artificial neural networks [ANN]:

**Encoder**: maps a point of the „variable space“ to a point in the so called „latent space“**Decoder**: maps a point of the „latent space“ to a point in the „variable space“.

Both parts of an AE can e.g. be realized with Dense and/or Conv2D layers. See the next post for a simple example of a VAE layer-layout. With Keras both ANNs can be defined as individual „Keras models“ which we later connect.

Some more details:

**Encoder:**

The**Encoder**ANN creates a vector for each input sample (mostly images) in a low-dimensional vector space, often called the „**latent space**„. I will sometimes use the abbreviation „**z-space**“ for it and call a point in it a „**z-point**„. The dimensions of the „z-space“ are given by a number „**z_dim**„. A „z-point“ is therefore represented as a z-dimensional vector (a special kind of a tensor of rank 1). The Encoder realizes a non-linear transformation of n-dimensional vectors to z-dimensional vectors. Encoders, actually, are very effective data compressors.**z_dim**may take numbers between 2 and a few hundred – depending on the complexity of data and objects represented by the input samples with much higher dimensions (some hundreds to millions).**Decoder:**

The „Decoder“ ANN instead takes arbitrary low dimensional vectors from the „latent space“ as input and creates or better „**reconstructs**“ tensors with the same rank and dimensions as the input samples. These output samples can, of course, also be represented by n-dimensional vectors in the original variable space. The Decoder is a kind of inversion of the „Encoder“. This is also reflected in its structure as we shall see later on. One objective of (V)AEs is that the reconstruction vector for a defined input sample should be pretty close to the input vector of the very same sample – with some metric to measure distances. In natural terms for images: An output image of a Decoder should be pretty comparable to the input image of the Encoder, when we feed the Decode with the z-dim vector created by the Encoder.

A **Keras model for the (V)AE** can be build by connecting the models for the Encoder and Decoder. The (V)AE model is trained as a whole to minimize differences between the input (e.g. an image) presented to the Encoder and the created output of the Decoder (a reconstructed image). A (V)AE is trained „unsupervised“ in the sense that the „label data“ for the training are identical to the encoder input samples.

When you build the Encoder and the Decoder as a „Convolutional Neutral Networks“ [CNNs] they support data compression and reconstruction very effectively by a parallel extraction of basic „features“ of the input samples and putting the related information into neural „maps“. (The neurons in such a „map“ react sensitively to correlation patterns of components of the tensors representing the input data samples.) For images such CNN networks would accept rank 2 tensors.

Note that the (V)AE’s knowledge about basic „features“ (or data correlations) in the input samples is encoded in the weights of the „Encoder“ and „Decoder“. The points in the „latent space“ which represent the input samples may show cluster patterns. The clusters could e.g. be used in classification tasks. However, without an appropriate Decoder with fitted weights you will not be able to reconstruct a complete image from a point in the z-space. Similar to encryption scenarios the Decoder is a kind of „key“ to retrieve original information (even with reduced detail quality).

*Variational* Autoencoders take care about a reasonable data organization in the latent space – in addition to data compression. The objective is to get well defined and segregated clusters there for „features“ or maybe even classes of input samples. But the clusters should also be confined in a common, relatively small and compact region of the z-space.

To achieve this goal one judges the arrangement of data points, i.e. their distribution in the latent space, with the help of a special loss contribution. I.e. we use specially designed costs which punish vast extensions of the data distribution off the origin in the latent space.

The data distribution is controlled by parameters, typically a mean value „mu“ and a standard deviation or variance „var“. These parameters provide additional degrees of freedom during training. The special loss used to control the predicted distribution in comparison to an ideal one is the so called „**Kullback-Leibler**“ [KL] loss. It measures the „distance“ of the real data distribution in the latent space from a more confined ideal one.

You find more information about the KL-loss in the books of D. Foster and R. Atienza named in the last section of this post.

Why are VAEs and their „KL loss“ important? Basically, because VAEs help you to confine data in the z-space. Especially when the latent space has a very low number of dimensions. VAEs thereby also help to condense clusters which may be dominated by samples of a specific class or a label for some specific feature. Thus VAEs provide the option to define paths between such clusters and allow for „feature arithmetics“ via vector operations.

See the following images of MNIST data mapped into a 2-dimensional latent space by a standard Autoencoder:

Despite the fact that the AE’s sub-nets (CNNs) already deal very well with with features the resulting clusters for the digit classes are spread in a relatively large area of the latent space: [-7 ≤ x ≤ +7], [6 ≥ y ≥ -9]. Had we used dense layers of a MLP only, the spread of data points would have covered an even bigger area in the latent space.

The following three plots resulted from VAEs with different parameters. Watch the scale of the axes.

The data are now confined to a region of around [[-4,4], [-4,4]] with respect to [x,y]-coordinates in the „latent space“. The available space is more efficiently used. So, you want to avoid any problems with the KL-loss in VAEs due to some peculiar requirements of Tensorflow 2.x.

AEs are easy to understand, VAEs have subtle additional properties which seem to be valuable for certain tasks. But also for principle reasons we want to enable a Keras model of an ANN to calculate quantities which depend on the elements of one or a few given layers. One example of such a quantity is the KL loss.

In the next post

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

we shall build an Autoencoder and apply it to the MNIST dataset. This will give us a reference point for VAEs later.

Stay tuned ….

Ceterum censeo: The worst living fascist and war criminal today, who must be isolated, denazified and imprisoned, is the Putler.

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

]]>

KMeans as a classifier for the WIFI and MNIST datasets – I – Cluster analysis of the WIFI example

KMeans as a classifier for the WIFI and MNIST datasets – II – PCA in combination with KMeans for the WIFI-example

KMeans as a classifier for the WIFI and MNIST datasets – III – KMeans as a classifier for the WIFI-example

KMeans as a classifier for the WIFI and MNIST datasets – IV – KMeans on PCA transformed data

we have so far studied the application of KMeans to the WIFI dataset of the UCI Irvine. We now apply the Kmeans clustering algorithm to the MNIST dataset – also in an extended form, namely as a *classifier*. The MNIST dataset – a collection of 28x28px images of handwritten numbers – has already been discussed in other sections of this blog and is well documented on the Internet. I, therefore, do not describe its basic properties in this post. A typical image of the collection is

Due to the ease of use, I loaded the MNIST data samples via TF2 and the included Keras interface. Otherwise TF2 was not used for the following experiments. Instead the clustering algorithms were taken from „sklearn“.

Each MNIST image can be transformed into a one-dimensional array with dimension 784 (= 28 * 28). This means the MNIST feature space has a dimension of 784 – which is much more than the seven dimensions we dealt with when analyzing the WIFI data in the last post. All MNIST samples were shuffled for individual runs.

A good question is whether we should scale or normalize the sample data for clustering – and if so by what formula. I could not answer this question directly; instead I tested multiple methods. Previous experience with PCA and MNIST indicated that Sklearn’s „Normalizer“ would be helpful, but I did not take this as granted.

A simple scaling method is to just divide the pixel values by 255. This brings all 784 data array elements of each image into the value range [0,1]. Note that this scaling does not change relative length differences of the sample vectors in the feature space. Neither does it shift or change the width of the data distribution around its mean value. Other methods would be to *standardize* the data or to *normalize* them, e.g. by using respective algorithms from Scikit-Learn. Using either method in combination with a cluster analysis corresponds to a theory about the cluster distribution in the feature space. Normalization would mean that we assume that the clusters do not so much depend on the vector length in the feature space but mainly on the angle of the sample vectors. We shall later see what kind of scaling helps when we classify the MNIST data based on clusters.

In a first approach we leave the data as they are, i.e. unscaled.

All the following cluster calculations were done on 3 out of 8 available (hyperthreaded) CPU cores. For Kmeans and MiniBatchKMeans we used

n_init = 100 # number of initial cluster configurations to test max_iter = 100 # maximum number of iterations tol = 1.e-4 # final deviation of subsequent results (= stop condition) random_state = 2 # a random state nmber for repeatable runs mb_size = 200 # size of minibatches (for MiniBatchKMeans)

The number of clusters „num_clus“ was defined individually for each run.

A naive approach to perform an elbow analysis, as we did for the WIFI-data, would be to apply KMeans of Sklearn directly to the MNIST data. But a test run on the CPU shows that such an endeavor would cost too much time. With 3 CPU cores and only a very limited number of clusters and iterations

n_init = 10 # only a few initial configurations max_iter = 50 tol = 1.e-3 num_clus = 25 # only a few clusters

a KMeans fit() run applied to 60,000 training samples [len(X_train) => 60,000]

kmeans.fit(X_train)

requires around 42 secs. For 200 clusters the cluster analysis requires around 214 secs. Doing an elbow analysis would therefore require many hours of computational time.

To overcome this problem I had to use **MiniBatchKMeans**. It is by factors > 80 faster.

When we use the following setting for MiniBatchKMeans

n_init = 50 # only a few initial configurations max_iter = 100 tol = 1.e-4 mb_size = 200

I could perform an elbow analysis for all cluster-numbers 1 < k <= 250 in less than 20 minutes. The following graphics shows the resulting intertia curve vs. cluster number:

The „elbow“ is not very pronounced. But I would say that by using a cluster number around 200 we are on the safe side. By the way: The shape of the curve does not change very much when we apply Sklearn’s Normalizer to the MNIST data ahead of the cluster analysis.

We now perform a prediction of our adapted cluster algorithm regarding the cluster membership for the training data and for k=225 clusters:

n_clu = 225 mb_size = 200 max_iter = 120 n_init = 100 tol = 1.e-4

Based on the resulting data we afterward apply the same type of algorithm which we used for the WIFI data to construct a „* classifier*“ based on clusters and a respective predictor function (see the last post of this series).

The data distribution for the 10 different digits of the training set was:

class 0 : 5905 class 1 : 6721 class 2 : 6031 class 3 : 6082 class 4 : 5845 class 5 : 5412 class 6 : 5917 class 7 : 6266 class 8 : 5860 class 9 : 5961

How good is the cluster membership of a sample for a digit class defined?

Well, out of 225 clusters there were only around 15 for which I got an „error“ above 40%, i.e. for which the relative fraction of data samples deviating from the dominant class of the cluster was above 40%. For the vast majority of clusters, however, samples of one specific digit class dominated the clusters members by more than 90%.

The resulting confusion matrix of our new „cluster classifier“ for the (unscaled) MNIST data looks like

[[5695 4 37 21 7 57 51 15 15 3] [ 0 6609 33 21 11 2 15 17 2 11] [ 62 45 5523 120 14 10 27 107 116 7] [ 11 43 114 5362 15 153 8 60 267 49] [ 5 60 62 2 4752 3 59 63 5 834] [ 54 18 103 777 25 4158 126 9 110 32] [ 49 20 56 4 6 38 5736 0 8 0] [ 5 57 96 2 86 1 0 5774 7 238] [ 30 76 109 416 51 152 39 35 4864 88] [ 25 20 37 84 706 14 6 381 46 4642]]

This confusion matrix comes at no surprise: The digits „4“, „5“, „8“, „9“ are somewhat error prone. Actually, everybody familiar with MNIST images knows that sometimes „4“s and „9“s can be mixed up even by the human eye. The same is true for handwritten „5“s, „8“s and „3“s.

Another representation of the confusion matrix is:

The calculation for the matrix elements was done in a standard way – the sum over percentages in a row gives 100% (the slight deviation in the matrix is due to rounding). I.e. we look at erors of the type TN (True Negatives).

The confusion matrix for the remaining 10,000 **test** data samples is:

The relative errors we get for our classifier when applied to the train and test data is

**rel_err_train = 0.115 ,**

**rel_err_test = 0.112**

All for * unscaled* MNIST data. Taking into account the crudeness of the whole approach this is a rather convincing result. It proves that it is worth the effort to perform a cluster analysis on high dimensional data:

- It provides a first impression whether the data are structured in the feature space such that we can find relatively good separable clusters with dominant members belonging to just one class.
- It also shows that a cluster based classification for many datasets cannot reach accuracy levels of CNNs, but that it may still deliver good results. Without any supervised training …

The second point also proves that the distance of the data points to the various cluster centers contains valuable information about the class membership. So, a MLP or CNN based classification could be performed on *transformed* MNIST data, namely distance vectors of sample datapoints to the different cluster centers. This corresponds to a dimension reduction of the classification problem. Actually, in a different part of this blog, I have already shown that such an approach delivers accuracy values beyond 98%.

For MNIST we can say that the samples define a relatively well separable cluster structure in the feature space. The granularity required to resolve classes sufficiently well means a clsuter number of around 200 < k < 250. Then we get an accuracy close to 90% for cluster based classification.

Can we somehow confirm this finding about a good cluster-class-relation independently? Well, in a limited way. The **t-SNE** algorithm, which can be used to „project“ multidimensional data onto a 2-dimensional plane, respects the vicinity of vectors in the original feature space whilst deriving a 2-dim representation. So, a rather well structured t-SNE diagram is an indication of clustering in the feature space. And indeed for 10,000 randomly selected samples of the (shufffled) training data we get:

The colorization was done by classes, i.e. digits. We see a relatively good separation of major „clusters“ with data points belonging to a specific class. But we also can identify multiple problem zones, where data points belonging to different classes are intermixed. This explains the confusion matrix. It also explains why we need so many fine-grained clusters to get a reasonable resolution regarding a reliable class-cluster-relation.

Can we improve the accuracy of our cluster based classification a bit? This would, e.g., require some transformation which leads to a better cluster separation. To see the effect of two different **scalers** I tried the „**Normalizer**“ and then also the „StandardScaler“ of Sklearn. Actually, they work in opposite direction regarding accuracy:

The „Normalizer“ improves accuracy by more than 1.5%, while the „Standardizer“ reduces it by almost the same amount.

I only discuss results for „Normalization“ below. The confusion matrix for the **training data** becomes:

and for the **test data**:

The relative error for the test data is

Error for trainings data:

**avg_err_train = 0.085** :: num_err_train = 5113

Error for test data:**avg_err_test = 0.083** :: num_err_test = 832

So, the relative accuracy is now around **91.5%**.

The result depends a bit on the composition of the training and the test dataset after an initial shuffling. But the value remains consistently above **90%**.

Just for interest I also had a look at a very different approach to invoke clustering:

I first applied a simple CNN-based **AutoEncoder** [AE] to compress the MNIST data into a 25-dimensional space and applied our clustering methods afterwards.

I shall not discuss the technology of autoenconders in this post. The only relevant point in our context is that an autoencoder provides an efficient non-linear way of data compression and dimensionality reduction. Among many other useful properties and abilities … . Note: I did not use a „Variational Autoencoder“ which would have allowed for even better results. The loss function for the AE was a simple quadratic loss. The autoencoder was trained on 50,000 training samples and for 40 epochs.

A t-SNE based plot of the „clusters“ for test data in the 25-dimensional space looks like:

We see that the separation of the data belonging to different classes is somewhat better than before. Therefore, we expect a slightly better classification based on clusters, too. Without any scaling we get the following confusion data:

[[5817 7 10 3 1 14 15 2 27 1] [ 3 6726 29 2 0 1 10 5 12 10] [ 49 35 5704 35 14 4 10 61 87 7] [ 8 78 48 5580 22 148 2 40 111 29] [ 47 27 18 0 4967 0 44 38 3 673] [ 32 20 10 150 8 5039 73 4 43 28] [ 31 11 23 2 2 47 5746 0 15 1] [ 6 35 35 6 32 0 1 5977 7 163] [ 17 67 22 86 16 217 24 22 5365 52] [ 35 32 11 92 184 15 1 172 33 5406]] Error averaged over (all) clusters : 6.74

The resulting relative error for the test data was:

**avg_err_test = 0.0574** :: num_err_test = 574

**With Normalization: **

Error for test data:**avg_err_test = 0.054** :: num_err_test = 832

So, after performing the autoencoder training on normalized data we consistently get

an accuracy of around **94%**.

This is not too much of a gain. But remember:

We performed a cluster analysis on a feature space with **only 25 dimensions** – which of course is much cheaper. However, we paid a prize, namely the Autoencoder training which lasted about 150 secs on my old Nvidia 960 GTX.

And note: Even with only 100 clusters we get above 92% on the AE-compressed data.

We have shown that using a non-supervised cluster analysis of the MNIST data with around 225 clusters allows for classifying images with an accuracy around 90.5%. In combination with an Autoencoder compression we even reaches values around **94%**. This is comparable with other non-optimized standard algorithms aside of neural networks.

This means that the MNIST data samples are organized in a well separable cluster structure in their feature space. A test run with *normalized* data showed that the clusters (and their centers) differ mostly by their direction relative to the origin of the feature space and not so much by their distance from the origin. With a relatively fine grained resolution we could establish a simple cluster-class-relation which allowed for **cluster based classification**.

The accuracy is, of course, below the values reachable with optimized MLPS (98%) and CNNs (above 99%). But, clustering is a fast, reliable and non-supervised method. In addition in combination with t-SNE we can create plots which can easily be understood by the customers. So, even for more complex data I would always recommend to try a cluster based classification approach if you need to provide plots and quick results. Sometimes the accuracy may even be sufficient for your customer’s purposes.

]]>