Tensors in ML are not tensors used in theoretical physics

People who have a physics education and want to start with Machine Learning [ML] often stumble across the use of the term “tensor” in typical Machine Learning frameworks. For a physicist the description of a “tensor” in ML appears strange from the very beginning – and one asks oneself: Have I mist something? Is there a secret relation and mapping between the term in physics and the term in ML?

Guys, let me warn you: The expression “tensor” in theoretical and mathematical physics has a (very) different meaning than what ML people use as tensors in ML. To say it clearly:

Tensors used in physics (e.g. in General Relativity or QFT) are not the same as tensors in ML.

For historical reasons even could even say:
The central term “tensor” in ML (e.g. by Google in Tensorflow) represents a kind of misuse of the original term and related framework developed by Ricci.

I admit that the last time I have worked professionally in physics and astrophysics is 3 decades ago. But I shall describe the major properties of physical tensors below as good as I remember them from that time. For people interested in a quick refresher on the topic I recommend the book “Tensors made easy”, 6th edition, 2018, ISBN 978-1-326-29253-9.

Tensors in theoretical physics

Physicists will regard tensors as multi-linear forms defined on a multidimensional (metrical) vector-space. It is a multi-linear function where the rank is the number of arguments accepted. A tensor is a function which is linear in all its arguments. The number of arguments it takes is called the ‘rank.’ Tensors can accept a certain number of vectors or covectors (which are 1-forms that basically introduce a scalar product on the vector space) as arguments. Most importantly:

Tensor objects follow certain rules regarding their component transformation under a change of the base vectors of the vector space. Components of tensors transform either in a contravariant or covariant form with respect to the (linear) change of base vectors.

Tensor fields can be defined on differentiable and curved affine (metrical) manifolds defined over Rn. The description of curved (affinely connected) manifolds requires coordinate systems and terms of differential geometry. Tensor equations will keep their form in case of a transformation of the coordinate system (by differentiable transformation functions defining a Jacobi matrix for old and new coordinates at each point). In the case of differential equations this requires to introduce so called covariant derivatives. This allows to keep up certain Tensor relations (including those based on differential operations) during coordinate transformations.

So, tensors in mathematical physics are bound to transformatory properties regarding the change of the base vectors of underlying vector spaces and the change of coordinate systems in complex geometries. Tensors in physics describe the property of physical objects independently of the choice of a specific coordinate system in space-time or other finite or infinite vector spaces as e.g Hilbert-spaces. The reference of the tensor definition to multi-linear forms, vectors and co-vectors reveals their complex structure and properties.

Note: Physical tensors with rank ≥ 2 have n**rank components, where n is the dimension of the underlying vector space. The rank and the vector-space define the number of components.

Tensors in ML => ML-tensors

Tensors in Machine Learning are most of all variants of multidimensional arrays. To distinguish them clearly from tensors in mathematical physics I will call them below “ML-tensors”.

I quote from the documentation of tensorflow:

“Tensors are multi-dimensional arrays with a uniform type (called a dtype). You can see all supported dtypes at tf.dtypes.DType. If you’re familiar with NumPy, tensors are (kind of) like np.arrays. All tensors are immutable like Python numbers and strings: you can never update the contents of a tensor, only create a new one.”

(Highlighting was done by me.

An introductory book to PyTorch – “PyTorch for Deep Learning”, Ian Pointer, 2021, O’Reilly – expresses the following opinion (freely summarized from the German text version): A tensor is a container for numbers. It is also accompanied by a set of rules which define transformations between tensors. The easiest way for an understanding a tensor in ML is probably to think of it as a multidimensional array.

Arrays simply allow for a specific “multidimensional” organisation of data to describe some objects of interest by basically independent variables. The important point is that these variables get ordered with respect to a few central, often apparent aspects of the object. In case of a picture two such aspects might be the spatial dimensions in terms of width and height.

Tensors in ML thus have axes or dimensions of the array to describe the organization of its components in a kind multidimensional structure. Each axis symbolizes one major aspect. Along each axis we allow for a number of distinct, indexed positions fixing the “size” of this dimension. Actually a axis corresponds to an indexed list of discrete elements; it should not be confused with an axis describing a dimension of a continuous spatial geometry mapped to RN. The axes of an array define a discrete multidimensional lattice of points at defined distances.

The number of dimensions is the so called rank of a ML-tensor. All sizes for the individual dimensions are gathered in a tuple called shape.

A ML-tensor of rank 3 can be thought of an arrangement of numbers (components) in a 3-dimensional lattice. However, the shape of this lattice can show a different number of elements for each of the three dimensions. I.e. an array of shape (50,2 16) can be a ML-tensor with values for 50x2x16 =1600 logically independent variables.

Think of a color image with a rectangular form and a certain pixel resolution (100 x 50 px). We may arrange the RGB-pixel values in form of an array with shape (100x50x3). Here we would have 15000 logically independent variables to characterize our object.

A rank alone obviously does not define the number of elements of an ML-tensor – we need the shape in addition.

Note also that the organization of the elements of a ML-tensor can be changed if required – one can e.g. transform a tensor of rank 3 with 1600 elements into one of rank 1 with the same number of elements in a certain linear ordered way. Such an ordered reorganization of a tensor’s elements into a tensor of different rank corresponds in Numpy to a reshaping of an array.

ML-tensors can in the same way as arrays be objects to certain algebraic operations with other “Ml-tensor” objects. For such operations the two involved tensors must obey certain rules regarding their shapes. There is a partial overlap of such operations with those used in Linear Algebra for linear mappings between objects of vector spaces of (different) finite dimensions. So, its no wonder that libraries for linear algebra dominate the field of ML-calculations.

Some of the required operations can be performed very fast and distributed over a series of processing units (or cores) which can work in parallel.

Note that the layers of modern Artificial Neural networks [ANNs] transform ML-tensors to ML-tensors. A change of ranks and the number of elements is possible during the operations of certain layers of ANNs.

Conclusion

ML-tensors simply are a clever arrangement of data describing objects of interest in the form of a multidimensional array. ANNs transform such ML-tensors and a change of the rank during such transformations is very common.

Tensors in physics are complex objects corresponding to multi-linear forms defined on vector spaces. Tensors in physics must fulfill certain properties regarding their components and their relations in case of linear transformations of the base vectors and in case of tensor fields in affine and metrical geometries also regarding the change of coordinate systems. Tensors keep their rank during such basic transformations.

Summary: ML-tensor are very different objects compared to tensors in mathematical physics. One should not mix them up or confuse them.

If someone of my experienced friends in physics has found some reasonable mapping of ML-tensors to multi-linear forms, please send me an email.

Variational Autoencoder with Tensorflow – XIV – Change of variational model parameters at inference time

In my last post of this series I compared a Variational Autoencoder [VAE] with only a tiny amount of KL-loss to a standard Autoencoder [AE]. See the links at

Variational Autoencoder with Tensorflow – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?

for more information and preparational posts.

Both the Keras models for the VAE and the AE were trained on the CelebA dataset of images of human heads. We found a tight similarity regarding the clustering of predicted data points for the training and test data in the latent space. In addition the VAE with tiny KL loss failed to reconstruct reasonable human face images from arbitrarily chosen points in the latent space. Just as a standard AE does. In forthcoming posts we will continue to study the relation between VAEs and AEs.

But in this post I want to briefly point out an interesting technical problem which may arise when you start to tests predictions for certain data samples after a training phase. Your Encoder or Decoder models may include parameters which you want to experiment with when predicting results for interesting input data. This raises the question whether we can vary such parameters at inference time. Actually, this was not quite as easy as it seemed to be when I started with respective experiments. To perform them I had to learn two aspects of Keras models I had not been aware of before.

How to switch of the z-point variation at inference time?

In my particular case the starting point was the following consideration:

At inference time there is no real need for using the logvar-based variation around mu-values predicted by the Encoder.

The variation in z-point values in VAEs is done by adding a statistical value to mu-values. The added term is based on a log_var value multiplied by a statistically fluctuating factor “epsilon”, which comes from a normal Gaussian distribution around zero. mu, therefore, is the center of a distribution a specific input tensor is mapped to in latent space for consecutive predictions. The mu- and log_var-values depend on weights of two dense layers of the Encoder and thus indirectly on the optimization during training.

But while the variation is essential during training one may NOT regard it necessary for predictions. During inference we may in some experiments have good reasons to only refer to the central mu value when predicting a z-point in latent space. For test and analysis purposes it could be interesting to omit the log_var contribution.

The question then is: How we can switch off the log_var component for the Encoder’s predictions, i.e. predictions of our Keras based encoder model?

One idea is to include variables of a Python class hosting the Keras models for the Encoder, Decoder and the composed VAE in the function for the calculation of z-point vectors.

The mu-, logvar and sampling layers of the VAE’s Encoder model encapsulated in a Python class

During this post series we have encapsulated the code for Encoder, Decoder and resulting VAE models in a Python class. Remember that the Encoder produced its output, namely z-points in the latent space via two “dense” Keras layers and a sampling layer (based on a Keras Lambda-layer). The dense layers followed a series of convolutional layers and gave us mu and log_var values. The Lambda-layer produced the eventual z-point-vector including the variation. In my case the code fragments for the layers look similar to the following ones:

# .... Layer model of the Encoder part 
...
...     # Definition of an input layer and multiple Conv2D layers 
...
        # "Variational" part - 2 Dense layers for a statistical distribution of z-points  
        self.mu      = Dense(self.z_dim, name='mu')(x)
        self.log_var = Dense(self.z_dim, name='log_var')(x)
        # Layer to provide a z_point in the Latent Space for each sample of the batch 
        self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var])
...
        # The Encoder model 
        self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")
...

The “self” refers to a class “MyVariationalAutoencoder” comprising the Encoder’s, Decoder’s and the VAE’s model and layer structures. See for details and explained code fragments of the class e.g. the posts around Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images.

The sampling is in my case done by a function “z_point_sampling”:

        def z_point_sampling(args):
            '''
            A point in the latent space is calculated statistically 
            around an optimized mu for each sample 
            '''
            mu, log_var = args # Note: These are 1D tensors !
            epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)
            return mu + B.exp(log_var / 2.) * epsilon * self.enc_v_eps_factor

You see that this function uses a class member variable “self.enc_v_eps_factor”.

Switch the variation with log_var on and off for predictions?

Our objective is to switch the log_var contribution on or off for input certain images or batches of such images fed into the Encoder. For this purpose we could in principle use the variable “self.enc_v_eps_factor” as a kind of boolean switch with values of either 0.0 or 1.0. To set the variable I had defined two class methods:

    def set_enc_to_predict(self):
        self.enc_v_eps_factor = 0.0 
    
    def set_enc_to_train(self):  
        self.enc_v_eps_factor = 1.0 

The basic idea was that the sampling function would pick the value of enc_v_eps_factor given at the runtime of a prediction, i.e. at inference time. This assumption was, however, wrong. At least in a certain sense.

Is a class variable change impacting a layer output noted during consecutive predictions of a Keras model?

Let us assume that we have instantiated our class and assigned it to a Python variable MyVae. Let us further assume the the comprised Keras models are referenced by variables

  • MyVae.encoder (for the Encoder part),
  • MyVae.decoder (for the Decoder part)
  • and MyVae.model (for the full VAE-model).

We do not care about further details of the VAE (consisting of the Encoder, the Decoder and GradientTape based cost control). But we should not forget that all the models’ layers and their weights determine cost functions derivatives and are therefore targets of the optimization performed during training. All factors determining gradients and value calculation with given weights are encoded with the compilation of a Keras model – for training purposes [using model.fit() with a Keras model], but as well for predictions! Without a compiled Keras model we cannot use model.predict().

This means: As soon as you have a compiled Keras model almost and load the weight-values saved after a a sufficient number of training epochs everything is settled for inference and predictions. Including the present value of self.enc_v_eps_factor. At compile time?

Well, thinking about it a bit more in detail from a developer perspective tells us:
The compilation would in principle not prevent the use of a changed variable at the run-time of a prediction. But on the other hand side we also have the feeling that Keras must do something to make training (which also requires predictions in the sense of a forward pass) and later raw predictions for batches at inference time pretty fast. Intermediate functionality changes would hamper performance – if only for the reason that you have to watch out for such changes.

So, it is natural to assume that Keras would keep any factors in the Lambda-layer taken from a class variable constant after compilation and during training or inference, i.e. predictions.

If this assumption were true then chain of actions AFTER a training of a VAE model (!) like

Define a Keras based VAE-model with a sampling layer and a factor enc_v_eps_factor = 1   =>   compile it (including the sub-models for the Encoder and Decoder)   =>   load weight parameters which had been saved after training   =>   switch the value of the class variable enc_v_eps_factor to 0.0   =>   Load an image or image batch for prediction

would probably NOT work as expected. To be honest: This is “wisdom” derived after experiments which did not give me the naively expected results.

Indeed, some first simple experiments showed: The value of enc_v_eps_factor (e.g. enc_v_eps_factor = 1) which was given at compile-time was used during all following prediction calculations, e.g. for a particular image in tensorial form. So, a command sequence like

MyVae.set_enc_to_train()
MyVae.encoder.compile() 
...
# Load weights from a set path 
MyVae.model.load_weights(path_weights)
...

z_point, mu, log_var = MyVae.encoder.predict(img1)
print(z_point, mu)
MyVae.set_enc_to_predict(self)
z_point, mu, log_var = MyVae.encoder.predict(img1)

did not give different results. Note, that I did not change the value of enc_v_eps_factor between compile time and the first call for a prediction.

Let us look at the example in more detail.

A concrete example

After a full training of my VAE on CelebA for 24 epochs I checked for a maximum log_var-value. Non-negligible values may indeed occur for certain images and special z-vector components for some images despite the tiny KL-loss. And indeed such a value occurred for a very special (singular) image with two diagonal black border stripes on the left and right of the photographed person’s head. I do not show the image due to digital rights concerns. But let us look at the predicted z, mu and log_var values (for a specific z-point vector component) of this image. Before I compiled the VAE model I had set enc_v_eps_factor = 1.0 :

# Set the learning rate and COMPILE the model 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
learning_rate = 0.0005

# The following is only required for compatibility reasons
b_old_optimizer = True     

# Set enc_v_eps_factor to 1.0
MyVae.set_enc_to_train()

# Separate Encoder compilation
# - does not harm the compilation of the full VAE-model, but is useful to avoid later trace warnings after changes 
MyVae.encoder.compile()

# Compilation of the full VAE model (Encoder, Decoder, cost functions used by GradientTape) 
MyVae.compile_myVAE(learning_rate=learning_rate, b_old_optimizer = b_old_optimizer )
...

# Load weights from a set path 
MyVae.model.load_weights(path_weights)
...
# start predictions
...

with

    def compile_myVAE(self, learning_rate, b_old_optimizer = False):
        # Version 1.1 of 221212
        # Forced to special handling of the optimizer for data resulting from training before TF 11.2 , due to warnings:  
        #      ValueError: You are trying to restore a checkpoint from a legacy Keras optimizer into a v2.11+ Optimizer, 
        #      which can cause errors. Please update the optimizer referenced in your code to be an instance 
        #      of `tf.keras.optimizers.legacy.Optimizer`, e.g.: `tf.keras.optimizers.legacy.Adam`.

        # Optimizer handling
        # ~~~~~~~~~ 
        if b_old_optimizer: 
            optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=learning_rate)
        else:    
            optimizer = Adam(learning_rate=learning_rate)
        ....

        # Solution type with train_step() and GradientTape()  
        if self.solution_type == 3:
            self.model.compile(optimizer=optimizer)

Details of the compilation are not important – though you may be interested in the fact that training data saved after a training based on Python module version < 11.2 of TF 2 requires a legacy version of the optimizer when later using a module version ≥ 11.2 (corresponding to TF V2.11 and above).

However, the really important point is that the compilation is done given a certain value of enc_v_eps_factor = 1.

Then we load the image in form a prepared training batch with just one element set and provide it to the predict() function of the Keras model for the Encoder. We perform two predictions and ahead of the second one we change the value of enc_v_eps_factor to 0.0:

# Choose an image 
j = 123223
img = x_train[j] # this has already a tensor compatible format 
img_list = []
img_list.append(img)
tf_img = tf.convert_to_tensor(img_list)

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)

print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

# !!!! Set enc_v_eps_factor to 0.0 !!!!
MyVae.set_enc_to_predict()

# New Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)

print()
print()

print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

The result is:

...
2.637196
-0.141142
3.3761873
...
-0.2085279
-0.141142
3.3761873

The result depends on statistical variations for the factor epsilon (Gaussian statistics; see the sampling function above).

But the central point is not the deviation for the two different prediction calls. The real point is that we have used MyVae.set_enc_to_predict() ahead of the second prediction and, yet, the values for mu and log_var for the special z_point-component (230; out of 256 components) were NOT identical. I.e. the variable value enc_v_eps_factor = 1.0, which we set before the compilation, was used during both of our prediction calculations!

Can we just recompile between different calls to model.predict() ?

The experiment described above seems to indicate that the value for the class variable enc_v_eps_factor given at compile time is used during all consecutive predictions. We could, of course, enforce a zero variation for all predictions by using MyVae.set_enc_to_predict() ahead of the compilation of the Encoder model. But this would not give us no flexibility to switch the log_var contribution off ahead of predictions for some special images and then turn it on again for other images.

But the is simple – if we need not do this switching permanently: We just recompile the Encoder model!

Compilation does not take much time for Encoder models with only a few (convolutional and dense) layers. Let us test this by modifying the code above:

# Choose an image 
j = 123223
img = x_train[j] # this has already a tensor compatible format 
img_list = []
img_list.append(img)
tf_img = tf.convert_to_tensor(img_list)

# Set enc_v_eps_factor to 0.0
MyVae.set_enc_to_predict()
# !!!!
MyVae.encoder.compile() 

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
# Decoder prediction - just for fun 
reco_list = MyVae.decoder.predict(z_points) # just for fun 
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

print() 
print()

# Set enc_v_eps_factor back to 1.0
MyVae.set_enc_to_train()
MyVae.encoder.compile()

# New Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)

print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

We get

Shape of img_list =  (1, 96, 96, 3)
eps_fact =  0.0
1/1 [==============================] - 0s 280ms/step
1/1 [==============================] - 0s 18ms/step
Shape of reco_list =  (1, 96, 96, 3)

-0.141142
-0.141142
3.3761873


eps_fact =  1.0
1/1 [==============================] - 0s 288ms/step

-0.63676023
-0.141142
3.3761873

This is exactly what we want!

The function for the prediction step of a Keras model is cached at inference time …

The example above gave us the impression that it could be the compilation of a model which “settles” all of the functionality used during predictions, i.e. at inference time. Actually, this is no quite true.

The documentation on a Keras model helped me to get a better understanding. Near the section on the method “predict()” we find some other interesting functions. A look at the remarks on “predict_step()“, reveals (quotation)

The logic for one inference step.

This method can be overridden to support custom inference logic. This method is called by Model.make_predict_function.

This method should contain the mathematical logic for one step of inference. This typically includes the forward pass.

This leads us to the function “make_predict_function()” for Keras models. And there we find the following interesting remarks – I quote:

This method can be overridden to support custom inference logic. This method is called by Model.predict and Model.predict_on_batch.

Typically, this method directly controls tf.function and tf.distribute.Strategy settings, and delegates the actual evaluation logic to Model.predict_step.

This function is cached the first time Model.predict or Model.predict_on_batch is called. The cache is cleared whenever Model.compile is called. You can skip the cache and generate again the function with force=True.

Ah! The function predict_step() normally covers the forward pass through the network and “make_predict_function()” caches the resulting (function) object at the first invocation of model.predict(). And the respective cache is not cleared automatically.

So, what really may have hindered my changes of the sampling functionality at inference time is a cache filled at the first call to encoder.predict()!

Let us test this!

Changing the sampling parameters after compilation, but before the first call of encoder.predict()

If our suspicion is right we should be able to set up the model from scratch again, compile it, use MyVae.set_enc_to_predict() and afterward call MyVae.encoder.predict() – and get the same values for mu and z_point.

So we do something like

# Build encoder according to layer parameters 
MyVae._build_enc()
# Build decoder according to layer parameters 
MyVae._build_dec()
# Build the VAE-model 
MyVae._build_VAE()
...

# Set variable to 1.0
MyVae.set_enc_to_train()

# Compile 
learning_rate = 0.0005
b_old_optimizer = True     
MyVae.compile_myVAE(learning_rate=learning_rate, b_old_optimizer = b_old_optimizer )
MyVae.encoder.compile()   # used to prevent retracing delays - when later changing encoder variables 

...
# Load weights from a set path 
MyVae.model.load_weights(path_weights)
...

# preparation if the selected img. 
...
...

MyVae.set_enc_to_predict()
print("eps_fact = ", MyVae.enc_v_eps_factor)
# Note: NO recompilation is done !

# First prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
print()
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])
..

Note that the change of enc_v_eps_factor ahead of the first call of predict(). And, indeed:

Shape of img_list =  (1, 96, 96, 3)
eps_fact =  0.0
...
-0.141142
-0.141142
3.3761873

Use make_predict_function(force=True) to clear and refill the cache for predict_step() and its forward pass functionality

The other option the documentation indicates is to use the function make_predict_function(force=True).
This leads to yet another experiment:

# img preparation 
....

# Set enc_v_eps_factor to 1.0
MyVae.set_enc_to_train()
MyVae.encoder.compile() 
print("eps_fact = ", MyVae.enc_v_eps_factor)

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

print() 
print()

# Set enc_v_eps_factor to 0.0
MyVae.set_enc_to_predict()
# !!!!
MyVae.encoder.make_predict_function(
    force=True
)
print("eps_fact = ", MyVae.enc_v_eps_factor)

# Encoder prediction 
z_points, mu, logvar  = MyVae.encoder.predict(tf_img)
print(z_points[0][230])
print(mu[0][230])
print(logvar[0][230])

We get

...
eps_fact =  1.0
1/1 [==============================] - 0s 287ms/step

-5.5451365
-0.141142
3.3761873


eps_fact =  0.0
1/1 [==============================] - 0s 271ms/step

-0.141142
-0.141142
3.3761873

Yes, exactly as expected. This again shows us that it is the cache which counts after the first call of model.predict() – and not the compilation of the Keras model (for the Encoder) !

Other approaches?

The general question of changing parameters at inference time also triggers the question whether we may be able to deliver parameters to the function model.predict() and transfer them further to customized variants of predict_step(). I found a similar question at stack overflow
Passing parameters to model.predict in tf.keras.Model

However, the example there was rather special – and I did not apply the lines of thought explained there to my own case. But the information given in the answer may still be useful for other readers.

Conclusion

In this post we have seen that we can change parameters influencing the forward pass of a Keras model at inference time. We saw, however, that we have to clear and fill a cache to make the changes effective. This can be achieved by

  • either applying a recompilation of the model
  • or enforcing a clearance and refilling of the cache for the model’s function predict_step().

In the special case of a VAE this allows for deactivating and re-activating the logvar-dependent statistical variation of the z-points a specific image is mapped to by the Encoder model during predictions. This gives us the option to focus on the central mu-dependent position of certain images in the latent space during experiments at inference time.

In the next post of this series we shall have a closer look at the filamental structure of the latent space of a VAE with tiny KL loss in comparison to the z-space structure of a VAE with sufficiently high KL loss.

 

Ceterum censeo: The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Somebody who orders the systematic destruction of civilian infrastructure must be fought and defeated because he is a permanent danger to basic principles of humanity – not only in Europe. Long live a free and democratic Ukraine!

 

Autoencoders, latent space and the curse of high dimensionality – II – a view on fragments and filaments of the latent space for CelebA images

I continue with experiments regarding the structure which an Autoencoder [AE] builds in its latent space. In the last post of this series

Autoencoders, latent space and the curse of high dimensionality – I

we have trained an AE with images of the CelebA dataset. The Encoder and the Decoder of the AE consist of a series of convolutional layers. Such layers have the ability to extract characteristic patterns out of input (image) data and save related information in their so called feature maps. CelebA images show human heads against varying backgrounds. The AE was obviously able to learn the typical features of human faces, hair-styling, background etc. After a sufficient number of training epochs the AE’s Encoder produces “z-points” (vectors) in the latent space. The latent space is a vector space which has a relatively low number of dimension compared with the number of image pixels. The Decoder of the AE was able to reconstruct images from such z-points which resembled the original closely and with good quality.

We saw, however, that the latent space (or “z-space”) lacks an important property:

The latent space of an Autoencoder does not appear to be densely and uniformly populated by the z-points of the training data.

We saw that his makes the latent space of an Autoencoder almost unusable for creative and generative purposes. The z-points which gave us good reconstructions in the sense of recognizable human faces appeared to be arranged and positioned in a very special way within the latent space. Below I call a CelebA related z-point for which the Decoder produces a reconstruction image with a clearly visible face a “meaningful z-point“.

We could not reconstruct “meaningful” images from randomly chosen z-points in the latent space of an Autoencoder trained on CelebA data. Randomly in the sense of random positions. The Decoder could not re-construct images with recognizable human heads and faces from almost any randomly positioned z-point. We got the impression that many more non-meaningful z-points exist in latent space than meaningful z-points.

We would expect such a behavior if the z-points for our CelebA training samples were arranged in tiny fragments or thin (and curved) filaments inside the multidimensional latent space. Filaments could have the structure of

  • multi-dimensional manifolds with almost no extensions in some dimensions
  • or almost one-dimensional string-like manifolds.

The latter would basically be described by a (wiggled) thin curve in the latent space. Its extensions in other dimensions would be small.

It was therefore reasonable to assume that meaningful z-points are surrounded by areas from which no reasonable interpretable image with a clear human face can be (re-) constructed. Paths from a “meaningful” z-point would only in a very few distinct directions lead to another meaningful point. As it would be the case if you had to follow a path on a thin curved manifold in a multidimensional vector space.

So, we had some good reasons to speculate that meaningful data points in the latent space may be organized in a fragmented way or that they lie within thin and curved filaments. I gave my readers a link to a scientific study which supported this view. But without detailed data or some visual representations the experiments in my last post only provided indirect indications of such a complex z-point distribution. And if there were filaments we got no clue whether these were one- or multidimensional.

Important Addendum, 03/18/2023:

I have to correct this post regarding the basic line of thought: Even if we find that the z-points for CelebA images are arranged in filaments the failure we saw in the first post of this series may not have its direct cause in missing these filaments in latent space by randomly chosen z-points. It could also be that we miss a much larger, coherent region where meaningful points are located. The filaments then would correspond to a correlation of certain features, only, which may not be decisive for the reconstruction of a face. So, the investigation of the existence of filaments is interesting – but the explanation of the AE’s reconstruction failure may require a more thorough analysis. I have done the calculations already, but have not yet found the time to write about them. As soon as the posts are ready I am going to provide a link. See also an added comment at the end of this post.

Do we have a chance to get a more direct evidence about a fragmented or filamental population of the latent space? Yes, I think so. And this is the topic of this post.

However, the analysis is a bit complicated as we have to deal with a multidimensional space. In our case the number of dimensions of the latent space is z_dim = 256. No chance to plot any clusters or filaments directly! However, some other methods will help to reduce the dimensionality of the problem and still get some valid representations of the data point correlations. In the end we will have a very strong evidence for the existence of filaments in the AE’s z-space.

Methods to work with data distributions in many dimensions

Below I will use several methods to investigate the z-point distribution in the multidimensional latent space:

  • An analysis of the variation of the z-point number-density along coordinate axes and vs. radius values.
  • An application of t-SNE projections from the standard multidimensional coordinate system onto a 2-dimensional plane.
  • PCA analysis and subsequent t-SNE projections of the PCA-transformed z-point distribution and its most important PCA components down to a 2-dim plane. Note that such an approach corresponds to a sequence of projections:
    1) Linear projections onto PCA rotated coordinates.
    2) A non-linear SNE-projection which scales and represents data point correlations on different scales on a 2-dim plane.
  • A direct view on the data distribution projected onto flat planes formed by two selected coordinate axes in the PCA-coordinate system. This will directly reveal whether the data (despite projection effects exhibit filaments and voids on some (small ?) scales.
  • A direct view on the data distribution projected onto a flat plane formed by two coordinate axes of the original latent space.

The results of all methods combined strongly support the claim that the latent space is neither populated densely nor uniformly on (small) scales. Instead data points are distributed along certain filamental structures around voids.

Layer structure of the Autoencoder

Below you find the layer structure of the AE’s Encoder. It got four Conv2D layers. The Decoder has a corresponding reverse structure consisting of Conv2DTranspose layers. The full AE model was constructed with Keras. It was trained on CelebA for 24 epochs with a small step size. The original CelebA images were reduced to a size of 96×96 pixels.

Encoder

Decoder

Number density of z-points vs. coordinate values

Each z-point can be described by a vector, whose components are given by projections onto the 256 coordinate axes. We assume orthogonal axes. Let us first look at the variation of the z-point number density vs. reasonable values for each of the 256 vector-components.

Below I have plotted the number density of z-points vs. coordinate values along all 256 coordinate axes. Each curve shows the variation along one of the 256 axes. The data sampling was done on intervals with a width of 0.25:

Most curves look like typical Gaussians with a peak at the coordinate value 0.0 with a half-width of around 2.

You see, however, that there are some coordinates which dominate the spatial distribution in the latent vector-space. For the following components the number density distribution is relatively broad and peaks at a center different from the origin of the z-space. To pick a few of these coordinate axes:

 52, center:  5.0,  width: 8
 61; center;  1.0,  width: 3 
 73; center:  0.0,  width: 5.5  
 83; center: -0.5,  width: 5
 94; center:  0.0,  width: 4
116; center:  0.0,  width: 4
119; center:  1.0,  width: 3
130; center: -2.0,  width: 9
171; center:  0.7,  width: 5
188; center:  0.75, width: 2.75
200; center:  0.5,  width: 11
221; center: -1.0,  width: 8

The first number is just an index of the vector component and the related coordinate axis. The next plot shows the number density along some these specific coordinate axes:

What have we learned?
For most coordinate axes of the latent space the number density of the z-points peaks at 0.0. We see an approximate Gaussian form of the number density distribution. There are around 5 coordinate directions where the distribution has a peak significantly off the origin (52, 130, 171, 200, 221). Along the corresponding axes the distribution of z-points obviously has an elongated form.

If there were only one such special vector component then we would speak of an elongated, ellipsoidal and almost cigar like distribution with the thickest area at some position along the specific coordinate axis. For a combination of more axes with elongated distributions, each with with a center off the origin, we get instead diagonally oriented multidimensional and elongated shapes.

These findings show again that large regions of the latent space of an AE remain empty. To get an idea just imagine a three dimensional space with all data in x-direction culminating at a coordinate value of 5 with a half-width of lets say 8. In the other directions y and z we have our Gaussian distributions with a total half-width of 1 around the mean value 0. What do we get? A cigar like shape confined around the x-axis and stretching from -3 < x < 13. And the rest of the space: More or less empty. We have obviously found something similar at different angular directions of our multidimensional latent space. As the number of special coordinate directions is limited these findings tell us that a PCA analysis could be helpful. But let us first have a look at the variation of number density with the radius value of the z-points.

Number density of z-points vs. radius

We define a radius via an Euclidean L2 norm for our 256-dimensional latent space. Afterward we can reduce the visualization of the z-point distribution to a one dimensional problem. We can just plot the variation of the number density of z-points vs. the radius of the z-points.

In the first plot below the sampling of data was done on intervals of 0.5 .

The curve does not remain that smooth on smaller sampling intervals. See e.g. for intervals of width 0.05

Still, we find a pronounced peak at a radius of R=16.5. But do not get misguided: 16 appears to be a big value. But this is mainly due to the high number of dimensions!

How does the peak in the close vicinity of R=16 fit to the above number density data along the coordinate axes? Answer: Very well. If you assume a z-point vector with an average value of 1 per coordinate direction we actually get a radius of exactly R=16!

But what about Gaussian distributions along the coordinate axes? Then we have to look at resulting expectation values. Let us assume that we fill a vector of dimension 256 with numbers for each component picked statistically from a normal distribution with a width of 1. And let us repeat this process many times. Then what will the expectation value for each component be?

A coordinate value contributes with its square to the radius. The math, therefore, requires an evaluation of the integral integral[(x**2)*gaussian(x)] per coordinate. This integral gives us an expectation value for the contribution of each coordinate to the total vector length (on average). The integral indeed has a resulting value of 1.0. From this it follows that the expectation value for the distance according to an Euclidean L2-metric would be avg_radius = sqrt(256) = 16. Nice, isn’t it?

However, due to the fact that not all Gaussians along the coordinate axes peak at zero, we get, of course, some deviations and the flank of the number distribution on the side of larger radius values becomes relatively broad.

What do we learn from this? Regions very close to the origin of the z-space are not densely populated. And above a radius value of 32, we do not find z-points either.

t-SNE correlation analysis and projections onto a 2-dimensional plane

To get an impression of possible clustering effects in the latent space let us apply a t-SNE analysis. A non-standard parameter set for the sklearn-variant of t-SNE was chosen for the first analysis

tsne2 = TSNE(n_components, early_exaggeration=16, perplexity=10, n_iter=1000) 

The first plot shows the result for 20,000 randomly selected z-points corresponding to CelebA images

Also this plot indicates that the latent space is not populated with uniform density in all regions. Instead we see some fragmentation and clustering. But note that this might happened on different length scales. t-SNE arranges its projections such that correlations on different scales get clearly indicated. So the distances in this plot must not be confused with the real spatial distances in the original latent space. The axes of the t-SNE plot do not reflect any axes of the latent space and the plotted distribution is not the real data point distribution after a linear and orthogonal projection onto a plane. t-SNE works non-linearly.

However, the impression of clustering remains for a growing numbers of z-points. In contrast to the first plot the next plots for 80,000 and 165,000 z-points were calculated with standard t-SNE parameters.

We still see gaps everywhere between locally dense centers. At the center the size of the plotted points leads to overlapping. If one could zoom into some of the centers then gaps would again appear on smaller scales (see more plots below).

PCA analysis and t-SNE-plots of the z-point distribution in the (rotated) PCA coordinate system

The z-point distribution can be analyzed by a PCA algorithm. There is one dominant component and the importance smooths out to an almost constant value after the first 10 components.

This is consistent with the above findings. Most of the coordinates show rather similar Gaussian distributions and thus contribute in almost the same manner.

The PCA-analysis transforms our data to a rotated coordinate system with a its origin at a position such that the transformed z-point distribution gets centered around this new origin. The orthogonal axes of the new PCA-coordinates system show into the direction of the main components.

When the projection of all points onto planes formed by two selected PCA axes do not show a uniform distribution but a fragmented one, then we can safely assume that there really is some fragmentation going on.

t-SNE after PCA

Below you see t-SNE-plots for a growing number of leading PCA components up to 4. The filamental structure gets a bit smeared out, but it does not really disappear. Especially the elongated empty regions (voids) remain clearly visible.

t-SNE after PCA for the first 2 main components – 80,000 randomly selected z-points

t-SNE after PCA for the first 2 main components – 165,000 randomly selected z-points

t-SNE after PCA for the first 4 main PCA components – 165,000 randomly selected z-points

For 10 components t-SNE gets a presentation problem and the plots get closer to what we saw when we directly operated on the latent space.

But still the 10-dim space does not appear to be uniformly populated. Despite an expected smear out effect due to the non-linear projection the empty ares seem to be at least as many and as extended as the populated areas.

Direct view on the z-point distribution after PCA in the rotated and centered PCA coordinate system

t-SNE blows correlations up to make them clearly visible. Therefore, we should also answer the following question:

On what scales does the fragmentation really happen ?

For this purpose we can make a scatter plot of the projection of the z-points onto a plane formed by the leading two primary component axes. Let us start with an overview and relatively large limiting values along the two (PCA) axes:

Yeah, a PCA transformation obviously has centered the distribution. But now the latent space appears to be filled densely and uniformly around the new origin. Why?

Well, this is only a matter of the visualized length scales. Let us zoom in to a square of side-length 5 at the center:

Well, not so densely populated as we thought.

And yet a further zoom to smaller length scales:

And eventually a really small square around the origin of the PCA coordinate system:

z-point distribution at the center of a two-dim plane formed by the coordinate axes of the first 2 primary components
The chosen qsquare has its corners at (-0.25, -0.25), (-0.25, 0.25), (0.25, -0.25), (0.25, 0.25).

Obviously, not a dense and neither a uniform distribution! After a PCA transformation we see the still see how thinly the latent space is populated and that the “meaningful” z-points from the CelebA data lie along curved and narrow lines or curves with some point-like intersections. Between such lines we see extended voids.

Let us see what happens when we look at the 2-dim pane defined by the first and the 18th axes of the PCA coordinate system:

Or the distribution resulting for the plane formed by the 8th and the 35th PCA axis:

We could look at other flat planes, but we do not get rid of he line like structures around void like areas. This is really a strong indication of filamental structures.

Interpretation of the line patterns:
The interesting thing is that we get lines for z-point projections onto multiple planes. What does this tell us about the structure of the filaments? In principle we have the two possibilities already named above: 1) Thin multidimensional manifolds or 2) thin and basically one-dimensional manifolds. If you think a bit about it, you will see that projections of multidimensional manifolds would not give us lines or curves on all projection planes. However curved string- or tube-like manifolds do appear as lines or line segments after a projection onto almost all flat planes. The prerequisite is that the extension of the string in other directions than its main one must really be small. The filament has to have a small diameter in all but one directions.

So, if the filaments really are one-dimensional string-like objects: Should we not see something similar in the original z-space? Let us for example look at the plane formed by axis 52 and axis 221 in the original z-space (without PCA transformation). You remember that these were axes where the distribution got elongated and had centers at -2 and 5, respectively. And indeed:

Again we see lines and voids. And this strengthens our idea about filaments as more or less one-dimensional manifolds.

The “meaningful” z-points for our CelebA data obviously get positioned on long, very thin and basically one-dimensional filaments which surround voids. And the voids are relatively large regarding their area/volume. (Reminds me of the galaxy distribution in simulations of the development of the early universe, by the way.)

Therefore: Whenever you chose a randomly positioned z-point the chance that you end up in an unpopulated region of the z-space or in a void and not on a filament is extremely big.

Conclusion

We have used a whole set of methods to analyze the z-point distribution of an AE trained on CelebA images. We found the the z-point distribution is dominated by the number density variation along a few coordinate axes. Elongated shapes in certain directions of the latent space are very plausible on larger scales.

We found that the number density distributions along most of the coordinate axes have a thin Gaussian form with a peak at the origin and a half-with of 1. We have no real explanation for this finding. But it may be related to the fact the some dominant features of human faces show Gaussian distributions around a mean value. With Gaussians given we could however explain why the number density vs. radius showed a peak close to R=16.

A PCA analysis finds primary directions in the multidimensional space and transforms the z-point distribution into a corresponding one for orthogonal primary components axes. For logical reason we can safely assume that the corresponding projections of the z-point distribution on the new axes would still reveal existing thin filamental structures. Actually, we found lines surrounding voids independently onto which flat plane we projected the data. This finding indicates thin, elongated and curved but basically one-dimensional filaments (like curved strings or tubes). We could see the same pattern of line-like structure in projections onto flat coordinate planes in the original latent space. The volume of the void areas is obviously much bigger than the volume occupied by the filaments.

Non-linear t-SNE projections onto a 2-dim flat hyperplanes which in addition reproduce and normalize correlations on multiple scales should make things a bit fuzzier, but still show empty regions between denser areas. Our t-SNE projections all showed signs of complex correlation patterns of the z-points with a lot of empty space between curved structures.

Important Addendum, 03/18/2023:
The following original conclusion is misleading and by parts wrong:

The experiments all in all indicate that z-points of the training data, for which we get good reconstructions, lie within thin filaments on characteristic small length scales. The areas/volumes of the voids between the filaments instead are relatively big. This explains why chances that randomly chosen points in the z-space falls into a void are very high.
The results of the last post are consistent with the interpretation that z-points in the voids do not lead to reconstructions by the Decoder which exhibit standard objects of the training images. in the case of CelebA such z-points do not produce images with clear face or head like patterns. Face like features obviously correspond to very special correlations of z-point coordinates in the latent space. These correlations correspond to thin manifolds consuming only a tiny fraction of the z-space with a volume close zero.

Due to a new analysis I would like to replace my original statemets with a question:

Do our findings of the existence of filaments and large surrounding voids really explain the results of the first post that randomly chosen z-points miss areas in the latent space which allow for a reconstruction of “faces”?

I am going to answer this question in another better prepared post series, soon. To make you a bit curious I leave you with the fact that the following picture shows a face reconstructed by an AE from a randomly selected point in the latent space – with some simple conditions applied: