This post focuses on a standard cookie-based access mechanism of X-clients to local Xorg-services on the very same Linux host. The basic idea is that the X-server grants access to its socket and services only if the X-client presents a fitting “secret” cookie. The required magic cookie value is defined by the X-server when it is started. The X-client gets information about X-related cookies from a user specific file **~/.Xauthority** which may contain multiple entries. I will show that changing of the name of the host, i.e. the *hostname*, may have an unexpected impact on the subsequent start of X-clients ** and** that different X-clients may react in deviating ways to a hostname change.

The ambiguity in the startup process of an X-client after a hostname change – even if the correct cookie is in principle available – is demonstrated by different reactions of a standard KDE application, namely **kate**, and a **flatpak**-based application, namely Blender. Flatpak is of interest as there are many complaints on the Internet regarding an unexplained denial of X-server access, especially on Opensuse systems.

We first look a bit into basic settings for a standard MIT-1 cookie based access mechanism. (For an in-depth introduction to X-server security readers will have to consult specific documentation available from Xorg). Afterward I will show that the *rules* for selecting a relevant cookie entry in the file .Xauthority deviate between the applications named above. The analysis will show that there is an intimate relation of these rules with available sources of the hostname and .Xauthority entries for various hosts. Hostname changes during running X-sessions may, therefore, have an impact on the start of graphical applications. An unexpected negative impact may even occur after a restart of the DisplayManager.

In this post we look at the effects of hostname changes *during a running X-session*, e.g. a KDE-session. In further posts I will look a bit deeper into sequences of actions a user may perform following a hostname change – such as a simple logout from and a new login into a graphical session or a full restart of systemd’s graphical target. Even then the consequences of a hostname change may create confusion. If and when I find the time I may also look into aspects of Wayland access in combination with KDE applications and flatpak.

Some days ago, I came to a location where I had to use a friend’s WLAN router to get an Internet connection. On my laptop *NetworkManager* controls the WLAN access. NetworkManager has an option to submit a hostname to the router. As I do not want my standard hostname to be spread around I first changed it via YaST during a running KDE-session. Then I configured NetworkManager for the local WLAN and restarted the network. Access to the WLAN router and Internet worked as expected. However, when I tried to start a flatpak based Blender installation I got the message

**Invalid MIT-MAGIC-COOKIE-1 key Unable to open a display**

This appeared to be strange because the flatpak Blender installation had worked flawlessly before the change of the hostname and before WLAN access. The question that hits you as a normal Linux user is: What do a hostname change and a network restart have to do with the start of a flatpak application in an already running KDE session?

Then I changed the hostname again to a different string – and could afterward start flatpak Blender without any X-authorization problem. Confused? I was.

Then I combined the change of the hostname with an intermediate start of the display manager and/or a stop and restart of systemd’s “graphical target” – and got more disturbing results. So I thought: This is an interesting area worth of looking into it a bit deeper.

The error message, of course, indicated problems with an access to the display offered by the running X-server. Therefore, I wanted answers to the following questions:

- Why did the magic cookie fail during an already running X-session? Had the X-access conditions not been handled properly already when I logged myself into my KDE session?
- Why was the magic cookie information invalid the first time, but not during the second trial?
- What impact has the (changed) hostname on the cookie-based X-authorization mechanism?
- What can I check by which tool regarding xauth-problems?
- What does flatpak exactly send to the X server when asking for access?
- Which rules govern the cookie based X-authorization mechanism for other applications than flatpak-based applications?

I had always assumed that X socket access would follow clear rules – independent of a specific X-client. But after some simple initial tests I started wondering whether the observed variations regarding X access had something to do with entries in the **~/.Xauthority** file of the user which I had used to log into my KDE-session. And I also wondered whether a standard KDE application like “kate” would react to hostname changes in really the same way as a flatpak application.

During a series of experiments I started to manipulate the contents of the *~/.Xauthority* file of the user of a running KDE session. Whilst doing so I compared the reaction of a typical KDE application like * kate* with the behavior of a (sandboxed)

You find the details of the experiments below. I have not yet analyzed the results with respect to potential security issues. But the variability in the selection and usage of .Xauthority entries by different applications appears a bit worrisome to me. In any case the deviating reactions at least have an impact upon whether an X-client application does start or does not start after one or multiple hostname changes. This makes an automatic handling of certain X-client starts by user scripts a bit more difficult than expected.

The basic problem is that a user can create a *transient* system status regarding a change of the static name of a host: A respective entry in the file */etc/hostname* may differ from values which other resources of the running graphical desktop session may provide to the user or programs. *Transient* is a somewhat problematic term. E.g. the command “hostnamectl” may show a *transient* hostname for some seconds when you change the contents of /etc/hostname directly – until a (systemd) background process settles the status. So what exactly do I mean by “transient”?

By “**transient**” I mean that a change of the static hostname in /etc/hostname may not be reflected by environment variables containing hostname information, e.g. by the variables HOST and HOSTNAME. Conflicting hostname information may occur as long as we do not restart the display manager and/or restart the “graphical target” of systemd after a change of the hostname.

To my astonishment I had to learn that it is not self-evident what will happen in such transient situations regarding access authorization to an Xorg-server via a so called “**magic MIT-1 cookie**“. This is partially due to the fact that the file *“~/.Xauthority”* may contain *multiple* entries. See below.

In this first post on the topics named above I will have a look at what happens

- when we change the static hostname
*during a running X-session*(e.g. a KDE session), - when we change entries in the ~/.Xauthority file during a running X-session, in particular with respect to hostname(s), and try to start an X-client afterwards.

The second point will help us to identify *rules* which X-client applications follow whilst choosing and picking up a cookie value among multiple entries in the file “~/.Xauthority”. We compare the behavior of *“kate”* as an example for a KDE application with the reaction of a sandboxed *flatpak* application (Blender). The astonishing result will be that the rules really differ. The rules for flatpak may prevent a start of a flatpak application although a valid entry may be present in .Xauthority.

The whole problem of an invalid cookie started with a hostname change ahead of a WLAN access. Why is a defined hostname important? How do we change it on an Opensuse system? What are potential sources on the system regarding information about the present hostname? And what should a consistent situation at the beginning of experiments with hostname changes look like?

In a typical structured LAN/WLAN there are multiple hosts wanting to interact with servers on the Internet, but also with each other or servers in (sub-) networks. As an administrator you may define separate sub-nets behind a common gateway. Members of a sub-net may be objects to routing and firewall restrictions regarding the communication with other hosts in the same or other sub-nets or servers on the Internet. As we humans operate with host-names rather than IP-addresses we may give hosts *unique names* in a defined (sub-) network, use DHCP to statically or dynamically assign IP addresses (and maybe even host-names) and use DNS-services to translate hostnames and related FQDNs into IP-addresses. In a local LAN/WLAN you may have full control over all these ingredients and design a consistent landscape of named and interacting hosts. Plus a pre-designed network segregation, host- routers, gateways, gateway-/perimeter-firewalls, etc.

The situation may be different when you put your laptop in a foreign and potentially dangerous WLAN environment. In my case I use Networkmanager on Leap 15.3 (soon 15.4) to configure WLAN access. The WLAN routers in most networks I have to deal with most often offer both DHCP and DNS services. The IP-address is assigned dynamically. The router’s DNS server works as a forwarder with respect to the Internet. But regarding the local network, which the router controls, the DNS-service of the router may respect your wishes regarding a hostname or not. And you do not know what other hosts creep around in the network you become a member of – and how they see you with respect to your host’s name.

There are three important things I want to achieve in such a situation:

- As soon as a WLAN connection to a router gets up I want to establish firewall rules blocking all incoming traffic and limit the outgoing traffic as much as possible.
- I do not want the WLAN routers to mingle with the hostname set by me, because this name play a role in certain scripts and some host-internal virtual network environments.
- Other friendly hosts in the network may ping me under a certain defined hostname – which the DNS part of the WLAN router should be informed about.

All these things can be managed directly or indirectly by Networkmanager (and some additional scripts). In particular you can a start a script that installs netfilter rules for a certain pre-defined hostname and does further things – e.g. delete or supplement entries in “/etc/hosts”.

However, the host’s name has somehow to be defined ahead of the network connection. There are multiple options to do this. One is to edit the file “/etc/hostnames”. Another is offered by YaST on Opensuse systems. A third is a script of your own, which in addition may manage settings in /etc/hosts, e.g. regarding virtual networks controlled by yourself.

Under perfect circumstances you may achieve a status where you have a well defined static hostname *not* touched by the WLAN router, a local host-based firewall controlling all required connections *and* your chosen hostname has been accepted by the router and integrated into its own DNS domain.

Let us check this for the WLAN environment at one of my friends locations:

myself@xtux:~> hostname; cat /etc/hostname; echo $HOST; echo $HOSTNAME; echo $XAUTHLOCALHOSTNAME; echo $SESSION_MANAGER; hostnamectl xtux xtux xtux xtux xtux local/xtux:@/tmp/.ICE-unix/7289,unix/xtux:/tmp/.ICE-unix/7289 Static hostname: xtux Icon name: computer-laptop Chassis: laptop Machine ID: ...... (COMMENT: long ID) Boot ID: b5d... (COMMENT: long ID) Operating System: openSUSE Leap 15.3 CPE OS Name: cpe:/o:opensuse:leap:15.3 Kernel: Linux 5.3.18-150300.59.101-default Architecture: x86-64 myself@xtux:~> myself@xtux:~ # ping xtux PING xtux.home (192.168.10.186) 56(84) bytes of data. 64 bytes from xtux.home (192.168.10.196): icmp_seq=1 ttl=64 time=0.030 ms 64 bytes from xtux.home (192.168.10.196): icmp_seq=2 ttl=64 time=0.060 ms ^C --- xtux.home ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2013ms

(The COMMENTs are not real output, but were added by me.)

You see that there is a whole variety of potential sources regarding information about the host’s name. In particular there are multiple environment variables. In the situation shown above all information sources agree about the host’s name, namely “xtux”. But you also see that the local domain name used by the WLAN router is a default one used by a certain router vendor. You would get the domain present name used also by issuing the command “dnsdomainname” at a shell-prompt.

An interesting question is: Which sources will register a hostname change during a running KDE session? Another interesting question is: Is the hostname used for local X-access authorization? The answers will be given by the experiments described below.

X-server access can be controlled by a variety of mechanisms. The one we focus on here is a cookie-based access. The theory is that an X-server when it starts up queries the hostname and creates a “secret” cookie plus a hash defining a file where the server saves the cookie. Afterward any X-client must provide this specific *magic* cookie when trying to get access to the X-server (more precisely to its socket). See e.g. a Wikipedia article for these basic principles.

Let us check what happens when we start a X-server. On an Opensuse system a variety of systemd-services is associated with a pseudo-state “3” of the host which can be set by a command “init 3”. This “state” corresponds, of course, to a systemd target. Network and multiuser operations are provided. A command “init 5” then moves the system to the graphical target. This includes the start of the Display Manager – in my case *sddm*. sddm in turn starts xorg’s X server. On an Opensuse system see the file **/etc/sysconfig/displaymanager** for respective settings.

Now let us get some information from the system about these points. First we look at the command which had been used to start the X-server. As root:

xtux:~ # pgrep -a X 16690 /usr/bin/X -nolisten tcp -auth /run/sddm/{0ca1db02-e253-4f2b-972f-9b124764a65f} -background none -noreset -displayfd 18 -seat seat0 vt7

The path after the -auth option gives us the location of the file containing the “magic cookie”. We can analyze its contents by the **xauth** command:

xtux:~ # xauth -f /run/sddm/\{0ca1db02-e253-4f2b-972f-9b124764a65f\} list xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0

(Hint: The tab-key will help you to avoid retyping the long hash). Ok, here we have our secret cookie for further authentication of applications wanting to get access to the X-server’s socket.

The other side of the story is the present *user* who opened an X-session – in my case a KDE session. Where does it or any X-clients, which the user later starts on a graphical desktop, get the knowledge about the required cookie from?

Answer: Whilst logging in via the DisplayManager an entry is written into or replaced inside the file **~/.Xauthority** (by a root controlled process). If this file does not exist it is created.

The .Xauthority file should normally not be edited as it contains non-interpretable characters. But the command “**xauth list**” helps again to present the contents in readable form. It automatically picks the file ~/.Xauthority:

myself@xtux:~> xauth list xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0

We see the same cookie here as on the local X-server’s side. Any applications with graphical output can and should use this information to get access to the X-server. E.g. “ssh -X” can use this information to rewrite cookie information set during interaction with a remote system to the present local X-server cookie. And there are of course the sandboxed flatpak applications in their namespaces. Note that the screen used by definition is :0 on the local host.

Note further that the situation regarding .Xauthority is ideal: In our present situation it contains only one entry. On systems working a lot on different remote hosts this file normally contains multiple entries, in particular if you have opened X-connections to other hosts in the past. Or, if you have changed your hostname before …

When we start a flatpak Blender installation this will just open the Blender interface on the hosts graphical desktop screen. We can close Blender directly afterward again.

myself@xtux:~> flatpak run org.blender.Blender & [1] 23145 myself@xtux:~> Saved session recovery to '/tmp/quit.blend' Blender quit [1]+ Fertig flatpak run org.blender.Blender myself@xtux:~>

We shall later see that flatpak replaces a screen-number :99 used in the applications namespace to :0 when interacting with the locally running X-server. In this respect it seems to work similar to “ssh -X”.

What happens if we disturb the initial consistent setup by changing the host’s name during a running X-session?

We have already identified potential sources of a mismatch regarding the hostname: /etc/hostname vs. a variety of environment variables vs. the DNS system.

The reader has certainly also noticed that the entry in the ~/.Xauthority file starts with the present name of the host. So here we have yet another potential source of a mismatch after a change of the hostname.

There are obvious limitations: The present cookie value stored at the location /run/sddm/\{..} should not be overwritten. This may disturb running X-clients and the start of other application windows in sub-shells of the running KDE-session. Changing environment variables which contain the hostname, may be dangerous, too.

On the other side the entry in ~/.Xauthority may not fit the new hostname if the entry is not being adapted. Actually, changing the hostname via YaST on an Opensuse system leaves .Xauthority entries as they were. So how would a KDE application and a flatpak application react to such discrepancies?

There are multiple options to change the hostname. We could overwrite the contents of /etc/hostname directly as root. (And wait for some systemd action to note the difference) But that does not automatically change other network settings. Let us use therefore use YaST in the form of the graphical yast2.

We just ignore the warning and define a new hostname “xmux”. Then we let yast2 do its job to reconfigure the network settings. Note that this will not interrupt an already running NetworkManager-controlled WLAN connection. Afterward we check our bash environment:

myself@xtux:~> hostname; cat /etc/hostname; echo $HOST; echo $HOSTNAME; echo $XAUTHLOCALHOSTNAME; echo $SESSION_MANAGER; hostnamectl xmux xmux xtux xtux xtux local/xtux:@/tmp/.ICE-unix/7289,unix/xtux:/tmp/.ICE-unix/7289 Static hostname: xmux Icon name: computer-laptop Chassis: laptop Machine ID: ..... (COMMENT: unchangd long string) Boot ID: 65c.. (COMMENT: changed long string) Operating System: openSUSE Leap 15.3 CPE OS Name: cpe:/o:opensuse:leap:15.3 Kernel: Linux 5.3.18-150300.59.101-default Architecture: x86-64 myself@xtux:~> xauth list xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0

We got a new bootid, but neither the environment variables nor the contents of ~/.Xauthority have been changed. Will our flatpak-based *Blender* run?

Answer: Yes, it will – without any warnings! Will *kate* run? Yes, it will, but with some warning:

myself@xtux:~> kate & [1] 6685 myself@xtux:~> No protocol specified

The reaction of kate is a bit questionable. Obviously, it detects an unexpected discrepancy, but starts nevertheless.

There are 2 possible explanations regarding flatpak: 1) flatpack ignores the hostname in the .Xauthority entry. It just reacts to the screen number. 2) flatpak remembers the last successful cookie and uses it.

How can we test this?

One way to test the reaction of applications to discrepancies of a changed hostname with cookie entries in .Xauthority is to *manually* change entries in this file. This is not as easy as it may seem as this file has weird characters in it. But it is possible with the help of kate or kwrite. You need an editor which can grasp most of the symbols in a sufficient way. vi is not adequate.

BUT: You have to be very careful when copying entry lines. Identify the symbol sequence marking the beginning of an entry is a first important step. Note also: When you change a hostname of an entry you must use one of the *same* length!

AND: Keep a copy of the original .Xauthority-file somewhere safe during your experiments.

The file ~/.Xauthority can in principle have multiple entries for different hosts and different X-servers. Among other scenarios multiple entries are required to cover situations for

- for direct access of a local X-client via a network to other hosts’ X-servers
- for remote application access based on “ssh -X” and a display of the output on the local host’s X-server.

Furthermore: A restart of an X-server will lead to new *additional* entries in ~/.Xauthority, whilst existing entries are kept there.

Therefore, it is wise to work with multiple entry lines in ~/.Xauthority for further experiments. However: Multiple entries with a combination of hostnames and cookie values open up a new degree of freedom:

The decision whether an X-client application gets successfully started or not will not only depend on a cookie match at the X-server, but also on the selection of an entry in the file ~/.Xauthority.

Hopefully, the results of experiments, which mix old, changed and freely chosen hostnames with valid and invalid cookie values will give us answers to the question of how a hostname change affects running X-sessions and freshly started X-clients. We are especially interested in the rules that guide an application when it must select a particular entry in .Xauthority among others. If we are lucky we will also get an idea about how an intermediate restart of the X-server after a hostname change may influence the start of X-clients afterward.

During the following experiments I try to formulate and improve rules regarding kate and flatpak applications with respect to the selection and usage of entries in .Xauthority.

Please note:

In all forthcoming experiments we only consider local applications which try to gain access to a locally running X-server!

The cookie of an entry in ~/.Xauthority is considered to be “correct” if it matches the magic cookie of the running X-server. Otherwise we regard it as “incorrect” with respect to the running X-server.

I prepared a file /home/myself/.Xauthority (as root) during the running X-session with the following entries:

myself@xtux:~> xauth list xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 myself@xtux:~> kate myself@xtux:~> flatpak run org.blender.Blender & [1] 28187 myself@xtux:~> Invalid MIT-MAGIC-COOKIE-1 keyUnable to open a display ^C [1]+ Exit 134 flatpak run org.blender.Blender myself@xtux:~>

You see that I have changed the hostname (xtux) of our original entry (with the correct cookie) to the meanwhile changed hostname (xmux). This entry is associated with the presently valid magic cookie. And I have added a leading entry with the original hostname, but a modified and therefore invalid cookie value.

Now, let us check what a KDE application like *kate* would do afterward: The output above shows that it just started – without any warning. However, flatpak did and does NOT start:

myself@xtux:~> flatpak run org.blender.Blender &[1] 16686 myself@xtux:~> Invalid MIT-MAGIC-COOKIE-1 keyUnable to open a display

Let us change the entries to:

myself@xtux:~> xauth list xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 myself@xtux:~> kate Invalid MIT-MAGIC-COOKIE-1 keymyself@xtux:~> flatpak run org.blender.Blender & [1] 28371 myself@xtux:~> Invalid MIT-MAGIC-COOKIE-1 keyUnable to open a display ^C [1]+ Exit 134 flatpak run org.blender.Blender myself@xtux:~>

It may seem that a line break is missing in the output and that kate did not start. But this is wrong:

- Kate actually DID start! But it produced an alarming warning!
- However flatpak did NOT start AND gave us a warning!

Meaning:

We got a clear indication that

- different entries are used by our two applications and
- that the potential discrepancies of the hostnames associated with the .Xauthority-entries in comparison to environment variables and /etc/hostname are handled differently by our two applications.

Why kate starts despite the clear warning is a question others have to answer. I see no direct security issue, but I have not really thought it through.

Let us now change the order of the cookie entries:

myself@xtux:~> xauth list xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 myself@xtux:~> kate Invalid MIT-MAGIC-COOKIE-1 keymyself@xtux:~> flatpak run org.blender.Blender & [1] 28571 myself@xtux:~> /run/user/1004/gvfs/ non-existent directory Saved session recovery to '/tmp/quit.blend' Blender quit [1]+ Fertig flatpak run org.blender.Blender myself@xtux:~>

kate again gives us a warning, but starts.

And, oh wonder, flatpak now does start Blender in its namespace without any warning!

Let us switch the hostnames again in the given order of the cookie entries. This gives us the last variation for mixing the new and the old hostnames with valid/invalid cookies :

myself@xtux:~> xauth list xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 myself@xtux:~> kate myself@xtux:~> flatpak run org.blender.Blender & [1] 28936 myself@xtux:~> /run/user/1004/gvfs/ non-existent directory Saved session recovery to '/tmp/quit.blend' Blender quit [1]+ Fertig flatpak run org.blender.Blender myself@xtux:~>

kate now starts without any warning. And also flatpak starts without warning.

How can we interpret the results above? So far the results are consistent with the following rules:

**kate:** Whenever the cookie associated with the present static hostname in .Xauthority matches the X-server’s cookie kate will start without warning. Otherwise it issues a warning, but starts nevertheless.

**flatpak:** Whenever the first entry in .Xauthority provides a cookie that matches the X-server’s cookie for the session than flatpak starts an X-client program like Blender.

But things are more complicated than this. We also have to check what happens if we, for some reason, have entries in ~/.Xauthority that reflect a hostname neither present in /etc/hostname nor in environment variables. (Such an entry may have resulted from previous access trials to the X-servers of remote hosts or “ssh -X” connections.)

I will call such a hostname a locally “** unknown hostname**” below. It admit that this is not the best wording, but it is a short one. A “

Entries of the form

myself@xtux:~> xauth list xfux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44

reflect such a situation.

The reactions of both kate and flatpak are negative in the given situation:

myself@xtux:~> xauth list xfux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 myself@xtux:~> kate Invalid MIT-MAGIC-COOKIE-1 keyNo protocol specified qt.qpa.xcb: could not connect to display :0 qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found. Failed to create wl_display (No such file or directory) qt.qpa.plugin: Could not load the Qt platform plugin "wayland" in "" even though it was found. This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. Available platform plugins are: wayland-org.kde.kwin.qpa, eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, xcb. Abgebrochen (Speicherabzug geschrieben) myself@xtux:~> myself@xtux:~> flatpak run org.blender.Blender & [1] 30565 myself@xtux:~> Invalid MIT-MAGIC-COOKIE-1 keyUnable to open a display ^C [1]+ Exit 134 flatpak run org.blender.Blender myself@xtux:~>

Meaning:

*flatpak* reacts allergic to entries with unknown hostnames and to entries with known hostnames, but with a wrong cookie.

However:

myself@xtux:~> xauth list xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xfux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 myself@xtux:~> kate myself@xtux:~> flatpak run org.blender.Blender & [1] 30710 myself@xtux:~> /run/user/1004/gvfs/ non-existent directory Saved session recovery to '/tmp/quit.blend' Blender quit [1]+ Fertig flatpak run org.blender.Blender myself@xtux:~>

Both applications start without warning!

The reaction of our applications changes again for the following settings:

myself@xtux:~> xauth list xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xfux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 myself@xtux:~> kate No protocol specified myself@xtux:~> flatpak run org.blender.Blender & [1] 30859 myself@xtux:~> /run/user/1004/gvfs/ non-existent directory Saved session recovery to '/tmp/quit.blend' Blender quit [1]+ Fertig flatpak run org.blender.Blender myself@xtux:~>

Meaning:

If the hostname associated with the right cookie is present in the environment variables, but does not correspond to the contents of /etc/hostname then kate will start with some warning. Flatpak starts starts without warning.

Switching entries and renaming confirms previous results:

myself@xtux:~> xauth list xfux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 xtux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 myself@xtux:~> kate No protocol specified myself@xtux:~> flatpak run org.blender.Blender & [1] 31113 myself@xtux:~> /run/user/1004/gvfs/ non-existent directory Saved session recovery to '/tmp/quit.blend' Blender quit [1]+ Fertig flatpak run org.blender.Blender

And:

myself@xtux:~> xauth list xfux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2a44 xmux/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 myself@xtux:~> kate myself@xtux:~> flatpak run org.blender.Blender & [1] 31348 myself@xtux:~> /run/user/1004/gvfs/ non-existent directory Saved session recovery to '/tmp/quit.blend' Blender quit

Let us use unknown hostnames, only.

myself@xtux:~> xauth list xfuxi/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 xruxi/unix:0 MIT-MAGIC-COOKIE-1 650ade473bc07c2e981d6174871c2ad0 myself@xtux:~> kate No protocol specified No protocol specified qt.qpa.xcb: could not connect to display :0 qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found. Failed to create wl_display (No such file or directory) qt.qpa.plugin: Could not load the Qt platform plugin "wayland" in "" even though it was found. This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. Available platform plugins are: wayland-org.kde.kwin.qpa, eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, xcb. Abgebrochen (Speicherabzug geschrieben) myself@xtux:~> flatpak run org.blender.Blender & [1] 29115 myself@xtux:~> No protocol specified Unable to open a display ^C [1]+ Exit 134 flatpak run org.blender.Blender

So, having unknown hostnames only, will lead to no X-access, neither for flatpak nor kate.

So the rules for our two selected applications regarding the selection of an entry in ~/.Xauthority and resulting X-server access are more like described below:

**kate:**If an entry in .Xauthority has a known hostname and fits the X server’s cookie kate is started – with a warning, if the hostname does not fit the present static hostname (in /etc/hostname). If there is an additional deviating entry for the present static hostname, but with an incorrect cookie then the warning includes the fact that the cookie is invalid, but kate starts nevertheless.**flatpak Blender:**The first entry which matches a known hostname (among the available ones from environment variables or from /etc/hosts) and which matches screen :0 is used to pick the respective cookie for X-server access. The application (X-client) only starts if the resulting cookie matches the X-server’s cookie.**Both:**If ~/.Xauthority does not contain entries which match any of the known hostnames than both programs fail regarding X-access. kate does check for other possible sockets (e.g. for (X)Wayland) in this case.

When evaluating these rules with respect to security issues one should always keep in mind that .Xauthority-entries like the ones we have artificially constructed may have been the result of a sequence of hostname changes followed by restarts of the X-server. This will become clearer in the next post.

By some simple experiments one can show that the access to an Xorg-server requested by a X-client application does not only depend on a cookie match but also on the combination of hostnames and associated magic cookie values offered by multiple entries in the file ~/.Xauthority. The rules by which an X-client application selects a specific entry may depend on the application and the rules may differ from those other applications follow. We have seen that at least a flatpak based Blender installation follows other rules than e.g. KDE’s kate. Therefore, changes of the hostname during a running X-session may have an impact on the startup of applications. E.g. if .Xauthority already contains an entry for the new hostname.

The attentive reader has, of course, noticed that the experiments described above alone do not explain the disturbing reaction of flatpak to hostname changes described in the beginning. These reactions had to do with already existing entries for certain hostnames in .Xauthority. Additional entries may in general be the result of previous (successful) accesses to remote hosts’ X-servers or previous local hostname changes followed by X-server restarts. In the next post I will, therefore, extend the experiments to intermediate starts of both the X-server and the graphical target after hostname changes.

Opensuse Leap 15.3 documentation on xauth

Stackoverflow question on “How does X11 authorization work? (MIT Magic Cookie)”

Stackexchange question on Invalid MIT-MAGIC-COOKIE-1 keyxhost: unable to open display “:0”

Also see comment of the 25th og January, 2019, in an archived Opensuse.org contribution

And before we who love democracy and freedom forget it:

The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Somebody who orders the systematic destruction of civilian infrastructure must be fought and defeated because he is a danger to mankind and principles of humanity. Long live a free and democratic Ukraine!

“Wir sehen zu deutlich die Absicht Putins, eine ganze Bevölkerung unterschiedslos zu terrorisieren, sie erfrieren zu lassen, sie um ihre Rechte, sogar um ihr Lebensrecht zu bringen. Auch angesichts des imperialen Wahns, von dem dieser Mann offenkundig besessen ist, ist leider Schlimmes zu erwarten.”

and:

Ihm werde es in diesem Jahr nicht gelingen, Weihnachten so zu feiern wie in früheren Jahren, sagte Gauck.

“Aber wir dürfen uns andererseits nicht von einem kaltblütigen Kriegsverbrecher unsere Lebensart ruinieren lassen.”

Quoted according to a publication of the newspaper “Die Zeit” (Link. This was a contribution to the Zeit-blog on the war in Ukraine.

Clear words – especially about the Putler. Nothing to add …

I wish all the people in Ukraine at least some peaceful hours during the next days. Our thoughts are with you – and I hope you will get in 2023 all the equipment you need to defend your *and* our freedom in a democratic Europe.

]]>

A flatpack installation of Blender had been necessary since I had changed the laptop’s OS to Opensuse Leap 15.3. Reason: Opensuse did and does not offer any current version of Blender in its official repositories for Leap 15.3 (which in itself is a shame). (I have not yet checked whether the situation has changed with Leap 15.4.)

Blender’s present version available for flatpack is 3.3.1. I had Blender version 3.1 installed. So, I upgraded to version 3.3.1. However, this upgrade step for Blender alone did not work. (In contrast to a snap upgrade).

The sequence of commands I used to perform the update was:

flatpak remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo flatpak update org.blender.Blender flatpak list

The second command brought Blender V3.3.1 to my disk. However, when I tried to start my new Blender with

flatpak run org.blender.Blender &

the system complained about a not working GLX and GL installation.

But, actually my KDE desktop was running on the laptop’s Nvidia card. The laptop has an Optimus configuration. I use Opensuse Prime to switch between the Intel i915 driver for the graphics card integrated in the processor and the Nvidia driver for the dedicated graphics card. And Nvidia was running definitively.

Flatpack requires the right interface for the Nvidia card AND the presently active GLX-environment to start OpenGL applications.

A “flatpack list” showed me that I had an “app” “nvidia-470-141-01” running.

Name Application ID Version Branch Installation Blender org.blender.Blender 3.3.1 stable system Codecs org.blender.Blender.Codecs stable system Freedesktop Platform org.freedesktop.Platform 21.08.16 21.08 system Mesa org.freedesktop.Platform.GL.default 21.3.9 21.08 system nvidia-470-141-01 org.freedesktop.Platform.GL.nvidia-470-141-01 1.4 system Intel org.freedesktop.Platform.VAAPI.Intel 21.08 system ffmpeg-full org.freedesktop.Platform.ffmpeg-full 21.08 system openh264 org.freedesktop.Platform.openh264 2.1.0 2.0 system

A quick view to the nvidia-settings and YaST showed me, however, that the drivers and other components installed on Leap 15.3 were of version “nvidia-470-141-03”.

Then I tried

mytux:~ # flatpak install flathub org.freedesktop.Platform.GL.nvidia-470-141 Looking for matches… Similar refs found for ‘org.freedesktop.Platform.GL.nvidia-470-141’ in remote ‘flathub’ (system): 1) runtime/org.freedesktop.Platform.GL.nvidia-470-94/x86_64/1.4 2) runtime/org.freedesktop.Platform.GL.nvidia-470-141-03/x86_64/1.4 3) runtime/org.freedesktop.Platform.GL.nvidia-470-141-10/x86_64/1.4 4) runtime/org.freedesktop.Platform.GL.nvidia-470-74/x86_64/1.4 5) runtime/org.freedesktop.Platform.GL.nvidia-430-14/x86_64/1.4 6) runtime/org.freedesktop.Platform.GL.nvidia-390-141/x86_64/1.4 Which do you want to use (0 to abort)? [0-6]: 2

This command offered me a list to select the required subversion from. In my case option 2 was the appropriate one.

And indeed after the installation I could start my new Blender version again.

By the way: Flatpak allows for multiple versions to be installed at the same time. Like:

Name Application ID Version Branch Installation Blender org.blender.Blender 3.3.1 stable system Codecs org.blender.Blender.Codecs stable system Freedesktop Platform org.freedesktop.Platform 21.08.16 21.08 system Mesa org.freedesktop.Platform.GL.default 21.3.9 21.08 system nvidia-470-141-03 org.freedesktop.Platform.GL.nvidia-470-141-03 1.4 system nvidia-470-141-10 org.freedesktop.Platform.GL.nvidia-470-141-10 1.4 system Intel org.freedesktop.Platform.VAAPI.Intel 21.08 system ffmpeg-full org.freedesktop.Platform.ffmpeg-full 21.08 system openh264 org.freedesktop.Platform.openh264 2.1.0 2.0 system

Your flatpack installation must provide a version of the ‘org.freedesktop.Platform.GL.nvidia-xxx-nnn-mm’ packet which matches the present Nvidia driver installation on your Linux operative system.

Do not forget to upgrade flatpak packets for NVidia after having upgraded Nvidia drivers on your Linux OS!

https://github.com/flathub/ org.blender.Blender/ issues/97

Replacing unstable Blender 2.82 on Leap 15.3 with flatpak or snap based Blender 3.1

Autoencoders, latent space and the curse of high dimensionality – I

we have trained an AE with images of the CelebA dataset. The Encoder and the Decoder of the AE consist of a series of convolutional layers. Such layers have the ability to extract characteristic patterns out of input (image) data and save related information in their so called *feature maps*. CelebA images show human heads against varying backgrounds. The AE was obviously able to learn the typical features of human faces, hair-styling, background etc. After a sufficient number of training epochs the AE’s Encoder produces “z-points” (vectors) in the *latent space*. The latent space is a vector space which has a relatively low number of dimension compared with the number of image pixels. The Decoder of the AE was able to reconstruct images from such z-points which resembled the original closely and with good quality.

We saw, however, that the latent space (or “z-space”) lacks an important property:

The latent space of an Autoencoder does **not** appear to be densely and *uniformly* populated by the z-points of the training data.

We saw that his makes the latent space of an Autoencoder almost unusable for creative and generative purposes. The z-points which gave us good reconstructions in the sense of recognizable human faces appeared to be arranged and positioned in a very special way within the latent space. Below I call a CelebA related z-point for which the Decoder produces a reconstruction image with a clearly visible face a “**meaningful z-point**“.

We could not reconstruct “meaningful” images from *randomly* chosen z-points in the latent space of an Autoencoder trained on CelebA data. Randomly in the sense of *random positions*. The Decoder could not re-construct images with recognizable human heads and faces from almost any randomly positioned z-point. We got the impression that many more non-meaningful z-points exist in latent space than meaningful z-points.

We would expect such a behavior if the z-points for our CelebA training samples were arranged in tiny fragments or thin (and curved) filaments inside the multidimensional latent space. Filaments could have the structure of

- multi-dimensional manifolds with almost no extensions in some dimensions
- or almost one-dimensional string-like manifolds.

The latter would basically be described by a (wiggled) thin *curve* in the latent space. Its extensions in other dimensions would be small.

It was therefore reasonable to assume that meaningful z-points are surrounded by areas from which no reasonable interpretable image with a clear human face can be (re-) constructed. Paths from a “meaningful” z-point would only in a very few distinct directions lead to another meaningful point. As it would be the case if you had to follow a path on a thin curved manifold in a multidimensional vector space.

So, we had some good reasons to speculate that meaningful data points in the latent space may be organized in a fragmented way or that they lie within thin and curved filaments. I gave my readers a link to a scientific study which supported this view. But without detailed data or some visual representations the experiments in my last post only provided indirect indications of such a complex z-point distribution. And if there were filaments we got no clue whether these were one- or multidimensional.

Do we have a chance to get a more direct evidence about a fragmented or filamental population of the latent space? Yes, I think so. And this is the topic of this post.

However, the analysis is a bit complicated as we have to deal with a multidimensional space. In our case the number of dimensions of the latent space is **z_dim = 256**. No chance to plot any clusters or filaments directly! However, some other methods will help to reduce the dimensionality of the problem and still get some valid representations of the data point correlations. In the end we will have a very strong evidence for the existence of filaments in the AE’s z-space.

Below I will use several methods to investigate the z-point distribution in the multidimensional latent space:

- An analysis of the variation of the z-point number-density along coordinate axes and vs. radius values.
- An application of t-SNE projections from the standard multidimensional coordinate system onto a 2-dimensional plane.
- PCA analysis and subsequent t-SNE projections of the PCA-transformed z-point distribution and its most important PCA components down to a 2-dim plane. Note that such an approach corresponds to a sequence of projections:

1) Linear projections onto PCA rotated coordinates.

2) A non-linear SNE-projection which scales and represents data point correlations on different scales on a 2-dim plane. - A direct view on the data distribution projected onto flat planes formed by two selected coordinate axes in the PCA-coordinate system. This will directly reveal whether the data (despite projection effects exhibit filaments and voids on some (small ?) scales.
- A direct view on the data distribution projected onto a flat plane formed by two coordinate axes of the original latent space.

The results of all methods combined strongly support the claim that the latent space is neither populated densely nor uniformly on (small) scales. Instead data points are distributed along certain * filamental structures* around

Below you find the layer structure of the AE’s Encoder. It got four Conv2D layers. The Decoder has a corresponding reverse structure consisting of Conv2DTranspose layers. The full AE model was constructed with Keras. It was trained on CelebA for 24 epochs with a small step size. The original CelebA images were reduced to a size of 96×96 pixels.

Each z-point can be described by a vector, whose components are given by projections onto the 256 coordinate axes. We assume orthogonal axes. Let us first look at the variation of the z-point number density vs. reasonable values for each of the 256 vector-components.

Below I have plotted the number density of z-points vs. coordinate values along all 256 coordinate axes. Each curve shows the variation along one of the 256 axes. The data sampling was done on intervals with a width of 0.25:

Most curves look like typical Gaussians with a peak at the coordinate value 0.0 with a half-width of around 2.

You see, however, that there are some coordinates which dominate the spatial distribution in the latent vector-space. For the following components the number density distribution is relatively broad and peaks at a center *different from the origin* of the z-space. To pick a few of these coordinate axes:

52, center: 5.0, width: 8 61; center; 1.0, width: 3 73; center: 0.0, width: 5.5 83; center: -0.5, width: 5 94; center: 0.0, width: 4 116; center: 0.0, width: 4 119; center: 1.0, width: 3 130; center: -2.0, width: 9 171; center: 0.7, width: 5 188; center: 0.75, width: 2.75 200; center: 0.5, width: 11 221; center: -1.0, width: 8

The first number is just an index of the vector component and the related coordinate axis. The next plot shows the number density along some these specific coordinate axes:

**What have we learned? **

For most coordinate axes of the latent space the number density of the z-points peaks at 0.0. We see an approximate Gaussian form of the number density distribution. There are around 5 coordinate directions where the distribution has a peak significantly off the origin (52, 130, 171, 200, 221). Along the corresponding axes the distribution of z-points obviously has an elongated form.

If there were only one such special vector component then we would speak of an elongated, ellipsoidal and almost cigar like distribution with the thickest area at some position along the specific coordinate axis. For a combination of more axes with elongated distributions, each with with a center off the origin, we get instead diagonally oriented multidimensional and elongated shapes.

These findings show again that large regions of the latent space of an AE remain empty. To get an idea just imagine a three dimensional space with all data in x-direction culminating at a coordinate value of 5 with a half-width of lets say 8. In the other directions y and z we have our Gaussian distributions with a total half-width of 1 around the mean value 0. What do we get? A cigar like shape confined around the x-axis and stretching from -3 < x < 13. And the rest of the space: More or less empty. We have obviously found something similar at different angular directions of our multidimensional latent space.
As the number of special coordinate directions is limited these findings tell us that a PCA analysis could be helpful. But let us first have a look at the variation of number density with the *radius* value of the z-points.

We define a *radius* via an Euclidean L2 norm for our 256-dimensional latent space. Afterward we can reduce the visualization of the z-point distribution to a one dimensional problem. We can just plot the variation of the number density of z-points vs. the radius of the z-points.

In the first plot below the sampling of data was done on intervals of 0.5 .

The curve does not remain that smooth on smaller sampling intervals. See e.g. for intervals of width 0.05

Still, we find a pronounced peak at a radius of **R=16.5**. But do not get misguided: 16 appears to be a big value. But this is mainly due to the high number of dimensions!

How does the peak in the close vicinity of R=16 fit to the above number density data along the coordinate axes? Answer: Very well. If you assume a z-point vector with an average value of 1 per coordinate direction we actually get a radius of exactly **R=16**!

But what about Gaussian distributions along the coordinate axes? Then we have to look at resulting expectation values. Let us assume that we fill a vector of dimension 256 with numbers for each component picked statistically from a normal distribution with a width of 1. And let us repeat this process many times. Then what will the expectation value for each component be?

A coordinate value contributes with its square to the radius. The math, therefore, requires an evaluation of the integral *integral[(x**2)*gaussian(x)]* per coordinate. This integral gives us an expectation value for the contribution of each coordinate to the total vector length (on average). The integral indeed has a resulting value of 1.0. From this it follows that the expectation value for the distance according to an Euclidean L2-metric would be **avg_radius = sqrt(256) = 16**. Nice, isn’t it?

However, due to the fact that not all Gaussians along the coordinate axes peak at zero, we get, of course, some deviations and the flank of the number distribution on the side of larger radius values becomes relatively broad.

What do we learn from this? Regions very close to the origin of the z-space are not densely populated. And above a radius value of 32, we do not find z-points either.

To get an impression of possible clustering effects in the latent space let us apply a t-SNE analysis. A non-standard parameter set for the sklearn-variant of t-SNE was chosen for the first analysis

tsne2 = TSNE(n_components, early_exaggeration=16, perplexity=10, n_iter=1000)

The first plot shows the result for 20,000 randomly selected z-points corresponding to CelebA images

Also this plot indicates that the latent space is not populated with uniform density in all regions. Instead we see some fragmentation and clustering. But note that this might happened on different length scales. t-SNE arranges its projections such that correlations on *different* scales get clearly indicated. So the distances in this plot must not be confused with the real spatial distances in the original latent space. The axes of the t-SNE plot do not reflect any axes of the latent space and the plotted distribution is not the real data point distribution after a linear and orthogonal projection onto a plane. t-SNE works non-linearly.

However, the impression of clustering remains for a growing numbers of z-points. In contrast to the first plot the next plots for 80,000 and 165,000 z-points were calculated with standard t-SNE parameters.

We still see gaps everywhere between locally dense centers. At the center the size of the plotted points leads to overlapping. If one could zoom into some of the centers then gaps would again appear on smaller scales (see more plots below).

The z-point distribution can be analyzed by a PCA algorithm. There is one dominant component and the importance smooths out to an almost constant value after the first 10 components.

This is consistent with the above findings. Most of the coordinates show rather similar Gaussian distributions and thus contribute in almost the same manner.

The PCA-analysis transforms our data to a rotated coordinate system with a its origin at a position such that the transformed z-point distribution gets centered around this new origin. The orthogonal axes of the new PCA-coordinates system show into the direction of the main components.

When the projection of all points onto planes formed by **two** selected PCA axes do not show a uniform distribution but a fragmented one, then we can safely assume that there really is some fragmentation going on.

Below you see t-SNE-plots for a growing number of leading PCA components up to 4. The filamental structure gets a bit smeared out, but it does not really disappear. Especially the elongated empty regions (voids) remain clearly visible.

**t-SNE after PCA for the first 2 main components – 80,000 randomly selected z-points **

**t-SNE after PCA for the first 2 main components – 165,000 randomly selected z-points **

**t-SNE after PCA for the first 4 main PCA components – 165,000 randomly selected z-points **

For 10 components t-SNE gets a presentation problem and the plots get closer to what we saw when we directly operated on the latent space.

But still the 10-dim space does not appear to be uniformly populated. Despite an expected smear out effect due to the non-linear projection the empty ares seem to be at least as many and as extended as the populated areas.

t-SNE blows correlations up to make them clearly visible. Therefore, we should also answer the following question:

On what scales does the fragmentation really happen ?

For this purpose we can make a scatter plot of the projection of the z-points onto a plane formed by the leading two primary component axes. Let us start with an overview and relatively large limiting values along the two (PCA) axes:

Yeah, a PCA transformation obviously has centered the distribution. But now the latent space appears to be filled densely and uniformly around the new origin. Why?

Well, this is only a matter of the visualized length scales. Let us zoom in to a square of side-length 5 at the center:

Well, not so densely populated as we thought.

And yet a further zoom to smaller length scales:

And eventually a really small square around the origin of the PCA coordinate system:

**z-point distribution at the center of a two-dim plane formed by the coordinate axes of the first 2 primary components **

The chosen qsquare has its corners at (-0.25, -0.25), (-0.25, 0.25), (0.25, -0.25), (0.25, 0.25).

Obviously, not a dense and neither a uniform distribution! After a PCA transformation we see the still see how thinly the latent space is populated and that the “meaningful” z-points from the CelebA data lie along curved and **narrow lines or curves** with some point-like intersections. Between such lines we see extended voids.

Let us see what happens when we look at the 2-dim pane defined by the first and the 18th axes of the PCA coordinate system:

Or the distribution resulting for the plane formed by the 8th and the 35th PCA axis:

We could look at other flat planes, but we do not get rid of he line like structures around void like areas. This is really a strong indication of filamental structures.

**Interpretation of the line patterns:**

The interesting thing is that we get lines for z-point projections onto multiple planes. What does this tell us about the structure of the filaments? In principle we have the two possibilities already named above: 1) Thin multidimensional manifolds or 2) thin and basically one-dimensional manifolds. If you think a bit about it, you will see that projections of multidimensional manifolds would not give us lines or curves on *all* projection planes. However curved string- or tube-like manifolds do appear as lines or line segments after a projection onto almost all flat planes. The prerequisite is that the extension of the string in other directions than its main one must really be small. The filament has to have a small diameter in all but one directions.

So, if the filaments really are one-dimensional string-like objects: Should we not see something similar in the original z-space? Let us for example look at the plane formed by axis 52 and axis 221 in the original z-space (without PCA transformation). You remember that these were axes where the distribution got elongated and had centers at -2 and 5, respectively. And indeed:

Again we see lines and voids. And this strengthens our idea about filaments as more or less one-dimensional manifolds.

The “meaningful” z-points for our CelebA data obviously get positioned on long, very thin and basically one-dimensional filaments which surround voids. And the voids are relatively large regarding their area/volume. (Reminds me of the galaxy distribution in simulations of the development of the early universe, by the way.)

Therefore: Whenever you chose a randomly positioned z-point the chance that you end up in an unpopulated region of the z-space or in a void and not on a filament is extremely big.

We have used a whole set of methods to analyze the z-point distribution of an AE trained on CelebA images. We found the the z-point distribution is dominated by the number density variation along a few coordinate axes. Elongated shapes in certain directions of the latent space are very plausible on larger scales.

We found that the number density distributions along most of the coordinate axes have a thin Gaussian form with a peak at the origin and a half-with of 1. We have no real explanation for this finding. But it may be related to the fact the some dominant features of human faces show Gaussian distributions around a mean value. With Gaussians given we could however explain why the number density vs. radius showed a peak close to R=16.

A PCA analysis finds primary directions in the multidimensional space and transforms the z-point distribution into a corresponding one for orthogonal primary components axes. For logical reason we can safely assume that the corresponding projections of the z-point distribution on the new axes would still reveal existing thin filamental structures. Actually, we found lines surrounding voids independently onto which flat plane we projected the data. This finding indicates thin, elongated and curved but basically one-dimensional filaments (like curved strings or tubes). We could see the same pattern of line-like structure in projections onto flat coordinate planes in the original latent space. The volume of the void areas is obviously much bigger than the volume occupied by the filaments.

Non-linear t-SNE projections onto a 2-dim flat hyperplanes which in addition reproduce and normalize correlations on multiple scales should make things a bit fuzzier, but still show empty regions between denser areas. Our t-SNE projections all showed signs of complex correlation patterns of the z-points with a lot of empty space between curved structures.

The experiments all in all indicate that “meaningful” z-points of the training data, for which we get good reconstructions, lie within thin filaments on characteristic small length scales. The areas/volumes of the voids between the filaments instead are relatively big. This explains why chances that randomly chosen points in the z-space falls into a void are very high.

The results of the last post are consistent with the interpretation that z-points in the voids do not lead to reconstructions by the Decoder which exhibit standard objects of the training images. in the case of CelebA such z-points do not produce images with clear face or head like patterns. Face like features obviously correspond to very special correlations of z-point coordinates in the latent space. These correlations correspond to thin manifolds consuming only a tiny fraction of the z-space with a volume close zero.

]]>

First I started to copy with the help of KDE’s Dolphin. The transfer rates were really bad – something about **5 MB/s**. Would have had to wait for hours.

Then I tried with just a standard **cp**-command. The rates rose to something like 20 MB/s to 25 MB/s. But this was not a reasonable transfer rate either.

A search on the Internet then gave me the following two commands

mytux:~ # echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes

mytux:~ # echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes

Note that setting values on these system parameters automatically sets the standard values of

mytux:~ # cat /proc/sys/vm/dirty_ratio 20 mytux:~ # cat /proc/sys/vm/dirty_background_ratio 10

to zero! And vice versa. You either have to use **dirty_bytes** and **dirty_background_bytes** *OR* the related ratio-parameters!

The effect of the above settings was remarkable: The total transfer rate (i.e. reading from one disk and writing to another; both on the same controller) rose to **130 MB/s**. Still far away from theoretical rates. But something I could live with.

It would be worth to experiment more with the size of the parameters, but I was satisfied regarding my specific problem.

An efficient data transfer on Linux systems between two external USB-3 disks may require system parameter settings by the user.

I admit: The whole thing left me baffled. For discussions and hints please refer to the Internet. It seems that the whole performance reduction is worst for users who have a lot of RAM. And the problem exists since 2014. Unbelievable!

https://gist.github.com/ 2E0PGS/ f63544f8abe69acc5caaa54f56efe52f

https://www.suse.com/ support/kb/doc/? id=000017857

https://archived.forum.manjaro.org/ t/decrease-dirty-bytes-for-more-reliable-usb-transfer/62513

https://unix.stackexchange.com/ questions/ 567698/slow-transfer-write-speed-to-usb-3-flash-what-are-all-possible-solutions

https://unix.stackexchange.com/ questions/ 23003/ slow-performance-when-copying-files-to-and-from-usb-devices

https://lonesysadmin.net/ 2013/12/22/ better-linux-disk-caching-performance-vm-dirty_ratio/

https://forum.manjaro.org/t/the-pernicious-usb-stick-stall-problem/52297

https://stackoverflow.com/ questions/ 6834929/linux-virtual-memory-parameters

]]>

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator

Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images

Variational Autoencoder with Tensorflow 2.8 – XI – image creation by a VAE trained on CelebA

Variational Autoencoder with Tensorflow 2.8 – XII – save some VRAM by an extra Dense layer in the Encoder

So far, most of the posts in this series have covered a variety of methods (provided by Tensorflow and Keras) to control the KL loss. One of the previous posts (XI) provided (indirect) evidence that also GradientTape()-based methods for KL-loss calculation work as expected. In stark contrast to a standard Autoencoder [AE] our VAE trained on CelebA data proved its ability to reconstruct humanly interpretable images from random z-points (or *z-vectors*) in the **latent space**. Provided that the z-points lie within a reasonable distance to the origin.

We could leave it at that. One of the basic motivations to work with VAEs is to use the latent space “creatively”. This requires that the data points coming from similar training images should fill the latent space densely and without gaps between clusters or filaments. We have obviously achieved this objective. Now we could start to do funny things like to combine reconstruction with vector arithmetic in the latent space.

But to look a bit deeper into the latent space may give us some new insights. The central point of the KL-loss is that it induces a statistical element into the training of AEs. As a consequence a VAE fills the so called “latent space” in a different way than a simple AE. The z-point distribution gets confined and areas around z-points for meaningful training images are forced to get broader and overlap. So two questions want an answer:

- Can we get more direct evidence of what the KL-loss does to the data distribution in latent space?
- Can we get some direct evidence supporting the assumption that most of the latent space of an AE is empty or only sparsely populated? in contrast to a VAE’s latent space?

Therefore, I thought it would be funny to compare the data organization in latent space caused by an AE with that of a VAE. But to get there we need some solid starting point. If you consider a bit where you yourself would start with an AE vs. VAE comparison you will probably come across the following additional and also interesting questions:

- Can one safely assume that a VAE with only a very tiny amount of KL-loss reproduces the same z-point distribution vs. radius which an AE would give us?
- In general: Can we really expect a VAE with a very tiny Kullback-Leibler loss to behave as a corresponding AE with the same structure of convolutional layers?

The answers to all these questions are the topics of this post and a forthcoming one. To get some answers I will compare a VAE with a very small KL-loss contribution with a similar AE. Both network types will consist of equivalent convolutional layers and will be trained on the **CelebA dataset**. We shall look at the resulting data point density distribution vs. radius, clustering properties and the ability to create images from statistical z-points.

This will give us a solid base to proceed to larger and more natural values of the KL-loss in further posts. I got some new insights along this path and hope the presented data will be interesting for the reader, too.

Below and in following posts I will sometimes call the target space of the Encoder also the “*z-space*“.

The training of an AE or a VAE occurs in a self-supervised manner. A VAe or an AE learns to create a point, a *z-point*, in the latent space for each of the training objects (e.g. CelebA images). In such a way that the Decoder can reconstruct an object (image) very close to the original from the z-point’s coordinate data. We will use the “**CelebA**” dataset to study the KL-impact on the z-point distribution.CelebA is more challenging for a VAE than MNIST. And the latent space requires a substantially higher number of dimensions than in the MNIST case for reasonable reconstructions. This makes things even more interesting.

The latent z-space filled by a trained AE or VAE is a multi-dimensional vector space. Meaning: Each z-point can be described by a vector defining a position in z-space. A vector in turn is defined by concrete values for as many vector components as the z-space has dimensions.

Of course, we would like to see some *direct* data visualizing the impact of the KL-loss on the z-point distribution which the Encoder creates for our training data. As we deal with a multidimensional vector space we cannot plot the data distribution. We have to simplify and somehow get rid of the many dimensions. A simple solution is to look at the data point distribution in latent space with respect to the *distance* of these points from the origin. Thereby we transform the problem into a one-dimensional one.

More precisely: I want to analyze the change in numbers of z-points within “*radius*“-intervals. Of course, a “radius” has to be defined in a multidimensional vector space as the z-space. But this can easily be achieved via an Euclidean L2-norm. As we expect the KL loss to have a confining effect on the z-point distribution it should reduce the average radius of the z-points. We shal later see that this is indeed the case.

Another simple method to reduce dimensions is to look at just one coordinate axis and the data distribution for the calculated values in this direction of the vector space. Therefore, I will also check the variation in the number of data points along each coordinate axis in one of the next posts.

A look at clustering via projections to a plane may be helpful, too.

Regarding the answers to the 3rd and 4th questions posed above your intuition tells you: Yes, you probably can bet on a similarity between a VAE with tiny KL-loss and an AE.

But when you look closer at the network architectures you may get a bit nervous. Why should a VAE network that has many more degrees of freedom than an AE not use both of its layers for “*mu*” and “*logvar*” to find a different distribution solution? A solution related to another minimum of the loss hyperplane in the weight configuration space? Especially as this weight-related space is significantly bigger than that of a corresponding AE with the same convolutional layers?

The whole point has to do with the following facts: In an AE’s Encoder the last flattening layer after the Conv2D-layers is connected to just one output layer. In a VAE, instead, the flattening layer feeds data into **two** consecutive layers (for *mu* and *logvar*) across twice as many connections (with twice as many weight parameters to optimize).

In the last post of this series we dealt with this point from the perspective of VRAM consumption. Now, its the question in how far a VAE will be similar to an AE for a tiny KL-loss.

Why should the z-points found be determined only by *mu*-values and not also by *logvar*-values? And why should the *mu* values reproduce the same distribution as an AE? At least the architecture does not guarantee this by any obvious means …

Well, let us look at some data.

Our test AE contains the same simple sequence of four Conv2D layers per Encoder and four 4 Conv2DTranspose layers as our VAE. See the AE’s Encoder layer structure below.

A difference, however, will be that I will not use any *BatchNormalizer* layers in the AE. But as a correctly implemented BatchNormalization should not affect the representational powers of a VAE network for very principle reasons this should not influence the comparison of the final z-point distribution in a significant way.

I performed an AE training run for 170,000 CelebA training images over 24 epochs. The latent space has a dimension if **z_dim=256**. (This relatively low number of dimensions will make it easier for a VAE to confine z_points around the origin; see the discussion in previous posts).

The resulting total loss of our AE became ca. **0.49 per pixel**. This translates into a total value of

AE total loss on Celeb A after 24 epochs (for a step size of 0.0005): **4515**

This value results from a summation over all geometric pixels of our CelebA images which were downsized to 96×96 px (see post IX). The given value can be compared to results measured by our GradientTape()-based VAE-model which delivers integrated values and not averages per pixel.

This value is significantly smaller than values we would get for the total loss of a VAE with a reasonably big KL-loss of contribution in the order of some percent of the reconstruction loss. A VAE produces values around 4800 up to 5000. Apparently, an AE’s Decoder reconstructs originals much better than a VAE with a significant KL-loss contribution to the total loss.

But what about a VAE with a very small KL-loss? You will get the answer in a minute.

We can not directly plot a data point distribution in a 256-dimensional vector-space. But we can look at the data point density variation with a calculated distance from the origin of the latent space.

The distance **R** from the origin to the z-point for each image can be measured in terms of a L2 (= Euclidean) norm of the latent vector space. Afterward it is easy to determine the number of images within all radius intervals with e.g. a length of 0.5 e.g. between radii R

**0 < R < 35 .**

We perform the following steps to get respective numbers. We let the Encoder of our trained AE predict the z-points of all 170,000 training data

z_points = AE.encoder.predict(data_flow)

*data_flow* was created by a Keras *DataImageGenerator* to send batches of training data to the GPU (see the previous posts).

Radius values are then calculated as

print("NUM_Images_Train = ", NUM_IMAGES_TRAIN) ay_rad_z = np.zeros((NUM_IMAGES_TRAIN,), dtype='float32') for i in range(0, NUM_IMAGES_TRAIN): sq = np.square(z_points[i]) sqrt_sum_sq = math.sqrt(sq.sum()) ay_rad_z[i] = sqrt_sum_sq

The numbers vs. radius relation then results from:

li_rad = [] li_num_rad = [] int_width = 0.5 for i in range(0,70): low = int_width * i high = int_width * (i+1) num = np.count_nonzero( (ay_rad_z >= low) & (ay_rad_z < high ) ) li_rad.append(0.5 * (low + high)) li_num_rad.append(num)

The resulting curve is shown below:

There seems to be a peak around **R = 16.75**. So, yet another question arises:

>What is so special about the radius values of 16 or 17 ?

We shall return to this point in the next post. For now we take this result as god-given.

Another interesting question is: Do we get some clustering in the latent space? Will there be a difference between an AE and a VAE?

A standard method to find an indication of clustering is to look for an elbow in the so called “inertia” curve for different assumed numbers of clusters. Below you find an **inertia plot** retrieved from the z-point data with the help of **MiniBatchKMeans**.

This result was achieved for data taken at every second value of the number of clusters “num_clus” between 2 ≤ num_clus ≤ 80. Unfortunately, the result does not show a pronounced elbow. Instead the variation at some special cluster numbers is relatively high. But, if we absolutely wanted to define a value then something between 38 and 42 appears to be reasonable. Up to that point the decline in inertia is relatively smooth. But do not let you get misguided – the data depend on statistics and initial cluster values. When you change to a different calculation you may get something like the following plot with more pronounced spikes:

This is always as sign that the clustering is not very clear – and that the clusters do not have a significant distance, at least not in all coordinate directions. Filamental structures will not be honored well by KMeans.

Nevertheless: A value of 40 is reasonable as we have 40 labels coming with the CelebA data. I.e. 40 basic features in the face images are considered to be significant and were labeled by the creators of the dataset.

We can also have a look at a 2-dimensional t-SNE-projection of the z-point distribution. The plots below have been produced with different settings for *early exaggeration* and *perplexity* parameters. The first plot resulted from standard parameter values for sklearn’s t-SNE variant.

tsne = TSNE(n_components=2, early_exaggeration=12, perplexity=30, n_iter=1000)

Other plots were produced by the following setting:

tsne = TSNE(n_components=2, early_exaggeration=16, perplexity=10, n_iter=1000)

Below you find some plots of a t-SNE-analysis for different numbers and different adjusted parameters for the resulting scatter plot. The number of statistically chosen z-point varies between 20,000 and 140,000.

**Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)**

Actually we see some indication of clustering, though it is not very pronounced. The clusters in the projection are not separated by clear and broad gaps. Of course a 2-dimensional projection can *not* completely visualize the original separations in a 256-dim space. However, we get the impression that clusters are located rather close to each other. Remember: We already know that almost all points are locates in a multidimensional sphere shell between 12 < R < 24. And more than 50% between 14 ≤ R ≤ 19.

However, how the actual distribution of meaningful z-points (in the sense of a recognizable face reconstruction) really looks like cannot be deduced from the above t-SNE analysis. The concentration of the z-points may still be one which follows *thin* and maybe curved filaments in some directions of the multidimensional latent space on relatively small or various scales. We shall get a much clearer picture of the fragmentation of the z-point distribution in an AE’s latent space in the next post of this series.

**Number of statistical z-points: 80,000**

For the higher number of selected z-points the room between some concentration centers appears to be filled in the projection. But remember: This may *only* be due to projection effects in the presently chosen coordinate system. Another calculation with the above non-standard data for perplexity and early_exaggeration gives us:

**Number of statistical z-points: 140,000**

Note that some islands appear. Obviously, there is at least *some* clustering going on. However, due to projection effects we cannot deduce much for the real structure of the point distribution between possible clusters. Even the clustering itself could appear due to overlapping two or more broader filaments along a projection line.

Whether correlations would get more pronounced and therefore could also be better handled by t-SNE in a *rotated* coordinate system based on a PCA-analysis remains to be seen. The next post will give an answer.

At least we have got a clear impression about the radial distribution of the z-points. And thereby gathered some data which we can compare to corresponding results of a VAE.

Our test VAE is parameterized to create only a very small KL-loss contribution to the total loss. With the Python classes we have developed in the course of this post series we can control the ratio between the KL-loss and a standard reconstruction loss as e.g. BCE (binary-crossentropy) by a parameter “**fact**“.

For BCE

fact = 1.0e-5

is a very small value. For a working VAE we would normally choose something like **fact=5** (see post XI).

A value like 1.0e-5 ensures a KL loss around 0.0178 compared to a reconstruction loss of 4550, which gives us a ratio below 4.e-6. Now, what is a VAE going to do, when the KL-loss is so small?

For the total loss the last epochs produced the following values:

AE total loss on Celeb A after 24 epochs for a step size of 0.0005: **4,553**

Output of the last 6 of 24 epochs.

Epoch 1/6 1329/1329 [==============================] - 120s 90ms/step - total_loss: 4557.1694 - reco_loss: 4557.1523 - kl_loss: 0.0179 Epoch 2/6 1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.9111 - reco_loss: 4556.8940 - kl_loss: 0.0179 Epoch 3/6 1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.6626 - reco_loss: 4556.6450 - kl_loss: 0.0179 Epoch 4/6 1329/1329 [==============================] - 120s 90ms/step - total_loss: 4556.3862 - reco_loss: 4556.3682 - kl_loss: 0.0179 Epoch 5/6 1329/1329 [==============================] - 120s 90ms/step - total_loss: 4555.9595 - reco_loss: 4555.9395 - kl_loss: 0.0179 Epoch 6/6 1329/1329 [==============================] - 118s 89ms/step - total_loss: 4555.6641 - reco_loss: 4555.6426 - kl_loss: 0.0178

This is not too far away from the value of our AE. Other training runs confirmed this result. On four different runs the total loss value came to lie between

VAE total loss on Celeb A after 24 epochs: **4553 ≤ loss ≤ 4555** .

Below you find the plot for the variation of the number density of z-points vs. radius for our VAE:

Again, we get a maximum close to **R = rad = 16**. The maximum value lies a bit below the one found for a KL-loss-free AE. But all in all the form and width of the distribution of the VAE are very comparable to that of our test AE.

**Can this result be reproduced?**

Unfortunately not at a 100% of test runs performed. There are two main reasons:

- Firstly, we can not be sure that a second minimum does not exist for a distribution of points at bigger radii. This may be the case both for the AE
*and*the VAE! - Secondly, we have a major factor of statistical fluctuation in our game:

The*epsilon*value which scales the*logvar*-contribution to the loss in the sampling layer of the Encoder may in very seldom cases abruptly jump to an unreasonable high value. A Gaussian covers extreme values, although the chances to produce such a value are pretty small. and a Gaussian is invilved in the calculation of z-points by our VAE.

Remember that the z-point coordinates are calculated via the the mu and logvar tensors according to

z = mu + B.exp(log_var / 2.) * epsilon

See Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics for respective code elements of the Encoder.

So, a lot depends on epsilon which is calculated as a statistically fluctuating quantity, namely as

epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.)

Is there a chance that the training process may sometimes drive the system to another corner of the weight-loss configuration space due to abrupt fluctuations? With the result for the z-point distribution vs. radius that it may significantly deviate from a distribution around R = 16? I think: Yes, this is possible!

From some other training runs I actually have an indication that there is a second minimum of the cost hyperplane with similar properties for higher average radius-values, namely for a distribution with an average radius at **R ≈ 19.75**. I got there after changing the initialization of the weights a bit.

Another indication that the cost surface has a relative rough structure and that extreme fluctuations of epsilon and a resulting gradient-fluctuation can drive the position of the network in the weight configuration space to some strange corners. The weight values there can result in different z-point distributions at higher average radii. This actually happened during yet another training run: At epoch 22 the Adam optimizer suddenly directed the whole system to weight values resulting in a maximum of the density distribution at R = 66 ! This appeared as totally crazy. At the same time the KL-loss also jumped to a much higher value.

When I afterward repeated the run from epoch 18 this did not happen again. Therefore, a statistical fluctuation must have been the reason for the described event. Such an erratic behavior can only be explained by sudden and extreme changes of z-point data enforcing a substantial change in size and direction of the loss gradient. And epsilon is a plausible candidate for this!

So far I had nothing in our Python classes which would limit the statistical variation of epsilon. The effects seen spoke for a code change such that we do not allow for extreme epsilon-values. I set limits in the respective part of the code for the sampling layer and its lambda function

# The following function will be used by an eventual Lambda layer of the Encoder def z_point_sampling(args): ''' A point in the latent space is calculated statistically around an optimized mu for each sample ''' mu, log_var = args # Note: These are 1D tensors ! epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) if abs(epsilon) >= 5: epsilon *= 5. / abs(epsilon) return mu + B.exp(log_var / 2.) * epsilon * self.enc_v_eps_factor

This stabilized everything. But even without these limitations on average three out of 4 runs which I performed for the VAE ran into a cost minimum which was associated with a pronounced maximum of the z-point-distribution around R ≈ 16. Below you see the plot for the fourth run:

So, there is some chance that the degrees of freedom associated with the logvar-layer and the statistical variation for epsilon may drive a VAE into other local minima or weight parameter ranges which do not lead to a z-point distribution around R = 16. But after the limitation of epsilon fluctuations all training runs found a loss minimum similar to the one of our simple AE – in the sense that it creates a z-point density distribution around R ≈ 16.

Our VAE gives the following variation of the inertia vs. the number of assumed clusters:

This also looks pretty similar to one of the plots shown for our AE above.

Below you find t-SNE plots for 20,000, 80,000 and 140,000 images:

**Number of statistical z-points: 20,000 (non-standard t-SNE-parameters)**

This is quite similar to the related image for the AE. You just have to rotate it.

**Number of statistical z-points: 80,000**

**Number of statistical z-points: 140,000**

All in all we get very similar indications as from our AE that some clustering is going on.

Besides reproducing a similar z-point distribution with respect to radius values, is there another indication that a VAE behaves similar to an AE? What would be a clear sign that the similarity really exists on a deeper level of the layers and their weights?

The z-vector is calculated from the mu and logvar-vectors by:

z = mu + exp(logvar/2)*epsilon

with epsilon coming from a normal distribution. Please note that we are talking about vectors of size z_dim=256 per image.

If a VAE with a tiny KL-loss really becomes similar to an AE it should define and set its z-points basically by using *mu*-values, only, and not by *logvar*-values. I.e. the VAE should become intelligent enough to ignore the degrees of freedom associated with the *logvar*-layer. Meaning that the z-point coordinates of a VAE with a very small Kl-loss should in the end be almost identical to the mu-component-values.

Ok, but to me it was not self-evident that a VAE during its training would learn

- to produce significant
*mu*-related weight-values, only, - and to keep the weight values for the connections to the logvar-layer so small that the logvar-impact on the z-space position gets negligible.

Before we speculate about reasons: Is there any evidence for a negligible logvar-contribution to the z-point coordinates or, equivalently, to the respective vector components?

To get some quantitative data on the logvar impact the following steps are appropriate:

- Get the size and algebraic sign of the logvar-values. Negative values
*logvar*< -3 would be optimal. - Measure the deviation between the mu- and z_points vector components. There should only be a few components which show significant values &br; abs(mu – z) > 0.05
- Compare the the radius-value determined by z-components vs. the radius values derived from mu-components, only, and measure the absolute and relative deviations. The relative deviation should be very small on average.

Regarding the maximum value of the logvar’s vector-components I found

3.4 ≥ max(logvar) ≥ -3.2. # for 1 up to 3 components out of a total 45.52 million components

The first value may appear to be big for a component. But typically there are **only 2 (!) out of 170,000 x 256 = 43.52 million vector components** in an interval of [-3, 5]. On the component level I found the following minimum, maximum and average-values

Maximum value for logvar: -2.0205276 Minimum value for logvar: -24.660698 Average value for logvar: -13.682616

The average value of *logvar* is pretty pleasing: Such big negative values indeed render the *logvar*-impact on the position of our z-points negligible. So we should only find very small deviations of the mu-components from the z-point components. And, actually, the maximum of the deviation between a z_point component and a mu component was delta_mu_z = 0.26:

Maximum (z_points - mu) = delta_mu_z = 0.26 # on the component level

There were only 5 out of the 45.52 million components which showed an absolute deviation in the interval

0.05 < abs(delta_mu_z) < 2.

The rest was much, much smaller!

What about radius values? Here the situation looks, of course, even better:

max radius defined by z : 33.10274 min radius defined by z : 6.4961233 max radius defined by mu : 33.0972 min radius defined by mu : 6.494558 avg_z: 16.283989 avg_mu: 16.283972 max absolute difference : 0.018045425 avg absolute difference : 0.00094899256 max relative difference : 0.00072147313 avg relative difference : 6.1240215e-05

As expected, the relative deviations between z- and mu-based radius values became very small.

In another run (the one corresponding to the second density distribution curve above) I got the following values:

Maximum value for logvar: 3.3761873 Minimum value for logvar: -22.777826 Average value for logvar: -13.4265175 max radius z : 35.51387 min radius z : 7.209168 max radius mu : 35.515926 min radius mu : 7.2086616 avg_z: 17.37412 avg_mu: 17.374104 max delta rad relative : 0.012512478 avg delta rad relative : 6.5660715e-05

This tells us that the z-point distributions may vary a bit in their width, their precise center and average values. But, overall they appear to be similar. Especially with respect to a relative negligible contribution of *logvar*-terms to the z-point position. The relative impact of *logvar* on the radius value of a z-point is of the order **6.e-5**, only.

All the above data confirm that a trained VAE with a very small KL-loss primarily uses *mu*-values to set the position of its z-points. During training the VAE moves along a path to an overall minimum on the loss hyperplane which leads to an area with weights that produce negligible *logvar* values.

o far we can summarize: Under normal conditions the VAE’s behavior is pretty close to that of a similar AE. The VAE produces only small *logvar* values. z-point coordinates are extremely close to just the mu-coordinates.

Can we find a plausible reason for this result? Looking at the cost-hyperplane with respect to the Encoder weights helps:

The cost surface of a VAE spans across a space of many more weight parameters than a corresponding AE. The reason is that we have weights for the connection to the logvar-layer in addition to the weights for the mu-layer (or a single output layer as in a corresponding AE). But if we look at the corner of the weight-vector-space where the *logvar*-related values are pretty small, then we would at least find a local (if not global) loss minimum there for the same values of the mu-related weight parameters as in the corresponding AE (with mu replacing the z-output).

So our question reduces to the closely related question whether the old minimum of an AE remains at least a local one when we shift to a VAE – and this is indeed the case for the basic reason that the KL-contributions to the height of the cost-hyperplane are negligibly small everywhere (!) – even for higher logvar-related values.

This tells us that a gradient descent algorithm should indeed be able to find a cost minimum for very small values of logvar-related weights and for weight-values related to the mu-layer very close to the AE’s weight-values for direct connections to its output layer. And, of course, with all other weight parameter of the VAE-Decoder being close to the values of the weights of a corresponding AE. At least under the condition that all variable quantities really change smoothly during training.

A last test to confirm that a VAE with a very small KL-loss operates as an comparable AE is a trial to create images with recognizable human faces from randomly chosen points in z-space. Such a trial should fail! I just show you three results – one for a normal distribution of the z-point components. And two for equidistant distribution of component values up to 3, 8 and 16:

**z-point coordinates from normal distribution **

**z-point coordinates from equidistant distribution in [-2,2] **

**z-point coordinates from equidistant distribution in [-10,10] **

This reminds us very much about the behavior of an AE. See: Autoencoders, latent space and the curse of high dimensionality – I.

The z-point distribution in latent space of a VAE with a very small KL-loss obviously is as complicated as that of an AE. Neighboring points of a z-point which leads to a good image produce chaotic images. The transition path from good z-points to other meaningful z-points is confined to a very small filament-like volume.

A trained VAE with only a tiny KL-loss contribution will under normal circumstances behave similar to an AE with a the same hidden (convolutional) layers. It may, however, be necessary to limit the statistical variation of the epsilon factor in the z-point calculation based on *mu*– and *logvar*-values.

The similarity is based on very small logvar-values after training. The VAE creates a z-point distribution which shows the same dependency on the radius as an AE. We see similar indications and patterns of clustering. And the VAE fails to produce human faces from random z-points in the latent space – as a comparable AE.

We have found a plausible reason for this similarity by comparing the minimum of the loss hyperplane in the weight-loss parameter space with a corresponding minimum in the weight-loss space of the VAE – at a position with small weights for the connection to the logvar layers.

The z-point density distribution shows a maximum at a radius between 16 and 17. The z-point distribution basically has a Gaussian form. In the next post we shall look a bit closer at these findings – and their origin in Gaussian distributions along the coordinate axes of the latent space. After an application of a PCA analysis we shall furthermore see that the z-point distribution in an AE’s latent vector space is indeed fragmented and shows filaments on certain length scales. A VAE with a tiny KL-loss will show the same fragmentation.

In further forthcoming post we shall afterward investigate the confining and at the same time blurring impact of the KL-loss on the latent space. Which will make it usable for creative purposes.

**And let us all who praise freedom not forget: **

The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!

]]>

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator

Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images

Variational Autoencoder with Tensorflow 2.8 – XI – image creation by a VAE trained on CelebA

After having successfully trained a VAE with CelebA data, we have shown that our VAE can afterward create images with human-like looking faces from statistically selected data points (z-points) in its latent space. We still have to analyze the confinement of the z-point distribution due to the KL-loss a bit in more depth. But before we turn to this topic I want to briefly discuss an option to reduce the VRAM requirements of the VAE’s Encoder.

In my opinion exploring the field of Machine Learning on a PC should not be limited to people who can afford a state of the art graphics card with a lot of VRAM. One could use Google’s Colab – but … I do not want to go into tax and personal data politics here. I really miss an EU-wide platform that offers services like Google Colab.

Anyway, a reduction of VRAM consumption may be decisive to be able to perform training runs for CNN-based VAEs on older graphic cards . Not only concerning VRAM limits but also regarding computational time: The less VRAM the weight parameters of our VAE models require the bigger we can size the batches the GPU operates on and the more CPU time we may potentially save. At least in principle. Therefore, we should consider the amount of trainable parameters of a neural network model and reduce them if possible.

When you print out the layer structure and related parameters of a VAE (see below) you will find that the Encoder requires more parameters than the Decoder. Around twice as many. A closer look reveals:

It is the transition from the convolutional part of the Encoder to its Dense layers for *mu* **and** *logvar* which plays an important role for the number of weight parameters.

For a layer structure comprising 4 Conv2D layers, related filters=(32,64,128,256) and an input image size of (96,96,3) pixels we arrive at a *Flatten*-layer of 9216 neurons at the end of the convolutional part of the Encoder. For z_dim = 512 the direct connections from the Flatten-layer to both the mu- and logvar-layers lead to more the 9.4 million (float32) parameters for the Encoder. This is the absolutely dominant part of all required 9.83 million parameters of the Encoder. In contrast the Decoder part requires a total of 5.1 million parameters, only.

This is due to the fact that the flattened layer supplies input to * two* connected layers before output is created by yet another layer. In the Decoder, instead, only one layer, namely the input layer is connected to the flattened layer ahead of the first Conv2DTranspose layer.

In the case of z_dim=256 we arrive at around half of the parameters, i.e. 4.9 million parameters for the Encoder and around 2.76 million for the Decoder.

It is obvious that the existence of two layers for the variational parameters inside the Encoder is the source of the high parameter number on the Encoder side.

A reduction of Conv2D-layers in the Encoder would of course reduce the parameters for the weights between the convolutional layers. But turning to only three convolutional layers whilst keeping up a stride value of stride=2 for all filters would **raise** the already dominant number of parameters after the flattened layer by a **factor of 4**!

So, one has to work with a delicate balance between the number of convolutional layers and the eventual number of maps at the innermost layer and the size of these maps. They determine the number of neurons and related weights on the flattening layer:

From the perspective of a low total number of parameters you should consider higher stride values when reducing the number of Conv2D-layers.

On the other hand side using *more* than 4 convolutional layers would reduce the resolution of the maps of the innermost Conv2D layer below a usable threshold for reasonable *mu* and *logvar* values.

Off topic remark: All in all it seems to be reasonable also to think about ResNets of low depth instead of plain CNNs to keep weight numbers under control.

The reader who followed the posts in this series may have looked at the recipe which F. Chollet has discussed in his Keras documentation on VAEs. See:

https://keras.io/ examples/ generative/vae/.

There is an element in Chollet’s Encoderstructure which one easily can overlook at first sight. In his example for the MNIST dataset Chollet adds an *intermediate Dense layer* between the Flatten-layer and the layers for *mu* and *logvar*.

... x = layers.Flatten()(x) x = layers.Dense(16, activation="relu")(x) z_mean = layers.Dense(latent_dim, name="z_mean")(x) z_log_var = layers.Dense(latent_dim, name="z_log_var")(x) ...

In the special case of MNIST an intermediate layer seems appropriate for bridging the gap between an input dimension of 784 to z_dim = 2. You do not expect major problems to arise from such a measure.

But: This intermediate layer introduced by Chollet also has the advantage of reducing the total number of trainable parameters substantially.

We could try something similar for our network. But here we have to be a bit more careful as we work in a latent space of much higher dimensions, typically with z_dim >= 256. Here we are in dilemma as we want to keep the intermediate dimension relatively high for using as much information as possible coming from the maps of the last Conv2D-layers. A fair compromise seems to be to use at least the dimension of the mu and varlog-layers, namely z_dim.

For z_dim=256 an additional Dense layer of the same size would reduce the total number of Encoder parameters from 5.11 million to 2.88 million.

If we took the dimension of the intermediate layer to be 384 we would still go down with the total Encoder parameters to 4.13 million. So an additional Dense layer really saves us some VRAM.

Will an additional dense layer have a negative impact on our VAE’s ability to create images from randomly chosen z-points in the latent space?

Let us try it out. To include an option for an additional Dense layer in the Encoder related part of our class “MyVariationalAutoencoder()” is a pretty simple task. I leave this to the reader. Note that if we choose the dimension of the additional Dense layer to be exactly z_dim there is no need to change the reconstruction logic and layer structure of the Decoder. Also for other choices for the size of the Dense layer I would refrain from changing the Decoder.

I used z_dim=256 for the extra layer’s size. Then I repeated the experiments described in my last post. Some results for random z-points picked from a normal distribution in all coordinates are shown below:

So we see that from the generative point of view an extra Dense layer does not hurt too much.

First and foremost:

We found a simple method to reduce the VRAM consumption of the Encoder.

But I have to admit that this method did NOT save any GPU time during training as long as I kept the size of image batches equally big as before (128). The reason is:

Due to the extra layer more matrix operations have to be performed than before, although some of matrixes becae smaller. On my old graphics card a full epoch with 170,000 (96×96) images takes around 120 secs – with or without an extra Encoder layer. Unfortunately, increasing the size of the batches the DataImageGenerator feeds into the GPU from 128 images to 256 did not change the required GPU time very much. More tests showed that a size of 128 already gave me an optimal turnaround time per epoch on my old graphics card (960 GTX).

An extra intermediate Dense layer between the Flatten-layer and the mu- and logvar-layers of the Encoder can help us to save some VRAM during the training of a VAE. Such a layer does not lead to a visible reduction of the quality of VAE-generated images from randomly selected points in the latent-space.

In the next post of this series

Variational Autoencoder with Tensorflow 2.8 – XIII – Does a VAE with tiny KL-loss behave like an AE? And if so, why?

we will compare a VAE with only a tiny contribution of KL-loss to the total loss with a corresponding AE. We shall investigate their similarity regarding their z-point distributions. This will give us a solid basis to investigate the impact of higher KL-loss values on the latent space in more detail afterwards.

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator

Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images

VAEs fall into a section of ML which is called “*Generative Deep Learning*“. The reason is that we can VAEs to create images with contain objects with features of objects learned from training images. One interesting category of such objects are human faces – of different color, with individual expressions and features and hairstyles, seen from different perspectives. One dataset which contains such images is the CelebA dataset.

During the last posts we came so far that we could train a CNN-based Variational Autoencoder [VAE] with images of the CelebA dataset. Even on graphics cards with low VRAM. Our VAE was equipped with a **GradientTape()**-based method for KL-loss control. We still have to prove that this method works in the expected way:

The distribution of data points (z-points) created by the VAE’s Encoder for training input should be confined to a region around the origin in the latent space (z-space). And neighboring z-points up to a limited distance should result in similar output of the Decoder.

Therefore, we have to look a bit deeper into the results of some VAE-experiments with the CelebA dataset. I have already pointed out why creating rather complex images from arbitrarily chosen points in the latent space is a suitable and good test for a VAE. Please remember that our efforts regarding the KL-loss have to do with the following fact:

not create reasonable images/objects from *arbitrarily* chosen z-points in the latent space.

This eliminates the use of an AE for creative purposes. A VAE, however, should be able to solve this type of task – at least for z-points in a limited surroundings of the latent space’s origin. Thus, by creating images from randomly selected z-points with the Decoder of a VAE, which has been trained on the CelebA data set, we cover two points:

- Test 1: We test the functionality of the VAE-class, which we have developed and which includes the code for KL-loss handling via TF2’s GradientTape() and Keras’ train_step().
- Test 2: We test the ability of the VAE’s Decoder to create images with convincing human-like face and hairstyle features from
*random*z-points within an area close to the origin of the latent space.

Most of the experiments discussed below follow the same prescription: We take our trained VAE, select some random points in the latent space, feed the z-point-data into the VAE’s Decoder for a prediction and plot the images created on the Decoder’s output side. The Encoder only plays a role when we want to test reconstruction abilities.

For a low dimension z_dim=256 of the latent space we will find that the generated images display human faces reasonably well. But the images appear a bit blurry or unsharp – as if not fully focused. So, we need to discuss what we can do about this point. I will also name some plausible causes for the loss of accuracy in the representation of details.

Afterwards I want to show you that a VAE Decoder reconstructs original images *relatively badly* from the z-points calculated by the Encoder. At least when one looks at details. A simple AE with a sufficiently high dimension of the latent space performs much better. One may feel disappointed about the reconstruction ability of a VAE. But actually it is the ability of a VAE to forget about details and instead to focus on general features which enables it (the VAE) to create something meaningful from randomly chosen z-points in the latent space.

In a last step in this post we are going to look at images created from z-points with a growing distance from the origin of the multidimensional latent space [z-space]. (Distance can be defined by a L2-Euclidean norm). We will see that most z-points which have some z-coordinates above a value of 3 produce fancy images where the face structures get dominated or distorted by background structures learned from CelebA images. This effect was to be expected as the KL-loss enforced a distribution of the z-points which is confined to a region relatively close to the origin. Ideally, this distribution would be characterized by a normal distribution in all coordinates with a sigma of only 1. So, the fact that z-points in the vicinity of the origin of the latent space lead to a construction of images which show recognizable human faces is an indirect proof of the confining impact of the KL-loss on the z-point distribution. In another post I shall deliver data which prove this more directly.

Below I will call the latent space of a (V)AE also *z-space*.

Our trained VAE with four Conv2D-layers in the Encoder and 4 corresponding Conv2DTranspose-Layers in the Decoder has the following basic characteristics:

(Encoder-CNN-) filters=(32,64,128,256), kernels=(3,3), stride=2,

reconstruction loss = BCE (binary crossentropy), fact=5.0, z_dim=256

The order of the filter- (= map-) numbers is, of course reversed for the Decoder. The factor *fact* to scale the KL-loss in comparison to the reconstruction loss was chosen to be *fact=5*, which led to a 3% contribution of the KL-loss to the total loss during training. The VAE was trained on 170,000 CelebA images with 24 epochs and a small epsilon=0.0005 plus Adam optimizer.

When you perform similar experiments on your own you may notice that the total loss values after around 24 epochs ( > 5015) are significantly higher than those of comparable experiments with a standard AE (4850). This already is an indication that our VAE will not reproduce a similar good match between an image reconstructed by the Decoder in comparison to the original input image fed into the Encoder.

The picture below shows some examples of generated face-images coming from randomly chosen z-points in the vicinity of the z-space’s origin. To calculate the coordinates of such z-points I applied a normal distribution:

z_points = np.random.normal(size = (n_to_show, z_dim)) # n_to_show = 28

So, what do the results for **z_dim=256** look like?

Ok, we get reasonable images of human-like faces. The variations in perspective, face forms and hairstyles are also clearly visible and reflect part of the related variety in the training set. You will find more variations in more images below. So, we take this result as a success! In contrast to a pure AE we DO get something from random z-points which we clearly can interpret as human faces. The whole effort of confining z-points around the origin and at the same time of smearing out z-points with similar content over a region instead of a fixed point-mapping (as in an AE) has paid off. See for comparison:

Autoencoders, latent space and the curse of high dimensionality – I

Unfortunately, the images and their details details appear a bit blurry and not very sharp. Personally, this reminded me of the times when the first CCD-chips with relative low resolution were introduced in cameras and the raw image data looked disappointing as long as we did not apply some sharpening filters. The basic information to enhance details were there, but they had to be used explicitly to improve the plain raw data of the CCD.

The quality in details is about the same as what we see in example images in the book of D.Foster on “Generative Deep Learning”, 2019, O’Reilly. Despite the fact that Foster used a slightly higher resolution of the input images (128x128x3 pixels). The higher input resolution there also led to a higher resolution of the maps of the innermost convolutional layer. Regarding quality see also the images presented in:

https://datagen.tech/guides/image-datasets/celeba/

Just for fun, I took a screenshot of my result, saved it and applied two different sharpening filters from the ShowFoto program:

Much better! And we do not have the impression that we added some fake information to the images by our post-processing ….

Now I hear already argument saying that such an enhancement should not be done. Frankly, I do not see any reason against post-processing of images created by a VAE-algorithm.

Remember: This is NOT about reproduction quality with respect to originals or a close-to-reality show. This is about generating new mages of human-like faces based on basic features which a VAE-algorithm hopefully has learned from training images. All of what we do with a VAE is creative. And it also comes close to a proof that ML-algorithms based on convolutional layers really can “learn” something about the basic features of objects presented to them. (The learned features are e.g. in the Encoder’s case saved in the sensitivity of the convolutional maps to typical patterns in the input images.)

And as in the case of raw-images of CCD or CMOS camera chips: Sometimes some post-processing is required to utilize the information optimally for sharpness.

Of course we do not want to produce images in a ML run, take screenshots and sharpen each image individually. We need some tool that fits into the ML process pipeline. The good old PIL library for Python offers sharpening as one of multiple enhancement options for images. The next examples are results from the application of a PIL enhancement procedure:

These images look quite OK, too. The basic code fragment I used for each individual image in the above grid:

# reconst_new is the output from my VAE's Decoder ay_img = reconst_new[i, :,:,:] * 255 ay_img = np.asarray(ay_img, dtype="uint8" ) img_orig = Image.fromarray(ay_img) img_shr_obj = ImageEnhance.Sharpness(img) sh_factor = 7 # Specified Factor for Enhancing Sharpness img_sh = img_shr_obj.enhance(sh_factor)

The sharpening factor I chose was quite high, namely **sh_factor = 7**.

Just to further demonstrate the effect of different factors for sharpening by PIL you find some examples below for sh_factor = 0, 3, 6.

Obviously, the enhancement is important to get clearer and sharper images.

However, when you enlarge the images sufficiently enough you see some artifacts in the form of crossing lines. These artifacts are partially already existing in the Decoder’s output, but they are enhanced by the Sharpening mechanism used by PIL (unsharp masking). The artifacts become more pronounced with a growing sh_factor.

**Hint:** According to ML-literature the use of *Upsampling layers* instead of *Conv2DTranspose layers* in the Decoder may reduce such artefacts a bit. I have not yet tried it myself.

How do we assess the point of relatively unclear, unsharp images produced by our VAE? What are plausible reasons for the loss of details?

- Firstly, already AEs with a latent space dimension z_dim=256 in general do not reconstruct brilliant images from z-points in the latent space. To get a good reconstruction quality even from an AE which does nothing else than to compress and reconstruct images of a size (96x96x3) z_dim-values > 1000 are required in my experience. More about this in another post in the future.
- A second important aspect is the following: Enforcing a compact distribution of similar images in the latent space via the KL-loss automatically introduces a loss of detail information. The KL-loss is designed to lead to a smear-out effect in z-space. Only basic concepts and features will be kept by the VAE to ensure a similarity of neighboring images. Details will be omitted and “smoothed” out. This has consequences also with respect to sharpness of detail structures. A detail as an eyebrow in a face is to be considered as an average of similar details found for images in the same region of the z-space. This alone brings some loss of clarity with it.
- Thirdly, a simple (V)AE based on some directly connected Conv2D-layers has limited capabilities in general. The reason is that we systematically reduce resolution whilst information is propagated from one Conv2D layer to the next neighboring one. Remember that we use a stride > 2 or pooling layers to cover filters on larger image scales. Due to this information processing a convolutional network automatically suppresses details in its inner layers – their resolution shrinks with growing distance from the input layer. In later posts of this blog we shall see that using
**ResNets**instead of CNNs in the Encoder and Decoder already helps a bit regarding the reconstruction of clearer images. The correlation between details and large scale information is better kept up there than in CNNs.

Regarding the first point one may think of increasing z_dim. This may not be the best idea. It contradicts the whole idea of a VAE which at its core is a reduction of the degrees of freedom for z-points. For a higher dimensional space we may have to raise the ratio of KL-loss to reconstruction loss even further.

Regarding the third point: Of course it would also help to increase kernel sizes for the first two Conv2D layers and the number of maps there. A higher resolution of the input images would also be of advantage. Both methods may, however, conflict with your VRAM or GPU time limits.

If the second point were true then reduction of *fact* in our models, which controls the ration of KL-loss to reconstruction loss, would lead to a better image quality. In this case we are doomed to find an optimal value for fact – satisfying both the need for generalization and clarity of details in our images. You cannot have both … here we see a basic problem related to VAEs and the creation of realistic images. Actually, I tried this out – the effect is there, but the gain actually is not worth the effort. And for too small values of fact we eventually loose the ability to create reasonable images from arbitrary z-points at all.

All in all post-processing appears to be a simple and effective method to get images with somewhat sharper details.

Hint: If you want to create images of artificially generated faces with a really high quality, you have to turn to GANs.

In this example you see that not all points give you good images of faces. The z-point of the middle image in the second to last of the first illustration below has a relatively high distance from the origin. The higher the distance from the origin in z-space the weirder the images get. We shall see this below in a more systematic way.

If I were not afraid of copy and personal rights aspects of using CelebA images directly I could show you now a comparison of the the reconstruction ability of an AE in comparison to a VAE. You find such a comparison, though a limited one, by looking at some images in the book of D. Foster.

To avoid any problems I just tried to work with an image of myself. Which really gave me a funny result.

A plain Autoencoder with

- an extended latent space dimension of z_dim = 1600,
- a reasonable convolutional filter sequence of (64, 64, 128, 128)
- a stride value of stride=2
- and kernels ((5,5),(5,5),(3,3),(3,3))

is well able to reproduce many detailed features one’s face after a training on 80,000 CelebA images. Below see the result for an image of myself after 24 training epochs of such an AE:

The left image is the original, the right one the reconstruction. The latter is not perfect, but many details have been reproduced. Please note that the trained AE never had seen an image of myself before. For biometric analysis the reproduction would probably be sufficient.

Ok, so much about an AE and a latent space with a relatively high dimension. But what does a VAE think of me?

With fact = 5.0, filters like (32,64,128,256), (3,3)-kernels, z_dim=256 and after 18 epochs with 170,000 training images of CelebA my image really got a good cure:

My wife just laughed and said: Well, now in the age of 64 at least an AI has found something soft and female in you … Well, had the CelebA included many faces of heavy metal figures the results would have looked differently. I bet …

So with generative VAEs we obviously pay a price: Details are neglected in favor of very general face features and hairstyle aspects. And we loose sharpness. Which is good if you have wrinkles. Good for me and the celebrities, too.

However, I recommend anybody who wants to study VAEs to check the reproduction quality for CelebA test images (not from the training set). You will see the *generalization* effect for a broader range of images. And, of course, a better reproduction with smaller values for the ratio of the KL-loss to the reconstruction loss. However, for too small values of *fact* you will not be able to create realistic face images at all from arbitrary z-points – even if you choose them to be relatively close to the origin of the latent space.

In another post in this blog I have discussed why we need VAEs at all if we want to reconstruct reasonable face images from randomly picked points in the latent space. See:

Autoencoders, latent space and the curse of high dimensionality – I

I think the reader is meanwhile convinced that VAEs do a reasonably good job to create images from randomly chosen z-points. But all of the above images were taken from z-points calculated with the help of a function assuming a normal distribution in the z-space coordinates. The width of the resulting distribution around the origin is of course rather limited. Most points lie within a 3 sigma distance around the origin. This is OK as we have put a lot of effort into the KL-loss to force the z-points to approach such a normal distribution around the origin of the latent space.

But what happens if and when we increase the distance of our random z-points from the origin? An easy way to investigate this is to create the z-points with a function that creates the coordinates randomly, but equally distributed in an interval ]0,limit]. The chance that at least one of the coordinates gets a high value is rather big then. This in turn ensures relatively high radius values (in terms of an L2-distance norm).

Below you find the results for z-points created by the function random.uniform:

r_limit = 1.5 l_limit = -r_limit znew = np.random.uniform(l_limit, r_limit, size = (n_to_show, z_dim))

r_limit is varied as indicated:

Well, this proves that we get reasonable images only up to a certain distance from the origin – and only in certain areas or pockets of the z-space at higher radii.

Another notable aspect is the fact that the background variations are completely smoothed out a low distances from the origin. But they get dominant in the outer regions of the z-space. This is consistent with the fact that we need more information to distinguish various background shapes, forms and colors than basic face patterns. Note also that the faces appear relatively homogeneous for r_limit = 0.5. The farther we are away from the origin the larger the volumes to cover and distinguish certain features of the training images become.

Our VAE with the GradientTape()-mechanism for the control of the KL-loss seems to do its job. In contrast to a pure AE the smear-out effect of the KL-loss allows now for the creation of images with interpretable contents from *arbitrary* z-points via the VAE’s Decoder – as long as the selected z-points are not too far away from the z-space’s origin. Thus, by indirect evidence we can conclude that the z-points for training images of the CelebA dataset were distributed and at the same time confined around the origin. The strongest indication came from the last series of images. But we pay a price: The reconstruction abilities of a VAE are far below those of AEs. A relatively low number of dimensions of the latent space helps with an effective confinement of the z-points. But it leads to a significant loss in detail sharpness of the generated images, too. However, part of this effect can be compensated by the application of standard procedures for image enhancemnet.

In the next post

Variational Autoencoder with Tensorflow 2.8 – XII – save some VRAM by an extra Dense layer in the Encoder

I will discuss a simple trick to reduce the VRAM consumption of the Encoder. In a further post we shall then analyze the confinement of the z-point distribution with the help of more explicit data.

**And let us all who praise freedom not forget: **

The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. An aggressor who orders the bombardment of civilian infrastructure, civilian buildings, schools and hospitals with drones bought from other anti-democrats and women oppressors puts himself in the darkest and most rotten corner of human history.

]]>

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator

The last method discussed made use of Tensorflow’s **GradientTape()-class**. We still have to test this approach on a challenging dataset like CelebA. Our ultimate objective will be to pick up randomly chosen data points in the VAE’s latent space and create yet unseen but realistic face images by the trained Decoder’s abilities. This task falls into the category of **Generative Deep Learning**. It has nothing to do with classification or a simple reconstruction of images. Instead we let a trained Artificial Neural Network create something new.

The code fragments discussed in the last post of this series helped us to prepare images of CelebA for training purposes. We cut and downsized them. We saved them in their final form in Numpy arrays: Loading e.g. 170,000 training images from a SSD as a Numpy array is a matter of a few seconds. We also learned how to prepare a Keras **ImageDataGenerator** object to create a flow of batches with image data to the GPU.

We have also developed two Python classes “MyVariationalAutoencoder” and “VAE” for the setup of a CNN-based VAE. These classes allow us to control a VAE’s input parameters, its layer structure based on Conv2D- and Conv2DTranspose layers, and the handling of the Kullback-Leibler [KL-] loss. In this post I will give you Jupyter code fragments that will help you to apply these classes in combination with CelebA data.

The Encoder and Decoder CNNs of our VAE shall consist of 4 convolutional layers and 4 transpose convolutional layers, respectively. We control the KL loss by invoking **GradientTape()** and **train_step()**.

**Regarding the size of the KL-loss: **

Due to the “curse of dimensionality” we will have to choose the KL-loss contribution to the total loss large enough. We control the relative size of the KL-loss in comparison to the standard reconstruction loss by a parameter “**fact**“. To determine an optimal value requires some experiments. It also depends on the kind of reconstruction loss: Below I assume that we use a “Binary Crossentropy” loss. Then we must choose fact > 3.0 to get the KL-loss to become bigger than 3% of the total loss. Otherwise the confining and smoothing effect of the KL-loss on the data distribution in the latent space will not be big enough to force the VAE to learn *general* and not specific features of the training images.

Below I present Jupyter cells for required imports and GPU preparation without many comments. Its all standard. I keep the Python file with the named classes in a folder “my_AE_code.models”. This folder must have been declared as part of the module search path “sys.path”.

import os, sys, time, random import math import numpy as np import matplotlib as mpl from matplotlib import pyplot as plt from matplotlib.colors import ListedColormap import matplotlib.patches as mpat import PIL as PIL from PIL import Image from PIL import ImageFilter # temsorflow and keras import tensorflow as tf from tensorflow import keras as K from tensorflow.keras import backend as B from tensorflow.keras.models import Model from tensorflow.keras import regularizers from tensorflow.keras import optimizers from tensorflow.keras.optimizers import Adam from tensorflow.keras import metrics from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \ Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, \ AlphaDropout, Concatenate, Rescaling, ZeroPadding2D, Layer #from tensorflow.keras.utils import to_categorical #from tensorflow.keras.optimizers import schedules from tensorflow.keras.preprocessing.image import ImageDataGenerator from my_AE_code.models.MyVAE_3 import MyVariationalAutoencoder from my_AE_code.models.MyVAE_3 import VAE

**Jupyter Cell 2 – List available Cuda devices**

# List Cuda devices # Suppress some TF2 warnings on negative NUMA node number # see https://www.programmerall.com/article/89182120793/ os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'} tf.config.experimental.list_physical_devices()

**Jupyter Cell 3 – Use GPU and limit VRAM usage**

# Restrict to GPU and activate jit to accelerate # ************************************************* # NOTE: To change any of the following values you MUST restart the notebook kernel ! b_tf_CPU_only = False # we need to work on a GPU tf_limit_CPU_cores = 4 tf_limit_GPU_RAM = 2048 b_experiment = False # Use only if you want to use the deprecated way of limiting CPU/GPU resources # see the next cell if not b_experiment: if b_tf_CPU_only: ... else: gpus = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit = tf_limit_GPU_RAM)]) # JiT optimizer tf.config.optimizer.set_jit(True)

You see that I limited the VRAM consumption drastically to leave some of the 4GB VRAM available on my old GPU for other purposes than ML.

The next cell defines some basic parameters – you know this already from my last post.

**Juypter Cell 4 – basic parameters**

# Some basic parameters # ~~~~~~~~~~~~~~~~~~~~~~~~ INPUT_DIM = (96, 96, 3) BATCH_SIZE = 128 # The number of available images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_imgs = 200000 # Check with notebook CelebA # The number of images to use during training and for tests # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NUM_IMAGES_TRAIN = 170000 # The number of images to use in a Trainings Run #NUM_IMAGES_TO_USE = 60000 # The number of images to use in a Trainings Run NUM_IMAGES_TEST = 10000 # The number of images to use in a Trainings Run # for historic comapatibility reasons N_ImagesToUse = NUM_IMAGES_TRAIN NUM_IMAGES = NUM_IMAGES_TRAIN NUM_IMAGES_TO_TRAIN = NUM_IMAGES_TRAIN # The number of images to use in a Trainings Run NUM_IMAGES_TO_TEST = NUM_IMAGES_TEST # The number of images to use in a Test Run # Define some shapes for Numpy arrays with all images for training # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ shape_ay_imgs_train = (N_ImagesToUse, ) + INPUT_DIM print("Assumed shape for Numpy array with train imgs: ", shape_ay_imgs_train) shape_ay_imgs_test = (NUM_IMAGES_TO_TEST, ) + INPUT_DIM print("Assumed shape for Numpy array with test imgs: ",shape_ay_imgs_test)

Also the next cells were already described in the last blog.

**Juypter Cell 5 – fill Numpy arrays with image data from disk**

# Load the Numpy arrays with scaled Celeb A directly from disk # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ print("Started loop for train and test images") start_time = time.perf_counter() x_train = np.load(path_file_ay_train) x_test = np.load(path_file_ay_test) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for loading Numpy arrays of CelebA imgs: ", cpu_time) print("Shape of x_train: ", x_train.shape) print("Shape of x_test: ", x_test.shape)

The Output is

Started loop for train and test images CPU-time for loading Numpy arrays of CelebA imgs: 2.7438277259999495 Shape of x_train: (170000, 96, 96, 3) Shape of x_test: (10000, 96, 96, 3)

**Juypter Cell 6 – create an ImageDataGenerator object**

# Generator based on Numpy array of image data (in RAM) # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ b_use_generator_ay = True BATCH_SIZE = 128 SOLUTION_TYPE = 3 if b_use_generator_ay: if SOLUTION_TYPE == 0: data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train , x_train , batch_size = BATCH_SIZE , shuffle = True ) if SOLUTION_TYPE == 3: data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train , batch_size = BATCH_SIZE , shuffle = True )

In our case we work with SOLUTION_TYPE = 3. This specifies the use of **GradientTape()** to control the KL-loss. Note that we do NOT need to define label data in this case.

Next we set up the sequence of convolutional layers of the Encoder and Decoder of our VAE. For this objective we feed the required parameters into the __init__() function of our class “MyVariationalAutoencoder” whilst creating an object instance (MyVae).

**Juypter Cell 7 – Parameters for the setup of VAE-layers**

from my_AE_code.models.MyVAE_3 import MyVariationalAutoencoder from my_AE_code.models.MyVAE_3 import VAE z_dim = 256 # a first good guess to get a sufficient basic reconstruction quality # due to the KL-loss the general reconstruction quality will # nevertheless be poor in comp. to an AE solution_type = SOLUTION_TYPE # We test GradientTape => SOLUTION_TYPE = 3 loss_type = 0 # Reconstruction loss => 0: BCE, 1: MSE act = 0 # standard leaky relu activation function # Factor to scale the KL-loss in comparison to the reconstruction loss fact = 5.0 # - for BCE , other working values 1.5, 2.25, 3.0 # best: fact >= 3.0 # fact = 2.0e-2 # - for MSE, other working values 1.2e-2, 4.0e-2, 5.0e-2 use_batch_norm = True use_dropout = False dropout_rate = 0.1 n_ch = INPUT_DIM[2] # number of channels print("Number of channels = ", n_ch) print() # Instantiation of our main class # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MyVae = MyVariationalAutoencoder( input_dim = INPUT_DIM , encoder_conv_filters = [32,64,128,256] , encoder_conv_kernel_size = [3,3,3,3] , encoder_conv_strides = [2,2,2,2] , encoder_conv_padding = ['same','same','same','same'] , decoder_conv_t_filters = [128,64,32,n_ch] , decoder_conv_t_kernel_size = [3,3,3,3] , decoder_conv_t_strides = [2,2,2,2] , decoder_conv_t_padding = ['same','same','same','same'] , z_dim = z_dim , solution_type = solution_type , act = act , fact = fact , loss_type = loss_type , use_batch_norm = use_batch_norm , use_dropout = use_dropout , dropout_rate = dropout_rate )

There are some noteworthy things:

Reaonable values of “fact” depend on the type of reconstruction loss we choose. In general the “Binary Cross-Entropy Loss” (BCE) has steep walls around a minimum. BCE, therefore, creates much larger loss values than a “Mean Square Error” loss (MSE). Our class can handle both types of reconstruction loss. For BCE some trials show that values “3.0 <= fact <= 6.0" produce z-point distributions which are well confined around the origin of the latent space. If you lie to work with "MSE" for the reconstruction loss you must assign much lower values to fact - around fact = 0.01.

I use batch normalization layers in addition to the convolution layers. It helps a bit or a faster convergence, but produces GPU-time overhead during training. In my experience batch normalization is not an absolute necessity. But try out by yourself. Drop-out layers in addition to a reasonable KL-loss size appear to me as an unnecessary double means to enforce generalization.

Four Convolution layers allow for a reasonable coverage of patterns on different length scales. Four layers make it also easy to use a constant stride of 2 and a “same” padding on all levels. We use a kernel size of 3 for all layers. The number of maps of the layers are defined as 32, 64, 128 and 256.

All in all we use a standard approach to combine filters at different granularity levels. We also cover 3 color layers of a standard image, reflected in the input dimensions of the Encoder. The Decoder creates corresponding arrays with color information.

We now call the classes methods to build the models for the Encoder and Decoder parts of the VAE.

**Juypter Cell 8 – Creation of the Encoder model**

# Build the Encoder # ~~~~~~~~~~~~~~~~~~ MyVae._build_enc() MyVae.encoder.summary()

You see that the KL-loss related layers dominate the number of parameters.

**Juypter Cell 9 – Creation of the Decoder model**

# Build the Decoder # ~~~~~~~~~~~~~~~~~~~ MyVae._build_dec() MyVae.decoder.summary()

Building and compiling the full VAE based on parameter solution_type = 3 is easy with our class:

**Juypter Cell 10 – Creation and compilation of the VAE model**

# Build the full AE # ~~~~~~~~~~~~~~~~~~~ MyVae._build_VAE() # Compile the model learning_rate = 0.0005 MyVae.compile_myVAE(learning_rate=learning_rate)

Note that internally an instance of class “VAE” is built which handles all loss calculations including the KL-contribution. Compilation and inclusion of an Adam optimizer is also handled internally. Our classes make or life easy …

Our initial learning_rate is relatively small. I followed recommendations of D. Foster’s book on “Generative Deep Learning” regarding this point. A value of 1.e-4 does not change much regarding the number of epochs for convergence.

Due to the chosen low dimension of the latent space the total number of trainable parameters is relatively moderate.

To save some precious computational time (and energy consumption) in the future we need a basic option to save and load model weight parameters. I only describe a direct method; I leave it up to the reader to define a related Callback.

**Juypter Cell 11 – Paths to save or load weight parameters**

path_model_save_dir = 'YOUR_PATH_TO_A_WEIGHT_SAVING_DIR' dir_name = 'MyVAE3_sol3_act0_loss0_epo24_fact_5p0emin0_ba128_lay32-64-128-256/' path_dir = path_model_save_dir + dir_name if not os.path.isdir(path_dir): os.mkdir(path_dir, mode = 0o755) dir_all_name = 'all/' dir_enc_name = 'enc/' dir_dec_name = 'dec/' path_dir_all = path_dir + dir_all_name if not os.path.isdir(path_dir_all): os.mkdir(path_dir_all, mode = 0o755) path_dir_enc = path_dir + dir_enc_name if not os.path.isdir(path_dir_enc): os.mkdir(path_dir_enc, mode = 0o755) path_dir_dec = path_dir + dir_dec_name if not os.path.isdir(path_dir_dec): os.mkdir(path_dir_dec, mode = 0o755) name_all = 'all_weights.hd5' name_enc = 'enc_weights.hd5' name_dec = 'dec_weights.hd5' #save all weights path_all = path_dir + dir_all_name + name_all path_enc = path_dir + dir_enc_name + name_enc path_dec = path_dir + dir_dec_name + name_dec

You see that I define separate files in “hd5” format to save parameters of both the full model as well as of its Encoder and Decoder parts.

If we really wanted to load saved weight parameters we could set the parameter “b_load_weight_parameters” in the next cell to “True” and execute the cell code:

**Juypter Cell 12 – Load saved weight parameters into the VAE model**

b_load_weight_parameters = False if b_load_weight_parameters: MyVae.model.load_weights(path_all)

We are ready to perform a training run. For our 170,000 training images and the parameters set I needed a bit more than 18 epochs, namely 24. I did this in two steps – first 18 epochs and then another 6.

**Juypter Cell 13 – Load saved weight parameters into the VAE model**

INITIAL_EPOCH = 0 #n_epochs = 18 n_epochs = 6 MyVae.set_enc_to_train() MyVae.train_myVAE( data_flow , b_use_generator = True , epochs = n_epochs , initial_epoch = INITIAL_EPOCH )

The total loss starts in the beginning with a value above 6,900 and quickly closes in to something like 5,100 and below. The KL-loss during raining rises continuously from something like 30 to 176 where it stays almost constant. The 6 epochs after epoch 18 gave the following result:

I stopped the calculation at this point – though a full convergence may need some more epochs.

You see that an epoch takes about **2 minutes** GPU time (on a GTX960; a modern graphics card will deliver far better values). **For 170,000 images the training really costs**. On the other side you get a broader variation of face properties in the resulting artificial images later on.

After some epoch we may want to save the weights calculated. The next Jupyter cell shows how.

**Juypter Cell 14 – Save weight parameters to disk**

print(path_all) MyVae.model.save_weights(path_all) print("saving all weights is finished") print() #save enc weights print(path_enc) MyVae.encoder.save_weights(path_enc) print("saving enc weights is finished") print() #save dec weights print(path_dec) MyVae.decoder.save_weights(path_dec) print("saving dec weights is finished")

After training you may first want to test the reconstruction quality of the VAE’s Decoder with respect to training or test images. Unfortunately, I cannot show you original data of the Celeb A dataset. However, the following code cells will help you to do the test by yourself.

**Juypter Cell 15 – Choose images and compare them to their reconstructed counterparts**

# We choose 14 "random" images from the x_train dataset # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ from numpy.random import MT19937 from numpy.random import RandomState, SeedSequence # For another method to create reproducale "random numbers" see https://albertcthomas.github.io/good-practices-random-number-generators/ n_to_show = 7 # per row # To really recover all data we must have one and the same input dataset per training run l_seed = [33, 44] #l_seed = [33, 44, 55, 66, 77, 88, 99] num_exmpls = len(l_seed) print(num_exmpls) # a list to save the image rows l_img_orig_rows = [] l_img_reco_rows = [] start_time = time.perf_counter() # Set the Encoder to prediction = epsilon * 0.0 # MyVae.set_enc_to_predict() for i in range(0, num_exmpls): # fixed random distribution rs1 = RandomState(MT19937( SeedSequence(l_seed[i]) )) # indices of example array selected from the test images #example_idx = np.random.choice(range(len(x_test)), n_to_show) example_idx = rs1.randint(0, len(x_train), n_to_show) example_images = x_train[example_idx] # calc points in the latent space if solution_type == 3: z_points, mu, logvar = MyVae.encoder.predict(example_images) else: z_points = MyVae.encoder.predict(example_images) # Reconstruct the images - note that this results in an array of images reconst_images = MyVae.decoder.predict(z_points) # save images in a list l_img_orig_rows.append(example_images) l_img_reco_rows.append(reconst_images) end_time = time.perf_counter() cpu_time = end_time - start_time # Reset the Encoder to prediction = epsilon * 1.00 # MyVae.set_enc_to_train() print() print("n_epochs : ", n_epochs, ":: CPU-time to reconstr. imgs: ", cpu_time)

We save the selected original images and the reconstructed images in Python lists.

We then display the original images in one row of a matrix and the reconstructed ones in a row below. We arrange 7 images per row.

**Juypter Cell 16 – display original and reconstructed images in a matrix-like array **

# Build an image mesh # ~~~~~~~~~~~~~~~~~~~~ fig = plt.figure(figsize=(16, 8)) fig.subplots_adjust(hspace=0.2, wspace=0.2) n_rows = num_exmpls*2 # One more for the original for j in range(num_exmpls): offset_orig = n_to_show * j * 2 for i in range(n_to_show): img = l_img_orig_rows[j][i].squeeze() ax = fig.add_subplot(n_rows, n_to_show, offset_orig + i+1) ax.axis('off') ax.imshow(img, cmap='gray_r') offset_reco = offset_orig + n_to_show for i in range(n_to_show): img = l_img_reco_rows[j][i].squeeze() ax = fig.add_subplot(n_rows, n_to_show, offset_reco+i+1) ax.axis('off') ax.imshow(img, cmap='gray_r')

You will find that the reconstruction quality is rather limited – and not really convincing by any measures regarding details. Only the general shape of faces an their features are reproduced. But, actually, it is this lack of precision regarding details which helps us to create images from arbitrary z-points. I will discuss these points in more detail in a further post.

The technique to display images can also be used to display images reconstructed from arbitrary points in the latent space. I will show you various results in another post.

For now just enjoy the creation of images derived from z-points defined by a normal distribution around the center of the latent space:

Most of these images look quite convincing and crispy down to details. The sharpness results from some photo-processing with PIL functions after the creation by the VAE. But who said that this is not allowed?

In this post I have presented Jupyter cells with code fragments which may help you to apply the VAE-classes created previously. With the VAE setup discussed above we control the KL-loss by a GradientTape() object.

Preliminary results show that the images created of arbitrarily chosen z-points really show heads with human-like faces and hair-dos. In contrast to what a simple AE would produce (see:

Autoencoders, latent space and the curse of high dimensionality – I

In the next post

Variational Autoencoder with Tensorflow 2.8 – XI – image creation by a VAE trained on CelebA

I will have a look at the distribution of z-points corresponding to the CelebA data and discuss the delicate balance between the representation of details and the generalization of features. With VAEs you cannot get both.

**And let us all who praise freedom not forget: **

The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!

]]>

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Variational Autoencoder with Tensorflow 2.8 – VIII – TF 2 GradientTape(), KL loss and metrics

We still have to test the Python classes which we have so laboriously developed during the last posts. One of these classes, “VAE()”, supports a specific approach to control the KL-loss parameters during training and cost optimization by gradient descent: The class may use Tensorflow’s [TF 2] *GradientTape*-mechanism and the Keras function *train_step()* – instead of relying on Keras’ standard “add_loss()” functions.

Instead of recreating simple MNIST images of digits from ponts in a latent space I now want to train a VAE (with GradienTape-based loss control) to solve a more challenging task:

We want to create artificial images of naturally appearing human faces from randomly chosen points in the latent space of a VAE, which has been trained with images of real human faces.

Actually, we will train our VAE with images provided by the so called “**Celeb A**” dataset. This dataset contains around 200,000 images showing the heads of so called celebrities. Due to the number and size of its images this dataset forces me (due to my very limited hardware) to use a *Keras Image Data Generator*. A generator is a tool to transfer huge amounts of data in a continuous process and in form of small batches to the GPU during neural network training. The batches must be small enough such that the respective image data fit into the VRAM of the GPU. Our VAE classes have been designed to support a generator.

In this post I first explain why Celeb A poses a thorough test for a VAE. Afterwards I shall bring the Celeb A data into a form suitable for older graphics cards with small VRAM.

To answer the question we first have to ask ourselves why we need VAEs at all. Why do certain ML tasks require more than just a simple plain Autoencoder [AE]?

The answer to the latter question lies in the data distribution an AE creates in its latent space. An AE, which is trained for the precise reconstruction of presented images will use a sufficiently broad area/volume of the latent space to place different points corresponding to different imageswith a sufficiently large distance between them. The position in an AE’s latent space (together with the Encode’s and Decoder’s weights) encodes specific features of an image. A standard AE is not forced to generalize sufficiently during training for reconstruction tasks. On the contrary: A good reconstruction AE shall learn to encode as many details of input images as possible whilst filling the latent space.

However: The neural networks of a (V)AE correspond to a (non-linear) mapping functions between multi-dimensional vector spaces, namely

- between the feature space of the input data objects and the AE’s latent space
- and also between the latent space and the reconstruction space (normally with the same dimension as the original feature space for the input data).

This poses some risks whenever some tasks require to use arbitrary points in the latent space. Let us, e.g., look at the case of images of certain real objects in font of varying backgrounds:

During the AE’s training we map points of a high-dimensional feature-space for the pixel values of (colored) images to points in the multi-dimensional latent space. The target region in the latent space stemming from regions in the original feature-space which correspond to “reasonable” images displaying *real* objects may cover only a relatively thin, wiggled manifold within in the latent space (z-space). For points outside the curved boundaries of such regions in z-space the Decoder may not give you clear realistic and interpretable images.

The most important objectives of invoking the KL-loss as an additional optimization element by a VAE are

- to
*confine*the data point distribution, which the VAE’s Encoder part produces in the multidimensional latent space, around the*origin*of the z-space – as far as possible symmetrically and within a very limited distance from**O**,**O** - to normalize the data distribution around any z-point calculated during training. Whenever a real training object marks the center of a
*limited area*in latent space then reconstructed data objects (e.g. images) within such an area should not be too different from the original training object.

I.e.: We force the VAE to generalize much more than a simple AE.

Both objectives are achieved via specific parameterized parts of the KL-loss. We optimize the KL-loss parameters – and thus the data distribution in the latent space – during training. After the training phase we want the VAE’s Decoder to behave well and smoothly for *neighboring* points in extended areas of the latent space:

The content of reconstructed objects (e.g. images) resulting from neighboring points within limited z-space areas (up to a certain distance from the origin) should vary only smoothly.

The KL loss provides the necessary *smear-out effect* for the data distribution in z-space.

During this series I have only shown you the effects of the KL-loss on **MNIST** data for a dimension of the latent space *z_dim = 2*. We saw the general confinement of z-points around the origin and also a confinement of points corresponding to different MNIST-numbers (= specific features of the original images) in limited areas. With some overlaps and transition regions for different numbers.

But note: The low dimension of the latent space in the MNIST case (between 2 and 16) simplifies the confinement task – close to the origin there are not many degrees of freedom and no big volume available for the VAE Encoder. Even a standard AE would be rather limited when trying to vastly distribute z-points resulting from MNIST images of different digits.

However, a more challenging task is posed by the data distribution, which a (V)AE creates e.g. of images showing human heads and faces with characteristic features in front of varying backgrounds. To get a reasonable image reconstruction we must assign a much higher number of dimensions to the latent space than in the MNIST case: **z_dim = 256** or **z_dim = 512** are reasonable values at the lower end!

Human faces or heads with different hair-dos are much more complex than digit figures. In addition the influence of details in the background of the faces must be handled – and for our objective be damped. As we have to deal with *many* more dimensions of the z-space than in the MNIST case a simple standard AE will run into trouble:

Without the confinement and local smear-out effect of the KL-loss only tiny and thin areas of the latent space will correspond to reconstructions of human-like “faces”. I have discussed this point in more detail also in the post

Autoencoders, latent space and the curse of high dimensionality – I

As a result a standard AE will **NOT** reconstruct human faces from randomly picked z-points in the latent space. So, an AE will fail on the challenge posed in the introduction of this post.

I recommend to get the Celeb A data from some trustworthy Kaggle contributor – and not from the original Chinese site. You may find cropped images e.g. at here. Still check the image container and the images carefully for unwanted add-ons.

The Celeb A dataset contains around 200,000 images of the heads of celebrities with a resolution of 218×178 pixels. Each image shows a celebrity face in front of some partially complex background. The amount of data to be handled during VAE training is relatively big – even if you downscale the images. The whole set will not fit into the limited VRAM of older graphics cards as mine (GTX960 with 4 GB, only). This post will show you how to deal with this problem.

You may wonder why the Celeb A dataset poses a problem as the original data only consume about 1.3 GByte on a hard disk. But do not forget that we need to provide floating point * tensors* of size (height x width x 3 x 32Bit) instead of compressed integer based jpg-information to the VAE algorithm. You can do the math on your own. In addition: Working with multiple screens and KDE on Linux may already consume more than 1 GB of our limited VRAM.

We use three tricks to work reasonably fast with the Celeb A data on a Linux systems with limited VRAM, but with around 32 GB or more standard RAM:

- We first crop and downscale the images – in my case to 96×96 pixels.
- We save a binary of a Numpy array of all images on a SSD and read it into the RAM during Jupyter experiments.
- We then apply a so called Keras
to transfer the images to the graphics card when required.*Image Data Generator*

The first point reduces the amount of MBytes per image. For basic experiments we do not need the full resolution.

The second point above is due to performance reasons: (1) Each time we want to work with a Jupyter notebook on the data we want to keep the time to load the data small. (2) We need the array data already in the system’s RAM to transfer them efficiently and in portions to the GPU.

A “**generator**” is a Keras tool which allows us to deliver input data for the VAE training in form of a continuously replenished dataflow from the CPU environment to the GPU. The amount of data provided with each transfer step to the GPU is reduced to a batch of images. Of course, we have to choose a reasonable size for such a batch. It should be compatible with the training batch size defined in the VAE-model’s fit() function.

A batch alone will fit into the VRAM whereas the whole dataset may not. The control of the data stream costs some overhead time – but this is better than not top be able to work at all. The second point helps to accelerate the transfer of data to the GPU significantly: A generator which sequentially picks data from a hard disk, transfers it to RAM and then to VRAM is too slow to get a convenient performance in the end.

Each time before we start VAE applications on the Jupyter side, we first fill the RAM with all image data in tensor-like form. From a SSD the totally required time should be small. The disadvantage of this approach is the amount of RAM we need. In my case close to 20 GB!

We first crop each of the original images to reduce background information and then resize the result to 96×96 px. D. Foster uses 128×128 px in his book on “Generative Deep Learning”. But for small VRAM 96×96 px is a bit more helpful.

I also wanted the images to have a quadratic shape because then one does not have to adjust the strides of

the VAE’s CNN Encoder and Decoder kernels differently for the two geometrical dimensions. 96 px in each dimension is also a good number as it allows for exactly 4 layers in the VAE’s CNNs. Each of the layers then reduces the resolution of the analyzed patterns by a factor of 2. At the innermost layer of the Encoder we deal with e.g. 256 maps with an extension of 6×6.

Cropping the original images is a bit risky as we may either cut some parts of the displayed heads/faces or the neck region. I decided to cut the upper part of the image. So I lost part of the hair-do in some cases, but this did not affect the ability to create realistic images of new heads or faces in the end. You may with good reason decide differently.

I set the edge points of the cropping region to

top=40, bottom = 0, left=0, right=178 .

This gave me quadratic pictures. But you may choose your own parameters, of course.

To prepare the pictures of the Celeb A dataset I used the PIL library.

import os, sys, time import numpy as np import scipy from glob import glob import PIL as PIL from PIL import Image from PIL import ImageFilter import matplotlib as mpl from matplotlib import pyplot as plt from matplotlib.colors import ListedColormap import matplotlib.patches as mpat

A Juyter cell with a loop to deal with almost all CelebA images would then look like:

**Jupyter cell 1**

dir_path_orig = 'YOUR_PATH_TO_THE_ORIGINAL_CELEB A_IMAGES' dir_path_save = 'YOUR_PATH_TO_THE_RESIZED_IMAGES' num_imgs = 200000 # the number of images we use print("Started loop for images") start_time = time.perf_counter() # cropping corner positions and new img size left = 0; top = 40 right = 178; bottom = 218 width_new = 96 height_new = 96 # Cropping and resizing for num in range(1, num_imgs): jpg_name ='{:0>6}'.format(num) jpg_orig_path = dir_path_orig + jpg_name +".jpg" jpg_save_path = dir_path_save + jpg_name +".jpg" im = Image.open(jpg_orig_path) imc = im.crop((left, top, right, bottom)) #imc = imc.resize((width_new, height_new), resample=PIL.Image.BICUBIC) imc = imc.resize((width_new, height_new), resample=PIL.Image.LANCZOS) imc.save(jpg_save_path, quality=95) # we save with high quality im.close() end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time: ", cpu_time)

Note that we save the images with high quality. Without the quality parameter PIL’s save function for a jpg target format would reduce the given quality unnecessarily and without having a positive impact on the RAM or VRAM consumption of the tensors we have to use in the end.

The whole process of cropping and resizing takes about 240 secs on my old PC *without* any parallelized operations on the CPU. The data were read from a standard old hard disk and not a SSD. As we have to make this investment of CPU time only once I did not care about optimization.

To prepare and save a huge Numpy array which contains all training images for our VAE we first need to define some parameters. I normally use 170,000 images for training purposes and around 10,000 for tests.

**Jupyter cell 2**

# Some basic parameters # ~~~~~~~~~~~~~~~~~~~~~~~~ INPUT_DIM = (96, 96, 3) BATCH_SIZE = 128 # The number of available images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_imgs = 200000 # Check with notebook CelebA # The number of images to use during training and for tests # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NUM_IMAGES_TRAIN = 170000 # The number of images to use in a Trainings Run #NUM_IMAGES_TO_USE = 60000 # The number of images to use in a Trainings Run NUM_IMAGES_TEST = 10000 # The number of images to use in a training Run # for historic compatibility reasons of other code-fragments (the reader may not care too much about it) N_ImagesToUse = NUM_IMAGES_TRAIN NUM_IMAGES = NUM_IMAGES_TRAIN NUM_IMAGES_TO_TRAIN = NUM_IMAGES_TRAIN # The number of images to use in a Trainings Run NUM_IMAGES_TO_TEST = NUM_IMAGES_TEST # The number of images to use in a Test Run # Define some shapes for Numpy arrays with all images for training # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ shape_ay_imgs = (N_ImagesToUse, ) + INPUT_DIM print("Assumed shape for Numpy array with train imgs: ", shape_ay_imgs) shape_ay_imgs_test = (NUM_IMAGES_TO_TEST, ) + INPUT_DIM print("Assumed shape for Numpy array with test imgs: ",shape_ay_imgs_test)

We also need to define some parameters to control the following aspects:

- Do we directly load Numpy arrays with train and test data?
- Do we load image data and convert them into Numpy arrays?
- From where do we load image data?

The following Jupyter cells help us:

**Jupyter cell 3**

# Set parameters where to get the image data from # ************************************************ # Use the cropped 96x96 HIGH-Quality images b_load_HQ = True # Load prepared Numpy-arrays # ~~~~~~~~~~~~~~~~~~~~~~~~~+ b_load_ay_from_saved = False # True: Load prepared x_train and x_test Numpy arrays # Load from SSD # ~~~~~~~~~~~~~~~~~~~~~~ b_load_from_SSD = True # Save newly calculated x_train, x_test-arrays in binary format onto disk # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ b_save_to_disk = False # Paths # ****** # Images on SSD # ~~~~~~~~~~~~~ if b_load_from_SSD: if b_load_HQ: dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_SSD/' # high quality else: dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_HD/' # low quality # Images on slow HD # ~~~~~~~~~~~~~~~~~~ if not b_load_from_SSD: if b_load_HQ: # high quality on slow Raid dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_HD/' else: # low quality on slow HD dir_path_load = 'YOUR_PATH_TO_HQ_DATA_ON_HD/' # x_train, x_test arrays on SSD # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if b_load_from_SSD: dir_path_ay = 'YOUR_PATH_TO_Numpy_ARRAY_DATA_ON_SSD/' if b_load_HQ: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_hq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_hq-x_test.npy" else: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_lq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_lq-x_est.npy" # x_train, x_test arrays on slow HD # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if not b_load_from_SSD: dir_path_ay = 'YOUR_PATH_TO_Numpy_ARRAY_DATA_ON_HD/' if b_load_HQ: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_hq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_hq-x_test.npy" else: path_file_ay_train = dir_path_ay + "celeba_200tsd_norm255_lq-x_train.npy" path_file_ay_test = dir_path_ay + "celeba_200tsd_norm255_lq-x_est.npy"

You must of course define your own paths and names.

Note that the ending “.npy” defines the standard binary format for Numpy data.

In case that I want to *prepare* the Numpy arrays (and not load already prepared ones from a binary) I make use of the following straightforward function:

**Jupyter cell 4**

def load_and_scale_celeba_imgs(start_idx, num_imgs, shape_ay, dir_path_load): ay_imgs = np.ones(shape_ay, dtype='float32') end_idx = start_idx + num_imgs # We open the images and transform them into Numpy arrays for j in range(start_idx, end_idx): idx = j - start_idx jpg_name ='{:0>6}'.format(j) jpg_orig_path = dir_path_load + jpg_name +".jpg" im = Image.open(jpg_orig_path) # transfrom data into a Numpy array img_array = np.array(im) ay_imgs[idx] = img_array im.close() # scale the images ay_imgs = ay_imgs / 255. return ay_imgs

We call this function for training images as follows:

**Jupyter cell 5**

# Load training images from SSD/HD and prepare Numpy float32-arrays # - (18.1 GByte of RAM required !! Int-arrays) # - takes around 30 to 35 Secs # ************************************ if not b_load_ay_from_saved: # Prepare float32 Numpy array for the training images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start_idx_train = 1 print("Started loop for training images") start_time = time.perf_counter() x_train = load_and_scale_celeba_imgs(start_idx = start_idx_train, num_imgs=NUM_IMAGES_TRAIN, shape_ay=shape_ay_imgs_train, dir_path_load=dir_path_load) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for array of training images: ", cpu_time) print("Shape of x_train: ", x_train.shape) # Plot an example image plt.imshow(x_train[169999])

And for test images:

**Jupyter cell 6**

# Load test images from SSD/HD and prepare Numpy float32-arrays # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if not b_load_ay_from_saved: # Prepare Float32 Numpy array for test images # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start_idx_test = NUM_IMAGES_TRAIN + 1 print("Started loop for test images") start_time = time.perf_counter() x_test = load_and_scale_celeba_imgs(start_idx = start_idx_test, num_imgs=NUM_IMAGES_TEST, shape_ay=shape_ay_imgs_test, dir_path_load=dir_path_load) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for array of test images: ", cpu_time) print("Shape of x_test: ", x_test.shape) #Plot an example img plt.imshow(x_test[27])

This takes about 35 secs in my case for the training images (170,000) and about 2 secs for the test images. Other people in the field use much lower numbers for the amount of training images.

If you want to save the Numpy arrays to disk:

**Jupyter cell 7**

# Save the newly calculatd NUMPY arrays in binary format to disk # **************************************************************** if not b_load_ay_from_saved and b_save_to_disk: print("Start saving arrays to disk ...") np.save(path_file_ay_train, x_train) print("Finished saving the train img array") np.save(path_file_ay_test, x_test) print("Finished saving the test img array")

If we wanted to load the Numpy arrays with training and test data from disk we would use the following code:

**Jupyter cell 8**

# Load the Numpy arrays with scaled Celeb A directly from disk # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ print("Started loop for test images") start_time = time.perf_counter() x_train = np.load(path_file_ay_train) x_test = np.load(path_file_ay_test) end_time = time.perf_counter() cpu_time = end_time - start_time print() print("CPU-time for loading Numpy arrays of CelebA imgs: ", cpu_time) print("Shape of x_train: ", x_train.shape) print("Shape of x_test: ", x_test.shape)

This takes about 2 secs on my system, which has enough and fast RAM. So loading a prepared Numpy array for the CelebA data is no problem.

Easy introductions to Keras’ ImageDataGenerators, their purpose and usage are given here and here.

ImageDataGenerators can not only be used to create a flow of limited batches of images to the GPU, but also for parallel operations on the images coming from some source. The latter ability is e.g. very welcome when we want to create additional augmented images data. The sources of images can be some directory of image files or a Python data structure. Depending on the source different ways of defining a generator object have to be chosen. The ImageDataGenerator-class and its methods can also be customized in very many details.

If we worked on a directory we might have to define our generator similar to the following code fragment

data_gen = ImageDataGenerator(rescale=1./255) # if the image data are not scaled already for float arrays # class_mode = 'input' is used for Autoencoders # see https://vijayabhaskar96.medium.com/tutorial-image-classification-with-keras-flow-from-directory-and-generators-95f75ebe5720 data_flow = data_gen.flow_from_directory(directory = YOUR_PATH_TO_ORIGINAL IMAGE DATA #, target_size = INPUT_DIM[:2] , batch_size = BATCH_SIZE , shuffle = True , class_mode = 'input' , subset = "training" )

This would allow us to read in data from a prepared sub-directory “YOUR_PATH_TO_ORIGINAL IMAGE DATA/train/” of the file-system and scale the pixel data at the same time to the interval [0.0, 1.0]. However, this approach is too slow for big amounts of data.

As we already have scaled image data available in RAM based Numpy arrays both the parameterization and the usage of the Generator during training is very simple. And the performance with RAM based data is much, much better!

So, how to our Jupyter cells for defining the generator look like?

**Jupyter cell 9**

# Generator based on Numpy array for images in RAM # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ b_use_generator_ay = True BATCH_SIZE = 128 SOLUTION_TYPE = 3 if b_use_generator_ay: # solution_type == 0 works with extra layers and add_loss to control the KL loss # it requires the definition of "labels" - which are the original images if SOLUTION_TYPE == 0: data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train , x_train #, target_size = INPUT_DIM[:2] , batch_size = BATCH_SIZE , shuffle = True #, class_mode = 'input' # Not working with this type of generator #, subset = "training" # Not required ) if .... if .... if SOLUTION_TYPE == 3: data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train #, x_train #, target_size = INPUT_DIM[:2] , batch_size = BATCH_SIZE , shuffle = True #, class_mode = 'input' # Not working with this type of generator #, subset = "training" # Not required )

Besides the method to use extra layers with layer.add_loss() (SOLUION_TYPE == 0) I have discussed other methods for the handling of the KL-loss in previous posts. I leave it to the reader to fill in the correct statements for these cases. In our present study we want to use a GradientTape()-based method, i.e. SOLUTION_TYPE = 3. In this case we do NOT need to pass a label-array to the Generator. Our gradient_step() function is intelligent enough to handle the loss calculation on its own! (See the previous posts).

So it is just

data_gen = ImageDataGenerator() data_flow = data_gen.flow( x_train , batch_size = BATCH_SIZE , shuffle = True )

which does a perfect job for us.

In the end we will only need the following call when we want to train our VAE-model

MyVae.train_myVAE( data_flow , b_use_generator = True , epochs = n_epochs , initial_epoch = INITIAL_EPOCH )

to train our VAE-model. This class function in turn will internally call something like

self.model.fit( data_flow # coming as a batched dataflow from the outside generator , shuffle = True , epochs = epochs , batch_size = batch_size # best identical to the batch_size of data_flow , initial_epoch = initial_epoch )

But the setup of a reasonable VAE-model for CelebA images and its training will be the topic of the next post.

What have we achieved? Nothing yet regarding VAE results. However, we have prepared almost 200,000 CelebA images such that we can easily load them from disk into a Numpy float32 array with 2 seconds. Around 20 GB of conventional PC RAM is required. But this array can now easily be used as a source of VAE training.

Furthermore I have shown that the setup of a Keras “ImageDataGenerator” to provide the image data as a flow of batches fitting into the GPU’s VRAM is a piece of cake – at least for our VAE objectives. We are well prepared now to apply a VAE-algorithm to the CelebA data – even if we only have an old graphics card available with limited VRAM.

In the next post of this series

I show you the code for VAE-training with CelebA data. Afterward we will pick random points in the latent space and create artificial images of human faces.

Variational Autoencoder with Tensorflow 2.8 – X – VAE application to CelebA images

People interested in data augmentation should have a closer look at the parameterization options of the ImageDataGenerator-class.

Celeb A

https://datagen.tech/guides/image-datasets/celeba/

Data generators

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

towardsdatascience.com/ keras-data-generators-and-how-to-use-them-b69129ed779c

**And last not least my standard statement as long as the war in Ukraine is going on: **

Ceterum censeo: The worst fascist, war criminal and killer living today is the Putler. He must be isolated at all levels, be denazified and sooner than later be imprisoned. Long live a free and democratic Ukraine!

]]>

In this series of posts I want to discuss this problem a bit as it illustrates why we need Variational Autoencoders for a systematic creation of faces with varying features from points and clusters in the latent space. But the problem also raises some fundamental and interesting questions

- about a certain “blindness” of neural networks during training in general, and
- about the way we save or conserve the knowledge which a neural network has gained about patterns in input data during training.

This post requires experience with the architecture and principles of Autoencoders.

For preparing my talk I worked with relatively simple Autoencoders. I used Convolutional Neural Networks [CNNs] with just 4 convolutional layers to create the Encoder and Decoder parts of the Autoencoder. As typical applications I chose the following:

- Effective image compression and reconstruction by using a latent space of relatively low dimensionality. The trained AEs were able to compress input images into latent vectors with only few components and reconstruct the original image from the compressed format.
- Denoising of images where the original data were obscured by the superposition of statistical noise and/or statistically dropped pixels. (This is my favorite task for AEs which they solve astonishingly well.)
- Recolorization of images: The trained AE in this case transforms images with only gray pixels into colorful images.

Such challenges for AEs are discussed in standard ML literature. In a first approach I applied my Autoencoders to the usual MNIST and Fashion MNIST datasets. For the task of recolorization I used the Cifar 10 dataset. But a bit later I turned to the **Celeb A** dataset with images of celebrity faces. Just to make all of the tasks a bit more challenging.

My Autoencoders excelled in all the tasks named above – both for MNIST, CELEB A and, regarding recolorization, CIFAR 10.

Regarding MNIST and MNIST/Fashion 4-layer CNNs for the Encoder and Decoder are almost an overkill. For MNIST the dimension **z_dim** of the latent space can be chosen to be pretty small:

**z_dim = 12** gives a really good reconstruction quality of (test) images compressed to minimum information in the latent space. **z_dim=4** still gave an acceptable quality and even with z_dim = 2 most of test images were reconstructed well enough. The same was true for the reconstruction of images superimposed with heavy statistical noise – such that the human eye could no longer guess the original information. For Fashion MNIST a dimension number 20 < z_dim < 40 gave good results.
Also for recolorization the results were very plausible. I shall present the results in other blog posts in the future.

Then I turned to the **Celeb A** dataset. By the way: I got interested in Celeb A when reading the books of David Foster on “Generative Deep Learning” and of Tariq Rashi “Make Your First GANs with PyTorch” (see the complete references in the last section of this post).

The Celeb A data set contains images of around 200,000 faces with varying contours, hairdos and very different, in-homogeneous backgrounds. And the faces are displayed from very different viewing angles.

For a good performance of image reconstruction in all of the named use cases one needs to raise the number of dimensions of the latent space *significantly*. Instead of 12 dimensions of the latent space as for MNIST we now talk about 200 up to 1200 dimensions for CELEB A – depending on the task the AE gets trained for and, of course, on the quality expectations. For reconstruction of normal images and for the reconstruction of clear images from noisy input images higher numbers of dimensions z_dim ≥ 512 gave visibly better results.

Actually, the impressive quality for the reconstruction of *test* images of faces, which were almost totally obscured by the superimposition of statistical noise or the statistical removal of pixels after a self-supervised training on around 100,000 images surprised me. (Totalitarian states and security agencies certainly are happy about the superb face reconstruction capabilities of even simple AEs.) Part of the explanation, of course, is that 20% un-obscured or un-blurred pixels out of 30,000 pixels still means 6,000 clear pixels. Obviously enough for the AE to choose the right pattern superposition to compose a plausible clear image.

Note that we are not talking about overfitting here – the Autoencoder handled * test images*, i.e. images which it had never seen before, very well. AEs based on CNNs just seem to extract and use patterns characteristic for faces extremely effectively.

But how is the target space of the Encoder, i.e. the latent space, filled for Celeb A data? Do *all* points in the latent space give us images with well recognizable faces in the end?

To answer the last question I trained an AE with 100,000 images of Celeb A for the reconstruction task named above. The dimension of the latent space was chosen to be z_dim = 200 for the results presented below. (Actually, I used a VAE with a tiny amount of KL loss by a factor of 1.e-6 smaller than the standard Binary Cross-Entropy loss for reconstruction – to get at least a minimum confinement of the z-points in the latent space. But the results are basically similar to those of a pure AE.)

My somewhat reworked and centered Celeb A images had a dimension of 96×96 pixels. So the original feature space had a number of dimensions of 27,648 (almost 30000). The challenge was to reproduce the original images from latent data points created of test images presented to the Encoder. To be more precise:

After a certain number of training epochs we feed the Encoder (with fixed weights) with test images the AE has never seen before. Then we get the components of the vectors from the origin to the resulting points in the latent space (**z-points**). After feeding these data into the Decoder we expect the reproduction of images close to the test input images.

With a balanced training controlled by an Adam optimizer I already got a good resemblance after 10 epochs. The reproduction got better and very acceptable also with respect to tiny details after 25 epochs for my AE. Due to possible copyright and personal rights violations I do not dare to present the results for general Celeb A images in a public blog. But you can write me a mail if you are interested.

Most of the data points in the latent space were created in a region of 0 < **x_i** < 20 with **x_i** meaning one of the vector components of a z-point in the latent space. I will provide more data on the z-point distribution produced by the Encoder in later posts in this blog.

Then I selected arbitrary data points in the latent space with randomly chosen and uniformly distributed components 0 < x_i < *boundary*. The values for *boundary* were systematically enlarged.

Note that most of the resulting points will have a tendency to be located in outer regions of the multidimensional cube with an extension in each direction given by *boundary*. This is due to the big chance that one of the components will get a relatively high value.

Then I fed these arbitrary z-points into the Decoder. Below you see the results after 10 training epochs of the AE; I selected only 10 of 100 data points created for each value of *boundary* (the images all look more or less the same regarding the absence or blurring of clear face contours):

This is more a collection of face hallucinations than of usable face images. (Interesting for artists, maybe? Seriously meant …).

So, most of the points in the latent space of an Autoencoder do NOT represent reasonable faces. Sometimes our random selection came close to a region in latent space where the result do resemble a face. See e.g. the central image for boundary=10.

From the images above it becomes clear that some arbitrary path inside the latent space will contain more points which do NOT give you a reasonable face reproduction than points that result in plausible face images – despite a successful training of the Autoencoder.

This result supports the impression that the latent space of well trained Autoencoders is almost unusable for *creative* purposes. It also raises the interesting question of what the distribution of “**meaningful points**” in the latent space really looks like. I do not know whether this has been investigated in depth at all. Some links to publications which prove a certain scientific interest in this question are given in the last section of this posts.

I also want to comment on an article published in the Quanta Magazine lately. See “Self-Taught AI Shows Similarities to How the Brain Works”. This article refers to “masked” Autoencoders and self-supervised learning. Reconstructing masked images, i.e. images with a superposition of a mask hiding/blurring pixels with a reasonably equipped Autoencoder indeed works very well. Regarding this point I totally agree. Also with the term “self-supervised learning”.

But to suggest that an Autoencoder with this (rather basic) capability reflects methods of the human brain is in my opinion a massive exaggeration. On the contrary, in my opinion an AE reflects a dumbness regarding the storage and usage of otherwise well extracted feature patterns. This is due to its construction and the nature of its mapping of image contents to the latent space. A child can, after some teaching, draw characteristic features of human faces – out of nothing on a plain white piece of paper. The Decoder part of a standard Autoencoder (in some contrast to a GAN) can not – at least not without help to pick a *meaningful* point in latent space. And this difference is a major one, in my opinion.

I think the reason why arbitrary points in the multi-dimensional latent space cannot be mapped to images with recognizable faces is yet another effect of the so called “curse of high dimensionality”. But this time also related to the latent space.

A normal Autoencoder (i.e. one without the Kullback-Leibler loss) uses the latent space in its vast extension to produce points where typical properties (features) of faces and background are encoded in a most unique way for each of the input pictures. But the distinct volume filled by such points is a pretty small one – compared to the extensions of the high dimensional latent space. The volume of data points resulting from a mapping-transformation of arbitrary points in the original feature space to points of the latent space is of course much bigger than the volume of points which correspond to images showing typical human faces.

This is due to the fact that there are many more images with arbitrary pixel values already in the original feature space of the input images (with lets say 30000 dimensions for 100×100 color pixels) than images with reasonable values for faces in front of some background. The points in the feature space which correspond to reasonable images of faces (right colors and dominant pixel values for face features), is certainly small compared to the extension of the original feature space. Therefore: If you pick a random point in latent space – even within a confined (but multidimensional) volume around the origin – the chance that this point lies outside the particular volume of points which make sense regarding face reproduction is big. I guess that for z_dim > 200 the probability is pretty close to 1.

In addition: As the mapping algorithm of a neural Encoder network as e.g. CNNs is highly non-linear we cannot say how the boundary hyperplanes of mapping areas for faces look like. Complicated – but due to the enormous number of original images with arbitrary pixel values – we can safely guess that they enclose a very tiny volume.

The manifold of data points in the z-space giving us recognizable faces in front of a reasonably separated background may follow a curved and wiggly “path” through the latent space. In principal there could even be isolated unconnected regions separated by areas of “chaotic reconstructions”.

I think this kind of argumentation line holds for standard Autoencoders and variational Autoencoders with a very small KL loss in comparison to the reconstruction loss (BCE (binary cross-entropy) or MSE).

The fist point is: VAEs reduce the total occupied volume of the latent space. Due to mu-related term in the Kullback-Leibler loss the whole distribution of z-points gets condensed into a limited volume around the origin of the latent space.

The second reason is that the distribution of meaningful points are smeared out by the logvar-term of the Kullback-Leibler loss.

Both effects enforce overlapping regions of meaningful standard Gaussian-like z-point distributions in the latent space. So VAEs significantly increase the probability to hit a meaningful z-point in latent space – if you chose points around the origin within a distance of “1”. The distance has to be measured with some norm, e.g. the Euclidian one. Actually we should get meaningful reconstructions even beyond a multidimensional sphere of radius “1”. Look at the series on the technical realization VAEs in this blog. The last posts there prove the effects of the KL-loss experimentally for Celeb A data. Below you find a selection of images created from randomly chosen points in the latent space of a Variational Autoencoder with z_dim=200 after 10 epochs.

Enough for today. Whilst standard Autoencoders solve certain tasks very well, they seem to produce very specific data distributions in the latent space for the reconstruction of “meaningful” images with human faces. The origin of this problem lies already in the original feature space of the images. Also there only a small minority of points represents humanly interpretable face images. This is even true if you use a dimensionality reduction method as PCA ahead.

From a first experiment the chance of hitting a data point in latent space which gives you a meaningful image seems to be small. This result appears to be a variant of the curse of high dimensionality – this time including the latent space.

In a forthcoming post

Autoencoders, latent space and the curse of high dimensionality – II – a view on fragments and filaments of the latent space for CelebA images

we will investigate the z-point distribution in latent space with a variety of tools. And find that this distribution is fragmented and that the z-points for CelebA images indeed are arranged in filament-like structures.

https://towardsdatascience.com/ exploring-the-latent-space-of-your-convnet-classifier-b6eb862e9e55

Felix Leeb, Stefan Bauer, Michel Besserve,Bernhard Schölkopf, “Exploring the Latent Space of Autoencoders with

Interventional Assays”, 2022,

https://arxiv.org/abs/2106.16091v2 // https://arxiv.org/pdf/2106.16091.pdf

https://wiredspace.wits.ac.za/ handle/10539/33094?show=full

https://www.elucidate.ai/post/ exploring-deep-latent-spaces

**Books:**

T. Rashid, “GANs mit PyTorch selbst programmieren”, 2020, O’Reilly, dpunkt.verlag, Heidelberg, ISBN 978-3-96009-147-9

D. Foster, “Generatives Deep Learning”, 2019, O’Reilly, dpunkt.verlag, Heidelberg, ISBN 978-3-96009-128-8

Variational Autoencoder with Tensorflow 2.8 – I – some basics

Variational Autoencoder with Tensorflow 2.8 – II – an Autoencoder with binary-crossentropy loss

Variational Autoencoder with Tensorflow 2.8 – III – problems with the KL loss and eager execution

Variational Autoencoder with Tensorflow 2.8 – IV – simple rules to avoid problems with eager execution

Variational Autoencoder with Tensorflow 2.8 – V – a customized Encoder layer for the KL loss

Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output

Variational Autoencoder with Tensorflow 2.8 – VII – KL loss via model.add_loss()

Our objective is to avoid or circumvent potential problems with the **eager execution mode** of present Tensorflow 2 versions. I have already described three solutions based on standard Keras functionality:

- Either we add loss contributions via the function
**layer.add_loss()**and a special layer of the Encoder part of the VAE - or we add a loss to the output of a full VAE-model via function
**model.add_loss()** - or we build a complex model which transports required KL-related tensors from the Encoder part of the VAE model to the Decoder’s output layer.

In all these cases we invoke *native* Keras functions to handle loss contributions and related operations. Keras controls the calculation of the gradient components of the KL related tensors “mu” and “log_var” in the background for us. This comprises partial derivatives with respect to trainable weight variables of *lower* Encoder layers and related operations. The same holds for partial derivatives of reconstruction tensors at the Decoder’s output layer with respect to trainable parameters of *all* layers of the VAE-model. Keras does most of the job

- of derivative calculation and the registration of related operation sequences during forward pass
- and the correct application of the registered operations and values in later weight corrections during backward propagation

for us *in the background* as long as we respect certain rules for eager mode execution.

But Tensorflow 2 [TF2] gives us a much more flexible and low-level option to control the calculation of gradients under the conditions of eager execution. This option requires that we inform the TF/Keras machinery which processes the **training steps** of an epoch of how to exactly calculate losses and their partial derivatives. Rules to determine and create metrics output must be provided in addition.

TF2 provides a context for *registering* operations for loss and derivative evaluations. This context is provided by a functional object called **GradientTape()**. In addition we have to write an encapsulating function “**train_step()**” to control gradient calculations and output during training.

In this post I will describe how we integrate such an approach with our **class “MyVariationalAutoencoder()”** for the setup of a VAE model based on convolutional layers. I have discussed the elements and methods of this class *MyVariationalAutoencoder()* in detail during the last posts.

Regarding the core of the technical solution for **train_step()** and **GradientTape()** I follow more or less the recommendations of one of the masters of Keras: **F. Chollet**. His original code for a TF2-compatible implementation of a VAE can be found here:

https://keras.io/examples/generative/vae/

However, in my opinion Chollet’s code contains a small problem, which I have allowed myself to correct.

The general recipe presented here can, of course, be extended to more complex tasks beyond the optimization of KL and reconstruction losses of VAEs. Therefore, a brief study of the methods to establish detailed loss control is really worth it for ML and VAE beginners. But TF2 and Keras experts will not learn anything new from this post.

I provide the pure code of the classes in this post. In the next post you will find Jupyter cell code for an application to the Celeb A dataset. To prove that the classes below do their job in the end I show you some faces which have been created from arbitrarily chosen points in the latent space after training.

These faces do not exist in reality. They are constructed by the VAE based on compressed and “normalized” data for face patterns and face attribute distributions in the latent space. Note that I used a latent space with a dimension of z_dim =200.

We have already many of the required methods ready. In the last posts we used the flexible *functional interface of Keras* to set up Neural Network models for both Encoder and Decoder, each with sequences of (convolutional) layers. For our present purposes we will not change the elementary layer structure of the Encoder or Decoder. In particular the layers for the “mu” and “log_var” contributions to the KL loss and a subsequent sampling-layer of the Encoder will remain unchanged.

In the course of the last two posts I have already introduced a parameter “*solution_type*” to control specifics of our VAE model. We shall use it now to invoke a child class of Keras’ Model() which allows for detailed steps of loss and gradient evaluations.

The standard Keras method **Model.fit()** normally provides a convenient interface for Keras users. We do not have to think about calling the low-level functions at all if we do not want to or do not need to control gradient calculations in detail. In our present approach, however, we use the low level functionality of **GradientTape()** directly. This requires to overwrite a specific method of the standard Keras class Model() – namely the function **“train_step()”**.

If you have never worked with a self-defined **training_step()** and **GradientTape()** before then I recommend to read the following introductions first:

https://www.tensorflow.org/guide/autodiff

customizing what happens in fit() and the relation to training_step()

These articles contain valuable information about how to operate at low level with **training_step()** regarding losses, derivatives and metrics. This information will help to better understand the methods of a new class VAE() which I am going to derive from Keras’ class Model() below.

Let us first briefly repeat some imports required.

**Imports**

# Imports # ~~~~~~~~ import sys import numpy as np import os import pickle import tensorflow as tf import tensorflow.keras as keras from tensorflow.keras.layers import Layer, Input, Conv2D, Flatten, Dense, Conv2DTranspose, Reshape, Lambda, \ Activation, BatchNormalization, ReLU, LeakyReLU, ELU, Dropout, AlphaDropout from tensorflow.keras.models import Model # to be consistent with my standard loading of the Keras backend in Jupyter notebooks: from tensorflow.keras import backend as B from tensorflow.keras import metrics #from tensorflow.keras.backend import binary_crossentropy from tensorflow.keras.optimizers import Adam from tensorflow.keras.callbacks import ModelCheckpoint from tensorflow.keras.utils import plot_model #from tensorflow.python.debug.lib.check_numerics_callback import _maybe_lookup_original_input_tensor # Personal: The following works only if the path in the notebook is supplemented by the path to /projects/GIT/mlx # The user has to organize his paths for modules to be referred to from Jupyter notebooks himself and # replace this settings from mynotebooks.my_AE_code.utils.callbacks import CustomCallback, VAE_CustomCallback, step_decay_schedule from keras.callbacks import ProgbarLogger

Now we define a class VAE() which inherits basic functionality from the Keras class Model() and overwrite the method train_step(). We shall later create an instance of this new class within an object of class MyVariationalAutoencoder().

**New Class VAE**

from tensorflow.keras import metrics ... ... # A child class of Model() to control train_step with GradientTape() class VAE(keras.Model): # We use our self defined __init__() to provide a reference MyVAE # to an object of type "MyVariationalAutoencoder" # This in turn allows us to address the Encoder and the Decoder def __init__(self, MyVAE, **kwargs): super(VAE, self).__init__(**kwargs) self.MyVAE = MyVAE self.encoder = self.MyVAE.encoder self.decoder = self.MyVAE.decoder # A factor to control the ratio between the KL loss and the reconstruction loss self.fact = MyVAE.fact # A counter self.count = 0 # A factor to scale the absolute values of the losses # e.g. by the number of pixels of an image self.scale_fact = 1.0 # no scaling # self.scale_fact = tf.constant(self.MyVAE.input_dim[0] * self.MyVAE.input_dim[1], dtype=tf.float32) self.f_scale = 1. / self.scale_fact # loss type : 0: BCE, 1: MSE self.loss_type = self.MyVAE.loss_type # track loss development via metrics self.total_loss_tracker = keras.metrics.Mean(name="total_loss") self.reco_loss_tracker = keras.metrics.Mean(name="reco_loss") self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss") def call(self, inputs): x, z_m, z_var = self.encoder(inputs) return self.decoder(x) # Overwrite the metrics() of Model() - use getter mechanism @property def metrics(self): return [ self.total_loss_tracker, self.reco_loss_tracker, self.kl_loss_tracker ] # Core function to control all operations regarding eager differentiation operations, # i.e. the calculation of loss terms with respect to tensors and differentiation variables # and metrics data def train_step(self, data): # We use the GradientTape context to record differntiation operations/results #self.count += 1 with tf.GradientTape() as tape: z, z_mean, z_log_var = self.encoder(data) reconstruction = self.decoder(z) #reco_shape = tf.shape(self.reconstruction) #print("reco_shape = ", reco_shape, self.reconstruction.shape, data.shape) #BCE loss (Binary Cross Entropy) if self.loss_type == 0: reconstruction_loss = tf.reduce_mean( tf.reduce_sum( keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2) ) ) * self.f_scale # MSE loss (Mean Squared Error) if self.loss_type == 1: reconstruction_loss = tf.reduce_mean( tf.reduce_sum( keras.losses.mse(data, reconstruction), axis=(1, 2) ) ) * self.f_scale kl_loss = -0.5 * self.fact * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)) kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1)) total_loss = reconstruction_loss + kl_loss grads = tape.gradient(total_loss, self.trainable_weights) self.optimizer.apply_gradients(zip(grads, self.trainable_weights)) #if self.count == 1: self.total_loss_tracker.update_state(total_loss) self.reco_loss_tracker.update_state(reconstruction_loss) self.kl_loss_tracker.update_state(kl_loss) return { "total_loss": self.total_loss_tracker.result(), "reco_loss": self.reco_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), } def compile_VAE(self, learning_rate): # Optimizer # ~~~~~~~~~ optimizer = Adam(learning_rate=learning_rate) # save the learning rate for possible intermediate output to files self.learning_rate = learning_rate self.compile(optimizer=optimizer)

First, we need to import an additional library **tensorflow.keras.metrics**. Its functions, as e.g. **Mean()**, will help us to print out intermediate data about various loss contributions during training – averaged over the batches of an epoch.

Then, we have added four central methods to class VAE:

- a function
**__init__()**, - a function
**metrics()**together with Python’s**getter**-mechanism - a function
**call()** - and our central function
**training_step().**

All functions overwrite the defaults of the parent class Model(). Be careful to distinguish the range of batches which keras.metrics() and training_step() operate on:

- A “training step” covers just one batch eventually provided to the training mechanism by the Model.fit()-function.
- Averaging performed by functions of keras.metrics instead works across
*all*batches of an epoch.

In general we can use the standard interface of __init__(inputs, outputs, …) **or a call()-interface** to instantiate an object of class-type Model(). See

https://www.tensorflow.org/api_docs/python/tf/ keras/ Model

https://docs.w3cub.com/tensorflow~python/ tf/ keras/ model.html

We have to be precise about the parameters of __init()__ or the call()-interface if we intend to use properties of the standard *compile()*– and *fit()*-interfaces of a model – at least in application cases where we do not control everything regarding losses and gradients ourselves.

To define a complete model for the general case we therefore add the *call()*-method. At the same time we “misuse” the __init__() function of VAE() to provide a reference to our instance of class “MyVariationalAutoencoder”. Actually, providing “call()” is done only for the sake of flexibility in other use cases than the one discussed here. For our present purposes we could actually omit call().

The __init__()-function retrieves some parameters from MyVAE. You see the factor *“fact”* which controls the ratio of the KL-loss to the reconstruction loss. In addition I provided an option to scale the loss values by a division by the number of pixels of input images. You just have to un-comment the respective statement. Sorry, I have not yet made it controllable by a parameter of MyVariationalAutoencoder().

Finally, the parameter loss_type is evaluated; for a value of “1” we take MSE as a loss function instead of the standard BCE (Binary Cross-Entropy); see the Jupyter cells in the next post. This allows for some more flexibility in dealing with certain datasets.

With the function **metrics()** we are able to establish our own “tracking” of the evolution of the Model’s loss contributions during training. In our case we are particularly interested in the evolution of the “**reconstruction loss**” and the “**KL-loss**“.

Note that the **@property** decorator is added to the **metrics()**-function. This allows us to define its output via the **getter**-mechanism for Python classes. In our case the __init__()-function defines the mechanism to fill required variables:

The three “tracker”-variables there get their values from the function tensorflow.keras.metrics.Mean(). Note that the *names* given to the loss-trackers in __init__() are of importance for later output handling!

Note also that **keras.metrics.Mean()** calculates averages over values derived for *all* batches of an epoch. The **tf.reduce_mean()**-statements in the GradientTape() section of the code above, instead, refer to averages calculated over the samples of a *single* batch.

Actualized loss output is later delivered during each training step by the method **update_state()**. You find a description of the methods of keras.metrics.Mean() here.

The result of all this is that metrics() delivers loss values by actualized tracker-variables of our child class *VAE()*. Note that neither __init__() nor metrics() define what exactly is to be done to calculate each loss term. __init__() and metrics() only prepare the later output technically by formal class constructs. Note also that all the data defined by metrics() are updated and averaged per epoch *without* the requirement to call the function “reset_states()” (see the Keras docs). This is automatically done at the beginning of each epoch.

Let us turn to the necessary calculations which must be performed during each training step. In an eager environment we must watch the trainable variables, on which the different loss terms depend, to be able to calculate the partial derivatives and record related operations and intermediate results **already during forward pass**:

We must track the differentiation operations and resulting values to know exactly what has to be done in reverse during error backward propagation. To be able to do this TF2 offers us a recording mechanism called **GradientTape()**. Its results are kept in an object which often is called a “tape”.

You find more information about these topics at

https://debuggercafe.com/basics-of-tensorflow-gradienttape/

https://runebook.dev/de/docs/ tensorflow/gradienttape

Within * train_step()* we need some tensors which are required for loss calculations in an explicit form. So, we must change the Keras model for the Encoder to give us the tensors for “

This is no problem for us. We have already made the output of the Encoder dependent on a variable “solution_type” and discussed a multi-output Encoder model already in the post Variational Autoencoder with Tensorflow 2.8 – VI – KL loss via tensor transfer and multiple output.

Therefore, we just have to add a new value 3 to the checks of “solution_type”. The same is true for the input control of the Decoder (see a section about the related methods of MyVariationalAutoencoder() below).

The statements within the section for ** GradientTape()** deal with the calculation of loss terms and record the related operations. All the calculations should be be familiar from previous posts of this series.

This includes an identification of the trainable_weights of the involved layers. Quote from

https://keras.io/guides/ writing_a_training_loop_from_scratch/ #using-the-gradienttape-a-first-endtoend-example:

Calling a model inside a GradientTape scope enables you to retrieve the gradients of the trainable weights of the layer with respect to a loss value. Using an optimizer instance, you can use these gradients to update these variables (which you can retrieve using model.trainable_weights).

In ** train_step()** we need to register that the total loss is dependent on all trainable weights and that all related partial derivatives have to be taken into account during optimization. This is done by

grads = tape.gradient(total_loss, self.trainable_weights) self.optimizer.apply_gradients(zip(grads, self.trainable_weights))

To be able to get actualized output during training we update the state of all tracked variables:

self.total_loss_tracker.update_state(total_loss) self.reco_loss_tracker.update_state(reco_loss) self.kl_loss_tracker.update_state(kl_loss)

The careful reader may have noticed that my code of the function “train_step()” deviates from F. Chollet’s recommendations. Regarding the return statement I use

return { "total_loss": self.total_loss_tracker.result(), "reco_loss": self.reco_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), }

whilst F. Chollet’s original code contains a statement like

return { "loss": self.total_loss_tracker.result(), # here lies the main difference - different "name" than defined in __init__! "reconstruction_loss": self.reconstruction_loss_tracker.result(), # ignore my abbreviation to reco_loss "kl_loss": self.kl_loss_tracker.result(), }

Chollet’s original code unfortunately gives *inconsistent* loss data: The sum of his “reconstruction loss” and the “KL (Kullback Leibler) loss” do * not* add up to the (total) “loss”. This can be seen from the data of the first epochs in F. Chollet’s example on the tutorial at

keras.io/examples/generative/vae.

Some of my own result data for the MNIST example with this error look like:

Epoch 1/5 469/469 [============================_build_dec==] - 7s 13ms/step - reconstruction_loss: 209.0115 - kl_loss: 3.4888 - loss: 258.9048 Epoch 2/5 469/469 [==============================] - 7s 14ms/step - reconstruction_loss: 173.7905 - kl_loss: 4.8220 - loss: 185.0963 Epoch 3/5 469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 160.4016 - kl_loss: 5.7511 - loss: 167.3470 Epoch 4/5 469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 155.5937 - kl_loss: 5.9947 - loss: 162.3994 Epoch 5/5 469/469 [==============================] - 6s 13ms/step - reconstruction_loss: 152.8330 - kl_loss: 6.1689 - loss: 159.5607

Things do get better from epoch to epoch – but we want a consistent output from the beginning: The averaged (total) loss should always be printed as equal to the sum of the averaged) KL loss plus the reconstruction loss.

The deviation is surprising as we *seem* to use the right tracker-results in the code. And the name used in the return statement of the train_step()-function here should only be relevant for the printing …

However, the name “loss” is NOT consistent with the name defined in the statement Mean(name=”total_loss”) in the __init__() function of Chollet, where he defines his tracking mechanisms.

self.total_loss_tracker = keras.metrics.Mean(name="total_loss")

This has consequences: The inconsistency triggers a different output than a consistent use of names. Just try it out on your own …

This is not only true for the deviation between “loss” in

return { "loss": self.total_loss_tracker.result(), .... }

and “total_loss” in the __init__) function

self.total_loss_tracker = keras.metrics.Mean(name="total_loss") , namely a value lacking averaging -

but also for deviations in the names used for the other loss contributions. *In case of an inconsistency Keras seems to fall back to a default* here which does not reflect the standard linear averaging of Mean() over all values calculated for the batches of an epoch (without any special weights).

That there is some common default mechanism working can be seen from the fact that wrong names for **all** individual losses (here the KL loss and the reconstruction loss) give us at least a consistent sum-value for the total amount again. But all the values derived by the fallback are much closer to the start values at an epochs begin than the mean values averaged over an epoch. You may test this yourself.

To get on the safe side we use the correct “names” defined in the __init__()-function of our code:

return { "total_loss": self.total_loss_tracker.result(), "reco_loss": self.reco_loss_tracker.result(), "kl_loss": self.kl_loss_tracker.result(), }

For MNIST data fed into our VAE model we then get:

Epoch 1/5 469/469 [==============================] - 8s 13ms/step - reco_loss: 214.5662 - kl_loss: 2.6004 - total_loss: 217.1666 Epoch 2/5 469/469 [==============================] - 7s 14ms/step - reco_loss: 186.4745 - kl_loss: 3.2799 - total_loss: 189.7544 Epoch 3/5 469/469 [==============================] - 6s 13ms/step - reco_loss: 181.9590 - kl_loss: 3.4186 - total_loss: 185.3774 Epoch 4/5 469/469 [==============================] - 6s 13ms/step - reco_loss: 177.5216 - kl_loss: 3.9433 - total_loss: 181.4649 Epoch 5/5 469/469 [==============================] - 6s 13ms/step - reco_loss: 163.7209 - kl_loss: 5.5816 - total_loss: 169.3026

This is exactly what we want.

So, the general recipe is:

- Define what metric properties you are interested in. Create respective tracker-variables in the __init__() function.
- Use the getter mechanism to define your metrics() function and its output via references to the trackers.
- Define your own training step by a function train_step().
- Use Tensorflow’s GradientTape context to register statements which control the calculation of loss contributions from elementary tensors of your (functional) Keras model. Provide all layers there, e.g. by references to their models.
- Register gradient-operations of the total loss with respect to all trainable weights and updates of metrics data within function “train_step()”.

Actually, I have used the GradientTape() mechanism already in this blog when I played a bit with approaches to create so called DeepDream images. See

https://linux-blog.anracom.com/category/machine-learning/deep-dream/

for more information – there in a different context.

Where do we stand? We have defined a new class “*VAE()*” which modifies the original Keras Model() class. And we have our class “MyVariationalAutoencoder()” to control the setup of a VAE model.

Next we need to address the question of how we combine these two classes. If you have read my previous posts you may expect a major change to the method “**_build_VAE()**“. This is correct, but we also have to modify the conditions for the Encoder output construction in _build_enc() and the definition of the Decoder’s input in _build_dec(). Therefore I give you the modified code for these functions. For reasons of completeness I add the code for the __init__()-function:

def __init__(self , input_dim # the shape of the input tensors (for MNIST (28,28,1)) , encoder_conv_filters # number of maps of the different Conv2D layers , encoder_conv_kernel_size # kernel sizes of the Conv2D layers , encoder_conv_strides # strides - here also used to reduce spatial resolution avoid pooling layers # used instead of Pooling layers , encoder_conv_padding # padding - valid or same , decoder_conv_t_filters # number of maps in Con2DTranspose layers , decoder_conv_t_kernel_size # kernel sizes of Conv2D Transpose layers , decoder_conv_t_strides # strides for Conv2dTranspose layers - inverts spatial resolution , decoder_conv_t_padding # padding - valid or same , z_dim # A good start is 16 or 24 , solution_type = 0 # Which type of solution for the KL loss calculation ? , act = 0 # Which type of activation function? , fact = 0.65e-4 # Factor for the KL loss (0.5e-4 < fact < 1.e-3is reasonable) , loss_type = 0 # 0: BCE, 1: MSE , use_batch_norm = False # Shall BatchNormalization be used after Conv2D layers? , use_dropout = False # Shall statistical dropout layers be used for tregularization purposes ? , dropout_rate = 0.25 # Rate for statistical dropout layer , b_build_all = False # Added by RMO - full Model is build in 2 steps ): ''' Input: The encoder_... and decoder_.... variables are Python lists, whose length defines the number of Conv2D and Conv2DTranspose layers input_dim : Shape/dimensions of the input tensor - for MNIST (28,28,1) encoder_conv_filters: List with the number of maps/filters per Conv2D layer encoder_conv_kernel_size: List with the kernel sizes for the Conv-Layers encoder_conv_strides: List with the strides used for the Conv-Layers z_dim : dimension of the "latent_space" solution_type : Type of solution for KL loss calculation (0: Customized Encoder layer, 1: transfer of mu, var_log to Decoder 2: model.add_loss() 3: definition of training step with Gradient.Tape() act : determines activation function to use (0: LeakyRELU, 1:RELU , 2: SELU) !!!! NOTE: !!!! If SELU is used then the weight kernel initialization and the dropout layer need to be special https://github.com/christianversloot/machine-learning-articles/blob/main/using-selu-with-tensorflow-and-keras.md AlphaDropout instead of Dropout + LeCunNormal for kernel initializer fact = 0.65e-4 : Factor to scale the KL loss relative to the reconstruction loss Must be adapted to the way of calculation - e.g. for solution_type == 3 the loss is not averaged over all pixels => at least factor of around 1000 bigger than normally loss-type = 0: Defines the way we calculate a reconstruction loss 0: Binary Cross Entropy - recommended by many authors 1: Mean Square error - recommended by some authors especially for "face arithmetics" use_batch_norm = False # True : We use BatchNormalization use_dropout = False # True : We use dropout layers (rate = 0.25, see Encoder) b_build_all = False # True : Full VAE Model is build in 1 step; False: Encoder, Decoder, VAE are build in separate steps ''' self.name = 'variational_autoencoder' # Parameters for Layers which define the Encoder and Decoder self.input_dim = input_dim self.encoder_conv_filters = encoder_conv_filters self.encoder_conv_kernel_size = encoder_conv_kernel_size self.encoder_conv_strides = encoder_conv_strides self.encoder_conv_padding = encoder_conv_padding self.decoder_conv_t_filters = decoder_conv_t_filters self.decoder_conv_t_kernel_size = decoder_conv_t_kernel_size self.decoder_conv_t_strides = decoder_conv_t_strides self.decoder_conv_t_padding = decoder_conv_t_padding self.z_dim = z_dim # Check param for activation function if act < 0 or act > 2: print("Range error: Parameter act = " + str(act) + " has unknown value ") sys.exit() else: self.act = act # Factor to scale the KL loss relative to the Binary Cross Entropy loss self.fact = fact # Type of loss - 0: BCE, 1: MSE self.loss_type = loss_type # Check param for solution approach if solution_type < 0 or solution_type > 3: print("Range error: Parameter solution_type = " + str(solution_type) + " has unknown value ") sys.exit() else: self.solution_type = solution_type self.use_batch_norm = use_batch_norm self.use_dropout = use_dropout self.dropout_rate = dropout_rate # Preparation of some variables to be filled later self._encoder_input = None # receives the Keras object for the Input Layer of the Encoder self._encoder_output = None # receives the Keras object for the Output Layer of the Encoder self.shape_before_flattening = None # info of the Encoder => is used by Decoder self._decoder_input = None # receives the Keras object for the Input Layer of the Decoder self._decoder_output = None # receives the Keras object for the Output Layer of the Decoder # Layers / tensors for KL loss self.mu = None # receives special Dense Layer's tensor for KL-loss self.log_var = None # receives special Dense Layer's tensor for KL-loss # Parameters for SELU - just in case we may need to use it somewhere # https://keras.io/api/layers/activations/ see selu self.selu_scale = 1.05070098 self.selu_alpha = 1.67326324 # The number of Conv2D and Conv2DTranspose layers for the Encoder / Decoder self.n_layers_encoder = len(encoder_conv_filters) self.n_layers_decoder = len(decoder_conv_t_filters) self.num_epoch = 0 # Intialization of the number of epochs # A matrix for the values of the losses self.std_loss = tf.TensorArray(tf.float32, size=0, dynamic_size=True, clear_after_read=False) # We only build the whole AE-model if requested self.b_build_all = b_build_all if b_build_all: self._build_all()

We just need to set the right options for the output tensors of the Encoder and the input tensors of the Decoder. The relevant code parts are controlled by the parameter “solution_type”.

**Modified code of _build_enc() of class MyVariationalAutoencoder**

def _build_enc(self, solution_type = -1, fact=-1.0): ''' Your documentation ''' # Checking whether "fact" and "solution_type" for the KL loss shall be overwritten if fact < 0: fact = self.fact if solution_type < 0: solution_type = self.solution_type else: self.solution_type = solution_type # Preparation: We later need a function to calculate the z-points in the latent space # The following function wiChangedll be used by an eventual Lambda layer of the Encoder def z_point_sampling(args): ''' A point in the latent space is calculated statistically around an optimized mu for each sample ''' mu, log_var = args # Note: These are 1D tensors ! epsilon = B.random_normal(shape=B.shape(mu), mean=0., stddev=1.) return mu + B.exp(log_var / 2) * epsilon # Input "layer" self._encoder_input = Input(shape=self.input_dim, name='encoder_input') # Initialization of a running variable x for individual layers x = self._encoder_input # Build the CNN-part with Conv2D layers # Note that stride>=2 reduces spatial resolution without the help of pooling layers for i in range(self.n_layers_encoder): conv_layer = Conv2D( filters = self.encoder_conv_filters[i] , kernel_size = self.encoder_conv_kernel_size[i] , strides = self.encoder_conv_strides[i] , padding = 'same' # Important ! Controls the shape of the layer tensors. , name = 'encoder_conv_' + str(i) ) x = conv_layer(x) # The "normalization" should be done ahead of the "activation" if self.use_batch_norm: x = BatchNormalization()(x) # Selection of activation function (out of 3) if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: # RMO: Just use the Activation layer to use SELU with predefined (!) parameters x = Activation('selu')(x) # Fulfill some SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last multi-dim tensor shape - is later needed by the decoder self._shape_before_flattening = B.int_shape(x)[1:] # Flattened layer before calculating VAE-output (z-points) via 2 special layers x = Flatten()(x) # "Variational" part - create 2 Dense layers for a statistical distribution of z-points self.mu = Dense(self.z_dim, name='mu')(x) self.log_var = Dense(self.z_dim, name='log_var')(x) if solution_type == 0: # Customized layer for the calculation of the KL loss based on mu, var_log data # We use a customized layer according to a class definition self.mu, self.log_var = My_KL_Layer()([self.mu, self.log_var], fact=fact) # Layer to provide a z_point in the Latent Space for each sample of the batch self._encoder_output = Lambda(z_point_sampling, name='encoder_output')([self.mu, self.log_var]) # The Encoder Model # ~~~~~~~~~~~~~~~~~~~ # With extra KL layer or with vae.add_loss() if self.solution_type == 0 or self.solution_type == 2: self.encoder = Model(self._encoder_input, self._encoder_output, name="encoder") # Transfer solution => Multiple outputs if self.solution_type == 1 or self.solution_type == 3: self.encoder = Model(inputs=self._encoder_input, outputs=[self._encoder_output, self.mu, self.log_var], name="encoder")

The difference is the dependency of the output on “solution_type 3”. For the Decoder we have:

**Modified code of _build_enc() of class MyVariationalAutoencoder**

def _build_dec(self): ''' Your documentation ''' # Input layer - aligned to the shape of z-points in the latent space = output[0] of the Encoder self._decoder_inp_z = Input(shape=(self.z_dim,), name='decoder_input') # Additional Input layers for the KL tensors (mu, log_var) from the Encoder if self.solution_type == 1 or self.solution_type == 3: self._dec_inp_mu = Input(shape=(self.z_dim), name='mu_input') self._dec_inp_var_log = Input(shape=(self.z_dim), name='logvar_input') # We give the layers later used as output a name # Each of the Activation layers below just correspond to an identity passed through #self._dec_mu = self._dec_inp_mu #self._dec_var_log = self._dec_inp_var_log self._dec_mu = Activation('linear',name='dc_mu')(self._dec_inp_mu) self._dec_var_log = Activation('linear', name='dc_var')(self._dec_inp_var_log) # Here we use the tensor shape info from the Encoder x = Dense(np.prod(self._shape_before_flattening))(self._decoder_inp_z) x = Reshape(self._shape_before_flattening)(x) # The inverse CNN for i in range(self.n_layers_decoder): conv_t_layer = Conv2DTranspose( filters = self.decoder_conv_t_filters[i] , kernel_size = self.decoder_conv_t_kernel_size[i] , strides = self.decoder_conv_t_strides[i] , padding = 'same' # Important ! Controls the shape of tensors during reconstruction # we want an image with the same resolution as the original input , name = 'decoder_conv_t_' + str(i) ) x = conv_t_layer(x) # Normalization and Activation if i < self.n_layers_decoder - 1: # Also in the decoder: normalization before activation if self.use_batch_norm: x = BatchNormalization()(x) # Choice of activation function if self.act == 0: x = LeakyReLU()(x) elif self.act == 1: x = ReLU()(x) elif self.act == 2: #x = self.selu_scale * ELU(alpha=self.selu_alpha)(x) x = Activation('selu')(x) # Adaptions to SELU requirements if self.use_dropout: if self.act == 2: x = AlphaDropout(rate = 0.25)(x) else: x = Dropout(rate = 0.25)(x) # Last layer => Sigmoid output # => This requires s<pre style="padding:8px; height: 400px; overflow:auto;">caled input => Division of pixel values by 255 else: x = Activation('sigmoid', name='dc_reco')(x) # Output tensor => a scaled image self._decoder_output = x # The Decoder model # solution_type == 0/2/3: Just the decoded input if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: self.decoder = Model(self._decoder_inp_z, self._decoder_output, name="decoder") # solution_type == 1: The decoded tensor plus the transferred tensors mu and log_var a for the variational distribution if self.solution_type == 1: self.decoder = Model([self._decoder_inp_z, self._dec_inp_mu, self._dec_inp_var_log], [self._decoder_output, self._dec_mu, self._dec_var_log], name="decoder")

Our VAE model now is set up with the help of the __init__() method of our new class VAE. We just have to supplement the object created by MyVariationalAutoencoder.

**Modified code of _build_VAE() of class MyVariationalAutoencoder**

def _build_VAE(self): ''' Your documentation ''' # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 3: self.model = VAE(self) # Here parameter "self" provides a reference to an instance of MyVariationalAutoencoder self.model.summary() # Solutions with layer.add_loss or model.add_loss() if self.solution_type == 0 or self.solution_type == 2: model_input = self._encoder_input model_output = self.decoder(self._encoder_output) self.model = Model(model_input, model_output, name="vae") # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: enc_out = self.encoder(self._encoder_input) dc_reco, dc_mu, dc_var = self.decoder(enc_out) # We organize the output and later association of cost functions and metrics via a dictionary mod_outputs = {'vae_out_main': dc_reco, 'vae_out_mu': dc_mu, 'vae_out_var': dc_var} self.model = Model(inputs=self._encoder_input, outputs=mod_outputs, name="vae")

Note that we keep the resulting model within the object for class MyVariationalAutoencoder. See the Jupyter cells in my next post.

The modification of the function compile_myVAE is simple

def compile_myVAE(self, learning_rate): # Optimizer # ~~~~~~~~~ optimizer = Adam(learning_rate=learning_rate) # save the learning rate for possible intermediate output to files self.learning_rate = learning_rate # Parameter "fact" will be used by the cost functions defined below to scale the KL loss relative to the BCE loss fact = self.fact # Function for solution_type == 1 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @tf.function def mu_loss(y_true, y_pred): loss_mux = fact * tf.reduce_mean(tf.square(y_pred)) return loss_mux @tf.function def logvar_loss(y_true, y_pred): loss_varx = -fact * tf.reduce_mean(1 + y_pred - tf.exp(y_pred)) return loss_varx # Function for solution_type == 2 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # We follow an approach described at # https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer # NOTE: We can NOT use @tf.function here def get_kl_loss(mu, log_var): kl_loss = -fact * tf.reduce_mean(1 + log_var - tf.square(mu) - tf.exp(log_var)) return kl_loss # Required operations for solution_type==2 => model.add_loss() # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ res_kl = get_kl_loss(mu=self.mu, log_var=self.log_var) if self.solution_type == 2: self.model.add_loss(res_kl) self.model.add_metric(res_kl, name='kl', aggregation='mean') # Model compilation # ~~~~~~~~~~~~~~~~~~~~ # Solutions with layer.add_loss or model.add_loss() if self.solution_type == 0 or self.solution_type == 2: if self.loss_type == 0: self.model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=[tf.keras.metrics.BinaryCrossentropy(name='bce')]) if self.loss_type == 1: self.model.compile(optimizer=optimizer, loss="mse", metrics=[tf.keras.metrics.MSE(name='mse')]) # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: if self.loss_type == 0: self.model.compile(optimizer=optimizer , loss={'vae_out_main':'binary_crossentropy', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} #, metrics={'vae_out_main':tf.keras.metrics.BinaryCrossentropy(name='bce'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss } ) if self.loss_type == 1: self.model.compile(optimizer=optimizer , loss={'vae_out_main':'mse', 'vae_out_mu':mu_loss, 'vae_out_var':logvar_loss} #, metrics={'vae_out_main':tf.keras.metrics.MSE(name='mse'), 'vae_out_mu':mu_loss, 'vae_out_var': logvar_loss } ) # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 3: self.model.compile(optimizer=optimizer)

Note the adaptions to the new parameter “loss_type” which we have added to the __init__()-function!

It gets a bit more complicated for the function “train_myVAE()”. The reason is that we use the opportunity to include the output of so called generators which create limited batches on the fly from disc or memory.

Such a generator is very useful if you have to handle datasets which you cannot get into the VRAM of your video card. A typical case might be the Celeb A dataset for older graphics cards as mine.

In such a case we provide a dataflow to the function. The batches in this data flow are continuously created as needed and handed over to Tensorflows data processing on the graphics card. *So, “x_train” as an input variable must not be taken literally in this case*! It is replaced by the generator’s dataflow then. See the code for the Jupyter cells in the next post.

In addition we prepare for cases where we have to provide target data to compare the input data “x_train” to which deviate from each other. Typical cases are the application of AEs/VAEs for denoising or recolorization.

# Function to initiate training # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def train_myVAE(self, x_train, x_target=None , b_use_generator = False , b_target_ne_train = False , batch_size = 32 , epochs = 2 , initial_epoch = 0, t_mu=None, t_logvar=None ): ''' @note: Sometimes x_target MUST be provided - e.g. for Denoising, Recolorization @note: x_train will come as a dataflow in case of a generator ''' # cax = ProgbarLogger(count_mode='samples', stateful_metrics=None) class MyPrinterCallback(tf.keras.callbacks.Callback): # def on_train_batch_begin(self, batch, logs=None): # # Do something on begin of training batch def on_epoch_end(self, epoch, logs=None): # Get overview over available keys #keys = list(logs.keys()) print("\nEPOCH: {}, Total Loss: {:8.6f}, // reco loss: {:8.6f}, mu Loss: {:8.6f}, logvar loss: {:8.6f}".format(epoch, logs['loss'], logs['decoder_loss'], logs['decoder_1_loss'], logs['decoder_2_loss'] )) print() #print('EPOCH: {}, Total Loss: {}'.format(epoch, logs['loss'])) #print('EPOCH: {}, metrics: {}'.format(epoch, logs['metrics'])) def on_epoch_begin(self, epoch, logs=None): print('-'*50) print('STARTING EPOCH: {}'.format(epoch)) if not b_target_ne_train : x_target = x_train # Data are provided from tensors in the Video RAM if not b_use_generator: # Solutions with layer.add_loss or model.add_loss() # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: self.model.fit( x_train , x_target , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch ) # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: self.model.fit( x_train , {'vae_out_main': x_target, 'vae_out_mu': t_mu, 'vae_out_var':t_logvar} # also working # , [x_train, t_mu, t_logvar] # we provide some dummy tensors here , batch_size = batch_size , shuffle = True , epochs = epochs , initial_epoch = initial_epoch #, verbose=1 , callbacks=[MyPrinterCallback()] ) # If data are provided as a batched dataflow from a generator - e.g. for Celeb A else: # Solution with transfer of data from the Encoder to the Decoder output layer if self.solution_type == 1: print("We have no solution yet for solution_type==1 and generators !") sys.exit() # Solutions with layer.add_loss or model.add_loss() # Solution with train_step() and GradientTape(): Control is transferred to class VAE if self.solution_type == 0 or self.solution_type == 2 or self.solution_type == 3: self.model.fit( x_train # coming as a batched dataflow from the outside generator - no batch size required here , shuffle = True , epochs = epochs , initial_epoch = initial_epoch )

As I have not tested a solution for solution_type==1 and generators, yet, I leave the writing of a working code to the reader. Sorry, I did not find the time for experiments. Presently, I use generators only in combination with the add_loss() based solutions and the solution based on train_step() and GradientTape().

Note also that if we use generators they must take care for a flow of target data to. As said: You must not take “x_train” literally in the case of generators. It is more of a continuously created “dataflow” of batches then – *both for the training’s input and target data*.

In this post I have outlined how we can use the methods **train_step()** and the tape-context of Tensorflows **GradientTape()** to control loss contributions and their gradients. Though done for the specific case of the KL-loss of a VAE the general approach should have become clear.

I have added a new class to create a Keras model from a pre-constructed Encoder and Decoder. For convenience reasons we still create the layer structure with our old class “MyVariationalAutoencoder(). But we switch control then to a new instance of a class representing a child class of Keras’ Model. This class uses customized versions of train_step() and GradientTape().

I have added some more flexibility in addition: We can now include a dataflow generator for input data (as images) which do not fit into the VRAM (Video RAM) of our graphics card but into the PC’s standard RAM. We can also switch to MSE for reconstruction losses instead of BCE.

The task of the KL-loss is to compress the data distribution in the latent space and normalize the distribution around certain feature centers there. In the next post

Variational Autoencoder with Tensorflow 2.8 – IX – taming Celeb A by resizing the images and using a generator

we apply this to images of faces. We shall use the “**Celeb A**” dataset for this purpose. We are going to see that the scaling factor of the KL loss in this case has to be chosen rather big in comparison to simple cases like MNIST. We will also see that chosing a high dimension of the latent space does not really help to create a reasonable face from statistically chosen points in the latent space.

**And before I forget it:**

*Ceterum Censeo:* The worst living fascist and war criminal living today is the Putler. He must be isolated at all levels, be denazified and imprisoned. Long live a free and democratic Ukraine!

]]>