A simple CNN for the MNIST dataset – VIII – filters and features – Python code to visualize patterns which activate a map strongly

Our series about a simple CNN trained on the MNIST data turns back to some coding.

A simple CNN for the MNIST dataset – VII – outline of steps to visualize image patterns which trigger filter maps
A simple CNN for the MNIST dataset – VI – classification by activation patterns and the role of the CNN’s MLP part
A simple CNN for the MNIST dataset – V – about the difference of activation patterns and features
A simple CNN for the MNIST dataset – IV – Visualizing the activation output of convolutional layers and maps
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST datasets – I – CNN basics

In the last article I discussed an optimization algorithm which should be able to create images of pixel patterns which trigger selected feature maps of a CNN strongly. In this article we shall focus on the required code elements. I again want to emphasize that I apply and modify some basic ideas which I read in a book of F. Chollet and in a contribution of a guy called Mohamed to a discussion at kaggle.com (see my last article for references). A careful reader will notice differences not only with respect to coding; there are major differences regarding the normalization of intermediate data and the construction of input images. To motivate the latter point I first want to point out that OIPs are idealized technical abstractions and that not all maps may react to purely statistical data on short length scales.

Images of OIPs which trigger feature maps are artificial technical abstractions!

In the last articles I made an explicit distinction between two types of patterns which we can analyze in the context of CNNs:

  • FCP: A pattern which emerges within and across activation maps of a chosen (deep) convolutional layer due to filter operations which the CNN applied to a specific input image.
  • OIP: A pattern which is present within the pixel value distribution of an input image and to which a CNN map reacts strongly.

Regarding OIPs there are some points which we should keep in mind:

  • We do something artificial when we create an image of an elementary OIP pattern to which a chosen feature map reacts. Such an OIP is already an abstraction in the sense that it reflects an idealized pattern – i.e. a specific geometrical correlation between pixel values of an input image which passes a very specific filter combinations. We forget about all other figurative elements of the input image which may trigger other maps.
  • There is an additional subtle point regarding our present approach to OIP-visualization:
    Our algorithm – if it works – will lead to OIP images which trigger a
    map’s neurons maximally on average. What does “average” mean with respect to the image area? A map always covers the whole input image area. Now let us assume that a filter combination of lower layers reacts to an elementary pattern limited in size and located somewhere on the input image. But some filters or filter combinations may be capable of detecting such a pattern at multiple locations of an input image.
    One example would be a crossing of two relatively thin lines. Such a crossing could appear at many places in an input image. In fact, a trained CNN has seen several thousand images of handwritten “4”s where the crossing of the horizontal and the vertical line actually appeared at many different locations and may have learned about this when establishing filters. Under such conditions it is obvious that a map gets optimally activated if a relatively small elementary pattern appears multiple times within the image which our algorithm artificially constructs out of initial random data.
    So our algorithm will with a high probability lead to OIP images which consist of a combination or a superposition of elementary patterns at multiple locations. In this sense an OIP image constructed due to the rule of a maximum average activation is another type of idealization. In a real MNIST image the re-occurrence of elementary patterns may not be present at all. Therefore, we should be careful and not directly associate the visualization of a pattern created by our algorithm with an elementary OIP or “feature”.
  • The last argument can in some cases also be reverted: There may be unique large scale patterns which can only be detected by filters of higher (i.e. deeper) convolutional levels which filter coarsely and with respect to relatively large areas of the image. In our approach such unique patterns may only appear in OIP images constructed for maps of the deepest convolutional layer.

Independence of the statistical data of the input image?

In the last article I showed you already some interesting patterns for certain maps which emerged from randomly varying pixel values in an input image. The fluctuations were forced into patterns by the discussed optimization loop. An example of the resulting evolution of the pixel values is shown below: With more and more steps of the optimization loop an OIP-pattern emerges out of the initial “chaos”.

Images were taken at optimization steps 0, 9, 18, 36, 72, 144, 288, 599 of a loop. Convergence is given by a change of the loss values between two different steps divided by the present loss value :
3.6/41 => 3.9/76 => 3.3/143 => 2.3/240 => 0.8/346 => 0.15/398 => 0.03/418

As we work with gradient ascent in OIP detection a lower loss means a lower activation of the map.

If we change the wavelength of the initial input fluctuations we get a somewhat, though not fundamentally different result (actually with a lower loss value of 381):

This gives us confidence in the general usability of the method. However, I would like to point out that during your own experiments you may also experience the contrary:

For some maps and for some initial statistical input data varying at short length scales, only, the optimization process will not converge. It will not even start to do so. Instead you may experience a zero activation of the selected map during all steps of the optimization for a given random input.

You should not be too surprised by this fact. Our CNN was optimized to react to patterns present in written digits. As digits have specific elements (features?) as straight lines, bows, circles, line-crossings, etc., we should expect that not all input will trigger the activation of a selected map which reacts on pixel value variations at relatively large length scales. Therefore, it is helpful to be able to vary the statistical input pattern at different length scales when you start your hunt for a nice visualization of an OIP and/or elementary feature.

All in all we cannot exclude a dependency on the statistical initial input image fluctuations. Our algorithm will find a maximum with respect to the specific input data fluctuations presented to him. Due to this point we should always be aware of the fact that the visualizations produced by our algorithm will probably mark a local maximum in the multidimensional parameter or representation space – not a global one. But also a local maximum may reveal significant sub-structures a specific filter combination is sensitive to.

Libraries

To build a suitable code we need to import some libraries, which you first must install into your virtual Python environment:

  
import numpy as np
import scipy
import time 
import sys 
import math

from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from tensorflow.keras import optimizers
from tensorflow.keras.optimizers import schedules
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist

from tensorflow.python.client import device_lib

import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpat 

import os 
from os import path as path

 

Basic code elements to construct OIP patterns from statistical input image data

To develop some code for OIP visualizations we follow the outline of steps discussed in the last article. To encapsulate the required functionality we set up a Python class. The following code fragment gives an overview about some variables which we are going to use. In the comments and explanations I sometimes used the word “reference” to mark a difference between

  • addresses to some intermediate Tensorflow 2 [TF2] objects handled by a “watch()“-method of TF2’s GradientTape-object for eager calculation of gradients
  • and eventual return objects (of a function based on keras’ backend) filled with values which we can directly use during the optimization iterations for corrections.

This is only a reminder of complex internal TF2 background operations based on complex layer models; I do not want to stress any difference in the sense of pointers and objects. Actually, I do not even know the precise programming patterns used behind TF2’s watch()-mechanism; but according to the documentation it basically records all operations involving the “watched” objects. The objective is the ability to produce gradient values of defined functions with respect to any variable changes instantaneously in eager execution later on.

 
# class to produce images of OIPs for a chosen CNN-map
# ****************************
***************************
class My_OIP:
    '''
    Version 0.2, 01.09.2020
    ~~~~~~~~~~~~~~~~~~~~~~~~~
    This class allows for the creation and the display of OIP-patterns 
    to which a selected map of a CNN-model reacts   
    
    Functions:
    ~~~~~~~~~~
    1) _load_cnn_model()             => load cnn-model
    2) _build_oip_model()            => build an oip-model to create OIP-images
    3) _setup_gradient_tape()        => Implements TF2 GradientTape to watch input data
                                        for eager gradient calculation
    4) _oip_strat_0_optimization_loop():
                                     => Method implementing a simple strategy to create OIP-images 
                                        based on superposition of random data on large distance scales
    5) _oip_strat_1_optimization_loop():
       (NOT YET DEVELOPED)           => Method implementing a complex strategy to create OIP-images 
                                        based on partially evolved intermediate image 
                                        getting enriched by small scale fluctuations
    6) _derive_OIP():                => Method used externally to start the creation of 
                                        an OIP for a chosen map 
    7) _build_initial_img_data()     => Method to construct an initial image based on 
                                        a superposition by random date on 4 different length scales 
    
    
    Requirements / Preconditions:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    In the present version 
    * a CNN-model is assumed which works with standardized (!) input images,
    * a CNN-Modell trained on MNIST data is assumed ,
    * exactly 4 length scales for random data fluctations are used
      to compose initial statistical image data 
      (roughly with a factor of 2 between them) 
      
    
    '''
    def __init__(self, cnn_model_file = 'cnn_best.h5', 
                 layer_name = 'Conv2D_3', map_index = 0, 
                 img_dim = 28, 
                 b_build_oip_model = True  
                ): 
        '''
        Input: 
        ~~~~~~
            cnn_model_file:     Name of a file containing  a full CNN-model
                                can later be overwritten by _load_cnn_model()
            layer_name:         We can define a layer name we are interested in already when starting; s. below
            map_index:          We may define a map we are interested in already when starting; s. below
            img_dim:            The dimension of the assumed quadratic images 
        
        Major internal variables:
        **************************
            _cnn_model:             A reference to the CNN model object
            _layer_name:            The name of convolutional layer 
                                    (can be overwritten by method _build_oip_model() ) 
            _map_index:             index of the map in the layer's output array 
                                    (can later be overwritten by other methods) 
            _r_cnn_inputs:          A reference to the input tensor of the CNN model (here: 1 image - NOT a batch of images)
            _layer_output:          Tensor with all maps of a certain layer
           
            _oip_submodel:          A new model connecting the input of the cnn-model with a certain map
            
            _tape:                  An instance of TF2's GradientTape-object 
                                    Watches input, output, loss of a model 
                                    and calculates gradients in TF2 eager mode 
            _r_oip_outputs:         A reference to the output of the new OIP-model 
            _r_oip_grads:           Reference to gradient tensors for the new OIP-model 
            _r_oip_loss:            Reference to a loss 
defined by operations on the OIP-output  
            _val_oip_loss:          Reference to a loss defined by operations on the OIP-output  
            
            _iterate:               Keras backend function to invoke the new OIP-model 
                                    and calculate both loss and gradient values (in TF2 eager mode) 
                                    This is the function to be used in the optimization loop for OIPs
            
            Parameters controlling the optimization loop:
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            _oip_strategy:          0, 1 - There are two strategies to evolve OIP patterns out of statistical data - only the first strategy is supported in this version  
                                    0: Simple superposition of fluctuations at different length scales
                                    1: Evolution over partially evolved images based on longer scale variations 
                                       enriched with fluctuations on shorter length scales 
                                    Both strategies can be combined with a precursor calculation 
            
            
            _b_oip_precursor:       True/False - Use a precursor analysis of long range variations 
                                    regarding loss => search for optimum variations for a given map
                                    (Some initial input images do not trigger a map at all or 
                                    sub-optimally => We test out multiple initial fluctuation patterns). 
            
            _ay_epochs:             A list of 4 optimization epochs to be used whilst 
                                    evolving the img data via strategy 1 and intermediate images 
            _n_epochs:              Number of optimization epochs to be used with strategy 0 
            
            _n_steps:               Defines at how many intermediate points we show images and report 
                                    on the optimization process 
            
            _epsilon:               Factor to control the amount of correction imposed by 
                                    the gradient values of the OIP-model 
            
            Input image data of the OIP-model and references to it 
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            _initial_precursor_img  The initial image to start a precursor optimization with.
                                    Would normally be an image of only long range fluctuations. 
            _precursor_image:       The evolved image updated during the precursor loop 
            
            _initial_inp_img_data:  A tensor representing the data of the input image 
            _inp_img_data:          A tensor representing the data of the input img during optimization  
            _img_dim:               We assume quadratic images to work with 
                                    with dimension _img_dim along each axis
                                    For the time being we only support MNIST images 
            
            Parameters controlling the composition of random initial image data 
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            _li_dim_steps:          A list of the intermediate dimensions for random data;
                                    these data are smoothly scaled to the image dimensions 
            _ay_facts:              A Numpy array of 4 factors to control the amount of 
                                    contribution of the statistical variations 
                                    on the 4 length scales to the initial image
                                   
        '''    
        
        # Input data and variable initializations
        # ****************************************
        
        # the model 
        self._cnn_model_file = cnn_model_file
        self._
cnn_model      = None 
        
        # the layer 
        self._layer_name = layer_name
        # the map 
        self._map_index  = map_index
        
        # References to objects and the OIP sub-model
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        self._r_cnn_inputs  = None # reference to input of the CNN_model
                                   # also used in the OIP-model  
        
        self._layer_output  = None
        self._oip_submodel  = None
        
        self._tape          = None # TF2 GradientTape variable
        # some "references"
        self._r_oip_outputs = None # output of the OIP-submodel to be watched 
        self._r_oip_grads   = None # gradients determined by GradientTape   
        self._r_oip_loss    = None 
        # respective values
        self._val_oip_grads = None
        self._val_oip_loss  = None
        
        # The Keras function to produce concrete outputs of the new OIP-model  
        self._iterate       = None
        
        # The strategy to produce an OIP pattern out of statistical input images 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~--------~~~~~~
        # 0: Simple superposition of fluctuations at different length scales 
        # 1: Move over 4 interediate images - partially optimized 
        self._oip_strategy = 0
        
        # Parameters controlling the OIP-optimization process 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~--------~~~~~~
        # Use a precursor analysis ? 
        self._b_oip_precursor = False
       
        # number of epochs for optimization strategy 1
        self._ay_epochs    = np.array((20, 40, 80, 400))
        len_epochs         = len(self._ay_epochs)

        # number of epochs for optimization strategy 0
        self._n_epochs     = self._ay_epochs[len_epochs-1]   
        self._n_steps      = 7   # divides the number of n_epochs into n_steps 
                                 # to produce intermediate outputs
        
        # size of corrections by gradients
        self._epsilon       = 0.01 # step-size for gradient correction
        
        
        # Input images and references to it 
        # ~~~~~~~~
        # precursor image
        self._initial_precursor_img = None
        self._precursor_img         = None
        # The evetually used input image - a superposition of initial random fluctuations
        self._initial_inp_img_data  = None  # The initial data constructed 
        self._inp_img_data          = None  # The data used and varied for optimization 
        # image dimension
        self._img_dim               = img_dim   # = 28 => MNIST images for the time being 
        
        
        # Parameters controlling the setup of an initial image 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~--------~~~~~~~~~~~~~~~~~~~
        # The length scales for initial input fluctuations
        self._li_dim_steps = ( (3, 3), (7,7), (14,14), (28,28) )
        # Parameters for fluctuations  - used both in strategy 0 and strategy 1  
        self._ay_facts     = np.array( (0.5, 0.5, 0.5, 0.5) )
        
        
        # ********************************************************
        # Model setup - load the cnn-model and build the oip-model
        # ************
        
        if path.isfile(self._cnn_model_file): 
            # We trigger the initial load of a model 
            self._load_cnn_model(file_of_cnn_model = self._cnn_model_file, b_print_cnn_model = True)
            # We trigger the build of a new sub-model based on the CNN model used for OIP search 
            self._build_oip_model(layer_name = self._layer_name, b_print_oip_model = True ) 
        else:
            print("<\nWarning: The standard file " + self._cnn_model_file + 
                  " for the cnn-model could not be found!\n " + 
                  " Please use method _load_cnn_model() to load 
a valid model")
            
        return
 

 
The purpose of most of the variables will become clearer as we walk though the class’s methods below.

Loading the original trained CNN model

Let us say we have a trained CNN-model with all final weight parameters for node-connections saved in some h5-file (see the 4th article of this series for more info). We then can load the CNN-model and derive sub-models from its layer elements. The following method performs the loading task for us:

 
    #
    # Method to load a specific CNN model
    # **********************************
    def _load_cnn_model(self, file_of_cnn_model='cnn_best.h5', b_print_cnn_model=True ):
        
        # Check existence of the file
        if not path.isfile(self._cnn_model_file): 
            print("<\nWarning: The file " + file_of_cnn_model + 
                  " for the cnn-model could not be found!\n" + 
                  "Please change the parameter \"file_of_cnn_model\"" + 
                  " to load a valid model")

        # load the CNN model 
        self._cnn_model_file = file_of_cnn_model
        self._cnn_model = models.load_model(self._cnn_model_file)

        # Inform about the model and its file file 7
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~
        print("Used file to load a ´ model = ", self._cnn_model_file)
        # we print out the models structure
        if b_print_cnn_model:
            print("Structure of the loaded CNN-model:\n")
            self._cnn_model.summary()

        # handle/references to the models input => more precise the input image 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        #    Note: As we only have one image instead of a batch 
        #    we pick only the first tensor element!
        #    The inputs will be needed for buildng the oip-model 
        self._r_cnn_inputs = self._cnn_model.inputs[0]  # !!! We have a btach with just ONE image 
        
        # print out the shape - it should be known fro the original cnn-model
        print("shape of cnn-model inputs = ", self._r_cnn_inputs.shape)
        
        return

 
Actually, I used this function already in the class’ “__init__()”-method – provided the standard file for the last CNN exists. (In a more advanced version you would in addition check that the name of the standard CNN-model meets your expectations.)

The code is straightforward. You see that the structure of the original CNN-model is printed out if requested by the user.

Note also that we assigned the first element of the input tensor of the CNN-model, i.e. a single image tensor, to the variable “self._r_cnn_inputs”. This tensor will become a major ingredient in a new Keras model which we are going to build in a minute and which we shall use to calculate gradient components of a loss function with respect to all pixel values of the input image. The gradient’s component values will in turn be used during gradient ascent to correct the pixel values. Repeated corrections should lead to a systematic approach of a maximum of our loss function, which describes the map’s activation. (Remember: Such a maximum may depend on the input image fluctuations).

Build a new Keras model based on the input tensor and a chosen layer

The next method is more interesting. We need to build a new Keras layer “model” based on the input layer and a deeper layer of the original CNN-model. (We already used the same kind of “trick” when we tried to visualize the activation output of a convolutional layer of the CNN.)

 
    #
    # Method to construct a model to optimize input for OIP-detection 
    # 
***************************************
    def _build_oip_model(self, layer_name = 'Conv2D_3', b_print_oip_model=True ): 
        '''
        We need a Conv layer to build a working model for input image optimization. 
        We get the Conv layer by the layer's name. 
        The new model connects the first input element of the CNN to 
        the output maps of the named Conv layer CNN 
        '''
        # free some RAM - hopefully 
        del self._oip_submodel
        
        self._layer_name = layer_name
        if self._cnn_model == None: 
            print("Error: cnn_model not yet defined.")
            sys.exit()
        # We build a new model based ion the model inputs and the output 
        self._layer_output = self._cnn_model.get_layer(self._layer_name).output
        
        # We do not acre at the moment about the composition of the input 
        # We trust in that we handle only one image - and not a batch
        model_name = "mod_oip__" + layer_name 
        self._oip_submodel = models.Model( [self._r_cnn_inputs], [self._layer_output], name = model_name)                                    
        
        # We print out the oip model structure
        if b_print_oip_model:
            print("Structure of the constructed OIP-sub-model:\n")
            self._oip_submodel.summary()
        return

 
We use the tensor for a single input image and the output of layer (= a collection of maps) of the original CNN-model as the definition elements of the new Keras model.

TF2’s GradientTape() – watch the change of variables which have an impact on model gradients and the loss function

What do we have so far? We have defined a new Keras model connecting input and output data of a layer of the original model. TF2 can determine related gradients by the node connections defined in the original model. However, we cannot rely on a graph analysis by Tensorflow as we were used to with TF1. TF2 uses eager mode – i.e. it calculates gradients directly. What does “directly” mean? Well – as soon as changes to variables occur which have an impact on the gradient values. This in turn means that “something” has to watch out for such changes. TF2 offers a special object for this purpose: tf.GradientTape. See:
https://www.tensorflow.org/guide/eager
https://www.tensorflow.org / api_docs/python/tf/GradientTape

So, as a next step, we set up a method to take care of “GradientTape()”.

    #
    # Method to watch gradients 
    # *************************
    def _setup_gradient_tape(self):
        '''
        For TF2 eager execution we need to watch input changes and trigger gradient evaluation
        '''   
        # Working with TF2 GradientTape
        self._tape = None
        
        # Watch out for input, output variables with respect to gradient chnages
        with tf.GradientTape() as self._tape: 
            # Input
            # ~~~~~~~
            self._tape.watch(self._r_cnn_inputs)
            # Output 
            # ~~~~~~~
            self._r_oip_outputs = self._oip_submodel(self._r_cnn_inputs)
            
            # Loss 
            # ~~~~~~~
            self._r_oip_loss = tf.reduce_mean(self._r_oip_outputs[0, :, :, self._map_index])
            print(self._r_oip_loss)
            print("shape of oip_loss = ", self._r_oip_loss.shape)

 
Note that the loss definition included in the code fragment is specific for a chosen map. This implies that we have to call this method every time we chose a different map for which we want to create OIP visualizations.

The advantage of the above code element is that “_tape()” can produce gradient values for the relation of the input data and loss data of a model automatically whenever we change the input image data. Gradient values can be called by

  self._r_oip_grads  = self._tape.gradient(self._r_oip_loss, self._r_cnn_inputs)

Gradient ascent

As already discussed in my last article we apply a gradient ascent method to our “loss” function whose outcome rises with the activation of the neurons of a map. The following code sets up a method which first calls “_setup_gradient_tape()” and afterwards applies a normalization to the gradient values which “_tape()” produces. It then defines a convenience function and eventually calls a method which runs the optimization loop.

    #        
    # Method to derive OIP from a given initial input image
    # ********************
    def _derive_OIP(self, map_index = 1, n_epochs = None, n_steps = 4, 
                          epsilon = 0.01, 
                          conv_criterion = 5.e-4, 
                          b_stop_with_convergence = False ):
        '''
        V0.3, 31.08.2020
        This method starts the process of producing an OIP of statistical input image data
        
        Requirements:    An initial input image with statistical fluctuations of pixel values 
        -------------    must have been created. 
        
        Restrictions:    This version only supports the most simple strategy - "strategy 0":  
        -------------    Optimize in one loop - starting from a superposition of fluctuations
                         No precursor, no intermediate evolution of input image data 
        
        Input:
        ------
        map_index:       We can chose a map here       (overwrites previous settings)
        n_epochs:        Number of optimization steps  (overwrites previous settings) 
        n_steps:         Defines number of intervals (with length n_epochs/ n_steps) for reporting
                         This number also sets a requirement for providing n_step external axis-frames 
                         to display intermediate images of the emerging OIP  
                         => see _oip_strat_0_optimization_loop()
        epsilon:         Size for corrections by gradient values
        conv_criterion:  A small threshold number for (difference of loss-values / present loss value )
        b_stop_with_convergence: 
                         Boolean which decides whether we stop a loop if the conv-criterion is fulfilled
                         
        '''
        self._map_index = map_index
        self._n_epochs  = n_epochs   
        self._n_steps   = n_steps
        self._epsilon   = epsilon
        
        # number of epochs for optimization strategy 0 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        if n_epochs == None:
            len_epochs = len(self._ay_epochs)
            self._n_epochs   = self._ay_epochs[len_epochs-1]
        else: 
            self._n_epochs = n_epochs
            
        # Rest some variables  
        self._val_oip_grads = None
        self._val_oip_loss  = None 
        self._iterate       = None 
        
        # Setup the TF2 GradientTape watch
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        self._setup_gradient_tape()
        print("GradientTape watch activated ")
    
        # Gradient handling - so far we only deal with addresses 
        # ~~~~~~~~~~~~~~~~~~
        self._r_oip_grads  = self._tape.gradient(self._r_oip_loss, self._r_cnn_inputs)
        #print("shape of grads = ", self._r_oip_grads.shape)
        
        # normalization of the gradient 
        self._r_oip_grads /= (B.sqrt(B.mean(B.
square(self._r_oip_grads))) + 1.e-7)
        #grads = tf.image.per_image_standardization(grads)
        
        # define an abstract recallable Keras function 
        # producing loss and gradients for corrected img data 
        # the first list of addresses points to the input data, the last to the output data 
        self._iterate = B.function( [self._r_cnn_inputs], [self._r_oip_loss, self._r_oip_grads] )
        
        # get the initial image into a variable for optimization 
        self._inp_img_data = None
        self._inp_img_data = self._initial_inp_img_data
        
        # Optimization loop for strategy 0 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        if self._oip_strategy == 0: 
            self._oip_strat_0_optimization_loop( conv_criterion = conv_criterion, 
                                                b_stop_with_convergence = b_stop_with_convergence )
    

 
Gradient value normalization is done here with respect to the L2-norm of the gradient. I.e., we adjust the length of the gradient to a unit length. Why does such a normalization help us with respect to convergence? You remember from a previous article series about MLPs that we had to care about a reasonable balance of an adaptive application of gradient values and a systematic reduction of the learning rate when approaching the global minimum of a cost function. The various available adaptive gradient methods care about speed and a proper deceleration of the steps by which we move across the cost hyperplane. When we approached a minimum we needed to care about overshooting. In our present gradient ascent case we have no such adaptive fine-tuning available. What we can do, however, is to control the length of the gradient vector such that changes remain of the same order as long as we do not change the step-size factor (epsilon). I.e.:

  • We do not wildly accelerate our path on the loss hyperplane when some local pixel areas want to drive us into a certain direction.
  • We reduce the chance of creating pixel values out of normal boundaries.

The last point fits well with a situation where the CNN has been trained on normalized or standardizes input images. Due to the normalization the size of pixel value corrections now depends significantly on the factor “epsilon”. We should choose it small enough with respect to the pixel values.

Another interesting statement in the above code fragment is

        self._iterate = B.function( [self._r_cnn_inputs], [self._r_oip_loss, self._r_oip_grads] )

Here we use the Keras Backend to define a convenience function which relates input data with dependent outputs, whose calculations we previously defined by suitable statements. The list which is used as the first parameter of this function “_iterate()” defines the input variables, the list used as a second parameter defines the output variables which will internally be calculated via the GradientTape-functionality defined before. The “_iterate()”-function makes it much easier for us to build the optimization loop.

The optimization loop for the construction of images that visualize OIPs and features

The optimization loop must systematically correct the pixel values to approach a maximum of the loss function. The following method “_oip_strat_0_optimization_loop()” does this job for us. (The “0” in the method’s name refers to a first simple approach.)

    #        
    # Method to optimize an emerging OIP out of statistical input image data 
    # (simple strategy - just optimize, no precursor, no intermediate pattern evolution 
    # ********************************
    def _oip_strat_0_optimization_loop(self, conv_criterion = 5.e-4, 
                                            b_stop_with_
convergence = False ):
        
        '''
        V 0.2, 28.08.2020 
        
        Purpose: 
        This method controls the optimization loop for OIP reconstruction of an initial 
        input image with a statistical distribution of pixel values. 
        It also provides intermediate output in the form of printed data and images.
        
        Parameter: 
        conv_criterion:  A small threshold number for (difference of loss-values / present loss value )
        b_stop_with_convergence: 
                         Booelan which decides whether we stop a loop if the conv-criterion is fulfilled
        
        
        This method produces some intermediate output during the optimization loop in form of images.
        It uses an external grid of plot frames and their axes-objects. The addresses of the 
        axis-objects must provided by an external list "li_axa[]".  
        We need a seqence of >= n_steps axis-frames length(li_axa) >= n_steps).    
        With Jupyter the grid for plots can externally be provided by statements like 
        
        # figure
        # -----------
        #sizing
        fig_size = plt.rcParams["figure.figsize"]
        fig_size[0] = 16
        fig_size[1] = 8
        fig_a = plt.figure()
        axa_1 = fig_a.add_subplot(241)
        axa_2 = fig_a.add_subplot(242)
        axa_3 = fig_a.add_subplot(243)
        axa_4 = fig_a.add_subplot(244)
        axa_5 = fig_a.add_subplot(245)
        axa_6 = fig_a.add_subplot(246)
        axa_7 = fig_a.add_subplot(247)
        axa_8 = fig_a.add_subplot(248)
        li_axa = [axa_1, axa_2, axa_3, axa_4, axa_5, axa_6, axa_7, axa_8]
        
        '''
        
        # Check that we really have an input image tensor
        if ( (self._inp_img_data == None) or 
             (self._inp_img_data.shape[1] != self._img_dim) or 
             (self._inp_img_data.shape[2] != self._img_dim) ) :
            print("There is no initial input image or it does not fit dimension requirements (28,28)")
            sys.exit()
        
        # Print some information
        print("*************\nStart of optimization loop\n*************")
        print("Strategy: Simple initial mixture of long and short range variations")
        print("Number of epochs = ", self._n_epochs)
        print("Epsilon =  ", self._epsilon)
        print("*************")
        
        # some initial value
        loss_old = 0.0
       
        # Preparation of intermediate reporting / img printing
        # --------------------------------------
        # Number of intermediate reporting points during the loop 
        steps = math.ceil(self._n_epochs / self._n_steps )
        # Logarithmic spacing of steps (most things happen initially)
        n_el = math.floor(self._n_epochs / 2**(self._n_steps) ) 
        li_int = []
        for j in range(self._n_steps):
            li_int.append(n_el*2**j)
        print("li_int = ", li_int)
        # A counter for intermediate reporting  
        n_rep = 0
        # Array for intermediate image data 
        li_imgs = np.zeros((self._img_dim, self._img_dim, 1), dtype=np.float32)
        
        # Convergence? - A list for steps meeting the convergence criterion
        # ~~~~~~~~~~~~
        li_conv = []
        
        
        # optimization loop 
        # *******************
        # A counter for steps with zero loss and gradient values 
        n_zeros = 0
        
        for j in range(self._n_epochs):
            
            # Get output values of our Keras iteration function 
            # ~~~~~~~~~~~~~~~~~~~
            self._val_oip_loss, self._val_oip_grads = self._iterate([self._inp_img_data])
            
            # loss difference to last step - shuold steadily become smaller 
            loss_diff = self._
val_oip_loss - loss_old 
            #print("loss_val = ", loss_val)
            #print("loss_diff = ", loss_diff)
            loss_old = self._val_oip_loss
            
            if j > 10 and (loss_diff/(self._val_oip_loss + 1.-7)) < conv_criterion:
                li_conv.append(j)
                lenc = len(li_conv)
                # print("conv - step = ", j)
                # stop only after the criterion has been met in 4 successive steps
                if b_stop_with_convergence and lenc > 5 and li_conv[-4] == j-4:
                    return
            
            grads_val     = self._val_oip_grads
            # the gradients average value 
            avg_grads_val = (tf.reduce_mean(grads_val)).numpy()
            
            # Check if our map reacts at all
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            if self._val_oip_loss == 0.0 and avg_grads_val == 0.0:
                print( "0-values, j= ", j, 
                       " loss = ", self._val_oip_loss, " avg_loss = ", avg_grads_val )
                n_zeros += 1
            
            if n_zeros > 10: 
                print("More than 10 times zero values - Try a different initial random distribution of pixel values")
                return
            
            # gradient ascent => Correction of the input image data 
            # ~~~~~~~~~~~~~~~
            self._inp_img_data += grads_val * self._epsilon
            
            # Standardize the corrected image - we won't get a convergence otherwise 
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            self._inp_img_data = tf.image.per_image_standardization(self._inp_img_data)
            
            # Some output at intermediate points 
            #     Note: We us logarithmic intervals because most changes 
            #           appear in the intial third of the loop's range  
            if (j == 0) or (j in li_int) or (j == self._n_epochs-1) :
                # print some info 
                print("\nstep " + str(j) + " finalized")
                #print("Shape of grads = ", grads_val.shape)
                print("present loss_val = ", self._val_oip_loss)
                print("loss_diff = ", loss_diff)
                # display the intermediate image data in an external grid 
                imgn = self._inp_img_data[0,:,:,0].numpy()
                #print("Shape of intermediate img = ", imgn.shape)
                li_axa[n_rep].imshow(imgn, cmap=plt.cm.viridis)
                # counter
                n_rep += 1
         
        return

 
The code is easy to understand: We use our convenience function “self._iterate()” to produce actual loss and gradient values within the loop. Then we change the pixel values of our input image and feed the changed image back into the loop. Note the positive sign of the correction! All complicated gradient calculations are automatically and “eagerly” done internally thanks to “GradientTape”.

We said above that we limited the gradient values. How big is the size of the resulting corrections compared to the image data? If and when we use standardized image data and scale our gradient to unit length size then the relative size of the changes are of the order of the step size “epsilon” for the correction. In our case we set epsilon to 5.e-4.

The careful reader has noticed that I standardized the image data after the correction with the (normalized) gradient. Well, this step proved to be absolutely necessary to get convergence in the sense that we really approach an extremum of the cost function. Reasons are:

  • My CNN was trained on standardized MNIST input images.
  • We did not include any normalization layers into our CNN.
  • Without counter-measures our normalized gradient values would eventually drive unlimited activation
    values.

The last point deserves some explanation:

We used the ReLU-function as the activation function of nodes in the inner layers of our CNN. For positive input values ReLU actually is a linear function. Now let us assume that we have found a rather optimal input pattern which via a succession of basically linear operations drives an activation pattern of a map’s neurons. What happens if we just add constant small values to each pixel per iteration step? The output values after the sequence of linear transformations will just grow! With our method and the ReLU activation function we walk around a surface until we reach a linear ramp and climb it up. Without compensatory steps we will not find a real maximum because there is none. The situation is very different at the convolutional layers than at the eventual output layer of the CNN’s MLP-part.

You may ask yourself why we experienced nothing of this during the classification training? First answer: We did not optimize input data but weights during the training. Second answer: During training we did NOT maximize potentially unbound activation values but minimized a cost function based on output values of the last a MLP-layer. And these values were produced by a sigmoid function! The sigmoid function limits any input to the range ]0, +1[. In addition, the cost function (categorial_crossentropy) is designed to be convex for deviations of limited calculated values from a limited target vector.

The consequence is that we have to limit the values of the (corrected) input data and the related gradients in our present optimization procedure at the same time! This is done by the standardization of the image data. Remember that the correction values are around of the relative order of 5.e-4. In the end this is the order of the fluctuations which are unavoidable in the final OIP image; but now we have a chance to converge to a related small region around a real maximum.

The last block in the code deals with intermediate output – not only printed data on the loss function but also in form of intermediate images of the hopefully emerging pattern. These images can be provided in an external grid of figures in e.g. a Jupyter environment. The important point is that we define a suitable number of Matplotlib’s axis-objects and deliver their addresses via an external array “li_axa[]”. I am well aware of that the plotting solution coded here is a very basic one and it requires some ahead planning of the user – feel free to program it in a better way.

Initial input image data – with variations on different length scales

We lack just one further ingredient: We need a method to construct an input image with statistical data. I have discussed already that it may be helpful to vary data on different length scales. A very simple approach to tackle this problem manually is the following one:

We use squares – each with a different small and limited number of cells; e.g. a (4×4)-, a (7×7)-, a (14×14)- and (28×28)-square. Note that (28×28) is the original size of the MNIST images. For other samples we may need different sizes of such squares and more of them. We then fill the cells with random numbers between [-1, 1]. We smoothly scale the variations of the smaller squares up to the real input image size; here: (28×28). We can do this by applying a bicubic interpolation to the data. In the end we add up all random data and normalize or standardize the resulting distribution of pixel values. See the code below for details:

    # 
    # Method to build an initial image from a superposition of random data on different length scales 
    # ***********************************
    def _build_initial_img_data( self, 
                                 strategy = 0, 
                                 li_epochs    = (20, 50, 100, 400), 
                                 li_
facts     = (0.5, 0.5, 0.5, 0.5),
                                 li_dim_steps = ( (3,3), (7,7), (14,14), (28,28) ), 
                                 b_smoothing = False):
        
        '''
        V0.2, 31.08.2020
        Purpose:
        ~~~~~~~~
        This method constructs an initial input image with a statistical distribution of pixel-values.
        We use 4 length scales to mix fluctuations with different "wave-length" by a simple  
        approach: 
        We fill four squares with a different number of cells below the number of pixels 
        in each dimension of the real input image; e.g. (4x4), (7x7, (14x14), (28,28) <= (28,28). 
        We fill the cells with random numbers in [-1.0, 1.]. We smootly scale the resulting pattern 
        up to (28,28) (or what ever the input image dimensions are) by bicubic interpolations 
        and eventually add up all values. As a final step we standardize the pixel value distribution.          
        
        Limitations
        ~~~~~~~~~~~
        This version works with 4 length scales. it only supports a simple strategy for 
        evolving OIP patterns. 
        '''

        self._oip_strategy = strategy
        self._ay_facts     = np.array(li_facts)
        self._ay_epochs    = np.array(li_epochs)
        
        self._li_dim_steps = li_dim_steps
        
        fluct_data = None

        
        # Strategy 0: Simple superposition of random patterns at 4 different wave-length
        # ~~~~~~~~~~
        if self._oip_strategy == 0:
            
            dim_1_1 = self._li_dim_steps[0][0] 
            dim_1_2 = self._li_dim_steps[0][1] 
            dim_2_1 = self._li_dim_steps[1][0] 
            dim_2_2 = self._li_dim_steps[1][1] 
            dim_3_1 = self._li_dim_steps[2][0] 
            dim_3_2 = self._li_dim_steps[2][1] 
            dim_4_1 = self._li_dim_steps[3][0] 
            dim_4_2 = self._li_dim_steps[3][1] 
            
            fact1 = self._ay_facts[0]
            fact2 = self._ay_facts[1]
            fact3 = self._ay_facts[2]
            fact4 = self._ay_facts[3]
            
            # print some parameter information
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            print("\nInitial image composition - strategy 0:\n Superposition of 4 different wavelength patterns")
            print("Parameters:\n", 
                 fact1, " => (" + str(dim_1_1) +", " + str(dim_1_2) + ") :: ", 
                 fact2, " => (" + str(dim_2_1) +", " + str(dim_2_2) + ") :: ", 
                 fact3, " => (" + str(dim_3_1) +", " + str(dim_3_2) + ") :: ", 
                 fact4, " => (" + str(dim_4_1) +", " + str(dim_4_2) + ")" 
                 )
            
            # fluctuations
            fluct1 =  2.0 * ( np.random.random((1, dim_1_1, dim_1_2, 1)) - 0.5 ) 
            fluct2 =  2.0 * ( np.random.random((1, dim_2_1, dim_2_2, 1)) - 0.5 ) 
            fluct3 =  2.0 * ( np.random.random((1, dim_3_1, dim_3_2, 1)) - 0.5 ) 
            fluct4 =  2.0 * ( np.random.random((1, dim_4_1, dim_4_2, 1)) - 0.5 ) 

            # Scaling with bicubic interpolation to the required image size
            fluct1_scale = tf.image.resize(fluct1, [28,28], method="bicubic", antialias=True)
            fluct2_scale = tf.image.resize(fluct2, [28,28], method="bicubic", antialias=True)
            fluct3_scale = tf.image.resize(fluct3, [28,28], method="bicubic", antialias=True)
            fluct4_scale = fluct4

            # superposition
            fluct_data = fact1*fluct1_scale + fact2*fluct2_scale + fact3*fluct3_scale + fact4*fluct4_scale
        
        
        # get the standardized plus smoothed and unsmoothed image 
        # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        #    TF2 provides a function performing standardization of image data function
r
        fluct_data_unsmoothed = tf.image.per_image_standardization(fluct_data) 
        fluct_data_smoothed   = tf.image.per_image_standardization(
                                    tf.image.resize( fluct_data, [28,28], 
                                                     method="bicubic", antialias=True) )

        if b_smoothing: 
            self._initial_inp_img_data = fluct_data_smoothed
        else:
            self._initial_inp_img_data = fluct_data_unsmoothed

        # There should be no difference 
        img_init_unsmoothed = fluct_data_unsmoothed[0,:,:,0].numpy()
        img_init_smoothed   = fluct_data_smoothed[0,:,:,0].numpy()
        
        ax1_1.imshow(img_init_unsmoothed, cmap=plt.cm.viridis)
        ax1_2.imshow(img_init_smoothed, cmap=plt.cm.viridis)

        print("Initial images plotted")
        
        return self._initial_inp_img_data    

 
The factors fact1, fact2 and fact3 determine the relative amplitudes of the fluctuations at the different length scales. Thus the user is e.g. able to suppress short-scale fluctuations completely.

I only took exactly four squares to simulate fluctuation on different length scales. A better code would make the number of squares and length scales parameterizable. Or it would work with a Fourier series right from the start. I was too lazy for such elaborated things. The plots again require the definition of some external Matplotlib figures with axis-objects. You can provide a suitable figure in a Jupyter cell.

Conclusion

In this article we have build a simple class to create OIPs for a specific CNN map out of an input image with a random distribution of pixel values. The algorithm should have made it clear that this a constructive work performed during iteration:

  • We start from the “detection” of slight traces of a pattern in the initial statistical pixel value distribution; the
    pattern actually is a statistical pixel correlation which triggers the chosen map,
  • then we amplify the recognized pattern elements
  • and suppress pixel values which are not relevant into a homogeneous background.

So the headline of this article is a bit misleading: We do not only “detect” a map related OIP; we also “create” it.

Our new Python class makes use of a given trained CNN-model and follows the outline of steps discussed in a previous article. The class has many limitations – in its present version it is only usable for MNIST images and the user has to know a lot about internals. However, I hope that my readers nevertheless have understood the basic ingredients. From there it is only a small step towards are more general and more capable version.

I have also underlined in this article that the images produced by the coded methods may only represent local maxima of the loss function for average map activation and idealized patterns composed of re-occuring elementary sub-structures. In the next article

A simple CNN for the MNIST dataset – IX – filter visualization at a convolutional layer

I am going to apply the code to most of the maps of the highest, i.e. inner-most convolutional layer of my trained CNN. We shall discover a whole zoo of simple and complex input image patterns. But we shall also confirm the suspicion that our optimization algorithm for finding an OIP for a specific map does not react to each and every kind of initial statistical pixel value fluctuation presented to the CNN.

 

A simple Python program for an ANN to cover the MNIST dataset – XIV – cluster detection in feature space

We extend our studies of a program for a Multilayer perceptron and gradient descent in combination with the MNIST dataset:

A simple Python program for an ANN to cover the MNIST dataset – XIII – the impact of regularization
A simple Python program for an ANN to cover the MNIST dataset – XII – accuracy evolution, learning rate, normalization
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix
A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests
A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests
A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation
A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles
A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“
A simple Python program for an ANN to cover the MNIST dataset – V – coding the loss function
A simple Python program for an ANN to cover the MNIST dataset – IV – the concept of a cost or loss function
A simple Python program for an ANN to cover the MNIST dataset – III – forward propagation
A simple Python program for an ANN to cover the MNIST dataset – II – initial random weight values
A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

In this article we shall work a bit on the following topic: How can we reduce the computational time required for gradient descent runs of our MLP?

Readers who followed my last articles will have noticed that I sometimes used 1800 epochs in a gradient descent run. The computational time including

  • costly intermediate print outs into Jupyter cells,
  • a full determination of the reached accuracy both on the full training and the test dataset at every epoch

lay in a region of 40 to 45 minutes for our MLP with two hidden layers and roughly 58000 weights. Using an Intel I7 standard CPU with OpenBlas
support. And I plan to work with bigger MLPs – not on MNIST but other data sets. Believe me: Everything beyond 10 minutes is a burden. So, I have a natural interest in accelerating things on a very basic level already before turning to GPUs or arrays of them.

Factors for CPU-time

This introductory question leads to another one: What basic factors beyond technical capabilities of our Linux system and badly written parts of my Python code influence the consumption of computational time? Four points come to my mind; you probably find even more:

  • One factor is certainly the extra forward propagation run which we apply to all samples of both the test and training data seat the end of each epoch. We perform this propagation to make predictions and to get data on the evolution of the accuracy, the total loss and the ratio of the regularization term to the real costs. We could do this in the future at every 2nd or 5th epoch to save some time. But this will reduce CPU-time only by less than 22%. 76% of the CPU-time of an epoch is spent in batch-handling with a dominant part in error backward propagation and weight corrections.
  • The learning rate has a direct impact on the number of required epochs. We could enlarge the learning rate in combination with input data normalization; see the last article. This could reduce the number of required epochs significantly. Depending on the parameter choices before by up to 40% or 50%. But it requires a bit of experimenting ….
  • Two other, more important factors are the frequent number of matrix operations during error back-propagation and the size of the involved matrices. These operations depend directly on the number of nodes involved. We could therefore reduce the number of nodes of our MLP to a minimum compatible with the required accuracy and precision. This leads directly to the next point.
  • The dominant weight matrix is of course the one which couples layer L0 and layer L1. In our case its shape is 784 x 70; it has almost 55000 elements. The matrix for the next pair of layers has only 70×30 = 2100 elements – it is much, much smaller. To reduce CPU time for forward propagation we should try to make this matrix smaller. During error back propagation we must perform multiple matrix multiplications; the matrix dimensions depend on the number of samples in a mini-batch AND on the number of nodes in the involved layers. The dimensions of the the result matrix correspond to the those of the weight matrix. So once again: A reduction of the nodes in the first 2 layers would be extremely helpful for the expensive backward propagation. See: The math behind EBP.

We shall mainly concentrate on the last point in this article.

Reduction of the dimensions of the dominant matrix”requires a reduction of input features

The following numbers show typical CPU times spend for matrix operations during error back propagation [EBP] between different layers of our MLP and for two different batches at the beginning of gradient descent:

Time_CPU for BW layer operations (to L2) 0.00029015699965384556
Time_CPU for BW layer operations (to L1) 0.0008645610000712622
Time_CPU for BW layer operations (to L0) 0.006551215999934357

Time_CPU for BW layer operations (to L2) 0.00029157400012991275
Time_CPU for BW layer operations (to L1) 0.0009575330000188842
Time_CPU for BW layer operations (to L0) 0.007488838999961445

The operations involving layer L0 cost a factor of 7 more CPU time than the other operations! Therefore, a key to the reduction of the number of mathematical operations is obviously the reduction of the number of nodes in the input layer! We cannot reduce the numbers in the hidden layers much, if we
do not want to hamper the accuracy properties of our MLP too much. So the basic question is

Can we reduce the number of input nodes somehow?

Yes, maybe we can! Input nodes correspond to “features“. In case of the MNIST dataset the relevant features are given by the gray-values for the 784 pixels of each image. A first idea is that there are many pixels within each MNIST image which are probably not used at all for classification – especially pixels at the outer image borders. So, it would be helpful to chop them off or to ignore them by some appropriate method. In addition, special significant pixel areas may exist to which the MLP, i.e. its weight optimization, reacts during training. For example: The digits 3, 5, 6, 8, 9 all have a bow within the lower 30% of an image, but in other regions, e.g. to the left and the right, they are rather different.

If we could identify suitable image areas in which dark pixels have a higher probability for certain digits then, maybe, we could use this information to discriminate the represented digits? But a “higher density of dark pixels in an image area” is nothing else than a description of a “cluster” of (dark) pixels in certain image areas. Can we use pixel clusters at numerous areas of an image to learn about the represented digits? Is the combination of (averaged) feature values in certain clusters of pixels representative for a handwritten digit in the MNIST dataset?

If the number of such pixel clusters could be reduced below lets say 100 then we could indeed reduce the number of input features significantly!

Cluster detection

To be able to use relevant “clusters” of pixels – if they exist in a usable form in MNIST images at all – we must first identify them. Cluster identification and discrimination is a major discipline of Machine Learning. This discipline works in general with unlabeled data. In the MNIST case we would not use the labels in the “y”-data at all to identify clusters; we would only use the “X”-data. A nice introduction to the mechanisms of cluster identification is given in the book of Paul Wilcott (see Machine Learning – book recommendations for the reference). The most fundamental method – called “kmeans” – iterates over 3 major steps [I simplify a bit :-)]:

  • We assume that K clusters exist and start with random initial positions of their centers (called “centroids”) in the multidimensional feature space
  • We measure the distance of all data points to he centroids and associate a point with that centroid to which the distance is smallest
  • We determine the “center of mass” (according to some distance metric) of the identified data point groups and assume it as a new position of the centroids and move the old positions (a bit) in this direction.

We iterate over these steps until the centroids’ positions hopefully get stable. Pretty simple. But there is a major drawback: You must make an assumption on the number “K” of clusters. To make such an assumption can become difficult in the complex case of a feature space with hundreds of dimensions.

You can compensate this by executing multiple cluster runs and comparing the results. By what? Regarding the closure or separation of clusters in terms of an appropriate norm. One such norm is called “cluster inertia“; it measures the mean squared distance to the center for all points of a cluster. The theory is that the sum of the inertias for all clusters drops significantly with the number of clusters until an optimal number is reached and the inertia curve flattens out. The point where this happens in a plot of inertia vs. number of clusters is called “elbow“.
Identifying this “elbow” is one of the means to find an optimal number of clusters. However, this recipe does not work under all circumstances. As the number of clusters get big we may be confronted with a smooth decline of the inertia sum.

What data do we use for gradient descent after cluster detection?

How could we measure whether an image shows certain clusters? We could e.g. measure distances (with some appropriate metric) of all image points to the clusters. The “fit_transform()”-method of KMeans and MiniBatchKMeans provide us with with some distance measure of each image to the identified clusters. This means our images are transformed into a new feature space – namely into a “cluster-distance space”. This is a quite complex space, too. But it has less dimensions than the original feature space!

Note: We would of course normalize the resulting distance data in the new feature space before applying gradient descent.

Application of “KMeansBatch” to MNIST

There are multiple variants of “KMeans”. We shall use one which is provided by SciKit-Learn and which is optimized for large datasets: “MiniBatchKMeans“. It operates batch-wise without loosing too much of accuracy and convergence properties in comparison to KMeans (or a comparison see here). “MiniBatchKMeans”has some parameters you can play with.

We could be tempted to use 10 clusters as there are 10 digits to discriminate between. But remember: A digit can be written in very many ways. So, it is much more probable that we need a significant larger number of clusters. But again: How to determine on which K-values we should invest a bit more time? “Kmeans” and methods alike offer another quantity called “silhouette” coefficient. It measures how well the data points are within, at or outside the borders of a cluster. See the book of Geron referenced at the link given above on more information.

Variation of CPU time, inertia and average silhouette coefficients with the number of clusters “K”

Let us first have a look at the evolution of CPU time, total inertia and averaged silhouette with the number of clusters “K” for two different runs. The following code for a Jupyter cell gives us the data:

    
# *********************************************************
# Pre-Clustering => Searching for the elbow 
# *********************************************************
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
X = np.concatenate((ANN._X_train, ANN._X_test), axis=0)
y = np.concatenate((ANN._y_train, ANN._y_test), axis=0)
print("X-shape = ", X.shape, "y-shape = ", y.shape)
num = X.shape[0]

li_n = []
li_inertia = []
li_CPU = []
li_sil1 = []

# Loop over the number "n" of assumed clusters 
rg_n = range(10,171,10)
for n in rg_n:
    print("\nNumber of clusters: ", n)
    start = time.perf_counter()
    kmeans = MiniBatchKMeans(n_clusters=n, n_init=500, max_iter=1000, batch_size=500 )  
    X_clustered = kmeans.fit_transform(X)
    sil1 = silhouette_score(X, kmeans.labels_)
    #sil2 = silhouette_score(X_clustered, kmeans.labels_)
    end = time.perf_counter()
    dtime = end - start
    print('Inertia = ', kmeans.inertia_)
    print('Time_CPU = ', dtime)
    print('sil1 score = ', sil1)
    li_n.append(n)    
    li_inertia.append(kmeans.inertia_)    
    li_CPU.append(dtime)    
    li_sil1.append(sil1)    

    
# Plots         
# ******
fig_size = plt.rcParams["figure.figsize"]
fig_size[
0] = 14
fig_size[1] = 5
fig1 = plt.figure(1)
fig2 = plt.figure(2)

ax1_1 = fig1.add_subplot(121)
ax1_2 = fig1.add_subplot(122)

ax1_1.plot(li_n, li_CPU)
ax1_1.set_xlabel("num clusters K")
ax1_1.set_ylabel("CPU time")

ax1_2.plot(li_n, li_inertia)
ax1_2.set_xlabel("num clusters K")
ax1_2.set_ylabel("inertia")

ax2_1 = fig2.add_subplot(121)
ax2_2 = fig2.add_subplot(122)

ax2_1.plot(li_n, li_sil1)
ax2_1.set_xlabel("num clusters K")
ax2_1.set_ylabel("silhoutte 1")

 
You see that I allowed for large numbers of initial centroid positions and iterations to be on the safe side. Before you try it yourself: Such runs for a broad variation of K-values are relatively costly. The CPU time rises from around 32 seconds for 30 clusters to a little less than 1 minute for 180 clusters. These times add up to a significant sum after a while …

Here are some plots:

The second run was executed with a higher resolution of K_(n+1) – K_n 5 = 5.

We see that the CPU time to determine the centroids’ positions varies fairly linear with “K”. And even for 170 clusters it does not take more than a minute! So, CPU-time for cluster identification is not a major limitation.

Unfortunately, we do not see a clear elbow in the inertia curve! What you regard as a reasonable choice for the number K depends a lot on where you say the curve starts to flatten. You could say that this happens around K = 60 to 90. But the results for the silhouette-quantity indicate for our parameter setting that K=40, K=70, K=90 are interesting points. We shall look at these points a bit closer with higher resolution later on.

Reduction of the regularization factor (for Ridge regularization)

Now, I want to discuss an important point which I did not find in the literature:
In my last article we saw that regularization plays a significant but also delicate role in reaching top accuracy values for the test dataset. We saw that Lambda2 = 0.2 was a good choice for a normalized input of the MNIST data. It corresponded to a certain ratio of the regularization term to average batch costs.
But when we reduce the number of input nodes we also reduce the number of total weights. So the weight values themselves will automatically become bigger if we want to get to similar good values at the second layer. But as the regularization term depends in a quadratic way on the weights we may assume that we roughly need a linear reduction of Lambda2. So, for K=100 clusters we may shrink Lambda2 to (0.2/784*100) = 0.025 instead of 0.2. In general:

Lambda2_cluster = Lambda2_std * K / (number of input nodes)

I applied this rule of a thumb successfully throughout experiments with clustering befor gradient descent.

Reference run without clustering

We saw at the end of article XII that we could reach an accuracy of around 0.975 after 500 epochs under optimal circumstances. But in the case I presented ten I was extremely lucky with the statistical initial weight distribution and the batch composition. In other runs with the same parameter setup I got smaller accuracy values. So, let us take an ad hoc run with the following parameters and results:
Parameters: learn_rate = 0.001, decrease_rate = 0.00001, mom_rate = 0.00005, n_size_mini_batch = 500, n_epochs = 600, Lambda2 = 0.2, weights at
all layers in [-2*1.0/sqrt(num_nodes_layer), 2*1.0/sqrt(num_nodes_layer)]
Results: acc_train: 0.9949 , acc_test: 0.9735, convergence after ca. 550-600 epochs

The next plot shows (from left to right and the down) the evolution of the costs per batch, the averaged error of the last mini-batch during an epoch, the ratio of regularization to batch costs and the total costs of the training set, respectively .

The following plot summarizes the evolution of the total costs of the traaining set (including the regularization contribution) and the evolution of the accuracy on the training and the test data sets (in orange and blue, respectively).

The required computational time for the 600 epochs was roughly 18,2 minutes.

Results of gradient descent based on a prior cluster identification

Before we go into a more detailed discussion of code adaption and test runs with things like clusters in unnormalized and normalized feature spaces, I want to show what we – without too much effort – can get out of using cluster detection ahead of gradient descent. The next plot shows the evolution of a run for K=70 clusters in combination with a special normalization:

and the total cost and accuracy evolution

The dotted line marks an accuracy of 97.8%! This is 0.5% bigger then our reference value of 97.3%. The total gain of %gt; 0.5% means however 18.5% of the remaining difference of 2.7% to 100% and we past a value of 97.8% already at epoch 600 of the run.

What were the required computational times?

If we just wanted 97.4% as accuracy we need around 150 epochs. And a total CPU time of 1.3 minutes to get to the same accuracy as our reference run. This is a factor of roughly 14 in required CPU time. For a stable 97.73% after epoch 350 we were still a factor of 5.6 better. For a stable accuracy beyond 97.8% we needed around 600 epochs – and still were by a factor of 3.3 faster than our reference run! So, clustering really brings some big advantages with it.

Conclusion

In
this article I discussed the idea of introducing cluster identification in the (unnormalized or normalized) feature space ahead of gradient descent as a possible means to save computational time. A preliminary trial run showed that we indeed can become significantly faster by at least a factor of 3 up to 5 and even more. This is just due to the point that we reduced the number of input nodes and thus the number of mathematical calculations during matrix operations.

In the next article we shall have a more detailed look at clustering techniques in combination with normalization.

A simple Python program for an ANN to cover the MNIST dataset – XIII – the impact of regularization

I continue with my growing series on a Multilayer perceptron and the MNIST dataset.

A simple Python program for an ANN to cover the MNIST dataset – XII – accuracy evolution, learning rate, normalization
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix
A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests
A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests
A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation
A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles
A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“
A simple Python program for an ANN to cover the MNIST dataset – V – coding the loss function
A simple Python program for an ANN to cover the MNIST dataset – IV – the concept of a cost or loss function
A simple Python program for an ANN to cover the MNIST dataset – III – forward propagation
A simple Python program for an ANN to cover the MNIST dataset – II – initial random weight values
A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

In the last article of the series we made some interesting experiences with the variation of the “leaning rate”. We also saw that a reasonable range for initial weight values should be chosen.

Even more fascinating was, however, the impact of a normalization of the input data on a smooth and fast gradient descent. We drew the conclusion that normalization is of major importance when we use the sigmoid function as the MLP’s activation function – especially for nodes in the first hidden layer and for input data which are on average relatively big. The reason for our concern were saturation effects of the sigmoid functions and other functions with a similar variation with their argument. In the meantime I have tried to make the importance of normalization even more plausible with the help of a a very minimalistic perceptron for which we can analyze saturation effects a bit more in depth; you get to the related article series via the following link:

A single neuron perceptron with sigmoid activation function – III – two ways of applying Normalizer

There we also have a look at other normalizers or feature scalers.

But back to our series on a multi-layer perceptron. You may have have asked yourself in the meantime: Why did he not check the impact of the regularization? Indeed: We kept the parameter Lambda2 for the quadratic regularization term constant in all experiments so far: Lambda2 = 0.2. So, the question about the impact of regularization e.g. on accuracy is a good one.

How big is the regularization term and how does it evolve during gradient decent training?

I add even one more question: How big is the relative contribution of the regularization term to the total loss or cost function? In our Python program for a MLP model we included a so called quadratic Ridge term:

Lambda2 * 0.5 * SUM[all weights**2], where bias nodes are excluded from the sum.

From various books on Machine Learning [ML] you just learn to choose the factor Lambda2 in the range between 0.01 and 0.1. But how big is the resulting term actually in comparison to the standard cost term, then, and how does the ratio between both terms evolve during gradient descent? What factors influence this ratio?

As we follow a training strategy based on mini-batches the regularization contribution was and is added up to the costs of each mini-batch. So its relative importance varies of course with the size of the mini-batches! Other factors which may also be of some importance – at least during the first epochs – could be the total number of weights in our network and the range of initial weight values.

Regarding the evolution during a converging gradient descent we know already that the total costs go down on the path to a cost minimum – whilst the weight values reach a stable level. So there is a (non-linear!) competition between the regularization term and the real costs of the “Log Loss” cost function! During convergence the relative importance of the regularization term may therefore become bigger until the ratio to the standard costs reaches an eventual constant level. But how dominant will the regularization term get in the end?

Let us do some experiments with the MNIST dataset again! We fix some common parameters and conditions for our test runs:
As we saw in the last article we should normalize the input data. So, all of our numerical experiments below (with the exception of the last one) are done with standardized input data (using Scikit-Learn’s StandardScaler). In addition initial weights are all set according to the sqrt(nodes)-rule for all layers in the interval [-0.5*sqrt(1/num_nodes), 0.5*sqrt(1/num_nodes)], with num_nodes meaning the number of nodes in a layer. Other parameters, which we keep constant, are:

Parameters: learn_rate = 0.001, decrease_rate = 0.00001, mom_rate = 0.00005, n_size_mini_batch = 500, n_epochs = 800.

I added some statements to the method for cost calculation in order to save the relative part of the regularization terms with respect to the total costs of each mini-batch in a Numpy array and plot the evolution in the end. The changes are so simple that I omit showing the modified code.

A first look at the evolution of the relative contribution of regularization to the total loss of a mini-batch

How does the outcome of gradient descent look for standardized input data and a Lambda2-value of 0.1?

Lambda2 = 0.1
Results: acc_train: 0.999 , acc_test: 0.9714, convergence after ca. 600 epochs

We see that the regularization term actually dominates the total loss of a mini-batch at convergence. At least with our present parameter setting. In comparisoin to the total loss of the full training set the contribution is of course much smaller and typically below 1%.

A small Lambda term

Let us reduce the regularization term via setting Lambda = 0.01. We expect its initial contribution to the costs of a batch to be smaller then, but this does NOT mean that the ratio to the standard costs of the batch automatically shrinks significantly, too:

Lambda2 = 0.01
Results: acc_train: 1.0 , acc_test: 0.9656, convergence after ca. 350 epochs

Note the absolute scale of the costs in the plots! We ended up at a much lower level of the total loss of a batch! But the relative dominance of regularization at the point of convergence actually increased! However, this did not help us with the accuracy of our MLP-algorithm on the test data set – although we perfectly fit the training set by a 100% accuracy.

In the end this is what regularization is all about. We do not want a total overfitting, a perfect adaption of the grid to the training set. It will not help in the sense of getting a better general accuracy on other input data. A Lambda2 of 0.01 is much too small in our case!

Slightly bigger regularization with Lambda2 = 0.2

So lets enlarge Lambda2 a bit:
Lambda2 = 0.2
Results: acc_train: 0.9946 , acc_test: 0.9728, convergence after ca. 700 epochs

We get an improved accuracy!

Two other cases with significantly bigger Lambda2

Lambda2 = 0.4
Results: acc_train: 0.9858 , acc_test: 0.9693, convergence after ca. 600 epochs

Lambda2 = 0.8
Results: acc_train: 0.9705 , acc_test: 0.9588, convergence after ca. 400 epochs

OK, but in both cases we see a significant and systematic trend towards reduced accuracy values on the test data set with growing Lambda2-values > 0.2 for our chosen mini-batch size (500 samples).

Conclusion

We learned a bit about the impact of regularization today. Whatever the exact Lambda2-value – in the end the contribution of a regularization term becomes a significant part of the total loss of a mini-batch when we approached the total cost minimum. However, the factor Lambda2 must be chosen with a reasonable size to get an impact of regularization on the final minimum position in the weight-space! But then it will help to improve accuracy on general input data in comparison to overfitted solutions!

But we also saw that there is some balance to take care of: For an optimum of generalization AND accuracy you should neither make Lambda2 too small nor too big. In our case Lambda2 = 0.2 seems to be a reasonable and good choice. Might be different with other datasets.

All in all studying the impact of a variation of achieved accuracy with the factor for a Ridge regularization term seems to be a good investment of time in ML projects. We shall come back to this point already in the next articles of this series.

In the next article

A simple Python program for an ANN to cover the MNIST dataset – XIV – cluster detection in feature space

we shall start to work on cluster detection in the feature space of the MNIST data before using gradient descent.

 

A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests

This brief article continues my series on a Python program for simple MLPs.

A simple program for an ANN to cover the Mnist dataset – VIII – coding Error Backward Propagation
A simple program for an ANN to cover the Mnist dataset – VII – EBP related topics and obstacles
A simple program for an ANN to cover the Mnist dataset – VI – the math behind the „error back-propagation“
A simple program for an ANN to cover the Mnist dataset – V – coding the loss function
A simple program for an ANN to cover the Mnist dataset – IV – the concept of a cost or loss function
A simple program for an ANN to cover the Mnist dataset – III – forward propagation
A simple program for an ANN to cover the Mnist dataset – II – initial random weight values
A simple program for an ANN to cover the Mnist dataset – I – a starting point

With the code for “Error Backward Propagation” we have come so far that we can perform first tests. As planned from the beginning we take the MNIST dataset as a test example. In a first approach we do not rebuild the mini-batches with each epoch. Neither do we vary the MLP setup.

What we are interested in is the question whether our artificial neural network converges with respect to final weight values during training. I.e. we want to see whether the training algorithm finds a reasonable global minimum on the cost hyperplane over the parameter space of all weights.

Test results

We use the following parameters

    ANN = myann.MyANN(my_data_set="mnist_keras", n_hidden_layers = 2, 
                 ay_nodes_layers = [0, 70, 30, 0], 
                 n_nodes_layer_out = 10,  

                 #my_loss_function = "MSE",
                 my_loss_function = "LogLoss",
                 n_size_mini_batch = 500,
     
                 n_epochs = 1500, 
                 n_max_batches = 2000,
                     
                 lambda2_reg = 0.1, 
                 lambda1_reg = 0.0,      

                 vect_mode = 'cols', 
                      
                 learn_rate = 0.0001,
                 decrease_const = 0.000001,
                 mom_rate   = 0.00005,  

                 figs_x1=12.0, figs_x2=8.0, 
                 legend_loc='upper right',
                 b_print_test_data = True
                 )

We use two hidden layers with 70 and 30 nodes. Note that we pick a rather small learning rate, which gets diminished even further. The number of epochs (1500) is quite high; with 4 CPU threads on an i7-6700K processor training takes around 40 minutes if started from a Jupyter notebook. (Our program is not yet optimized.)

I supplemented my code with some statements to print out data for the total costs and the averaged error for mini-batches. We get the
following series of data for the last mini-batch within every 50th epoch.

Starting epoch 1
total costs of mini_batch =  1757.1499806500506
avg total error of mini_batch =  0.17150838451718683
---------
Starting epoch 51
total costs of mini_batch =  532.5817607913532
avg total error of mini_batch =  0.034658195573307196
---------
Starting epoch 101
total costs of mini_batch =  436.67115522484687
avg total error of mini_batch =  0.023496458964699055
---------
Starting epoch 151
total costs of mini_batch =  402.381331415108
avg total error of mini_batch =  0.020836159866695597
---------
Starting epoch 201
total costs of mini_batch =  342.6296325512483
avg total error of mini_batch =  0.016565121882126693
---------
Starting epoch 251
total costs of mini_batch =  319.5995117831668
avg total error of mini_batch =  0.01533372596379799
---------
Starting epoch 301
total costs of mini_batch =  288.2201307002896
avg total error of mini_batch =  0.013799141451102469
---------
Starting epoch 351
total costs of mini_batch =  272.40526022720826
avg total error of mini_batch =  0.013499221607285198
---------
Starting epoch 401
total costs of mini_batch =  251.02417188663628
avg total error of mini_batch =  0.012696943309314687
---------
Starting epoch 451
total costs of mini_batch =  231.92274565746214
avg total error of mini_batch =  0.011152542115360705
---------
Starting epoch 501
total costs of mini_batch =  216.34658280101385
avg total error of mini_batch =  0.010692239864121407
---------
Starting epoch 551
total costs of mini_batch =  215.21791509166042
avg total error of mini_batch =  0.010999255316821901
---------
Starting epoch 601
total costs of mini_batch =  207.79645393570436
avg total error of mini_batch =  0.011123079894527222
---------
Starting epoch 651
total costs of mini_batch =  188.33965068903723
avg total error of mini_batch =  0.009868734062493835
---------
Starting epoch 701
total costs of mini_batch =  173.07625091642274
avg total error of mini_batch =  0.008942065167336382
---------
Starting epoch 751
total costs of mini_batch =  174.98264336120369
avg total error of mini_batch =  0.009714870291761567
---------
Starting epoch 801
total costs of mini_batch =  161.10229359519792
avg total error of mini_batch =  0.008844419847237179
---------
Starting epoch 851
total costs of mini_batch =  155.4186141788981
avg total error of mini_batch =  0.008244783820578621
---------
Starting epoch 901
total costs of mini_batch =  158.88876607392308
avg total error of mini_batch =  0.008970678691005138
---------
Starting epoch 951
total costs of mini_batch =  148.61870772570722
avg total error of mini_batch =  0.008124438423034456
---------
Starting epoch 1001
total costs of mini_batch =  152.16976618516264
avg total error of mini_batch =  0.009151413825781066
---------
Starting epoch 1051
total costs of mini_batch =  142.24802525081637
avg total error of mini_batch =  0.008297161160449798
---------
Starting epoch 1101
total costs of mini_batch =  137.3828515603569
avg total error of mini_batch =  0.007659755348989629
---------
Starting epoch 1151
total costs of mini_batch =  129.8472897084494
avg total error of mini_batch =  0.007254892176613871
---------
Starting epoch 1201
total costs of mini_batch =  139.30002497623792
avg total error of mini_batch =  0.007881199505625214
---------
Starting epoch 1251
total costs of mini_batch =  138.0323454321882
avg total error of mini_batch =  0.00807373439996105
---------
Starting epoch 1301
total costs of mini_batch =  117.95701570484076
avg total error of mini_batch =  0.006378071703153664
---------
Starting epoch 1351
total costs of mini_batch =  125.
71869046937177
avg total error of mini_batch =  0.0072716189968114265
---------
Starting epoch 1401
total costs of mini_batch =  117.3485602627176
avg total error of mini_batch =  0.006291182169676069
---------
Starting epoch 1451
total costs of mini_batch =  118.09317470010767
avg total error of mini_batch =  0.0066519021636054195
---------
Starting epoch 1491
total costs of mini_batch =  112.69566736699439
avg total error of mini_batch =  0.006151660466611035

 ------
Total training Time_CPU:  2430.0504785089997

 
We can display the results also graphically; a code fragemnt to do this, may look like follows in a Jupyter cell:

# Plotting 
# **********
num_epochs  = ANN._ay_costs.shape[0]
num_batches = ANN._ay_costs.shape[1]
num_tot = num_epochs * num_batches

ANN._ay_costs = ANN._ay_costs.reshape(num_tot)
ANN._ay_theta = ANN._ay_theta .reshape(num_tot)

#sizing
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 5

# Two figures 
# -----------
fig1 = plt.figure(1)
fig2 = plt.figure(2)

# first figure with two plot-areas with axes 
# --------------------------------------------
ax1_1 = fig1.add_subplot(121)
ax1_2 = fig1.add_subplot(122)

ax1_1.plot(range(len(ANN._ay_costs)), ANN._ay_costs)
ax1_1.set_xlim (0, num_tot+5)
ax1_1.set_ylim (0, 2000)
ax1_1.set_xlabel("epochs * batches (" + str(num_epochs) + " * " + str(num_batches) + " )")
ax1_1.set_ylabel("costs")

ax1_2.plot(range(len(ANN._ay_theta)), ANN._ay_theta)
ax1_2.set_xlim (0, num_tot+5)
ax1_2.set_ylim (0, 0.2)
ax1_2.set_xlabel("epochs * batches (" + str(num_epochs) + " * " + str(num_batches) + " )")
ax1_2.set_ylabel("averaged error")

 

We get:

This looks quite promising!
The vertical spread of both curves is due to the fact that we plotted cost and error data for each mini-batch. As we know the cost hyperplanes of the batches differ from each other and the hyperplane of the total costs for all training data. So do the cost and error values.

Secondary test: Rate of correctly and wrongly predicted values of the training and the test data sets

With the following code in a Jupyter cell we can check the relative percentage of correctly predicted MNIST numbers for the training data set and the test data set:

# ------ all training data 
# *************************
size_set = ANN._X_train.shape[0]

li_Z_in_layer_test  = [None] * ANN._n_total_layers
li_Z_in_layer_test[0] = ANN._X_train
# Transpose 
ay_Z_in_0T       = li_Z_in_layer_test[0].T
li_Z_in_layer_test[0] = ay_Z_in_0T
li_A_out_layer_test  = [None] * ANN._n_total_layers

ANN._fw_propagation(li_Z_in = li_Z_in_layer_test, li_A_out = li_A_out_layer_test, b_print = True) 
Result = np.argmax(li_A_out_layer_test[3], axis=0)
Error = ANN._y_train - Result 
acc_train = (np.sum(Error == 0)) / size_set
print ("total accuracy for training data = ", acc_train)

# ------ all test data 
# *************************
size_set = ANN._X_test.shape[0]

li_Z_in_layer_test  = [None] * ANN._n_total_layers
li_Z_in_layer_test[0] = ANN._X_test
# Transpose 
ay_Z_in_0T       = li_Z_in_layer_test[0].T
li_Z_in_layer_test[0] = ay_Z_in_0T
li_A_out_layer_test  = [None] * ANN._n_total_layers

ANN._fw_propagation(li_Z_in = li_Z_in_layer_test, li_A_out = li_A_out_layer_test, b_print = True) 
Result = np.argmax(li_A_out_layer_
test[3], axis=0)
Error = ANN._y_test - Result 
acc_test = (np.sum(Error == 0)) / size_set
print ("total accuracy for test data = ", acc_test)

 

“acc” stands for “accuracy”.

We get

total accuracy for training data = 0.9919
total accuracy for test data        = 0.9645

So, there is some overfitting – but not much.

Conclusion

Our training algorithm and the error backward propagation seem to work.

The real question is, whether we produced the accuracy values really efficiently: In our example case we needed to fix around 786*70 + 70*30 + 30*10 = 57420 weight values. This is close to the total amount of training data (60000). A smaller network with just one hidden layer would require much fewer values – and the training would be much faster regarding CPU time.

So, in the next article
A simple program for an ANN to cover the Mnist dataset – X – mini-batch-shuffling and some more tests
we shall extend our tests to different setups of the MLP.

A simple Python program for an ANN to cover the MNIST dataset – II – initial random weight values

In this article series we are going to build a relatively simple Python class for the simulation of a “Multilayer Perceptron” [MLP]. A MLP is a simple form of an “artificial neural network” [ANN] with multiple layers of neurons. It has three characteristic properties: (1) Only connections between nodes of neighboring layers are allowed. (2) Information transport occurs in one forward direction. We speak of a “forward propagation of input information” through the ANN. (3) Neighboring layers are densely connected; a node of layer L_n is connected to all nodes of layer L_(n+1).

The firsts two points mean simplifications: According to (1) we do not consider so called cascaded networks. According to (2) no loops occur in the information transport. The third point, however, implies a lot of mathematical operations – not only during forward propagation of information through the ANN, but also – as we shall see – during the training and optimization of the network where we will back propagate “errors” from the output to the input layers. But for the next articles we need to care about simpler things first.

We shall use our MLP for classification tasks. Our first objective is to apply it to the MNIST dataset. In my last article

A simple program for an ANN to cover the Mnist dataset – I – a starting point

I presented already some code for the “__init__”-function on some other methods of our Python class. It enabled us to import MNIST data and split them in to a set of training and a set of test samples. Note that this is a standard approach in Machine Learning: You train on one set of data samples, but you test the classification or regression abilities of your ANN on a separate disjunctive data set.

In the present article we shall extend the functionality of our class: First we shall equip our network layers with a defined number of nodes. Then we shall provide statistical initial values for the “weights” describing the forward transport of information along the connections between the layers of our network.

Present status of our function “__init__”

The present status of our __init__function is:

    def __init__(self, 
                 my_data_set = "mnist", 
                 n_hidden_layers = 1, 
                 ay_nodes_layers = [0, 100, 0], # array which should have as much elements as n_hidden + 2
                 n_nodes_layer_out = 10,  # expected number of nodes in output layer 
                                                  
                 my_activation_function = "sigmoid", 
                 my_out_function        = "sigmoid",   
                 
                 n_mini_batch = 1000,  # number of data elements in a mini-batch 
                 
                 vect_mode = 'cols', 
                 
                 figs_x1=12.0, figs_x2=8.0, 
                 legend_loc='upper right',
                 
                 b_print_test_data = True
                 
                 ):
        '''
        Initialization of MyANN
        Input: 
            data_set: type of dataset; so far only the "mnist", "mnist_784" datsets are known 
                      We use this information to prepare the input data and learn about the feature dimension. 
                      This info is used in preparing the size of the input layer.     
            n_hidden_layers = number of hidden layers => between input layer 0 and output layer n 

            ay_nodes_layers = [0, 100, 0 ] : We set the number of nodes in input layer_0 and the output_layer to zero 
                              Will be set to real number afterwards by infos from the input dataset. 
       
                       All other numbers are used for the node numbers of the hidden layers.
            n_nodes_out_layer = expected number of nodes in the output layer (is checked); 
                                this number corresponds to the number of categories NC = number of labels to be distinguished
            
            my_activation_function : name of the activation function to use 
            my_out_function : name of the "activation" function of the last layer which produces the output values 
            
            n_mini_batch : Number of elements/samples in a mini-batch of training dtaa 
            
            vect_mode: Are 1-dim data arrays (vctors) ordered by columns or rows ?

            figs_x1=12.0, figs_x2=8.0 : Standard sizing of plots , 
            legend_loc='upper right': Position of legends in the plots
            
            b_print_test_data: Boolean variable to control the print out of some tests data 
             
         '''
        
        # Array (Python list) of known input data sets 
        self._input_data_sets = ["mnist", "mnist_784", "mnist_keras"]  
        self._my_data_set = my_data_set
        
        # X, y, X_train, y_train, X_test, y_test  
            # will be set by analyze_input_data 
            # X: Input array (2D) - at present status of MNIST image data, only.    
            # y: result (=classification data) [digits represent categories in the case of Mnist]
        self._X       = None 
        self._X_train = None 
        self._X_test  = None   
        self._y       = None 
        self._y_train = None 
        self._y_test  = None
        
        # relevant dimensions 
        # from input data information;  will be set in handle_input_data()
        self._dim_sets     = 0  
        self._dim_features = 0  
        self._n_labels     = 0   # number of unique labels - will be extracted from y-data 
        
        # Img sizes 
        self._dim_img      = 0 # should be sqrt(dim_features) - we assume square like images  
        self._img_h        = 0 
        self._img_w        = 0 
        
        # Layers
        # ------
        # number of hidden layers 
        self._n_hidden_layers = n_hidden_layers
        # Number of total layers 
        self._n_total_layers = 2 + self._n_hidden_layers  
        # Nodes for hidden layers 
        self._ay_nodes_layers = np.array(ay_nodes_layers)
        # Number of nodes in output layer - will be checked against information from target arrays
        self._n_nodes_layer_out = n_nodes_layer_out
        
        
        # Weights 
        # --------
        # empty List for all weight-matrices for all layer-connections
        # Numbering : 
        # w[0] contains the weight matrix which connects layer 0 (input layer ) to hidden layer 1 
        # w[1] contains the weight matrix which connects layer 1 (input layer ) to (hidden?) layer 2 
        self._ay_w = []  
        
        # Known Randomizer methods ( 0: np.random.randint, 1: np.random.uniform )  
        # ------------------
        self.__ay_known_randomizers = [0, 1]

        # Types of activation functions and output functions 
        # ------------------
        self.__ay_activation_functions = ["sigmoid"] # later also relu 
        self.__ay_output_functions     = ["sigmoid"] # later also softmax 
        
        # the following dictionaries will be used for indirect function calls 
        self.__d_activation_funcs = {
            'sigmoid': self._sigmoid, 
            'relu':    self._relu
            }
        self.__d_output_funcs = { 
            'sigmoid': self._sigmoid, 
            'softmax': self._softmax
            }  
          
        # The following variables will later be set by _check_and set_activation_and_out_functions()            
        
self._my_act_func = my_activation_function
        self._my_out_func = my_out_function
        self._act_func = None    
        self._out_func = None    

        # number of data samples in a mini-batch 
        self._n_mini_batch = n_mini_batch


        # print some test data 
        self._b_print_test_data = b_print_test_data

        # Plot handling 
        # --------------
        # Alternatives to resize plots 
        # 1: just resize figure  2: resize plus create subplots() [figure + axes] 
        self._plot_resize_alternative = 1 
        # Plot-sizing
        self._figs_x1 = figs_x1
        self._figs_x2 = figs_x2
        self._fig = None
        self._ax  = None 
        # alternative 2 does resizing and (!) subplots() 
        self.initiate_and_resize_plot(self._plot_resize_alternative)        
        
        
        # ***********
        # operations 
        # ***********
        # check and handle input data 
        self._handle_input_data()
        
        print("\nStopping program regularily")
        sys.exit()

 
The kind reader may have noticed that this is not exactly what was presented in the last article. I have introduced two additional parameters and corresponding class attributes:
“b_print_test_data” and “n_mini_batch”.

The first of these parameters controls whether we print out some test data or not. The second parameter controls the number of test samples dealt with in parallel during propagation and optimization via the so called “mini-batch-approach” mentioned in the last article.

Setting node numbers of the layers

We said in the last article that we would provide the numbers of nodes of the ANN layers via a parameter list “ay_nodes_layers”. We set the number of nodes for the input and the output layer, i.e. the first number and the last number in the list, to “0” in this array because these numbers are determined by properties of the input data set – here the MNIST dataset.

All other numbers in the array determine the amount of nodes of the hidden layers in consecutive order between the input and the output layer. So, the number at ay_nodes_layers[1] is the number of nodes in the first hidden layer, i.e. the layer which follows after the input layer.

In the last article we have understood already that the number of nodes in the input layer should be equal to the full number of “features” of our input data set – 784 in our case. The number of nodes of the output layer must instead be determined from the number of categories in our data set. This is equivalent to the number of distinct labels in the set of training data represented by an array “_y_train” (in case of MNIST: 10).

We provide three methods to check the node numbers defined by the user, set the node numbers for the input and output layers and print the numbers.

    # Method which checks the number of nodes given for hidden layers 
    def _check_layer_and_node_numbers(self): 
        try: 
            if (self._n_total_layers != (self._n_hidden_layers + 2)): 
                raise ValueError
        except ValueError:
            print("The assumed total number of layers does not fit the number of hidden layers + 2") 
            sys.exit()   
        try: 
            if (len(self._ay_nodes_layers) != (self._n_hidden_layers + 2)): 
                raise ValueError
        except ValueError:
            print("The number of elements in the array for layer-nodes does not fit the number of hidden layers + 2") 
            sys.exit(1)   
    
    # Method which sets the number of nodes of the input and the layer 
    def _set_nodes_for_input_output_layers(self): 
        
        # Input layer: for the input layer we do 
NOT take into account a bias node 
        self._ay_nodes_layers[0] = self._dim_features 
                
        # Output layer: for the output layer we check the number of unique values in y_train
        try: 
            if ( self._n_labels != (self._n_nodes_layer_out) ): 
                raise ValueError
        except ValueError:
            print("The unique elements in target-array do not fit number of nodes in the output layer") 
            sys.exit(1)   
        self._ay_nodes_layers[self._n_total_layers - 1] = self._n_labels     


    # Method which prints the number of nodes of all layers 
    def _show_node_numbers(self): 
        print("\nThe node numbers for the " + str(self._n_total_layers) + " layers are : ")
        print(self._ay_nodes_layers)

 
The code should be easy to understand. self._dim_features was set in the method “_common_handling_of mnist()” discussed in the last article. It was derived from the shape of the input data array _X_train. The number of unique labels was evaluated by the method “_get_num_labels()” – also discussed in the last article.

Note: The initial node numbers DO NOT include a bias node, yet.

If we extend the final commands in the “__init__”-function by :

        # ***********
        # operations 
        # ***********
        # check and handle input data 
        self._handle_input_data()

        # check consistency of the node-number list with the number of hidden layers (n_hidden)
        self._check_layer_and_node_numbers()
        # set node numbers for the input layer and the output layer
        self._set_nodes_for_input_output_layers() 
        self._show_node_numbers() 
        
        print("\nStopping program regularily")
        sys.exit()

 

By testing our additional code in a Jupyter notebook we get a corresponding output for

ANN = myann.MyANN(my_data_set="mnist_keras", n_hidden_layers = 2, 
                     ay_nodes_layers = [0, 100, 50, 0], 
                     n_nodes_layer_out = 10,  
                     vect_mode = 'cols', 
                     figs_x1=12.0, figs_x2=8.0, 
                     legend_loc='upper right',
                     b_print_test_data = False
                     )

of:

The node numbers for the 4 defined layers are: 
[784 100  50  10]

Good!

Setting initial random numbers for the weights

Initial values for the ANN weights have to be given as matrices, i.e. 2-dim arrays. However, the randomizer functions provided by Numpy give you vectors as output. So, we need to reshape such vectors into the required form.

First we define a method to provide random floating point and integer numbers:

 
    # ---
    # method to create an array of randomized values
    def _create_vector_with_random_values(self, r_low=None, r_high=None, r_size=None, randomizer=0 ):   
        '''
        Method to create a vector of length "r_size" with "random values" in [r_low, r_high] 
        generated by method "randomizer" 
        Input: 
        ramdonizer : integer which sets randomizer method; presently only 
            0: np.random.uniform 
            1: np.randint     
        [r_low, r_high]: range of the random numbers to be created
        r_size: Size of output array 
        Output: A 1-dim numpy array of length rand-size - produced as a class member 
        '''
        # check parameters
        try: 
            if (r_low==None or r_high == None or r_size == None ): 
                raise ValueError
        except ValueError:
            print("One of 
the required parameters r_low, r_high, r_size has not been set") 
            sys.exit(1)   
        
        rmizer = int(randomizer)
        try: 
            if (rmizer not in self.__ay_known_randomizers): 
                raise ValueError
        except ValueError:
            print("randomizer not known") 
            sys.exit(1)   
        
        # 2 randomizers (so far)
        if (rmizer == 0): 
            ay_r_out = np.random.randint(int(r_low), int(r_high), int(r_size))
        if (rmizer == 1):
            ay_r_out = np.random.uniform(r_low, r_high, size=int(r_size))
            
        return ay_r_out     

 
Presently, we can only use two randomizer functions to be used – numpy.random.randint and numpy.random.uniform. The first one provides random integer values, the other one floating point values – both within a defined interval. The parameter “r_size” defines how many random numbers shall be created and put into an array. The code requires no further explanation.

Now, we define two methods to create the weight matrices for the connections

  • between the input layer L0 to the first hidden layer,
  • between further hidden layers and eventually between the last hidden layer and the output layer

As we allow for all possible connections between nodes the dimensions of the matrices are determined by the numbers of nodes in the connected neighboring layers. Each node of a layer L_n cam be connected to each node of layer L_(n+1). Our methods are:

 
    # Method to create the weight matrix between L0/L1
    # ------
    def _create_WM_Input(self):
        '''
        Method to create the input layer 
        The dimension will be taken from the structure of the input data 
        We need to fill self._w[0] with a matrix for conections of all nodes in L0 with all nodes in L1
        We fill the matrix with random numbers between [-1, 1] 
        '''
        # the num_nodes of layer 0 should already include the bias node 
        num_nodes_layer_0 = self._ay_nodes_layers[0]
        num_nodes_with_bias_layer_0 = num_nodes_layer_0 + 1 
        num_nodes_layer_1 = self._ay_nodes_layers[1] 
        
        # fill the matrix with random values 
        rand_low  = -1.0
        rand_high = 1.0
        rand_size = num_nodes_layer_1 * (num_nodes_with_bias_layer_0) 
        
        randomizer = 1 # method np.random.uniform   
        
        w0 = self._create_vector_with_random_values(rand_low, rand_high, rand_size, randomizer)
        w0 = w0.reshape(num_nodes_layer_1, num_nodes_with_bias_layer_0)
        
        # put the weight matrix into array of matrices 
        self._ay_w.append(w0.copy())
        print("\nShape of weight matrix between layers 0 and 1 " + str(self._ay_w[0].shape))


        
    # Method to create the weight-matrices for hidden layers 
    def _create_WM_Hidden(self):
        '''
        Method to create the weights of the hidden layers, i.e. between [L1, L2] and so on ... [L_n, L_out] 
        We fill the matrix with random numbers between [-1, 1] 
        '''
        
        # The "+1" is required due to range properties ! 
        rg_hidden_layers = range(1, self._n_hidden_layers + 1, 1)

        # for random operation 
        rand_low  = -1.0
        rand_high = 1.0
        
        for i in rg_hidden_layers: 
            print ("Creating weight matrix for layer " + str(i) + " to layer " + str(i+1) )
            
            num_nodes_layer = self._ay_nodes_layers[i] 
            num_nodes_with_bias_layer = num_nodes_layer + 1 
            
            # the number of the next layer is taken without the bias node!
            num_nodes_layer_
next = self._ay_nodes_layers[i+1]
            
            # assign random values  
            rand_size = num_nodes_layer_next * num_nodes_with_bias_layer   
            
            randomizer = 1 # np.random.uniform
            
            w_i_next = self._create_vector_with_random_values(rand_low, rand_high, rand_size, randomizer)   
            w_i_next = w_i_next.reshape(num_nodes_layer_next, num_nodes_with_bias_layer)
            
            # put the weight matrix into our array of matrices 
            self._ay_w.append(w_i_next.copy())
            print("Shape of weight matrix between layers " + str(i) + " and " + str(i+1) + " = " + str(self._ay_w[i].shape))

 
Three things may need explanation:

  • The first thing is the inclusion of a bias-node in all layers. A bias node only provides input to the next layer but does not receive input from a preceding layer. It gives an additional bias to the input a layer provides to a neighbor layer. (Actually, it provides an additional degree of freedom for the optimization process.) So, we include connections from a bias-node of a layer L_n to all nodes (without the bias-node) in layer L_(n+1).
  • The second thing is the reshaping of the vector of random numbers into a matrix. Of course, the dimension of the vector of random numbers must fit the product of the 2 matrix dimensions. Note that the first dimension always corresponds to number of nodes in layer L_(n+1) and the second dimension to the number of nodes in layer L_n (including the bias-node)!
    We need this special form to support the vectorized propagation properly later on.
  • The third thing is that we save our (Numpy) weight matrices as elements of a Python list.

In my opinion this makes the access to these matrices flexible and easy in the case of multiple hidden layers.

Setting the activation and output functions

We must set the activation and the output function. This is handled by a method “_check_and_set_activation_and_out_functions()”.

 
    def _check_and_set_activation_and_out_functions(self):
        # check for known activation function 
        try: 
            if (self._my_act_func not in self.__d_activation_funcs ): 
                raise ValueError
        except ValueError:
            print("The requested activation function " + self._my_act_func + " is not known!" )
            sys.exit()   
        # check for known output function 
        try: 
            if (self._my_out_func not in self.__d_output_funcs ): 
                raise ValueError
        except ValueError:
            print("The requested output function " + self._my_out_func + " is not known!" )
            sys.exit()   
             
        # set the function to variables for indirect addressing 
        self._act_func = self.__d_activation_funcs[self._my_act_func]
        self._out_func = self.__d_output_funcs[self._my_out_func]
        
        if self._b_print_test_data:
            z = 7.0
            print("\nThe activation function of the standard neurons was defined as \""  + self._my_act_func + '"') 
            print("The activation function gives for z=7.0:  " + str(self._act_func(z))) 
            print("\nThe output function of the neurons in the output layer was defined as \"" + self._my_out_func + "\"") 
            print("The output function gives for z=7.0:  " + str(self._out_func(z))) 
        

 
It requires not much of an explanation. For the time being we just rely on the given definition of the “__init__”-interface, which sets both to “sigmoid()”. The internal dictionaries __d_activation_funcs[] and __d_output_funcs[] provide the functions to
internal variables (indirect addressing).

Some test output

A test output for the enhanced code

 
        # ***********
        # operations 
        # ***********
        # check and handle input data 
        self._handle_input_data()

        # check consistency of the node-number list with the number of hidden layers (n_hidden)
        self._check_layer_and_node_numbers()
        # set node numbers for the input layer and the output layer
        self._set_nodes_for_input_output_layers() 
        self._show_node_numbers() 

        # create the weight matrix between input and first hidden layer 
        self._create_WM_Input() 
        # create weight matrices between the hidden layers and between tha last hidden and the output layer 
        self._create_WM_Hidden() 

        # check and set activation functions 
        self._check_and_set_activation_and_out_functions()

        print("\nStopping program regularily")
        sys.exit()

 

and a Jupyter cell

gives :

The shapes of the weight matrices correspond correctly to the numbers of nodes in the 4 layers defined. (Do not forget about the bias-nodes!).

Conclusion

We have reached a status where our ANN class can read in the MNIST dataset and set initial random values for the weights. This means that we can start to do some more interesting things. In the next article

A simple program for an ANN to cover the Mnist dataset – III – forward propagation

we shall program the “forward propagation”. We shall perform the propagation for a mini-batch of many data samples in one step. We shall see that this is a very simple task, which only requires a few lines of code.