A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests

Posted on 2. March 2020 by eremo

I continue my series on a Python code for a simple multi-layer perceptron [MLP]. During the course of the previous articles we have built a Python class “ANN” with methods to import the MNIST data set and handle forward propagation as well as error backward propagation [EBP]. We also had a deeper look at the mathematics of gradient descent and EBP for MLP training.

The code modifications in the last article enabled us to perform a first test on the MNIST dataset. This test gave us some confidence in our training algorithm: It seemed to converge and produce weights which did a relatively good job on analyzing the MNIST images.

We saw a slight tendency of overfitting. But an accuracy level of 96.5% on the test dataset showed that the MLP had indeed “learned” something during training. We needed around 1000 epochs to come to this point.

However, there are a lot of parameters controlling our grid structure and the learning behavior. Such parameters are often called “hyper-parameters“. To get a better understanding of our MLP we must start playing around with such parameters. In this article we shall concentrate on the parameter for (regression) regularization (called Lambda2 in the parameter interface of our class ANN) and then start varying the node numbers on the layers.

But before we start new test runs we add a statistical element to the training – namely the variation of the composition of our mini-batches (see the last article).

General hint: In all of the test runs below we used 4 CPU cores with libOpenBlas on a Linux system with an I7 6700K CPU.

Shuffling the contents of the mini-batches

Let us add some more parameters to the interface of class “ANN”:

shuffle_batches = True
print_period = 20

The first parameter
shall control whether we vary the composition of the mini-batches with each epoch. The second parameter controls for which period of the epochs we print out some intermediate data (costs, averaged error of last mini-batch).

    def __init__(self, 
                 my_data_set = "mnist", 
                 n_hidden_layers = 1, 
                 ay_nodes_layers = [0, 100, 0], # array which should have as much elements as n_hidden + 2
                 n_nodes_layer_out = 10,  # expected number of nodes in output layer 
                                                  
                 my_activation_function = "sigmoid", 
                 my_out_function        = "sigmoid",   
                 my_loss_function       = "LogLoss",   
                 
                 n_size_mini_batch = 50,  # number of data elements in a mini-batch 
                 
                 n_epochs      = 1,
                 n_max_batches = -1,  # number of mini-batches to use during epochs - > 0 only for testing 
                                      # a negative value uses all mini-batches 
                 
                 lambda2_reg = 0.1,     # factor for quadratic regularization term 
                 lambda1_reg = 0.0,     # factor for linear regularization term 
                 
                 vect_mode = 'cols', 
                 
                 learn_rate = 0.001,        # the learning rate (often called epsilon in textbooks) 
                 decrease_const = 0.00001,  # a factor for decreasing the learning rate with epochs
                 mom_rate   = 0.0005,       # a factor for momentum learning
                 
                 shuffle_batches = True,    # True: we mix the data for mini-batches in the X-train set at the start of each epoch
                 
                 print_period = 20,         # number of epochs for which to print the costs and the averaged error
                 
                 figs_x1=12.0, figs_x2=8.0, 
                 legend_loc='upper right',
                 
                 b_print_test_data = True
                 
                 ):
        '''
        Initialization of MyANN
        Input: 
            data_set: type of dataset; so far only the "mnist", "mnist_784" datsets are known 
                      We use this information to prepare the input data and learn about the feature dimension. 
                      This info is used in preparing the size of the input layer.     
            n_hidden_layers = number of hidden layers => between input layer 0 and output layer n 
            
            ay_nodes_layers = [0, 100, 0 ] : We set the number of nodes in input layer_0 and the output_layer to zero 
                              Will be set to real number afterwards by infos from the input dataset. 
                              All other numbers are used for the node numbers of the hidden layers.
            n_nodes_out_layer = expected number of nodes in the output layer (is checked); 
                                this number corresponds to the number of categories NC = number of labels to be distinguished
            
            my_activation_function : name of the activation function to use 
            my_out_function : name of the "activation" function of the last layer which produces the output values 
            my_loss_function : name of the "cost" or "loss" function used for optimization 
            
            n_size_mini_batch : Number of elements/samples in a mini-batch of training data 
                                The number of mini-batches will be calculated from this
            
            n_epochs : number of epochs to calculate during training
            n_max_batches : > 0: maximum of mini-batches to use during training 
                      
      < 0: use all mini-batches  
            
            lambda_reg2:    The factor for the quadartic regularization term 
            lambda_reg1:    The factor for the linear regularization term 
            
            vect_mode: Are 1-dim data arrays (vctors) ordered by columns or rows ?
            
            learn rate :     Learning rate - definies by how much we correct weights in the indicated direction of the gradient on the cost hyperplane.
            decrease_const:  Controls a systematic decrease of the learning rate with epoch number 
            mom_const:       Momentum rate. Controls a mixture of the last with the present weight corrections (momentum learning)
            
            shuffle_batches: True => vary composition of mini-batches with each epoch
            
            print_period:    number of periods between printing out some intermediate data 
                             on costs and the averaged error of the last mini-batch   
                       
            
            figs_x1=12.0, figs_x2=8.0 : Standard sizing of plots , 
            legend_loc='upper right': Position of legends in the plots
            
            b_print_test_data: Boolean variable to control the print out of some tests data 
             
         '''
        
        # Array (Python list) of known input data sets 
        self._input_data_sets = ["mnist", "mnist_784", "mnist_keras"]  
        self._my_data_set = my_data_set
        
        # X, y, X_train, y_train, X_test, y_test  
            # will be set by analyze_input_data 
            # X: Input array (2D) - at present status of MNIST image data, only.    
            # y: result (=classification data) [digits represent categories in the case of Mnist]
        self._X       = None 
        self._X_train = None 
        self._X_test  = None   
        self._y       = None 
        self._y_train = None 
        self._y_test  = None
        
        # relevant dimensions 
        # from input data information;  will be set in handle_input_data()
        self._dim_sets     = 0  
        self._dim_features = 0  
        self._n_labels     = 0   # number of unique labels - will be extracted from y-data 
        
        # Img sizes 
        self._dim_img      = 0 # should be sqrt(dim_features) - we assume square like images  
        self._img_h        = 0 
        self._img_w        = 0 
        
        # Layers
        # ------
        # number of hidden layers 
        self._n_hidden_layers = n_hidden_layers
        # Number of total layers 
        self._n_total_layers = 2 + self._n_hidden_layers  
        # Nodes for hidden layers 
        self._ay_nodes_layers = np.array(ay_nodes_layers)
        # Number of nodes in output layer - will be checked against information from target arrays
        self._n_nodes_layer_out = n_nodes_layer_out
        
        # Weights 
        # --------
        # empty List for all weight-matrices for all layer-connections
        # Numbering : 
        # w[0] contains the weight matrix which connects layer 0 (input layer ) to hidden layer 1 
        # w[1] contains the weight matrix which connects layer 1 (input layer ) to (hidden?) layer 2 
        self._li_w = []
        
        # Arrays for encoded output labels - will be set in _encode_all_mnist_labels()
        # -------------------------------
        self._ay_onehot = None
        self._ay_oneval = None
        
        # Known Randomizer methods ( 0: np.random.randint, 1: np.random.uniform )  
        # ------------------
        self.__ay_known_randomizers = [0, 1]

        # Types of activation functions and output functions 
        # ------------------
        self.__ay_activation_functions = ["sigmoid"] # later also relu 
        self.__ay_output_functions     = ["sigmoid"] # later also 
softmax 
        
        # Types of cost functions 
        # ------------------
        self.__ay_loss_functions = ["LogLoss", "MSE" ] # later also other types of cost/loss functions  


        # the following dictionaries will be used for indirect function calls 
        self.__d_activation_funcs = {
            'sigmoid': self._sigmoid, 
            'relu':    self._relu
            }
        self.__d_output_funcs = { 
            'sigmoid': self._sigmoid, 
            'softmax': self._softmax
            }  
        self.__d_loss_funcs = { 
            'LogLoss': self._loss_LogLoss, 
            'MSE': self._loss_MSE
            }  
        # Derivative functions 
        self.__d_D_activation_funcs = {
            'sigmoid': self._D_sigmoid, 
            'relu':    self._D_relu
            }
        self.__d_D_output_funcs = { 
            'sigmoid': self._D_sigmoid, 
            'softmax': self._D_softmax
            }  
        self.__d_D_loss_funcs = { 
            'LogLoss': self._D_loss_LogLoss, 
            'MSE': self._D_loss_MSE
            }  
        
        
        # The following variables will later be set by _check_and set_activation_and_out_functions()            
        
        self._my_act_func  = my_activation_function
        self._my_out_func  = my_out_function
        self._my_loss_func = my_loss_function
        self._act_func = None    
        self._out_func = None    
        self._loss_func = None    
        
        # number of data samples in a mini-batch 
        self._n_size_mini_batch = n_size_mini_batch
        self._n_mini_batches = None  # will be determined by _get_number_of_mini_batches()

        # maximum number of epochs - we set this number to an assumed maximum 
        # - as we shall build a backup and reload functionality for training, this should not be a major problem 
        self._n_epochs = n_epochs
        
        # maximum number of batches to handle ( if < 0 => all!) 
        self._n_max_batches = n_max_batches
        # actual number of batches 
        self._n_batches = None

        # regularization parameters
        self._lambda2_reg = lambda2_reg
        self._lambda1_reg = lambda1_reg
        
        # parameter for momentum learning 
        self._learn_rate = learn_rate
        self._decrease_const = decrease_const
        self._mom_rate   = mom_rate
        self._li_mom = [None] *  self._n_total_layers
        
        # shuffle data in X_train? 
        self._shuffle_batches = shuffle_batches
        
        # epoch period for printing 
        self._print_period = print_period
        
        # book-keeping for epochs and mini-batches 
        # -------------------------------
        # range for epochs - will be set by _prepare-epochs_and_batches() 
        self._rg_idx_epochs = None
        # range for mini-batches 
        self._rg_idx_batches = None
        # dimension of the numpy arrays for book-keeping - will be set in _prepare_epochs_and_batches() 
        self._shape_epochs_batches = None    # (n_epochs, n_batches, 1) 

        # list for error values at outermost layer for minibatches and epochs during training
        # we use a numpy array here because we can redimension it
        self._ay_theta = None
        # list for cost values of mini-batches during training 
        # The list will later be split into sections for epochs 
        self._ay_costs = None
        
        
        # Data elements for back propagation
        # ----------------------------------
        
        # 2-dim array of partial derivatives of the elements of an additive cost function 
        # The derivative is taken with respect to the output results a_j = ay_ANN_out[j]
        # The array dimensions account for nodes and sampls of a 
mini_batch. The array will be set in function 
        # self._initiate_bw_propagation()
        self._ay_delta_out_batch = None
        

        # parameter to allow printing of some test data 
        self._b_print_test_data = b_print_test_data

        # Plot handling 
        # --------------
        # Alternatives to resize plots 
        # 1: just resize figure  2: resize plus create subplots() [figure + axes] 
        self._plot_resize_alternative = 1 
        # Plot-sizing
        self._figs_x1 = figs_x1
        self._figs_x2 = figs_x2
        self._fig = None
        self._ax  = None 
        # alternative 2 does resizing and (!) subplots() 
        self.initiate_and_resize_plot(self._plot_resize_alternative)        
        
        
        # ***********
        # operations 
        # ***********
        
        # check and handle input data 
        self._handle_input_data()
        # set the ANN structure 
        self._set_ANN_structure()
        
        # Prepare epoch and batch-handling - sets ranges, limits num of mini-batches and initializes book-keeping arrays
        self._rg_idx_epochs, self._rg_idx_batches = self._prepare_epochs_and_batches()
        
        # perform training 
        start_c = time.perf_counter()
        self._fit(b_print=True, b_measure_batch_time=False)
        end_c = time.perf_counter()
        print('\n\n ------') 
        print('Total training Time_CPU: ', end_c - start_c) 
        print("\nStopping program regularily")

Both parameters affect our method “_fit()” in the following way :

    ''' -- Method to set the number of batches based on given batch size -- '''
    def _fit(self, b_print = False, b_measure_batch_time = False):
        '''
        Parameters: 
            b_print:                 Do we print intermediate results of the training at all? 
            b_print_period:          For which period of epochs do we print? 
            b_measure_batch_time:    Measure CPU-Time for a batch
        '''
        rg_idx_epochs  = self._rg_idx_epochs 
        rg_idx_batches = self._rg_idx_batches
        if (b_print):    
            print("\nnumber of epochs = " + str(len(rg_idx_epochs)))
            print("max number of batches = " + str(len(rg_idx_batches)))
       
        # loop over epochs
        for idxe in rg_idx_epochs:
            if (b_print and (idxe % self._print_period == 0) ):
                print("\n ---------")
                print("\nStarting epoch " + str(idxe+1))
                
            # sinmple adaption of the learning rate 
            self._learn_rate /= (1.0 + self._decrease_const * idxe)
            
            # shuffle indices for a variation of the mini-batches with each epoch
            if self._shuffle_batches:
                shuffled_index = np.random.permutation(self._dim_sets)
                self._X_train, self._y_train, self._ay_onehot = self._X_train[shuffled_index], self._y_train[shuffled_index], self._ay_onehot[:, shuffled_index]
                
            
            # loop over mini-batches
            for idxb in rg_idx_batches:
                if b_measure_batch_time: 
                    start_0 = time.perf_counter()
                # deal with a mini-batch
                self._handle_mini_batch(num_batch = idxb, num_epoch=idxe, b_print_y_vals = False, b_print = False)
                if b_measure_batch_time: 
                    end_0 = time.perf_counter()
                    print('Time_CPU for batch ' + str(idxb+1), end_0 - start_0) 
                
            if (b_print and (idxe % self._print_period == 0) ):
                print("\ntotal costs of mini_batch = ", self._ay_costs[idxe, idxb])
      
          print("avg total error of mini_batch = ", self._ay_theta[idxe, idxb])
                
        return None

Results for shuffling the contents of the mini-batches

With shuffling we expect a slightly broader variation of the costs and the averaged error. But the accuracy should no change too much in the end. We start a new test run with the following parameters:

     ay_nodes_layers = [0, 70, 30, 0], 
     n_nodes_layer_out = 10,  
     my_loss_function = "LogLoss",
     n_size_mini_batch = 500,
     n_epochs = 1500, 
     n_max_batches = 2000,  # small values only for test runs
     lambda2_reg = 0.1, 
     lambda1_reg = 0.0,      
     vect_mode = 'cols', 
     learn_rate = 0.0001,
     decrease_const = 0.000001,
     mom_rate   = 0.00005,  
     shuffle_batches = True,
     print_period = 20,         
...

If we look at the intermediate printout for the last mini-batch of some epochs and compare it to the results given in the last article, we see a stronger variation in the costs and averaged error. The reason is that the composition of last mini-batch of an epoch changes with every epoch.

number of epochs = 1500
max number of batches = 120
---------
Starting epoch 1
total costs of mini_batch =  1757.7650929607967
avg total error of mini_batch =  0.17086198431410954
---------
Starting epoch 61
total costs of mini_batch =  511.7001121819204
avg total error of mini_batch =  0.030287362041332373
---------
Starting epoch 121
total costs of mini_batch =  435.2513093033654
avg total error of mini_batch =  0.023445601362614754
----------
Starting epoch 181
total costs of mini_batch =  361.8665831722295
avg total error of mini_batch =  0.018540003201911136
---------
Starting epoch 241
total costs of mini_batch =  293.31230634431023
avg total error of mini_batch =  0.0138237366634751
---------
Starting epoch 301
total costs of mini_batch =  332.70394217467936
avg total error of mini_batch =  0.017697548541363246
---------
Starting epoch 361
total costs of mini_batch =  249.26400606039937
avg total error of mini_batch =  0.011765164578232358
---------
Starting epoch 421
total costs of mini_batch =  240.0503762160913
avg total error of mini_batch =  0.011650843329895542
---------
Starting epoch 481
total costs of mini_batch =  222.89422430417295
avg total error of mini_batch =  0.011503859412784031
---------
Starting epoch 541
total costs of mini_batch =  200.1195962051405
avg total error of mini_batch =  0.009962020519104173
---------
tarting epoch 601
total costs of mini_batch =  206.74753168607685
avg total error of mini_batch =  0.01067995191155135
---------
Starting epoch 661
total costs of mini_batch =  171.14090717705736
avg total error of mini_batch =  0.0077091934178393105
---------
Starting epoch 721
total costs of mini_batch =  158.44967190977957
avg total error of mini_batch =  0.0070760922760890735
---------
Starting epoch 781
total costs of mini_batch =  165.4047453537401
avg total error of mini_batch =  0.008622788115637027
---------
Starting epoch 841
total costs of mini_batch =  140.52762105883642
avg total error of mini_batch =  0.0067360505574077766
---------
Starting epoch 901
total costs of mini_batch =  163.9117184790982
avg total error of mini_batch =  0.007431666926365192
---------
Starting epoch 961
total costs of mini_batch =  126.05539161877512
avg total error of mini_batch =  0.005982378079899406
---------
Starting epoch 1021
total costs of mini_batch =  114.89943308334199
avg total error of mini_batch =  0.005122976288751798
---------
Starting epoch 1081
total costs of mini_batch =  117.22051220670932
avg total error of mini_batch 
=  0.005185936692097749
---------
Starting epoch 1141
total costs of mini_batch =  140.88969853048422
avg total error of mini_batch =  0.007665464508660714
---------
Starting epoch 1201
total costs of mini_batch =  113.27223303239667
avg total error of mini_batch =  0.0059791015452599705
---------
Starting epoch 1261
total costs of mini_batch =  105.55343407063131
avg total error of mini_batch =  0.005000503315756879
---------
Starting epoch 1321
total costs of mini_batch =  130.48116668827038
avg total error of mini_batch =  0.006287118265324945
---------
Starting epoch 1381
total costs of mini_batch =  109.04042315247389
avg total error of mini_batch =  0.005874339148860562
---------
Starting epoch 1441
total costs of mini_batch =  121.01379412127089
avg total error of mini_batch =  0.0065105907117289944
---------
Starting epoch 1461
total costs of mini_batch =  103.08774822996196
avg total error of mini_batch =  0.005299079778792264
---------
Starting epoch 1481
total costs of mini_batch =  106.21334182056928
avg total error of mini_batch =  0.005343967730134955
-------
Total training Time_CPU:  1963.8792177759988

Note that the averaged error values result from averaging of the absolute values of the errors of all records in a batch! The small numbers are not due to a cancelling of positive by negative deviations. A contribution to the error at an output node is given by the absloute value of the difference between the predicted real output value and the encoded target output value. We then first calculate an average over all output nodes (=10) per record and then average these values over all records of a batch. Such an “averaged error” gives us a first indication of the accuracy level reached.

Note that this averaged error is not becoming a constant. The last values in the above list indicate that we do not get much better with the error than 0.0055 on the training data. Our approached minimum points on the various cost hyperplanes of the mini-batches obviously hop around the global minimum on the hyperplane of the total costs. One of the reasons is the varying composition of the mini-batches; another reason is that the cost hyperplanes of the various mini-batches themselves are different from the hyperplane of the total costs of all records of the test data set. We see the effects of a mixture of “batch gradient descent” and “stochastic gradient descent” here; i.e., we do not get rid of stochastic elements even if we are close to a global minimum.

Still we observe an overall convergent behavior at around 1050 epochs. There our curves get relatively flat.

Accuracy values are:

total accuracy for training data = 0.9914
total accuracy for test data = 0.9611

So, this is pretty much the same as in our original run in the last article without data shuffling.

Dropping regularization

In the above run we had used a quadratic from of the regularization (often called Ridge regularization). In the next test run we shall drop regularization completely (Lambda2 = 0, Lambda1 = 0) and find out whether this hampers the generalization of our MLP and the resulting accuracy with respect to the test data set.

Resulting data for the last epochs of the test run are

Starting epoch 1001
total costs of mini_batch =  67.98542512352101
avg total error of mini_batch =  0.007449654093429594
---------
nStarting epoch 1051
total costs of mini_batch =  56.69195783294443
avg total error of mini_batch =  0.0063384571747725415
---------
Starting epoch 1101
total costs of mini_batch =  51.81035466850738
avg total error of mini_batch =  0.005939699354987233
---------
Starting epoch 1151
total costs of mini_batch =  52.23157716632318
avg total error of mini_batch =  0.006373981433882217
---------
Starting epoch 1201
total costs of mini_batch =  48.40298652277855
avg total error of mini_batch =  0.005653856253701317
---------
Starting epoch 1251
total costs of mini_batch =  45.00623540189525
avg total error of mini_batch =  0.005245339176038497
---------
Starting epoch 1301
total costs of mini_batch =  36.88409532579881
avg total error of mini_batch =  0.004600719544961844
---------
Starting epoch 1351
total costs of mini_batch =  36.53543045554845
avg total error of mini_batch =  0.003993852242709943
---------
Starting epoch 1401
total costs of mini_batch =  38.80422469954769
avg total error of mini_batch =  0.00464620714991714
---------
Starting epoch 1451
total costs of mini_batch =  42.39371261881638
avg total error of mini_batch =  0.005294796697150631
------
Total training Time_CPU:  2118.4527089519997

Note, by the way, that the absolute values of the costs depend on the regularization parameter; therefore we see somewhat lower values in the end than before. But the absolute cost values are not so important regarding the general convergence and the accuracy of the network reached.

We omit the plots and just give the accuracy values:

total accuracy for training data = 0.9874
total accuracy for test data = 0.9514

We get a slight drop in accuracy for the test data set – small (1%), but notable. It is interesting the even the accuracy on the training data became influenced.

Why might it be interesting to check the impact of the regularization?

We learn from literature that regularization helps with overfitting. Another aspect discussed e.g. by Jörg Frochte in his book “Maschinelles Lernen” is, whether we have enough training data to determine the vast amount of weights in complicated networks. He suggests on page 190 of his book to consider the number of weights in an MLP and compare it with the number of available data points.

He suggests that one may run into trouble if the difference between the number of weights (number of degrees of freedom) and the number of data records (number of independent information data) becomes too big. However, his test example was a rather limited one and for regression not classification. He also notes that if the data are well distributed may not be as big as assumed. If one thinks about it, one may also come to the question whether the real amount of data provided by the records is not by a factor of 10 larger – as we use 10 output values per record ….

Anyway, I think it is worthwhile to have a look at regularization.

Enlarging the regularization factor

We double the value of Lambda2: Lambda2 = 0.2.

Starting epoch 1251
total costs of mini_batch =  128.00827405482318
avg total error of mini_batch =  0.007276206815511017
---------
Starting epoch 1301
total costs of mini_batch =  107.62983581797556
avg total error of mini_batch =  0.005535653858885446
---------
Starting epoch 1351
total costs of mini_batch =  107.83630092292944
avg total error of mini_batch =  0.
005446805325519184
---------
Starting epoch 1401
total costs of mini_batch =  119.7648277329852
avg total error of mini_batch =  0.00729466852297802
---------
Starting epoch 1451
total costs of mini_batch =  106.74254206278933
avg total error of mini_batch =  0.005343124456075227

We get a slight improvement of the accuracy compared to our first run with data shuffling:

total accuracy for training data = 0.9950
total accuracy for test data = 0.964

So, regularization does have its advantages. I recommend to investigate the impact of this parameter closely, if you need to get the “last percentages” in generalization and accuracy for a given MLP-model.

Enlarging node numbers

We have 60000 data records in the training set. In our example case we needed to fix around 784*70 + 70*30 + 30*10 = 57280 weight values. This is pretty close to the total amount of training data (60000). What happens if we extend the number of weights beyond the number of training records?
E.g. 784*100 + 100*50 + 50*10 = 83900. Do we get some trouble?

The results are:

          
Starting epoch 1151
total costs of mini_batch =  109.77341617599176
avg total error of mini_batch =  0.005494982077591186
---------
Starting epoch 1201
total costs of mini_batch =  113.5293680548904
avg total error of mini_batch =  0.005352117137100675
---------
Starting epoch 1251
total costs of mini_batch =  116.26371170820423
avg total error of mini_batch =  0.0072335516486698
---------
Starting epoch 1301
total costs of mini_batch =  99.7268420386945
avg total error of mini_batch =  0.004850817052601995
---------
Starting epoch 1351
total costs of mini_batch =  101.16579732551999
avg total error of mini_batch =  0.004831600835072556
---------
Starting epoch 1401
total costs of mini_batch =  98.45208584213253
avg total error of mini_batch =  0.004796133492821962
---------
Starting epoch 1451
total costs of mini_batch =  99.279344780807
avg total error of mini_batch =  0.005289728162205425
------
Total training Time_CPU:  2159.5880855739997

Ooops, there appears a glitch in the data around epoch 1250. Such things happen! So, we should have a look at the graphs before we decide to take the weights of a special epoch for our MLP model!

But in the end, i.e. with the weights at epoch 1500 the accuracy values are:

total accuracy for training data = 0.9962
total accuracy for test data = 0.9632

So, we were NOT punished by extending our network, but we gained nothing worth the effort.

Now, let us go up with node numbers much more: 300 and 100 => 784*300 + 300*100 + 100*10 = 266200; ie. substantially more individual weights than training samples! First with Lambda2 = 0.2:

Starting epoch 1201
total costs of mini_batch =  104.4420759423322
avg total error of mini_batch =  0.0037985801450468246
---------
Starting epoch 1251
total costs of mini_batch =  102.80878647657674
avg total error of mini_batch =  0.003926855904089417
---------
Starting epoch 1301
total costs of mini_batch =  100.01189950545002
avg total error of mini_batch =  0.0037743225368465773
---------
Starting epoch 1351
total 
costs of mini_batch =  97.34294880936079
avg total error of mini_batch =  0.0035513092392408865
---------
Starting epoch 1401
total costs of mini_batch =  93.15432903284587
avg total error of mini_batch =  0.0032916082743134206
---------
Starting epoch 1451
total costs of mini_batch =  89.79127326241868
avg total error of mini_batch =  0.0033628384147655283
------
Total training Time_CPU:  4254.479082876998

total accuracy for training data = 0.9987
total accuracy for test data = 0.9630

So , much CPU-time for no gain!

Now, what happens if we set Lambda2 = 0? We get:

total accuracy for training data = 0.9955
total accuracy for test data = 0.9491

This is a small change around 1.4%! I would say:

In the special case of MNIST, a MLP-network with 2 hidden layers and a small learning rate we see neither a major problem regarding the amount of available data, nor a dramatic impact of the regularization. Regularization brings around a 1% gain in accuracy.

Reducing the node number on the hidden layers

Another experiment could be to reduce the number of nodes on the hidden layers. Let us go down to 40 nodes on L1 and 20 on L2, then to 30 and 20. Lambda2 is set to 0.2 in both runs. acc1 in the following listing means the accuracy for the training data, acc2 for the test data.

The results are:

40 nodes on L1, 20 nodes on L2, 1500 epochs => 1600 sec, acc1 = 0.9898, acc2 = 0.9578
30 nodes on L1, 20 nodes on L2, 1800 epochs => 1864 sec, acc1 = 0.9861, acc2 = 0.9513

We loose around 1% in accuracy!

The plot for 30, 20 nodes on layers L1, L2 shows that we got a convergence only beyond 1500 epochs.

Working with just one hidden layer

To get some more indications regarding efficiency let us now turn to networks with just one layer.
We investigate three situations with the following node numbers on the hidden layer: 100, 50, 30

The plot for 100 nodes on the hidden layer; we get convergence at around 1050 epochs.

Interestingly the CPU time for 100 nodes is with 1850 secs not smaller than for the network with 70 and 30 nodes on the hidden layers. As the dominant matrices are the ones connecting layer L0 and layer L1 this is quite understandable. (Note the CPU time also depends on the consumption of other jobs on the system.

The plots for 50 and 30 nodes on the hidden layer; we get convergence at around 1450 epochs. The
CPU time for 1500 epochs goes down to 1500 sec and XXX sec, respectively.

Plot for 50 nodes on the hidden layer:

Plot for 30 nodes:

We get the following accuracy values:

100 nodes, 1950 sec (1500 epochs), acc1 = 0.9947, acc2 = 0.9619,
50 nodes, 1600 sec (1500 epochs), acc1 = 0.9880, acc2 = 0.9566,
30 nodes, 1450 sec (1500 epochs), acc1 = 0.9780, acc2 = 0.9436

Well, now we see a drop in the accuracy by around 2% compared to our best cases. You have to decide yourself whether the gain in CPU time is worth it.

Note, by the way, that the accuracy value for 50 nodes is pretty close to the value S. Rashka got in his book “Python Machine Learning”. If you compare such values with your own runs be aware of the rather small learning rate (0.0001) and momentum rates (0.00005) I used. You can probably become faster with smaller learning rates. But then you may need another type of adaption for the learning rate compared to our simple linear one.

Conclusion

We saw that our original guess of a network with 2 hidden layers with 70 and 30 nodes was not a bad one. A network with just one layer with just 50 nodes or below does not give us the same accuracy. However, we neither saw a major improvement if we went to grids with 300 nodes on layer L1 and 100 nodes on layer L2. Despite some discrepancy between the number of weights in comparison to the number of test records we saw no significant loss in accuracy either – with or without regularization.
We also learned that we should use regularization (here of the quadratic type) to get the last 1% to 2% bit of accuracy on the test data in our special MNIST case.

In the next article

A simple program for an ANN to cover the Mnist dataset – XI – confusion matrix

we shall have a closer look at those MNIST number images where our MLP got problems.

A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests

Posted on 24. February 2020 by eremo

This brief article continues my series on a Python program for simple MLPs.

With the code for “Error Backward Propagation” we have come so far that we can perform first tests. As planned from the beginning we take the MNIST dataset as a test example. In a first approach we do not rebuild the mini-batches with each epoch. Neither do we vary the MLP setup.

What we are interested in is the question whether our artificial neural network converges with respect to final weight values during training. I.e. we want to see whether the training algorithm finds a reasonable global minimum on the cost hyperplane over the parameter space of all weights.

Test results

We use the following parameters

    ANN = myann.MyANN(my_data_set="mnist_keras", n_hidden_layers = 2, 
                 ay_nodes_layers = [0, 70, 30, 0], 
                 n_nodes_layer_out = 10,  

                 #my_loss_function = "MSE",
                 my_loss_function = "LogLoss",
                 n_size_mini_batch = 500,
     
                 n_epochs = 1500, 
                 n_max_batches = 2000,
                     
                 lambda2_reg = 0.1, 
                 lambda1_reg = 0.0,      

                 vect_mode = 'cols', 
                      
                 learn_rate = 0.0001,
                 decrease_const = 0.000001,
                 mom_rate   = 0.00005,  

                 figs_x1=12.0, figs_x2=8.0, 
                 legend_loc='upper right',
                 b_print_test_data = True
                 )

We use two hidden layers with 70 and 30 nodes. Note that we pick a rather small learning rate, which gets diminished even further. The number of epochs (1500) is quite high; with 4 CPU threads on an i7-6700K processor training takes around 40 minutes if started from a Jupyter notebook. (Our program is not yet optimized.)

I supplemented my code with some statements to print out data for the total costs and the averaged error for mini-batches. We get the
following series of data for the last mini-batch within every 50th epoch.

Starting epoch 1
total costs of mini_batch =  1757.1499806500506
avg total error of mini_batch =  0.17150838451718683
---------
Starting epoch 51
total costs of mini_batch =  532.5817607913532
avg total error of mini_batch =  0.034658195573307196
---------
Starting epoch 101
total costs of mini_batch =  436.67115522484687
avg total error of mini_batch =  0.023496458964699055
---------
Starting epoch 151
total costs of mini_batch =  402.381331415108
avg total error of mini_batch =  0.020836159866695597
---------
Starting epoch 201
total costs of mini_batch =  342.6296325512483
avg total error of mini_batch =  0.016565121882126693
---------
Starting epoch 251
total costs of mini_batch =  319.5995117831668
avg total error of mini_batch =  0.01533372596379799
---------
Starting epoch 301
total costs of mini_batch =  288.2201307002896
avg total error of mini_batch =  0.013799141451102469
---------
Starting epoch 351
total costs of mini_batch =  272.40526022720826
avg total error of mini_batch =  0.013499221607285198
---------
Starting epoch 401
total costs of mini_batch =  251.02417188663628
avg total error of mini_batch =  0.012696943309314687
---------
Starting epoch 451
total costs of mini_batch =  231.92274565746214
avg total error of mini_batch =  0.011152542115360705
---------
Starting epoch 501
total costs of mini_batch =  216.34658280101385
avg total error of mini_batch =  0.010692239864121407
---------
Starting epoch 551
total costs of mini_batch =  215.21791509166042
avg total error of mini_batch =  0.010999255316821901
---------
Starting epoch 601
total costs of mini_batch =  207.79645393570436
avg total error of mini_batch =  0.011123079894527222
---------
Starting epoch 651
total costs of mini_batch =  188.33965068903723
avg total error of mini_batch =  0.009868734062493835
---------
Starting epoch 701
total costs of mini_batch =  173.07625091642274
avg total error of mini_batch =  0.008942065167336382
---------
Starting epoch 751
total costs of mini_batch =  174.98264336120369
avg total error of mini_batch =  0.009714870291761567
---------
Starting epoch 801
total costs of mini_batch =  161.10229359519792
avg total error of mini_batch =  0.008844419847237179
---------
Starting epoch 851
total costs of mini_batch =  155.4186141788981
avg total error of mini_batch =  0.008244783820578621
---------
Starting epoch 901
total costs of mini_batch =  158.88876607392308
avg total error of mini_batch =  0.008970678691005138
---------
Starting epoch 951
total costs of mini_batch =  148.61870772570722
avg total error of mini_batch =  0.008124438423034456
---------
Starting epoch 1001
total costs of mini_batch =  152.16976618516264
avg total error of mini_batch =  0.009151413825781066
---------
Starting epoch 1051
total costs of mini_batch =  142.24802525081637
avg total error of mini_batch =  0.008297161160449798
---------
Starting epoch 1101
total costs of mini_batch =  137.3828515603569
avg total error of mini_batch =  0.007659755348989629
---------
Starting epoch 1151
total costs of mini_batch =  129.8472897084494
avg total error of mini_batch =  0.007254892176613871
---------
Starting epoch 1201
total costs of mini_batch =  139.30002497623792
avg total error of mini_batch =  0.007881199505625214
---------
Starting epoch 1251
total costs of mini_batch =  138.0323454321882
avg total error of mini_batch =  0.00807373439996105
---------
Starting epoch 1301
total costs of mini_batch =  117.95701570484076
avg total error of mini_batch =  0.006378071703153664
---------
Starting epoch 1351
total costs of mini_batch =  125.
71869046937177
avg total error of mini_batch =  0.0072716189968114265
---------
Starting epoch 1401
total costs of mini_batch =  117.3485602627176
avg total error of mini_batch =  0.006291182169676069
---------
Starting epoch 1451
total costs of mini_batch =  118.09317470010767
avg total error of mini_batch =  0.0066519021636054195
---------
Starting epoch 1491
total costs of mini_batch =  112.69566736699439
avg total error of mini_batch =  0.006151660466611035

 ------
Total training Time_CPU:  2430.0504785089997

We can display the results also graphically; a code fragemnt to do this, may look like follows in a Jupyter cell:

# Plotting 
# **********
num_epochs  = ANN._ay_costs.shape[0]
num_batches = ANN._ay_costs.shape[1]
num_tot = num_epochs * num_batches

ANN._ay_costs = ANN._ay_costs.reshape(num_tot)
ANN._ay_theta = ANN._ay_theta .reshape(num_tot)

#sizing
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 5

# Two figures 
# -----------
fig1 = plt.figure(1)
fig2 = plt.figure(2)

# first figure with two plot-areas with axes 
# --------------------------------------------
ax1_1 = fig1.add_subplot(121)
ax1_2 = fig1.add_subplot(122)

ax1_1.plot(range(len(ANN._ay_costs)), ANN._ay_costs)
ax1_1.set_xlim (0, num_tot+5)
ax1_1.set_ylim (0, 2000)
ax1_1.set_xlabel("epochs * batches (" + str(num_epochs) + " * " + str(num_batches) + " )")
ax1_1.set_ylabel("costs")

ax1_2.plot(range(len(ANN._ay_theta)), ANN._ay_theta)
ax1_2.set_xlim (0, num_tot+5)
ax1_2.set_ylim (0, 0.2)
ax1_2.set_xlabel("epochs * batches (" + str(num_epochs) + " * " + str(num_batches) + " )")
ax1_2.set_ylabel("averaged error")

We get:

This looks quite promising!
The vertical spread of both curves is due to the fact that we plotted cost and error data for each mini-batch. As we know the cost hyperplanes of the batches differ from each other and the hyperplane of the total costs for all training data. So do the cost and error values.

Secondary test: Rate of correctly and wrongly predicted values of the training and the test data sets

With the following code in a Jupyter cell we can check the relative percentage of correctly predicted MNIST numbers for the training data set and the test data set:

# ------ all training data 
# *************************
size_set = ANN._X_train.shape[0]

li_Z_in_layer_test  = [None] * ANN._n_total_layers
li_Z_in_layer_test[0] = ANN._X_train
# Transpose 
ay_Z_in_0T       = li_Z_in_layer_test[0].T
li_Z_in_layer_test[0] = ay_Z_in_0T
li_A_out_layer_test  = [None] * ANN._n_total_layers

ANN._fw_propagation(li_Z_in = li_Z_in_layer_test, li_A_out = li_A_out_layer_test, b_print = True) 
Result = np.argmax(li_A_out_layer_test[3], axis=0)
Error = ANN._y_train - Result 
acc_train = (np.sum(Error == 0)) / size_set
print ("total accuracy for training data = ", acc_train)

# ------ all test data 
# *************************
size_set = ANN._X_test.shape[0]

li_Z_in_layer_test  = [None] * ANN._n_total_layers
li_Z_in_layer_test[0] = ANN._X_test
# Transpose 
ay_Z_in_0T       = li_Z_in_layer_test[0].T
li_Z_in_layer_test[0] = ay_Z_in_0T
li_A_out_layer_test  = [None] * ANN._n_total_layers

ANN._fw_propagation(li_Z_in = li_Z_in_layer_test, li_A_out = li_A_out_layer_test, b_print = True) 
Result = np.argmax(li_A_out_layer_
test[3], axis=0)
Error = ANN._y_test - Result 
acc_test = (np.sum(Error == 0)) / size_set
print ("total accuracy for test data = ", acc_test)

“acc” stands for “accuracy”.

We get

total accuracy for training data = 0.9919
total accuracy for test data = 0.9645

So, there is some overfitting – but not much.

Conclusion

Our training algorithm and the error backward propagation seem to work.

The real question is, whether we produced the accuracy values really efficiently: In our example case we needed to fix around 786*70 + 70*30 + 30*10 = 57420 weight values. This is close to the total amount of training data (60000). A smaller network with just one hidden layer would require much fewer values – and the training would be much faster regarding CPU time.

So, in the next article
A simple program for an ANN to cover the Mnist dataset – X – mini-batch-shuffling and some more tests
we shall extend our tests to different setups of the MLP.

A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation

Posted on 23. February 2020 by eremo

I continue with my series on a Python program for coding small “Multilayer Perceptrons” [MLPs].

A simple program for an ANN to cover the Mnist dataset – VII – EBP related topics and obstacles
A simple program for an ANN to cover the Mnist dataset – VI – the math behind the „error back-propagation“
A simple program for an ANN to cover the Mnist dataset – V – coding the loss function
A simple program for an ANN to cover the Mnist dataset – IV – the concept of a cost or loss function
A simple program for an ANN to cover the Mnist dataset – III – forward propagation
A simple program for an ANN to cover the Mnist dataset – II – initial random weight values
A simple program for an ANN to cover the Mnist dataset – I – a starting point

After all the theoretical considerations of the last two articles we now start coding again. Our objective is to extend our methods for training the MLP on the MNIST dataset by methods which perform the “error back propagation” and the correction of the weights. The mathematical prescriptions were derived in the following PDF:
My PDF on “The math behind EBP”

When you study the new code fragments below remember a few things:
We are prepared to use mini-batches. Therefore, the cost functions will be calculated over the data records of each batch and all the matrix operations for back propagation will cover all batch-records in parallel. Training means to loop over epochs and mini-batches – in pseudo-code:

Loop over epochs
1. adjust learning rate,
2. check for convergence criteria,
3. Shuffle all data records in the test data set and build new mini-batches
Loop over mini-batches
1. Perform forward propagation for all records of the mini-batch
2. calculate and save the total cost value for each mini-batch
3. calculate and save an averaged error on the output layer for each mini-batch
4. perform error backward propagation on all records of the mini-batch to get the gradient of the cost function with respect to all weights
5. adjust all weights on all layers

As discussed in the last article: The cost hyperplane changes a bit with each mini-batch. If there is a good mixture of records in a batch then the form of its specific cost hyperplane will (hopefully) resemble the form of an overall cost hyperplane, but it will not be the same. By the second step in the outer loop we want to avoid that the same data records always get an influence on the gradients at the same position in the correction procedure. Both statistical elements help a bit to overcome dominant records and a non-equal distribution of test records. If we had only pictures for number 3 at the end of our MNIST data set we
may start learning “3” representations very well, but not other numbers. Statistical variation also helps to avoid side minima on the overall cost hyperplane for all data records of the test set.

We shall implement the second step and third step in the epoch loop in the next article – when we are sure that the training algorithm works as expected. So, at the moment we will stop our training only after a given number of epochs.

More input parameters

In the first articles we had build an __init__()-method to parameterize a training run. We have to include three more parameters to control the backward propagation.

learn_rate = 0.001, # the learning rate (often called epsilon in textbooks)
decrease_const = 0.00001, # a factor for decreasing the learning rate with epochs
mom_rate = 0.0005, # a factor for momentum learning

The first parameter controls by how much we change weights with the help of gradient values. See formula (93) in the PDF of article VI (you find the Link to the latest version in the last section of this article). The second parameter will give us an option to decrease the learning rate with the number of training epochs. Note that a constant decrease rate only makes sense, if we can be relatively sure that we do not end up in a side minimum of the cost function.

The third parameter is interesting: It will allow us to mix the presently calculated weight correction with the correction term from the last step. So to say: We extend the “momentum” of the last correction into the next correction. This helps us not to follow indicated direction changes on the cost hyperplanes too fast.

Some hygienic measures regarding variables

In the already written parts of the code we have used a prefix “ay_” for all variables which represent some vector or array like structure – including Python lists and Numpy arrays. For back propagation coding it will be more important to distinguish between lists and arrays. So, I changed the variable prefix for some important Python lists from “ay_” to “li_”. (I shall do it for all lists used in a further version). In addition I have changed the prefix for Python ranges to “rg_”. These changes will affect the contents and the interface of some methods. You will notice when we come to these methods.

The changed input()-method

Our modified __init__() function now looks like this:

    def __init__(self, 
                 my_data_set = "mnist", 
                 n_hidden_layers = 1, 
                 ay_nodes_layers = [0, 100, 0], # array which should have as much elements as n_hidden + 2
                 n_nodes_layer_out = 10,  # expected number of nodes in output layer 
                                                  
                 my_activation_function = "sigmoid", 
                 my_out_function        = "sigmoid",   
                 my_loss_function       = "LogLoss",   
                 
                 n_size_mini_batch = 50,  # number of data elements in a mini-batch 
                 
                 n_epochs      = 1,
                 n_max_batches = -1,  # number of mini-batches to use during epochs - > 0 only for testing 
                                      # a negative value uses all mini-batches 
                 
                 lambda2_reg = 0.1,     # factor for quadratic regularization term 
                 lambda1_reg = 0.0,     # factor for linear regularization term 
                 
                 vect_mode = 'cols', 
                 
                 learn_rate = 0.001,        # the learning rate (often called epsilon in textbooks) 
                 decrease_const = 0.00001,  # a factor for decreasing the learning rate with epochs
                 mom_rate  
 = 0.0005,       # a factor for momentum learning
                 
                 figs_x1=12.0, figs_x2=8.0, 
                 legend_loc='upper right',
                 
                 b_print_test_data = True
                 
                 ):
        '''
        Initialization of MyANN
        Input: 
            data_set: type of dataset; so far only the "mnist", "mnist_784" datsets are known 
                      We use this information to prepare the input data and learn about the feature dimension. 
                      This info is used in preparing the size of the input layer.     
            n_hidden_layers = number of hidden layers => between input layer 0 and output layer n 
            
            ay_nodes_layers = [0, 100, 0 ] : We set the number of nodes in input layer_0 and the output_layer to zero 
                              Will be set to real number afterwards by infos from the input dataset. 
                              All other numbers are used for the node numbers of the hidden layers.
            n_nodes_out_layer = expected number of nodes in the output layer (is checked); 
                                this number corresponds to the number of categories NC = number of labels to be distinguished
            
            my_activation_function : name of the activation function to use 
            my_out_function : name of the "activation" function of the last layer which produces the output values 
            my_loss_function : name of the "cost" or "loss" function used for optimization 
            
            n_size_mini_batch : Number of elements/samples in a mini-batch of training data 
                                The number of mini-batches will be calculated from this
            
            n_epochs : number of epochs to calculate during training
            n_max_batches : > 0: maximum of mini-batches to use during training 
                            < 0: use all mini-batches  
            
            lambda_reg2:    The factor for the quadartic regularization term 
            lambda_reg1:    The factor for the linear regularization term 
            
            vect_mode: Are 1-dim data arrays (vctors) ordered by columns or rows ?
            
            learn rate :     Learning rate - definies by how much we correct weights in the indicated direction of the gradient on the cost hyperplane.
            decrease_const:  Controls a systematic decrease of the learning rate with epoch number 
            mom_const:       Momentum rate. Controls a mixture of the last with the present weight corrections (momentum learning)
            
            figs_x1=12.0, figs_x2=8.0 : Standard sizing of plots , 
            legend_loc='upper right': Position of legends in the plots
            
            b_print_test_data: Boolean variable to control the print out of some tests data 
             
         '''
        
        # Array (Python list) of known input data sets 
        self._input_data_sets = ["mnist", "mnist_784", "mnist_keras"]  
        self._my_data_set = my_data_set
        
        # X, y, X_train, y_train, X_test, y_test  
            # will be set by analyze_input_data 
            # X: Input array (2D) - at present status of MNIST image data, only.    
            # y: result (=classification data) [digits represent categories in the case of Mnist]
        self._X       = None 
        self._X_train = None 
        self._X_test  = None   
        self._y       = None 
        self._y_train = None 
        self._y_test  = None
        
        # relevant dimensions 
        # from input data information;  will be set in handle_input_data()
        self._dim_sets     = 0  
        self._dim_features = 0  
        self._n_labels     = 0   # number of unique labels - will be extracted from y-data 
        
      
  # Img sizes 
        self._dim_img      = 0 # should be sqrt(dim_features) - we assume square like images  
        self._img_h        = 0 
        self._img_w        = 0 
        
        # Layers
        # ------
        # number of hidden layers 
        self._n_hidden_layers = n_hidden_layers
        # Number of total layers 
        self._n_total_layers = 2 + self._n_hidden_layers  
        # Nodes for hidden layers 
        self._ay_nodes_layers = np.array(ay_nodes_layers)
        # Number of nodes in output layer - will be checked against information from target arrays
        self._n_nodes_layer_out = n_nodes_layer_out
        
        # Weights 
        # --------
        # empty List for all weight-matrices for all layer-connections
        # Numbering : 
        # w[0] contains the weight matrix which connects layer 0 (input layer ) to hidden layer 1 
        # w[1] contains the weight matrix which connects layer 1 (input layer ) to (hidden?) layer 2 
        self._li_w = []
        
        # Arrays for encoded output labels - will be set in _encode_all_mnist_labels()
        # -------------------------------
        self._ay_onehot = None
        self._ay_oneval = None
        
        # Known Randomizer methods ( 0: np.random.randint, 1: np.random.uniform )  
        # ------------------
        self.__ay_known_randomizers = [0, 1]

        # Types of activation functions and output functions 
        # ------------------
        self.__ay_activation_functions = ["sigmoid"] # later also relu 
        self.__ay_output_functions     = ["sigmoid"] # later also softmax 
        
        # Types of cost functions 
        # ------------------
        self.__ay_loss_functions = ["LogLoss", "MSE" ] # later also othr types of cost/loss functions  


        # the following dictionaries will be used for indirect function calls 
        self.__d_activation_funcs = {
            'sigmoid': self._sigmoid, 
            'relu':    self._relu
            }
        self.__d_output_funcs = { 
            'sigmoid': self._sigmoid, 
            'softmax': self._softmax
            }  
        self.__d_loss_funcs = { 
            'LogLoss': self._loss_LogLoss, 
            'MSE': self._loss_MSE
            }  
        # Derivative functions 
        self.__d_D_activation_funcs = {
            'sigmoid': self._D_sigmoid, 
            'relu':    self._D_relu
            }
        self.__d_D_output_funcs = { 
            'sigmoid': self._D_sigmoid, 
            'softmax': self._D_softmax
            }  
        self.__d_D_loss_funcs = { 
            'LogLoss': self._D_loss_LogLoss, 
            'MSE': self._D_loss_MSE
            }  
        
        
        # The following variables will later be set by _check_and set_activation_and_out_functions()            
        
        self._my_act_func  = my_activation_function
        self._my_out_func  = my_out_function
        self._my_loss_func = my_loss_function
        self._act_func = None    
        self._out_func = None    
        self._loss_func = None    
        
        # number of data samples in a mini-batch 
        self._n_size_mini_batch = n_size_mini_batch
        self._n_mini_batches = None  # will be determined by _get_number_of_mini_batches()

        # maximum number of epochs - we set this number to an assumed maximum 
        # - as we shall build a backup and reload functionality for training, this should not be a major problem 
        self._n_epochs = n_epochs
        
        # maximum number of batches to handle ( if < 0 => all!) 
        self._n_max_batches = n_max_batches
        # actual number of batches 
        self._n_batches = None

        # regularization parameters
        self._lambda2_reg = lambda2_reg
        self._
lambda1_reg = lambda1_reg
        
        # parameter for momentum learning 
        self._learn_rate = learn_rate
        self._decrease_const = decrease_const
        self._mom_rate   = mom_rate
        self._li_mom = [None] *  self._n_total_layers
        
        # book-keeping for epochs and mini-batches 
        # -------------------------------
        # range for epochs - will be set by _prepare-epochs_and_batches() 
        self._rg_idx_epochs = None
        # range for mini-batches 
        self._rg_idx_batches = None
        # dimension of the numpy arrays for book-keeping - will be set in _prepare_epochs_and_batches() 
        self._shape_epochs_batches = None    # (n_epochs, n_batches, 1) 

        # list for error values at outermost layer for minibatches and epochs during training
        # we use a numpy array here because we can redimension it
        self._ay_theta = None
        # list for cost values of mini-batches during training 
        # The list will later be split into sections for epochs 
        self._ay_costs = None
        
        
        # Data elements for back propagation
        # ----------------------------------
        
        # 2-dim array of partial derivatives of the elements of an additive cost function 
        # The derivative is taken with respect to the output results a_j = ay_ANN_out[j]
        # The array dimensions account for nodes and sampls of a mini_batch. The array will be set in function 
        # self._initiate_bw_propagation()
        self._ay_delta_out_batch = None
        

        # parameter to allow printing of some test data 
        self._b_print_test_data = b_print_test_data

        # Plot handling 
        # --------------
        # Alternatives to resize plots 
        # 1: just resize figure  2: resize plus create subplots() [figure + axes] 
        self._plot_resize_alternative = 1 
        # Plot-sizing
        self._figs_x1 = figs_x1
        self._figs_x2 = figs_x2
        self._fig = None
        self._ax  = None 
        # alternative 2 does resizing and (!) subplots() 
        self.initiate_and_resize_plot(self._plot_resize_alternative)        
        
        
        # ***********
        # operations 
        # ***********
        
        # check and handle input data 
        self._handle_input_data()
        # set the ANN structure 
        self._set_ANN_structure()
        
        # Prepare epoch and batch-handling - sets ranges, limits num of mini-batches and initializes book-keeping arrays
        self._rg_idx_epochs, self._rg_idx_batches = self._prepare_epochs_and_batches()
        
        # perform training 
        start_c = time.perf_counter()
        self._fit(b_print=True, b_measure_batch_time=False)
        end_c = time.perf_counter()
        print('\n\n ------') 
        print('Total training Time_CPU: ', end_c - start_c) 
        print("\nStopping program regularily")
        sys.exit()

The extended method _set_ANN_structure()

I do not change method “_handle_input_data()”. However, I extend method “def _set_ANN_structure()” by a statement to initialize a list with momentum matrices for all layers.

    '''-- Method to set ANN structure --''' 
    def _set_ANN_structure(self):
        # check consistency of the node-number list with the number of hidden layers (n_hidden)
        self._check_layer_and_node_numbers()
        # set node numbers for the input layer and the output layer
        self._set_nodes_for_input_output_layers() 
        self._show_node_numbers() 
        
        # create the weight matrix between input and first hidden layer 
        self._create_WM_Input() 
        # create weight matrices between the 
hidden layers and between tha last hidden and the output layer 
        self._create_WM_Hidden() 
        
        # initialize momentum differences
        self._create_momentum_matrices()
        #print("\nLength of li_mom = ", str(len(self._li_mom)))
        
        # check and set activation functions 
        self._check_and_set_activation_and_out_functions()
        self._check_and_set_loss_function()
        
        return None

The following box shows the changed functions _create_WM_Input(), _create_WM_Hidden() and the new function _create_momentum_matrices():

    '''-- Method to create the weight matrix between L0/L1 --'''
    def _create_WM_Input(self):
        '''
        Method to create the input layer 
        The dimension will be taken from the structure of the input data 
        We need to fill self._w[0] with a matrix for conections of all nodes in L0 with all nodes in L1
        We fill the matrix with random numbers between [-1, 1] 
        '''
        # the num_nodes of layer 0 should already include the bias node 
        num_nodes_layer_0 = self._ay_nodes_layers[0]
        num_nodes_with_bias_layer_0 = num_nodes_layer_0 + 1 
        num_nodes_layer_1 = self._ay_nodes_layers[1] 
        
        # fill the matrix with random values 
        #rand_low  = -1.0
        #rand_high = 1.0
        rand_low  = -0.5
        rand_high = 0.5
        rand_size = num_nodes_layer_1 * (num_nodes_with_bias_layer_0) 
        
        randomizer = 1 # method np.random.uniform   
        
        w0 = self._create_vector_with_random_values(rand_low, rand_high, rand_size, randomizer)
        w0 = w0.reshape(num_nodes_layer_1, num_nodes_with_bias_layer_0)
        
        # put the weight matrix into array of matrices 
        self._li_w.append(w0)
        print("\nShape of weight matrix between layers 0 and 1 " + str(self._li_w[0].shape))
        
#
    '''-- Method to create the weight-matrices for hidden layers--''' 
    def _create_WM_Hidden(self):
        '''
        Method to create the weights of the hidden layers, i.e. between [L1, L2] and so on ... [L_n, L_out] 
        We fill the matrix with random numbers between [-1, 1] 
        '''
        
        # The "+1" is required due to range properties ! 
        rg_hidden_layers = range(1, self._n_hidden_layers + 1, 1)

        # for random operation 
        rand_low  = -1.0
        rand_high = 1.0
        
        for i in rg_hidden_layers: 
            print ("Creating weight matrix for layer " + str(i) + " to layer " + str(i+1) )
            
            num_nodes_layer = self._ay_nodes_layers[i] 
            num_nodes_with_bias_layer = num_nodes_layer + 1 
            
            # the number of the next layer is taken without the bias node!
            num_nodes_layer_next = self._ay_nodes_layers[i+1]
            
            # assign random values  
            rand_size = num_nodes_layer_next * num_nodes_with_bias_layer   
            
            randomizer = 1 # np.random.uniform
            
            w_i_next = self._create_vector_with_random_values(rand_low, rand_high, rand_size, randomizer)   
            w_i_next = w_i_next.reshape(num_nodes_layer_next, num_nodes_with_bias_layer)
            
            # put the weight matrix into our array of matrices 
            self._li_w.append(w_i_next)
            print("Shape of weight matrix between layers " + str(i) + " and " + str(i+1) + " = " + str(self._li_w[i].shape))
#
    '''-- Method to create and initialize matrices for momentum learning (differences) '''
    def _create_momentum_matrices(self):
        rg_layers = range(0, self._n_total_layers - 1)
        for i in rg_layers: 
r
            self._li_mom[i] = np.zeros(self._li_w[i].shape)
            #print("shape of li_mom[" + str(i) + "] = ", self._li_mom[i].shape)

The modified functions _fit() and _handle_mini_batch()

The _fit()-function is modified to include a systematic decrease of the learning rate.

    ''' -- Method to set the number of batches based on given batch size -- '''
    def _fit(self, b_print = False, b_measure_batch_time = False):
        
        rg_idx_epochs  = self._rg_idx_epochs 
        rg_idx_batches = self._rg_idx_batches
        if (b_print):    
            print("\nnumber of epochs = " + str(len(rg_idx_epochs)))
            print("max number of batches = " + str(len(rg_idx_batches)))
       
        # loop over epochs
        for idxe in rg_idx_epochs:
            if (b_print):
                print("\n ---------")
                print("\nStarting epoch " + str(idxe+1))
                
                self._learn_rate /= (1.0 + self._decrease_const * idxe)
            
            # loop over mini-batches
            for idxb in rg_idx_batches:
                if (b_print):
                    print("\n ---------")
                    print("\nDealing with mini-batch " + str(idxb+1))
                if b_measure_batch_time: 
                    start_0 = time.perf_counter()
                # deal with a mini-batch
                self._handle_mini_batch(num_batch = idxb, num_epoch=idxe, b_print_y_vals = False, b_print = False)
                if b_measure_batch_time: 
                    end_0 = time.perf_counter()
                    print('Time_CPU for batch ' + str(idxb+1), end_0 - start_0) 
                
                #if idxb == 100: 
                #    sys.exit() 
        
        return None

Note that the number of epochs is determined by an external parameter as an upper limit of the range “rg_idx_epochs”.

Method “_handle_mini_batch()” requires several changes: First we define lists which are required to save matrix data of the backward propagation. And, of course, we call a method to perform the BW propagation (see step 6 in the code). Some statements print shapes, if required. At step 7 of the code we correct the weights by using the learning rate and the calculated gradient of the loss function.

Note, that we mix the correction evaluated at the last batch-record with the correction evaluated for the present record! This corresponds to a simple form of momentum learning. We then have to save the present correction values, of course. Note that the list for momentum correction “li_mom” is, therefore, not deleted at the end of a mini-batch treatment !

In addition to saving the total costs per mini-batch we now also save a mean error at the output level. The average is done by the help of Numpy’s function numpy.average() for matrices. Remember, we build the average over errors at all output nodes and all records of the mini-batch.


    ''' -- Method to deal with a batch -- '''
    def _handle_mini_batch(self, num_batch = 0, num_epoch = 0, b_print_y_vals = False, b_print = False, b_keep_bw_matrices = True):
        '''
        For each batch we keep the input data array Z and the output data A (output of activation function!) 
        for all layers in Python lists
        We can use this as input variables in function calls - mutable variables are handled by reference values !
        We receive the A and Z data from propagation functions and proceed them to cost and gradient calculation functions
        
        As an initial step we define the Python lists li_Z_
in_layer and li_A_out_layer 
        and fill in the first input elements for layer L0  
        
        Forward propagation:
        --------------------
        li_Z_in_layer : List of layer-related 2-dim matrices for input values z at each node (rows) and all batch-samples (cols).
        li_A_out_layer: List of layer-related 2-dim matrices for output alues z at each node (rows) and all batch-samples (cols).
                        The output is created by Phi(z), where Phi represents an activation or output function 
        
        Note that the matrices in ay_A_out will be permanently extended by a row (over all samples) 
        to account for a bias node of each inner layer. This happens during FW propagation. 
        
        Note that the matrices ay_Z_in will be temporarily extended by a row (over all samples) 
        to account for a bias node of each inner layer. This happens during BW propagation.
        
        Backward propagation:
        -------------------- 
        li_delta_out:  Startup matrix for _out_delta-values at the outermost layer 
        li_grad_layer: List of layer-related matrices with gradient values for the correction of the weights 
        
        Depending on parameter "b_keep_bw_matrices" we keep 
            - a list of layer-related matrices D with values for the derivatives of the act./output functions
            - a list of layer-related matrices for the back propagated delta-values 
        in lists during back propagation. This can support error analysis. 
        
        All matrices in the lists are 2 dimensional with dimensions for nodes (rows) and training samples (cols) 
        All these lists be deleted at the end of the function to accelerate garbadge handling
        
        Input parameters: 
        ----------------
        num_epoch:     Number of present epoch
        num_batch:    Number of present mini-batch 
        '''
        # Layer-related lists to be filled with 2-dim Numpy matrices during FW propagation
        # ********************************************************************************
        li_Z_in_layer  = [None] * self._n_total_layers # List of matrices with z-input values for each layer; filled during FW-propagation
        li_A_out_layer = li_Z_in_layer.copy()          # List of matrices with results of activation/output-functions for each layer; filled during FW-propagation
        li_delta_out   = li_Z_in_layer.copy()          # Matrix with out_delta-values at the outermost layer 
        li_delta_layer = li_Z_in_layer.copy()          # List of the matrices for the BW propagated delta values 
        li_D_layer     = li_Z_in_layer.copy()          # List of the derivative matrices D containing partial derivatives of the activation/ouput functions 
        li_grad_layer  = li_Z_in_layer.copy()          # List of the matrices with gradient values for weight corrections
        
        if b_print: 
            len_lists = len(li_A_out_layer)
            print("\nnum_epoch = ", num_epoch, "  num_batch = ", num_batch )
            print("\nhandle_mini_batch(): length of lists = ", len_lists)
            self._info_point_print("handle_mini_batch: point 1")
        
        # Print some infos
        # ****************
        if b_print:
            self._print_batch_infos()
            self._info_point_print("handle_mini_batch: point 2")
        
        # Major steps for the mini-batch during one epoch iteration 
        # **********************************************************
        
        # Step 0: List of indices for data records in the present mini-batch
        # ******
        ay_idx_batch = self._ay_mini_batches[num_batch]
        
        # Step 1: Special preparation of the Z-input to the MLP's input Layer L0
        # ******
        # Layer L0: Fill in the input vector for the ANN's input layer L0 
        li_
Z_in_layer[0] = self._X_train[ay_idx_batch] # numpy arrays can be indexed by an array of integers
        if b_print:
            print("\nPropagation : Shape of X_in = li_Z_in_layer = ", li_Z_in_layer[0].shape)           
            #print("\nidx, expected y_value of Layer L0-input :")           
            #for idx in self._ay_mini_batches[num_batch]:
            #    print(str(idx) + ', ' + str(self._y_train[idx]) )
            self._info_point_print("handle_mini_batch: point 3")
        
        # Step 2: Layer L0: We need to transpose the data of the input layer 
        # *******
        ay_Z_in_0T       = li_Z_in_layer[0].T
        li_Z_in_layer[0] = ay_Z_in_0T
        if b_print:
            print("\nPropagation : Shape of transposed X_in = li_Z_in_layer = ", li_Z_in_layer[0].shape)           
            self._info_point_print("handle_mini_batch: point 4")
        
        # Step 3: Call forward propagation method for the present mini-batch of training records
        # *******
        # this function will fill the ay_Z_in- and ay_A_out-lists with matrices per layer
        self._fw_propagation(li_Z_in = li_Z_in_layer, li_A_out = li_A_out_layer, b_print = b_print) 
        
        if b_print:
            ilayer = range(0, self._n_total_layers)
            print("\n ---- ")
            print("\nAfter propagation through all " + str(self._n_total_layers) + " layers: ")
            for il in ilayer:
                print("Shape of Z_in of layer L" + str(il) + " = " + str(li_Z_in_layer[il].shape))
                print("Shape of A_out of layer L" + str(il) + " = " + str(li_A_out_layer[il].shape))
                if il < self._n_total_layers-1:
                    print("Shape of W of layer L" + str(il) + " = " + str(self._li_w[il].shape))
                    print("Shape of Mom of layer L" + str(il) + " = " + str(self._li_mom[il].shape))
            self._info_point_print("handle_mini_batch: point 5")
        
        
        # Step 4: Cost calculation for the mini-batch 
        # ********
        ay_y_enc = self._ay_onehot[:, ay_idx_batch]
        ay_ANN_out = li_A_out_layer[self._n_total_layers-1]
        # print("Shape of ay_ANN_out = " + str(ay_ANN_out.shape))
        
        total_costs_batch = self._calculate_loss_for_batch(ay_y_enc, ay_ANN_out, b_print = False)
        # we add the present cost value to the numpy array 
        self._ay_costs[num_epoch, num_batch] = total_costs_batch
        if b_print:
            print("\n total costs of mini_batch = ", self._ay_costs[num_epoch, num_batch])
            self._info_point_print("handle_mini_batch: point 6")
        print("\n total costs of mini_batch = ", self._ay_costs[num_epoch, num_batch])
        
        # Step 5: Avg-error for later plotting 
        # ********
        # mean "error" values - averaged over all nodes at outermost layer and all data sets of a mini-batch 
        ay_theta_out = ay_y_enc - ay_ANN_out
        if (b_print): 
            print("Shape of ay_theta_out = " + str(ay_theta_out.shape))
        ay_theta_avg = np.average(np.abs(ay_theta_out)) 
        self._ay_theta[num_epoch, num_batch] = ay_theta_avg 
        
        if b_print:
            print("\navg total error of mini_batch = ", self._ay_theta[num_epoch, num_batch])
            self._info_point_print("handle_mini_batch: point 7")
        print("avg total error of mini_batch = ", self._ay_theta[num_epoch, num_batch])
        
        
        # Step 6: Perform gradient calculation via back propagation of errors
        # ******* 
        self._bw_propagation( ay_y_enc = ay_y_enc, 
                              li_Z_in = li_Z_in_layer, 
                              li_A_out = li_A_out_layer, 
                              li_delta_out = li_delta_out, 
                              li_delta = li_delta_
layer,
                              li_D = li_D_layer, 
                              li_grad = li_grad_layer, 
                              b_print = b_print,
                              b_internal_timing = False 
                              ) 
        
        
        # Step 7: Adjustment of weights  
        # *******
        rg_layer=range(0, self._n_total_layers -1)
        for N in rg_layer:
            delta_w_N = self._learn_rate * li_grad_layer[N]
            self._li_w[N] -= ( delta_w_N + (self._mom_rate * self._li_mom[N]) )
            # save momentum
            self._li_mom[N] = delta_w_N
        
        # try to accelerate garbage handling
        # **************
        if len(li_Z_in_layer) > 0:
            del li_Z_in_layer
        if len(li_A_out_layer) > 0:
            del li_A_out_layer
        if len(li_delta_out) > 0:
            del li_delta_out
        if len(li_delta_layer) > 0:
            del li_delta_layer
        if len(li_D_layer) > 0:
            del li_D_layer
        if len(li_grad_layer) > 0:
            del li_grad_layer
            
        return None

Forward Propagation

The method for forward propagation remains unchanged in its structure. We only changed the prefix for the Python lists.

    ''' -- Method to handle FW propagation for a mini-batch --'''
    def _fw_propagation(self, li_Z_in, li_A_out, b_print= False):
        
        b_internal_timing = False
        
        # index range of layers 
        #    Note that we count from 0 (0=>L0) to E L(=>E) / 
        #    Careful: during BW-propgation we may need a correct indexing of lists filled during FW-propagation
        ilayer = range(0, self._n_total_layers-1)

        # propagation loop
        # ***************
        for il in ilayer:
            if b_internal_timing: start_0 = time.perf_counter()
            
            if b_print: 
                print("\nStarting propagation between L" + str(il) + " and L" + str(il+1))
                print("Shape of Z_in of layer L" + str(il) + " (without bias) = " + str(li_Z_in[il].shape))
            
            # Step 1: Take input of last layer and apply activation function 
            # ******
            if il == 0: 
                A_out_il = li_Z_in[il] # L0: activation function is identity 
            else: 
                A_out_il = self._act_func( li_Z_in[il] ) # use real activation function 
            
            # Step 2: Add bias node
            # ****** 
            A_out_il = self._add_bias_neuron_to_layer(A_out_il, 'row')
            # save in array     
            li_A_out[il] = A_out_il
            if b_print: 
                print("Shape of A_out of layer L" + str(il) + " (with bias) = " + str(li_A_out[il].shape))
            
            # Step 3: Propagate by matrix operation
            # ****** 
            Z_in_ilp1 = np.dot(self._li_w[il], A_out_il) 
            li_Z_in[il+1] = Z_in_ilp1
            
            if b_internal_timing: 
                end_0 = time.perf_counter()
                print('Time_CPU for layer propagation L' + str(il) + ' to L' + str(il+1), end_0 - start_0) 
        
        # treatment of the last layer 
        # ***************************
        il = il + 1
        if b_print:
            print("\nShape of Z_in of layer L" + str(il) + " = " + str(li_Z_in[il].shape))
        A_out_il = self._out_func( li_Z_in[il] ) # use the output function 
        li_A_out[il] = A_out_il
        if b_print:
            print("Shape of A_out of last layer L" + str(il) + " = " + str(li_A_out[il].shape))
        
        return None

nAddendum, 15.05.2020:
We shall later learn that the treatment of bias neurons can be done more efficiently. The present way of coding it reduces performance – especially at the input layer. See the article series starting with
MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation
for more information. At the present stage of our discussion we are, however, more interested in getting a working code first – and not so much in performance optimization.

Methods for Error Backward Propagation

In contrast to the recipe given in my PDF on the EBP-math we cannot calculate the matrices with the derivatives of the activation functions “ay_D” in advance for all layers. The reason was discussed in the last article VII: Some matrices have to be intermediately adjusted for a bias-neuron, which is ignored in the analysis of the PDF.

The resulting code of our method for EBP looks like given below:

 
    ''' -- Method to handle error BW propagation for a mini-batch --'''
    def _bw_propagation(self, 
                        ay_y_enc, li_Z_in, li_A_out, 
                        li_delta_out, li_delta, li_D, li_grad, 
                        b_print = True, b_internal_timing = False):
        
        # List initialization: All parameter lists or arrays are filled or to be filled by layers 
        # Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch 
        
        # Initiate BW propagation - provide delta-matrices for outermost layer
        # *********************** 
        # Input Z at outermost layer E  (4 layers -> layer 3)
        ay_Z_E = li_Z_in[self._n_total_layers-1]
        # Output A at outermost layer E (was calculated by output function)
        ay_A_E = li_A_out[self._n_total_layers-1]
        
        # Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid 
        ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print )
        
        # Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices)
        ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print) 
        
        # We check the shapes
        shape_theory = (self._n_nodes_layer_out, self._n_size_mini_batch)
        if (b_print and ay_delta_E.shape != shape_theory):
            print("\nError: Shape of ay_delta_E is wrong:")
            print("Shape = ", ay_delta_E.shape, "  ::  should be = ", shape_theory )
        if (b_print and ay_D_E.shape != shape_theory):
            print("\nError: Shape of ay_D_E is wrong:")
            print("Shape = ", ay_D_E.shape, "  ::  should be = ", shape_theory )
        
        # add the matrices to their lists ; li_delta_out gets only one element 
        idxE = self._n_total_layers - 1
        li_delta_out[idxE] = ay_delta_out_E # this happens only once
        li_delta[idxE]     = ay_delta_E
        li_D[idxE]         = ay_D_E
        li_grad[idxE]      = None    # On the outermost layer there is no gradient ! 
        
        if b_print:
            print("bw: Shape delta_E = ", li_delta[idxE].shape)
            print("bw: Shape D_E = ", ay_D_E.shape)
            self._info_point_print("bw_propagation: point bw_1")
        
        
        # Loop over all layers in reverse direction 
        # ******************************************
        # index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2))
        if b_print:
            range_N_bw_layer_test = reversed(range(0,
 self._n_total_layers-1))   # must be -1 as the last element is not taken 
            rg_list = list(range_N_bw_layer_test) # Note this exhausts the range-object
            print("range_N_bw_layer = ", rg_list)
        
        range_N_bw_layer = reversed(range(0, self._n_total_layers-1))   # must be -1 as the last element is not taken 
        
        # loop over layers 
        for N in range_N_bw_layer:
            if b_print:
                print("\n N (layer) = " + str(N) +"\n")
            # start timer 
            if b_internal_timing: start_0 = time.perf_counter()
            
            # Back Propagation operations between layers N+1 and N 
            # *******************************************************
            # this method handles the special treatment of bias nodes in Z_in, too
            ay_delta_N, ay_D_N, ay_grad_N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False )
            
            if b_internal_timing: 
                end_0 = time.perf_counter()
                print('Time_CPU for BW layer operations ', end_0 - start_0) 
            
            # add matrices to their lists 
            li_delta[N] = ay_delta_N
            li_D[N]     = ay_D_N
            li_grad[N]= ay_grad_N
            #sys.exit()
        
        return

We first handle the necessary matrix evaluations for the outermost layer. We use two helper functions there to calculate the derivative of the output function with respect to the a-term [ _calculate_D_E() ] and to calculate the values for the “delta“-terms at all nodes and for all records [ _calculate_delta_E() ] according to the prescription in the PDF:

    
    ''' -- Method to calculate the matrix with the derivative values of the output function at outermost layer '''
    def _calculate_D_E(self, ay_Z_E, b_print= True):
        '''
        This method calculates and returns the D-matrix for the outermost layer
        The D matrix contains derivatives of the output function with respect to local input "z_j" at outermost nodes. 
        
        Returns
        ------
        ay_D_E:    Matrix with derivative values of the output function 
                   with respect to local z_j valus at the nodes of the outermost layer E
        Note: This is a 2-dim matrix over layer nodes and training samples of the mini-batch
        '''
        if self._my_out_func == 'sigmoid':
            ay_D_E = self._D_sigmoid(ay_Z=ay_Z_E)
        
        else:
            print("The derivative for output function " + self._my_out_func + " is not known yet!" )
            sys.exit()
        
        return ay_D_E

    ''' -- Method to calculate the delta_E matrix as a starting point of the backward propagation '''
    def _calculate_delta_E(self, ay_y_enc, ay_A_E, ay_D_E, b_print= False):
        '''
        This method calculates and returns the 2 delta-matrices for the outermost layer 
        
        Returns
        ------
        delta_E:     delta_matrix of the outermost layer (indicated by E)
        delta_out:   delta_out matrix => elements are local derivative values of the cost function 
                     with respect to the output "a_j" at an outermost node  
                     !!! delta_out will only be returned if calculable !!!
        
        Note: these are 2-dim matrices over layer nodes and training samples of the mini-batch
        '''
        
        if self._my_loss_func == 'LogLoss':
            # Calculate delta_S_E directly to avoid problems with zero denominators
            ay_delta_E = ay_A_E - ay_y_enc
            # delta_out is fetched but may be None 
            ay_delta_out, ay_D_
numerator, ay_D_denominator = self._D_loss_LogLoss(ay_y_enc, ay_A_E, b_print = False)
            
            # To be done: Analyze critical values in D_denominator 
            
            # Release variables explicitly 
            del ay_D_numerator
            del ay_D_denominator
            
        
        if self._my_loss_func == 'MSE':
            # First calculate delta_out and then the delta_E
            delta_out = self._D_loss_MSE(ay_y_enc, ay_A_E, b_print=False)
            # calculate delta_E via matrix multiplication 
            ay_delta_E = delta_out * ay_D_E
                    
        return ay_delta_E, ay_delta_out

Further required helper methods to calculate the cost functions and related derivatives are :

    ''' method to calculate the logistic regression loss function '''
    def _loss_LogLoss(self, ay_y_enc, ay_ANN_out, b_print = False):
        '''
        Method which calculates LogReg loss function in a vectorized form on multidimensional Numpy arrays 
        '''
        b_test = False

        if b_print:
            print("From LogLoss: shape of ay_y_enc =  " + str(ay_y_enc.shape))
            print("From LogLoss: shape of ay_ANN_out =  " + str(ay_ANN_out.shape))
            print("LogLoss: ay_y_enc = ", ay_y_enc) 
            print("LogLoss: ANN_out = \n", ay_ANN_out) 
            print("LogLoss: log(ay_ANN_out) =  \n", np.log(ay_ANN_out) )

        # The following means an element-wise (!) operation between matrices of the same shape!
        Log1 = -ay_y_enc * (np.log(ay_ANN_out))
        # The following means an element-wise (!) operation between matrices of the same shape!
        Log2 = (1 - ay_y_enc) * np.log(1 - ay_ANN_out)
        
        # the next operation calculates the sum over all matrix elements 
        # - thus getting the total costs for all mini-batch elements 
        cost = np.sum(Log1 - Log2)
        
        #if b_print and b_test:
            # Log1_x = -ay_y_enc.dot((np.log(ay_ANN_out)).T)
            # print("From LogLoss: L1 =   " + str(L1))
            # print("From LogLoss: L1X =  " + str(L1X))
        
        if b_print: 
            print("From LogLoss: cost =  " + str(cost))
        
        # The total costs is just a number (scalar)
        return cost
#
    ''' method to calculate the derivative of the logistic regression loss function 
        with respect to the output values '''
    def _D_loss_LogLoss(self, ay_y_enc, ay_ANN_out, b_print = False):
        '''
        This function returns the out_delta_S-matrix which is required to initialize the 
        BW propagation (EBP) 
        Note ANN_out is the A_out-list element ( a 2-dim matrix) for the outermost layer 
        In this case we have to take care of denominators = 0 
        '''
        D_numerator = ay_ANN_out - ay_y_enc
        D_denominator = -(ay_ANN_out - 1.0) * ay_ANN_out
        n_critical = np.count_nonzero(D_denominator < 1.0e-8)
        if n_critical > 0:
            delta_s_out = None
        else:
            delta_s_out = np.divide(D_numerator, D_denominator)
        return delta_s_out, D_numerator, D_denominator
#
    ''' method to calculate the MSE loss function '''
    def _loss_MSE(self, ay_y_enc, ay_ANN_out, b_print = False):
        '''
        Method which calculates LogReg loss function in a vectorized form on multidimensional Numpy arrays 
        '''
        if b_print:
            print("From loss_MSE: shape of ay_y_enc =  " + str(ay_y_enc.shape))
            print("From loss_MSE: shape of ay_ANN_out =  " + str(ay_ANN_out.shape))
            #print("LogReg: ay_y_enc = ", ay_y_enc) 
            #print("LogReg: ANN_out = \n", ay_
ANN_out) 
            #print("LogReg: log(ay_ANN_out) =  \n", np.log(ay_ANN_out) )
        
        cost = 0.5 * np.sum( np.square( ay_y_enc - ay_ANN_out ) )

        if b_print: 
            print("From loss_MSE: cost =  " + str(cost))
        
        return cost
#
    ''' method to calculate the derivative of the MSE loss function 
        with respect to the output values '''
    def _D_loss_MSE(self, ay_y_enc, ay_ANN_out, b_print = False):
        '''
        This function returns the out_delta_S - matrix which is required to initialize the 
        BW propagation (EBP) 
        Note ANN_out is the A_out-list element ( a 2-dim matrix) for the outermost layer
        In this case the output is harmless (no critical denominator) 
        '''
        delta_s_out = ay_ANN_out - ay_y_enc
        return delta_s_out

You see that we are a bit careful to avoid zero denominators for the Logarithmic loss function in all of our helper functions.

The check statements for shapes can be eliminated in a future version when we are sure that everything works correctly. Keeping the layer specific matrices during the handling of a mini-batch will be also good for potentially required error analysis in the beginning. In the end we only may keep the gradient-matrices and the layer specific matrices required to process the local calculations during back propagation.

Then we turn to loop over all other layers down to layer L0. The matrix operation to be done for all these layers are handled in a further method:

    
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False):
        '''
        BW-error-propagation bewtween layer N+1 and N 
        Inputs: 
            li_Z_in:  List of input Z-matrices on all layers - values were calculated during FW-propagation
            li_A_out: List of output A-matrices - values were calculated during FW-propagation
            li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist 
        
        Returns: 
            ay_delta_N - delta-matrix of layer N (required in subsequent steps)
            ay_D_N     - derivative matrix for the activation function on layer N 
            ay_grad_N  - matrix with gradient elements of the cost fnction with respect to the weights on layer N 
        '''
        
        if b_print:
            print("Inside _bw_prop_Np1_to_N: N = " + str(N) )
        
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling betwen layer N and N+1 
        ay_W_N = self._li_w[N]
        shape_W_N   = ay_W_N.shape # due to bias node first dim is 1 bigger than Z-matrix 
        if b_print:
            print("shape of W_N = ", shape_W_N )

        # delta-matrix of layer N+1
        ay_delta_Np1 = li_delta[N+1]
        shape_delta_Np1 = ay_delta_Np1.shape 

        # !!! Add intermediate row (for bias) to Z_N !!!
        ay_Z_N = li_Z_in[N]
        shape_Z_N_orig = ay_Z_N.shape
        ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        shape_Z_N = ay_Z_N.shape # dimensions should fit now with W- and A-matrix 
        
        # Derivative matrix for the activation function (with extra bias node row)
        #    can only be calculated now as we need the z-values
        ay_D_N = self._calculate_D_N(ay_Z_N)
        shape_D_N = ay_D_N.shape 
        
        ay_A_N = li_A_out[N]
        shape_A_N = ay_A_N.shape
        
        # print shapes 
        if b_print:
            print("shape of W_N = ", shape_W_N)
            print("
shape of delta_(N+1) = ", shape_delta_Np1)
            print("shape of Z_N_orig = ", shape_Z_N_orig)
            print("shape of Z_N = ", shape_Z_N)
            print("shape of D_N = ", shape_D_N)
            print("shape of A_N = ", shape_A_N)
        
        
        # Propagate delta
        # **************
        if li_delta[N+1] is None:
            print("BW-Prop-error:\n No delta-matrix found for layer " + str(N+1) ) 
            sys.exit()
            
        # Check shapes for np.dot()-operation - here for element [0] of both shapes - as we operate with W.T !
        if ( shape_W_N[0] != shape_delta_Np1[0]): 
            print("BW-Prop-error:\n shape of W_N [", shape_W_N, "]) does not fit shape of delta_N+1 [", shape_delta_Np1, "]" )
            sys.exit() 
        
        # intermediate delta 
        # ~~~~~~~~~~~~~~~~~~
        ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
        shape_delta_w_N = ay_delta_w_N.shape
        
        # Check shapes for element wise *-operation !
        if ( shape_delta_w_N != shape_D_N ): 
            print("BW-Prop-error:\n shape of delta_w_N [", shape_delta_w_N, "]) does not fit shape of D_N [", shape_D_N, "]" )
            sys.exit() 
        
        # final delta 
        # ~~~~~~~~~~~
        ay_delta_N = ay_delta_w_N * ay_D_N
        # reduce dimension again 
        ay_delta_N = ay_delta_N[1:, :]
        shape_delta_N = ay_delta_N.shape
        
        # Check dimensions again - ay_delta_N.shape should fit shape_Z_in_orig
        if shape_delta_N != shape_Z_N_orig: 
            print("BW-Prop-error:\n shape of delta_N [", shape_delta_N, "]) does not fit original shape Z_in_N [", shape_Z_N_orig, "]" )
            sys.exit() 
        
        if N > 0:
            shape_W_Nm1 = self._li_w[N-1].shape 
            if shape_delta_N[0] != shape_W_Nm1[0] : 
                print("BW-Prop-error:\n shape of delta_N [", shape_delta_N, "]) does not fit shape of W_Nm1 [", shape_W_Nm1, "]" )
                sysexit() 
        
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        # check shapes 
        if shape_delta_Np1[1] != shape_A_N[1]:
            print("BW-Prop-error:\n shape of delta_Np1 [", shape_delta_Np1, "]) does not fit shape of A_N [", shape_A_N, "] for matrix multiplication" )
            sys.exit() 
        
        # calculate gradient             
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        ay_grad_N[:, 1:] += (self._li_w[N][:, 1:] * self._lambda2_reg + np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg) 
        
        #
        # Check shape 
        shape_grad_N = ay_grad_N.shape
        if shape_grad_N != shape_W_N:
            print("BW-Prop-error:\n shape of grad_N [", shape_grad_N, "]) does not fit shape of W_N [", shape_W_N, "]" )
            sys.exit() 
        
        # print shapes 
        if b_print:
            print("shape of delta_N = ", shape_delta_N)
            print("shape of grad_N = ", shape_grad_N)
            
            print(ay_grad_N)
        
        return ay_delta_N, ay_D_N, ay_grad_N

This function does more or less exactly what we have requested by our theoretical analysis in the last two articles. Note the intermediate handling of bias nodes! Note also that bias nodes are NOT regarded in regularization terms L1 and L2! The function to calculate the derivative of the activation function is:

   
#
    ''' -- Method to calculate the matrix with the derivative values of the output function at outermost layer '''

    def _calculate_D_N(self, ay_Z_N, b_print= False):
        '''
        This method calculates and returns the D-matrix for the outermost layer
        The D matrix contains derivatives of the output function with respect to local input "z_j" at outermost nodes. 
        
        Returns
        ------
        ay_D_E:    Matrix with derivative values of the output function 
                   with respect to local z_j valus at the nodes of the outermost layer E
        Note: This is a 2-dim matrix over layer nodes and training samples of the mini-batch
        '''
        if self._my_out_func == 'sigmoid':
            ay_D_E = self._D_sigmoid(ay_Z = ay_Z_N)
        
        else:
            print("The derivative for output function " + self._my_out_func + " is not known yet!" )
            sys.exit()
        
        return ay_D_E

The methods to calculate regularization terms for the loss function are:

   
#
    ''' method do calculate the quadratic regularization term for the loss function '''
    def _regularize_by_L2(self, b_print=False): 
        '''
        The L2 regularization term sums up all quadratic weights (without the weight for the bias) 
        over the input and all hidden layers (but not the output layer)
        The weight for the bias is in the first column (index 0) of the weight matrix - 
        as the bias node's output is in the first row of the output vector of the layer 
        '''
        ilayer = range(0, self._n_total_layers-1) # this excludes the last layer 
        L2 = 0.0
        for idx in ilayer:
            L2 += (np.sum( np.square(self._li_w[idx][:, 1:])) ) 
        L2 *= 0.5 * self._lambda2_reg
        if b_print: 
            print("\nL2: total L2 = " + str(L2) )
        return L2 
#
    ''' method do calculate the linear regularization term for the loss function '''
    def _regularize_by_L1(self, b_print=False): 
        '''
        The L1 regularization term sums up all weights (without the weight for the bias) 
        over the input and all hidden layers (but not the output layer
        The weight for the bias is in the first column (index 0) of the weight matrix - 
        as the bias node's output is in the first row of the output vector of the layer 
        '''
        ilayer = range(0, self._n_total_layers-1) # this excludes the last layer 
        L1 = 0.0
        for idx in ilayer:
            L1 += np.sum(np.abs( self._li_w[idx][:, 1:]))
        L1 *= 0.5 * self._lambda1_reg
        if b_print:
            print("\nL1: total L1 = " + str(L1))
        return L1

Addendum, 15.05.2020:
Also the BW-propagation code presented here will later be the target of optimization steps. We shall see that it – despite working correctly – can be criticized regarding efficiency at several points. See again the article series starting with
MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation.

Conclusion

We have extended our set of methods quite a bit. At the core of the operations we perform matrix operations which are supported by the Openblas library on a Linux system with multiple CPU cores. In the next article

A simple program for an ANN to cover the Mnist dataset – IX – First Tests

we shall test the convergence of our training for the MNIST data set. We shall see that a MLP with two hidden layers with 70 and 30 nodes can give us a convergence
of the averaged relative error down to 0.006 after 1000 epochs on the test data. However, we have to analyze such results for overfitting. Stay tuned …

Links

My PDF on “The math behind EBP”

A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles

Posted on 26. January 2020 by eremo

I continue with my series about a Python program to build simple MLPs:

A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“
A simple program for an ANN to cover the Mnist dataset – V – coding the loss function
A simple program for an ANN to cover the Mnist dataset – IV – the concept of a cost or loss function
A simple program for an ANN to cover the Mnist dataset – III – forward propagation
A simple program for an ANN to cover the Mnist dataset – II – initial random weight values
A simple program for an ANN to cover the Mnist dataset – I – a starting point

On our tour we have already learned a lot about multiple aspects of MLP usage. I name forward propagation, matrix operations, loss or cost functions. In the last article of this series
A simple program for an ANN to cover the Mnist dataset – VI – the math behind the „error back-propagation“
I tried to explain some of the math which governs “Error Back Propagation” [EBP]. See the PDF attached to the last article.

EBP is an algorithm which applies the “Gradient Descent” method for the optimization of the weights of a Multilayer Perceptron [MLP]. “Gradient Descent” itself is a method where we step-wise follow short tracks perpendicular to contour lines of a hyperplane in a multidimensional parameter space to hopefully approach a global minimum. A step means a change of of parameter values – in our context of weights. In our case the hyperplane is the surface formed by the cost function over the weights. If we have m weights we get a hyperplane in an (m+1) dimensional space.

To apply gradient descent we have to calculate partial derivatives of the cost function with respect to the weights. We have discussed this in detail in the last article. If you read the PDF you certainly have noted: Most of the time we shall execute matrix operations to provide the components of the weight gradient. Of course, we must guarantee that the matrices’ dimensions fit each other such that the required operations – as an element-wise multiplication and the numpy.dot(X,Y)-operation – become executable.

Unfortunately, there are some challenges regarding this point which we have not covered, yet. One objective of this article is to get prepared for these potential problems before we start coding EBP.

Another point worth discussing is: Is there really just one cost function when we use mini-batches in combination with gradient descent? Regarding the descriptions and the formulas in the PDF of the last article this was and is not fully clear. We only built sums there over cost contributions of all the records in a mini-batch. We did NOT use a loss function which assigned to costs to deviations of the predicted result (after forward propagation) from known values for all training data records.

This triggers the question what our
code in the end really does if and when it works with mini-batches during weight optimization … We start with this point.

In the following I try to keep the writing close to the quantity notations in the PDF. Sorry for a bad display of the δs in HTML.

Gradient descent and mini-batches – one or multiple cost functions?

Regarding the formulas given so far, we obviously handle costs and gradient descent batch-wise. I.e. each mini-batch has its own cost function – with fewer contributions than a cost function for all records would have. Each cost function has (hopefully) a defined position of a global minimum in the weights’ parameter space. Taking this into consideration the whole mini-batch approach is obviously based on some conceptually important assumptions:

The basic idea is that the positions of the global minima of all the cost-functions for the different batches do not deviate too much from each other in the basic parameter space.
If we additionally defined a cost function for all test data records (over all batches) then this cost function should display a global minimum positioned in between the ones of the batches’ cost functions.
This also means that there should be enough records in each batch with a really statistical distribution and no specialties associated with them.
Contour lines and gradients on the hyperplanes defined by the loss functions will differ from each other. On average over all mini-batches this should not hinder convergence into a common optimum.

To understand the last point let us assume that we have a batch for MNIST dataset where all records of handwritten digits show a tendency to be shifted to the left border of the basic 28×28 pixel frames. Then this batch would probably give us other weights than other batches.

To get a deeper understanding, let us take only two batches. By chance their cost functions may deviate a bit. In the plots below I have just simulated this by two assumed “cost” functions – each forming a hyperplane in 3 dimensions over only two parameter (=weight) dimensions x and y. You see that the “global” minima of the blue and the red curve deviate a bit in their position.

The next graph shows the sum, i.e. the full “cost function”, in green in comparison to the (vertically shifted and scaled) original functions.

Also here you clearly see the differences in the minimas’ positions. What does this mean for gradient descent?

Firstly, the contour lines on the total cost function would deviate from the ones on the cost function hyperplanes of our 2 batches. So would the directions of the different gradients at the point presently reached in the parameter space during optimization! Working with batches therefore means jumping around on the surface of the total cost function a bit erratically and not precisely along the direction of steepest descent there. By the way: This behavior can be quite helpful to overcome local minima.

Secondly, in our simplified example we would in the end not converge completely, but jump or circle around the minimum of the total cost function. Reason: Each batch forces the weight corrections for x,y into different directions namely those of
its own minimum. So, a weight correction induced by one bath would be countered by corrections imposed by the optimization for the other batch. (Regarding MNIST it would e.g. be interesting to run a batch with handwritten digits of Europeans against a batch with digits written by Americans and see how the weights differ after gradient descent has converged for each batch.)

This makes us understand multiple things:

Mini-batches should be built with a statistical distribution of records and their composition should be changed statistically from epoch to epoch.
We need a criterion to stop iterating over too many epochs senselessly.
We should investigate whether the number and thus the size of mini-batches influences the results of EBP.
At the end of an optimization run we could invest in some more iterations not for the batches, but for the full cost function of all training records and see if we can get a little deeper into the minimum of this total cost function.
We should analyze our batches – if we keep them up and do not create them statistically anew at the beginning of each epoch – for special data records whose properties are off of the normal – and maybe eliminate those data records.

Repetition: Why Back-propagation of 2 dimensional matrices and not vectors?

The step wise matrix operations of EBP are to be performed according to a scheme with the following structure:

On a given layer N apply a layer specific matrix “_NW.T” (depending on the weights there) by some operational rule on some matrix “_(N+1)δ_S“, which contains some data already calculated for layer (N+1).
Take the results and modify it properly by multiplying it element-wise with some other matrix _ND (containing derivative expressions for the activation function) until you get a new _Nδ_S.
Get partial derivatives of the cost function with respect to the weights on layer (N-1) by a further matrix operation of _Nδ_S on a matrix with output values _(N-1)A.T_S on layer (N-1).
Proceed to the next layer in backward direction.

The input into this process is a matrix of error-dependent quantities, which are defined at the output layer. These values are then back-propagated in parallel to the inner layers of our MLP.

Now, why do we propagate data matrices and not just data vectors? Why are we allowed to combine so many different multiplications and summations described in the last article when we deal with partial derivatives with respect to variables deep inside the network?

The answer to the first question is numerical efficiency. We operate on all data records of a mini-batch in parallel; see the PDF. The answer to the second question is 2-fold:

We are allowed to perform so many independent operations because of the linear structure of our cost-functions with respect to contributions coming from the records of a mini-batch and the fact that we just apply linear operations between layers during forward propagation. All contributions – however non-linear each may be in itself – are just summed up. And propagation itself between layers is defined to be linear.
The only non-linearity occurring – namely in the form of non-linear activation functions – is to be applied just on layers. And there it works only node-wise! We do not
couple values for nodes on one and the same layer.

In this sense MLPs are very simple by definition – although they may look complex! (By the way and if you wonder why MLPs are nevertheless so powerful: One reason has to do with the “Universal Approximation Theorem”; see the literature hint at the end.)

Consequence of the simplicity: We can deal with δ-values (see the PDF) for both all nodes of a layer and all records of a mini-batch in parallel.

Results derived in the last article would change dramatically if we had rules that coupled the Z- or A-values of different nodes! E.g. if the squared value at node 7 in layer X must always be the sum of squared values at nodes 5 an 6. Believe me: There are real networks in this world where such a type of node coupling occurs – not only in physics.

Note: As we have explained in the PDF, the nodes of a layer define one dimension of the _Nδ_S“-matrices,
the number of mini-batch records the other. The latter remains constant. So, during the process the δ-matrices change only one of their 2 dimensions.

Some possible pitfalls to tackle before EBP-coding

Now, my friends, we can happily start coding … Nope, there are actually some minor pitfalls, which we have to explain first.

Special cost-, activation- and output-functions

I refer to the PDF mentioned above and its formulas. The example explained there referred to the “Log Loss” function, which we took as an example cost function. In this case the _outδ_S and the ₃δ_S-terms at the nodes of the outermost layer turned out to be quite simple. See formula (21), (22), (26) and (27) in the PDF.

However, there may be other cost functions for which the derivative with respect to the output vector “a” at the outermost nodes is more complicated.

In addition we may have other output or activation functions than the sigmoid function discussed in the PDF’s example. Further, the output function may differ from the activation function at inner layers. Thus, we find that the partial derivatives of these functions with respect to their variables “z” must be calculated explicitly and as needed for each layer during back propagation; i.e., we have to provide separate and specific functions for the provision of the required derivatives.

At the outermost layer we apply the general formulas (84) to (88) with matrix _ED containing derivatives of the output-function _Eφ(z) with respect to the input z to find _Eδ_S with E marking the outermost layer. Afterwards, however, we apply formula (92) – but this time with D-elements referring to derivatives of the standard activation-function φ used at nodes of inner layers.

The special case of the Log Loss function and other loss functions with critical denominators in their derivative

Formula (21) shows something interesting for the quantity _outδ_S, which is a starting point for backward propagation: a denominator depending on critical factors, which directly involve output “a” at the outer nodes or “a” in a difference term. But in our one-hot-approach “a” may become zero or come close to it – during training by accident or by convergence! This is a dangerous thing; numerically we absolutely want to avoid any division by zero or by small numbers close to the numerical accuracy of a programming language.

What mathematically saves us in the special case of Log Loss are formulas (26) and (27), where due to some “magic” the dangerous denominator is cancelled by a corresponding factor in the numerator when we evaluate _Eδ_S.

In the general case, however,
we must investigate what numerical dangers the functional form of the derivative of the loss function may bring with it. In the end there are two things we should do:

Build a function to directly calculate _Eδ_S and put as much mathematical knowledge about the involved functions and operations into it as possible, before employing an explicit calculation of values of the cost function’s derivative.
Check the involved matrices, whose elements may appear in denominators, for elements which are either zero or close to it in the sense of the achievable accuracy.

For our program this means: Whether we calculate the derivative of a cost function to get values for “_outδ_S” will depend on the mathematical nature of the cost function. In case of Log Loss we shall avoid it. In case of MSE we shall perform the numerical operation.

Handling of bias nodes

A further complication of our aspired coding has its origin in the existence of bias nodes on every inner layer of the MLP. A bias node of a layer adds an additional degree of freedom whilst adjusting the layer’s weights; a bias node has no input, it produces only a constant output – but is connected with weights to all normal nodes of the next layer.

Some readers who are not so familiar with “artificial neural networks” may ask: Why do we need bias nodes at all?

Well, think about a simple matrix operation on a 2 dim-vector; it changes its direction and length. But if we want to approximate a function for regression or a separation hyperplanes for classification by a linear operation then we need another element which corresponds to a constant translation part in a linear transformation: z = w1*x1 + w2*x2 + const.. Take a simple function y=w*x + c. The “c” controls where the line crosses the y axis. We need such a parameter if our line should separate clusters of points separably distributed somewhere in the (x,y)-plane; the w is not sufficient to orientate and position the hyperplane in the (x,y)-plane.

This very basically what bias neurons are good for regarding the basically linear operation between two MLP-layers. They add a constant to an otherwise linear transformation.

Do we need a bias node on all layers? Definitely on the input layer. However, on the hidden layers a trained network could by learning evolve weights in such a way that a bias neuron comes about – with almost zero weights on one side. At least in principle; however, we make it easier for the MLP to converge by providing explicit “bias” neurons.

What did we do to account for bias nodes in our Python code so far? We extended the matrices describing the output arrays ay_A_out of the activation function (for input ay_Z_in) on the input and all hidden layers by elements of an additional row. This was done by the method “add_bias_neuron_to_layer()” – see the codes given in article III.

The important point is that our weight matrices already got a corresponding dimension when we built them; i.e. we defined weights for the bias nodes, too. Of course, during optimization we must calculate partial derivatives of the cost function with respect to these weights.

The problem is:

We need to back propagate a delta-matrix _Nδ for layer N via
( (_NW.T).dot(_Nδ) ). But then we can not apply a simple element-wise matrix multiplication with the _(N-1)D(z)-matrix at layer N-1. Reason: The dimensions do not fit, if we calculate the elements of D only for the existing Z-Values at layer N-1.

There are two solutions for coding:

We can add a row artificially and intermediately to the Z-matrix to calculate the D-matrix, then calculate _Nδ_S as
( (_NW.T).dot(_Nδ) ) * _{(N_1)}D
and eliminate the first artificial row appearing in _Nδ_S afterwards.

The other option is to reduce the weight-matrix (_NW) by a row intermediately and restore it again afterwards.

What we do is a matter of efficiency; in our coding we shall follow the first way and test the difference to the second way afterwards.

Check the matrix dimensions

As all steps to back-propagate and to circumvent the pitfalls require a bit of matrix wizardry we should at least check at every step during EBP backward-propagation that the dimensions of the involved matrices fit each other.

Outlook

Guys, after having explained some of the matrix math in the previous article of this series and the problems we have to tackle whilst programming the EBP-algorithm we are eventually well prepared to add EBP-methods to our Python class for MLP simulation. We are going to to this in the next article:
A simple program for an ANN to cover the Mnist dataset – VIII – coding Error Backward Propagation

Literature

“Machine Learning – An Applied Mathematics Introduction”, Paul Wilmott, 2019, Panda Ohana Publishing

Posted in Machine Learning, MLP | Tagged backpropagation, backward propagation, beginner, bias nodes, cost functions, critical denominators, error back propagation, global minimum, gradient descent, matrix dimensions, minima positions, MLP, Multilayer Perceptron

Upgrading Win 7 to Win 10 guests on Opensuse/Linux based VMware hosts – I – some experiences

Posted on 20. January 2020 by eremo

As my readers know I am not a fan of MS or any “Windows N” operating system – whatever the version number N. But some of you may be facing the same situation as me:

A customer or an employer enforces the use of MS products – as e.g. MS Office, clients for MS Exchange, Skype for Business, Sharepoint, components for effort booking and so on. For the fulfillment of most of your customer’s demands you can use browser based interfaces or Linux clients.

However, something that regularly leads to problems is the heavy use of MS Office programs or graphics tools in their latest versions. Despite other claims: A friction-less back and forth between Libreoffice and MS Office is still a dream. Crossover Office is nice – but the latest MS Office versions are often not yet covered when you need them. Another very reasonable field of using MS Windows guests on Linux is, by the way, training for pen-testing and security measures.

So, even Linux enthusiasts are sometimes forced to work with or within a native Windows environment. We would use a virtualized Windows guest machine then – on a Linux host with the help of VMware, KVM or Virtualbox. Regarding graphical performance, support of basic 3D features, Direct X and of the latest USB-versions in the emulated system environment I have a tendency to use VMware Workstation, despite its high price. Get me right: I practically never use VMware to virtualize Linux systems – for this purpose I use LXC containers or KVM. But for “Win 7” or “Win 10” VMware seemed to be a good choice – so far.

Upgrade to Win 10

During the last days of orchestrated panic regarding the transition from Windows 7 to Windows 10 I eventually gave in and upgraded some of my VMware-virtualized Windows 7 systems to Windows 10. More because of having some free time to get into this process than because assuming a sudden drop in security. (As if we ever trusted in the security of Windows system … I come back to security and privacy aspects in a second article.) However, on a perspective of some weeks or months the transition from Win 7 to Win 10 is probably unavoidable – if you cannot isolate your Windows machine completely from the Internet and/or from other external servers which bring a potential attack risk with them. The latter may even hold for servers of your clients.

I was a bit skeptical about the outcome of the upgrade procedure and the effort it would require on my side. A good friend of mine, who sells and administers Windows system professionally, had told me that he had experienced a whole variety of different problems – depending on the Win 7 setup, the amount and character of application SW installed, hardware drivers and the validity of licenses.

Well, my Windows 7 Pro clients were equipped with rather elementary SW: MS Office in different versions, MS Project, Lexware, Adobe Creative suite in an old version, some mind mapping SW, Adobe Reader, Anti malware SW. The “hardware” of the virtual machines is standard, partially emulated by VMware with appropriate drivers. So, no need to be especially nervous.

To be on the safe side I also ordered a VMware WS Pro upgrade to version 15.X. (I own WS 12.5.9 and WS 14 licenses.) Reason: I had read that only the WS 15.5 Pro supports the latest Win 10 versions fully. Well reading without thinking may lead to a waste of resources – see below.

Another rumor you often hear is that Windows 10 requires rather new hardware and is quite resource-demanding. MS itself recommends to buy a new PC or laptop on its web-sites – of course often followed by advertisement for MS notebook models on the very same web page. Yeah, money makes the world turn around. Well, regarding resources for my Windows guest systems I was/am rather restrictive:

Virtual machines for MS Win never get a lot of RAM from me – a maximum of 4 GB at most. This is enough for office purposes. (All really resource craving
things I do on Linux 🙂 ). Neither do my virtualized Win systems get a lot of disk space – typically < 60 GB. I mostly use vmdk-files to provide virtual hard disks – without full space allocation at startup, but dynamically added 4GB extents. vdmk files allow for an easy movement of virtual machines and simple backup procedures. And I usually give my virtual Win machines a maximum of 2 processor cores. So, these limitations contributed a bit to my skepticism. In addition I have 3D support on for my Win 7 guests in the virtual machine setup.

Meanwhile, I have successfully performed multiple upgrades on a rather old Linux host with an i7 950 CPU and newer hosts with I7 6700 K and modern i9 9900 processors. The operative system on all hosts run Opensuse Leap 15.1; I did not find the time to test my Debian hosts, yet.

I had some nice and some annoying experiences. I also found some aspects which you should take care of ahead of the Win 7 to Win 10 upgrade.

Make a backup!

As always with critical operations: Make a backup first! This is quite easy with a VMware virtual machine based on “vmdk”-files: Just copy the machines directory with all its files to some Linux formatted backup medium and keep up all the access rights during copying (=> cp -dpRv). In case of partition based virtual machines – make a copy of the partition with “dd”.

If you should need to restore the virtual machine in its old state again and to copy your backup files to their old places: VMware will notice this and will ask you whether you moved or copied the guest. Then answer “moved” (!) – which appears a bit paradox. But otherwise there is a very high probability that trouble with your Windows license will follow. VMware interprets a “copy”-operation as a duplication of a virtual machine and puts a related information somewhere (?) which Windows evaluates. Windows will almost certainly ask for a reactivation of your installation in case that your Win license was/is an individual one – as e.g. an OEM license.

Good news and potentially bad news regarding the upgrade to Win 10

The good news is:

Provided that you have valid licences for your Win 7 and for all SW components installed and provided that there is enough real and virtual disk space available, the Win 7 to Win 10 upgrade works smoothly. However, it takes a considerable amount of time.

I did not experience any performance problems after the upgrades – not even regarding transparency effects and other gimmicks in comparison to Windows 7. VMware’s 3D support for Win works – in WS 15 even for DirectX 10.

The requirement for time depends partially on the bandwidth of your Internet connection and partially on the performance of your disk access as well as your CPU and the available RAM. In my case I had to invest around 1 hr – in those cases when everything went straight through.

The potentially bad news comprises the following points:

The upgrade requires a considerable amount of free space on your virtual machine’s hard disk, which will be used temporarily. So, you should carefully check the available disk space – inside the virtual machine and – a bit surprising – also on the Linux filesystem keeping the vmdk-files. I ran into problems with limited space for multiple upgrades on both sides; see below. Whether you will experience something similar depends on your safety margin policies with respect to disk space in the guest and on the host.

A really annoying
aspect of the upgrade had to do with VMware’s development and market strategy. From advertisement you may conclude that it would be best to use VMware WS 14 or 15 to handle Windows 10. However, on older Intel based systems you should absolutely check whether the CPU is compatible with VMware WS 14 and 15. Check it, before you think upgrading a Vmware WS 12 license to anything higher. On my Intel i7 950 neither WS 14 nor WS 15 did work at all. Even if you get these WS versions working by a trick (see below) they perform badly.

Then there is a certain privacy aspect. As said, the upgrade takes a lot of time during which you are connected to the Internet and to Microsoft servers. This is only partially due to the fact that Win 10 SW has to be downloaded during the upgrade process; there are more phases of information exchange. It is also quite understandable that MS has to analyze and check your system on a full scale. But do we know what Big Brother [BB] MS is doing during this time and what information/data they transfer to their own systems? No, we do not. So, if you have any sensitive data files on your system – how to protect them? You cannot isolate your Windows 10 during the upgrade. And even worse: Later on you will be more or less forced to perform updates within certain periods. So, how to keep sensitive data inaccessible for BB during the upgrade and beyond?

I address the first two aspects below. The last point of privacy is an interesting but complicated one. I shall discuss it in a separate article.

Which VMware workstation version should I use?

Do not get misguided by reports or advertisement on the Internet that certain MS Win 10 require the latest version of VMware Workstation! WS 12 Pro was the first version which supported Win 10 in late 2015. Now VMware 15.X has arrived. And yes, there are articles that claim incompatibility of VMware WS 12, WS 14 and early subversions of WS 15 with some of the latest Win 10 builds and updates. See the following links and discussions therein:
https://communities.vmware.com/thread/608589
https://www.borncity.com/blog/2019/10/03/windows-10-update-kb4522015-breaks-vmware-workstation/
https://www.askwoody.com/forums/topic/vmware-12-and-newer-incompatible-with-windows-10-1903/

But read carefully: The statements on incompatibility refer mostly (if not only) to using a MS Win 10 system as a host for VMware! But we guys are using Linux systems as hosts.

Therefore the good message is:

Windows 10 as a VMware guest is already supported by VM WS 12.5.9 Pro, which runs also on older CPUs. For all practical purposes and 2D graphics a Win 10 guest installation works quite well on a Linux host with VMware 12.5.9.

At least, I have not yet noticed anything wrong on my hosts with Opensuse Leap 15.1 and VMware WS 12.5.9 PRO for a Win 10 guests. (Neither did I see problems with WS 14 or WS 15 on those hosts where I could use these versions).

The compatibility of WS 12.5 with Win 10 guest on Linux is more important than you may think if your host has an older CPU. If you really want to spend money and use WS 14 or WS 15 please note:

WS 14 Pro and WS 15 Pro require that your CPU provides Intel VT-x virtualization technology and EPT abilities.

So, the potentially bad message for you as the still proud owner of an older but capable CPU is:

The present VMware WS versions 14 and 15 which support Win 10 fully (as guest and host system) may not be compatible with your CPU!

Check
compatibility twice BEFORE you intend to upgrade VMware Workstation ahead of a “Win7 to Win 10”-upgrade. It would be a major waste of money if your CPU is not supported. And as stated: Win 12.5 does a good job with Win 10 guests.

VMware has deserved a lot of criticism with their decision to ignore older processors with WS Pro versions > 14. See
https://communities.vmware.com/thread/572931
https://vinfrastructure.it/2018/07/vmware-workstation-pro-14-issues-with-old-cpu/
https://www.heise.de/newsticker/meldung/VMware-Workstation-14-braucht-juengere-Prozessoren-3847372.html
For me this is a good reason to try a bit harder with KVM for the virtualization of Windows – and drop VMware wherever possible.

There is a small trick, though, to get WS 14 Pro running on an i7 950 and other older processors: In the file “/etc/vmware/config” you can add the setting

monitor.allowLegacyCPU = “true”

See https://communities.vmware.com/thread/572804.

But: I have tested this and found that a Win 7 start takes around 3 minutes! You really have to be very patient… This is crazy – and for me unacceptable. After you once are logged in, performance of Win 7 seems to be OK – maybe a bit sluggish. Still I cannot bear the waiting at boot time. So, I went back to WS 12 Pro on the machine with an i7 950.

Another problem for you may be that the installation of WS 12.5.9 on both Opensuse Leap 15.0 and 15.1 requires some special settings and tricks which I have written about in this blog. See:
Upgrade auf Opensuse Leap 15.0 – Probleme mit Nvidia-Treiber aus dem Repository und mit VMware WS 12.5.9
Upgrade Laptop to Opensuse 42.3, Probleme mit Bumblebee und VMware WS 12.5, Workarounds
The first article is relevant also for Opensuse 15.1.

Use the Windows Upgrade site and the Media Creation Tool page to save money

If you have a valid Win 7 license for all of your virtualized Win 7 installations it is not required to spend money on a new Win 10 license. Microsoft’s offer for a cost free upgrade to Win 10 still works. See e.g.:
https://www.cnet.com/how-to/windows-10-dont-wait-on-free-upgrade-because-windows-7-officially-done/
https://www.techbook.de/apps/kostenloses-update-windows-10
Follow the steps there – as I have done successfully myself.

Problems with disk space within the VMware Windows 7 guest during upgrade

My first Win7 to Win10 upgrade trial ran into trouble twice. The first problem occurred during the upgrade process and within the virtual machine:
I got a warning from the upgrade program at its start that I should free at least some 8.5 GByte.

Not so funny – as said, I am a bit picky about resources. The virtual guest machine had only a 60 GB C-disk. Fortunately, there were a lot of temporary files which could be deleted. Actually Gigabytes and partially years old – makes you wonder why Win 7 kept those files piled up. I also could move a bunch of data files to a D-disk. And I deinstalled some programs. All in all – it just worked out. The upgrade itself afterwards went friction-free and without

So one
message is:

Ensure that you have around 15 GB free on your virtual C-disk.

It is better to solve the problems with freeing C-disk space inside Win 7 without pressure – meaning: ahead of the upgrade to Win 10. If you run into the described problem it may be better to abort the Win 10 upgrade. I have tested this – and the Win 7 system was restored – apparently in good health. I got a strange message during reboot that the system was prepared for first use – but after everything was as before.

On another system I got a warning during the upgrade, when the “search for updates” began, that I should clear some 10 GByte of temporarily required disk space or attach an external drive (USB) to be used for temporary operations. The latter went OK in this case. But be careful the USB disk must be kept attached to the virtual machine over some reboots. Do not touch it until the upgrade has finalized.

So, a second message is:

Be prepared to have some external device with some free 20 GB ready if you have a complex installation with a lot of application SW and/or a complex virtual HW configuration.

I advise you to check your external USB drive, USB stick or whatever you use for filesystem errors before attaching it. And have your VMware window active whilst attaching the device! VMware will then warn you that the Linux host may claim access to the device and you just have to click the buttons in the dialog boxes to give the VMware guest full control instead of the host OS.

If you now should think about a general enlargement of the virtual disk(s) of your existing Win 7 installation please take into account the following:

On the one hand side an enlargement is of course possible and relatively easy to handle if you use vdmk files for disk virtualization and have free space on the Linux partition which hosts the vmdks. VMware supports the resizing process in the disk section of the virtual machine “settings”. On Win 7 you afterward can use the Win admin tools to extend the NTFS filesystem to the full extent of the newly configured disk.

But, on the other side, please, consider that Windows may react allergic to a change of the main C-disk and request a new activation due to major hardware changes. 🙁

This is one of the points why we do not like Windows ….
So, how you solve a potential free disk problem depends a bit on what you think is the bigger problem – reactivation or freeing disk space by deletions, movement of files or deinstallations.

Addendum: Also check old restore points Win 7 may have created over time! After a successful upgrade to Win 10 I stumbled across an option to release all restore information for old installations (in this case for Win 7 and its kept restore points). This will give you again many Gigabytes if you had not deleted “restore point” data for a long time in your Win 7. In my case I gained remarkable 17 GB! => Should have deleted some old restore points data already before the upgrade.

Problems with disk space on the Linux host

The second problem with disk space occurred after or during some upgrades to Win 10: I ran out of space in the Linux filesystem containing the vmdk files of my virtual machine. In one case the upgrade simply stopped. In another case the problem occurred a while after the upgrade – without me actually doing much on the new Win 10 installation. VMware suddenly issued a warning regarding the Linux file system and paused the virtual machine. I was first a bit surprised as I had not experienced this lack of space during normal usage of the previous Win 7 installation.

The explanation was simple: As said, I had set up the virtual disk such that the required space was not allocated at once, but as required. Due to the upgrade the VMware had created all 4GB-extends
to provide the full disk space the guest needed. In addition I had activated “Autoprotect Snapshots” on VMware (3 per day) – the first automatically created snapshot after the upgrade required a lot of additional space on the Linux file system – due to heavy changes on the hard disk.

My virtualized machines most often reside on specific (encrypted) LVM-based Linux partitions. And there it just got tight – when VMware stopped the virtual machine only 3.5 GB were left free. Not funny: You cannot kill snapshots on a paused virtual guest – the guest must be running or be shut down. And if you want to enlarge a Linux partition – which is possible if there is (neighboring) space free on your hard disk – then the filesystem should best be unmounted. Well, you can enlarge a GPT-partition with the ext4-filesystem in operation (e.g. with YaST) – but it gives you an uncomfortable feeling.

In my case I decided to brutally power down the virtual machines. In one case where this problem occurred I could at least eliminate one snapshot. I could start the virtual machine then again and let Windows check the NTFS filesystems for errors. Then I shut down the virtual machine again, deleted another snapshot and used the tools of VMware to defragment and compact the virtual disks. This gave me a considerable amount of free GBs. Good!
Afterwards I additionally reduced the number of protection snapshots – if this still seemed to be necessary.

On another system with a more important Win 7/10 installation I really extended the Linux partition and its ext4 filesystem by 20 GB – I had some spare space, fortunately – and then followed the steps just described.

So, there is a whole spectrum of options to regain disk space after the upgrade. See also:
thebackroomtech.com : reduce-size-virtual-machine-disk-vmware-workstation/

My third message is:

Ensure a reasonable amount of free space in the Linux filesystem – for required extents and snapshots!
After the backup of your old Win 7 installation, eliminate all VMware snapshots which you do not absolutely need – in the snapshot manager from the left to the right. Also use the VMware tools to defragment and compact your virtual disks ahead of the upgrade.

By the way: I hope that it is clear that snapshots do NOT replace backups. You should make a backup of your successfully upgraded Win 10 installation after you have tested the functionality of your applications and before you start working seriously with your new Win 10. You do not want to go through the upgrade procedure again ..

Addendum: Circumvent the enforcement of Windows 10 updates after your upgrade

Updates on Windows 7 have often lead to trouble in the past – and as an administrator you were happy to have some control over the download and installation points for updates in time. After reading a bit, I got the impression that the situation has not changed much: There have occurred some major problems related to updates of Win 10 since 2016. Yet, Windows 10 enforces updates more rigidly than Win 7.

I, therefore, generally recommend the following:

Delay or stop automatic updates on Win 10. Then use VMware’s snapshot mechanism before manual updates to be able to turn back to a running Win 10 guest version. In this order.

The first point is not so easy as it may seem – there are no basic and directly accessible options to only get informed about available updates as on Win 7. Win 10 enforces updates if you have enabled “Windows Update”; there is no “inform only” or “download only”. You have to either disable updates totally or to delay them. The latter only works for a maximum period of 35 days. How to deactivate updates completely is described here:

https://www.easeus.com/todo-backup-resource/how-to-stop-windows-10-from-automatically-update.html
https://www.t-online.de/digital/software/id_77429674/windows-10-automatische-updates-deaktivieren-so-geht-s.html

There is also a description on “Upgrade” values for a related registry entry:
www.deskmodder.de/wiki/index.php/Automatische-Updates-deaktivieren-oder-auf-manuell-setzen-Windows-10#Windows_10_1607.2C-1703-Pro-Updates-auf-manuell-setzen-oder-deaktivieren

I am not sure whether this works on Win 10 Pro build 1909 – we shall see.

Conclusion

Win 7 and Win 10 can be run on VMware WS Pro versions 12.5 up to 15.5 on Linux hosts. Before you upgrade VMware WS check for compatibility with your CPU! An upgrade of a Win 7 Pro installation on a VMware virtual machine to Win 10 Pro basically works smoothly – but you should take care of providing enough disk space within the virtual machine and also on the host’s filesystem containing the vdmk-files for the virtual disks.

It is not necessary to change the quality of the virtualized hardware configuration. Win 10 appears to be running with at least the same performance as the old Win 7 on a given virtual machine.

In the next article I will discuss some privacy aspects during the upgrade and after. The main question there will be: What can we do to prevent the transfer of sensitive data files from a Win 10 installation?

Posted in Erfahrungsberichte, Leap 15.0, Leap 15.1, VMware Workstation | Tagged Upgrade to Windows 10, virtual disk extents, virtual disk space, vmdk files, VMware guests on Linux, VMware WS 12.5.9 Pro, VMware WS 15 Pro, Windows 10, Windows 7, Windows guest systems

Post navigation

← Older posts

Newer posts →

March 2026

M T W T F S S

1

2 3 4 5 6 7 8

9 10 11 12 13 14 15

16 17 18 19 20 21 22

23 24 25 26 27 28 29

30 31

« Jan

Recent Posts

New year – new operative system for desktops and laptops?

Leap 15.6, Nvidia driver 580 – resume from S3 deep sleep state after suspend to RAM?

Leap 15.6, Nvidia-driver – problems due to dependency on original kernel-default-devel package

KDE Plasma users – reconfigure your desktop from scratch when you experience startup delays after system upgrades

S3Dlib, Matplotlib, 3D rendering – spheres in front of a surface

Categories

Allgemeines / Meinungen

Blender

Cups, Druck

Debian

Debian 8

Debian 9

Kali Linux

Erfahrungsberichte

Firewall Netfilter Iptables

Hardware, Treiber

Nvidia

KDE

Kontact – Kmail

LAMP / Webentwicklung

Apache

Eclipse für PHP

GIT

Javascript / JQuery / Ajax

MySQL

PHP

phpMyAdmin

SVN

vsftp

LDAP

LibreOffice, OpenOffice

Linux 3D Desktop

Linux-Container

Machine Learning

Autoencoder

Cluster Algorithms

CNNs

DeepDream

GPT

Keras

Mathematics

ML environment

MLP

Perceptron

SVM- Support Vector Machines

Tensorflow 2

Text analysis with ML

Text pre-processing for ML

Variational Autoencoder

Netzwerk / networking

Open Source

Open-Xchange

Open-Xchange 5

Open-Xchange 6

Opensuse

Leap 15.0

Leap 15.1

Leap 15.2

Leap 15.3

Leap 15.4

Leap 15.5

Leap 15.6

Leap 42.1

Leap 42.2

Leap 42.3

Opensuse 12.1

Opensuse 12.2

Opensuse 12.3

Opensuse 13.1

Opensuse 13.2

Postfix, Cyrus IMAP, Kmail

Python and Eclipse PyDev

matplotlib

Numpy

Security und Security-Tools

Sound

Strato-V-Server

systemd und udev

Verschlüsselung – SSH – SFTP – TLS – E-Mails

Virtualisierung, KVM, Xen

VMware Workstation

Web – Browser, HTTP, HTML, CSS

Wordpress

Blogroll

machine-learning@anracon

Personal Blog

homepages

anracon.de

anracom.com

Shuffling the contents of the mini-batches

Results for shuffling the contents of the mini-batches

Dropping regularization

Why might it be interesting to check the impact of the regularization?

Enlarging the regularization factor

Enlarging node numbers

Reducing the node number on the hidden layers

Working with just one hidden layer

Conclusion

Test results

Secondary test: Rate of correctly and wrongly predicted values of the training and the test data sets

Conclusion

More input parameters

Some hygienic measures regarding variables

The changed __input__()-method

The extended method _set_ANN_structure()

The modified functions _fit() and _handle_mini_batch()

Forward Propagation

Methods for Error Backward Propagation

Conclusion

Links

Gradient descent and mini-batches – one or multiple cost functions?

Repetition: Why Back-propagation of 2 dimensional matrices and not vectors?

Some possible pitfalls to tackle before EBP-coding

Special cost-, activation- and output-functions

The special case of the Log Loss function and other loss functions with critical denominators in their derivative

Handling of bias nodes

Check the matrix dimensions

Outlook

Literature

Upgrade to Win 10

Make a backup!

Good news and potentially bad news regarding the upgrade to Win 10

Which VMware workstation version should I use?

Use the Windows Upgrade site and the Media Creation Tool page to save money

Problems with disk space within the VMware Windows 7 guest during upgrade

Problems with disk space on the Linux host

Addendum: Circumvent the enforcement of Windows 10 updates after your upgrade

Conclusion

The changed input()-method