MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation

In my last article in this blog I wrote a bit about some steps to get Keras running with Tensorflow 2 [TF2] and Cuda 10.2 on Opensuse Leap 15.1. One objective of these efforts was a performance comparison between two similar Multilayer Perceptrons [MLP] :

  • my own MLP programmed with Python and Numpy; I have discuss this program in another article series;
  • an MLP with a similar setup based on Keras and TF2

Not for reasons of a competition, but to learn a bit about differences. When and for what parameters do Keras/TF2 offer a better performance?
Another objective is to test TF-alternatives to Numpy functions and possible performance gains.

For the Python code of my own MLP see the article series starting with the following post:

A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

But I will discuss relevant code fragments also here when needed.

I think, performance is always an interesting topic - especially for dummies as me regarding Python. After some trials and errors I decided to discuss some of my experiences with MLP performance and optimization options in a separate series of the section "Machine learning" in this blog. This articles starts with two simple measures.

A factor of 6 turns turns into a factor below 2

Well, what did a first comparison give me? Regarding CPU time I got a factor of 6 on the MNIST dataset for a batch-size of 500. Of course, Keras with TF2 was faster 🙂 . Devastating? Not at all ... After years of dealing with databases and factors of up to 100 by changes of SQL-statements and indexing a factor of 6 cannot shock or surprise me.

The Python code was the product of an unpaid hobby activity in my scarce free time. And I am still a beginner in Python. The code was also totally unoptimized, yet - both regarding technical aspects and the general handling of forward and backward propagation. It also contained and still contains a lot of superfluous statements for testing. Actually, I had expected an even bigger factor.

In addition, some things between Keras and my Python programs are not directly comparable as I only use 4 CPU cores for Openblas - this gave me an optimum for Python/Numpy programs in a Jupyter environment. Keras and TF2 instead seem to use all available CPU threads (successfully) despite limiting threading with TF-statements. (By the way: This is an interesting point in itself. If OpenBlas cannot give them advantages what else do they do?)

A very surprising point was, however, that using a GPU did not make the factor much bigger - despite the fact that TF2 should be able to accelerate certain operations on a GPU by at least by a factor of 2 up to 5 as independent tests on matrix operations showed me. And a factor of > 2 between my GPU and the CPU is what I remember from TF1-times last year. So, either the CPU is better supported now or the GPU-support of TF2 has become worse compared to TF1. An interesting point, too, for further investigations ...

An even bigger surprise was that I could reduce the factor for the given batch-size down to 2 by just two major, butsimple code changes! However, further testing also showed a huge dependency on the batch sizechosen for training - which is another interesting point. Simple tests show that we may even be able to reduce the performance factor further by

  • by using directly coupled matrix operations - if logically possible
  • by using the basic low-level Python API for some operations

Hope, this sounds interesting for you.

The reference model based on Keras

I used the following model as a reference in a Jupyter environment executed on Firefox:

Jupyter Cell 1

 
# compact version 
# ****************
import time 
import tensorflow as tf
#from tensorflow import keras as K
import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras import regularizers
from tensorflow.python.client import device_lib
import os

# use to work with CPU (CPU XLA ) only 
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
# The following can only be done once - all CPU cores are used otherwise  
tf.config.threading.set_intra_op_parallelism_threads(4)
tf.config.threading.set_inter_op_parallelism_threads(4)

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
    
# if not yet done elsewhere 
#tf.compat.v1.disable_eager_execution()
#tf.config.optimizer.set_jit(True)
tf.debugging.set_log_device_placement(True)

use_cpu_or_gpu = 0 # 0: cpu, 1: gpu

# function for training 
def train(train_images, train_labels, epochs, batch_size, shuffle):
    network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size, shuffle=shuffle)

# setup of the MLP
network = models.Sequential()
network.add(layers.Dense(70, activation='sigmoid', input_shape=(28*28,), kernel_regularizer=regularizers.l2(0.01)))
#network.add(layers.Dense(80, activation='sigmoid'))
#network.add(layers.Dense(50, activation='sigmoid'))
network.add(layers.Dense(30, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01)))
network.add(layers.Dense(10, activation='sigmoid'))
network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# load MNIST 
mnist = K.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# simple normalization
train_images = X_train.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255
test_images = X_test.reshape((10000, 28*28))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

 

Jupyter Cell 2

# run it 
if use_cpu_or_gpu == 1:
    start_g = time.perf_counter()
    train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True)
    end_g = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    with tf.device("/CPU:0"):
        train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True)
    end_c = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_CPU: ', end_c - start_c)  

# test accuracy 
print('Acc:: ', test_acc)

Typical output - first run:

 
Epoch 1/35
60000/60000 [==============================] - 1s 16us/step - loss: 2.6700 - accuracy: 0.1939
Epoch 2/35
60000/60000 [==============================] - 0s 5us/step - loss: 2.2814 - accuracy: 0.3489
Epoch 3/35
60000/60000 [==============================] - 0s 5us/step - loss: 2.1386 - accuracy: 0.3848
Epoch 4/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.9996 - accuracy: 0.3957
Epoch 5/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.8941 - accuracy: 0.4115
Epoch 6/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.8143 - accuracy: 0.4257
Epoch 7/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.7556 - accuracy: 0.4392
Epoch 8/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.7086 - accuracy: 0.4542
Epoch 9/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6726 - accuracy: 0.4664
Epoch 10/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6412 - accuracy: 0.4767
Epoch 11/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6156 - accuracy: 0.4869
Epoch 12/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5933 - accuracy: 0.4968
Epoch 13/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5732 - accuracy: 0.5078
Epoch 14/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5556 - accuracy: 0.5180
Epoch 15/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5400 - accuracy: 0.5269
Epoch 16/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5244 - accuracy: 0.5373
Epoch 17/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5106 - accuracy: 0.5494
Epoch 18/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4969 - accuracy: 0.5613
Epoch 19/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4834 - accuracy: 0.5809
Epoch 20/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4648 - accuracy: 0.6112
Epoch 21/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4369 - accuracy: 0.6520
Epoch 22/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3976 - accuracy: 0.6821
Epoch 23/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3602 - accuracy: 0.6984
Epoch 24/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3275 - accuracy: 0.7084
Epoch 25/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3011 - accuracy: 0.7147
Epoch 26/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2777 - accuracy: 0.7199
Epoch 27/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2581 - accuracy: 0.7261
Epoch 28/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2411 - accuracy: 0.7265
Epoch 29/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2259 - accuracy: 0.7306
Epoch 30/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2140 - accuracy: 0.7329
Epoch 31/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2003 - accuracy: 0.7355
Epoch 32/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1890 - accuracy: 0.7378
Epoch 33/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1783 - accuracy: 0.7410
Epoch 34/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1700 - accuracy: 0.7425
Epoch 35/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1605 - accuracy: 0.7449
10000/10000 [==============================] - 0s 37us/step
Time_CPU:  11.055424336002034
Acc::  0.7436000108718872

 
A second run was a bit faster: 10.8 secs. Accuracy around: 0.7449.
The relatively low accuracy is mainly due to the regularization (and reasonable to avoid overfitting). Without regularization we would already have passed the 0.9 border.

My own unoptimized MLP-program was executed with the following parameter setting:

 

             my_data_set="mnist_keras", 
             n_hidden_layers = 2, 
             ay_nodes_layers = [0, 70, 30, 0], 
             n_nodes_layer_out = 10,
             num_test_records = 10000, # number of test data
             
             # Normalizing - you should play with scaler1 only for the time being      
             scaler1 = 1,   # 1: StandardScaler (full set), 1: Normalizer (per sample)        
             scaler2 = 0,   # 0: StandardScaler (full set), 1: MinMaxScaler (full set)       
             b_normalize_X_before_preproc = False,     
             b_normalize_X_after_preproc  = True,     

             my_loss_function = "LogLoss",
 
             n_size_mini_batch = 500,
             n_epochs = 35, 
             lambda2_reg = 0.01,  

             learn_rate = 0.001,
             decrease_const = 0.000001, 

             init_weight_meth_L0 = "sqrt_nodes",  # method to init weights in an interval defined by  =>"sqrt_nodes" or a constant interval  "const"
             init_weight_meth_Ln = "sqrt_nodes",  # sqrt_nodes", "const"
             init_weight_intervals = [(-0.5, 0.5), (-0.5, 0.5), (-0.5, 0.5)],   # in case of a constant interval
             init_weight_fact = 2.0,              # extends the interval 
             mom_rate   = 0.00005,

             b_shuffle_batches = True,    # shuffling the batches at the start of each epoch 
             b_predictions_train = True,  # test accuracy by  predictions for ALL samples of the training set (MNIST: 60000) at the start of each epoch
             b_predictions_test  = False,  
             prediction_train_period = 1, # 1: each and every epoch is used for accuracy tests on the full training set
             prediction_test_period = 1,  # 1: each and every epoch is used for accuracy tests on the full test dataset

 

People familiar with my other article series on the MLP program know the parameters. But I think their names and comments are clear enough.

With a measurement of accuracy based on a forward propagation of the complete training set after each and every epoch (with the adjusted weights) I got a run time of 60 secs.

With accuracy measurements based on error tracking for batches and averaging over all batches, I get 49.5 secs (on 4 CPU threads). So, this is the mentioned factor between 5 and 6.

(By the way: The test indicates some space for improvement on the "Forward Propagation" 🙂 We shall take care of this in the next article of this series - promised).

So, these were the references or baselines for improvements.

Two measures - and a significant acceleration

Well, let us look at the results after two major code changes. With a test of accuracy performed on the full training set of 60000 samples at the start of each epoch I get the following result :

------------------
Starting epoch 35

Time_CPU for epoch 35 0.5518779030026053
relative CPU time portions: shuffle: 0.05  batch loop: 0.58  prediction:  0.37
Total CPU-time:  19.065050211000198

learning rate =  0.0009994051838157095

total costs of training set   =  5843.522
rel. reg. contrib. to total costs =  0.0013737131

total costs of last mini_batch   =  56.300297
rel. reg. contrib. to batch costs =  0.14256112

mean abs weight at L0 :  0.06393985
mean abs weight at L1 :  0.37341583
mean abs weight at L2 :  1.302389

avg total error of last mini_batch =  0.00709
presently reached train accuracy   =  0.99072

-------------------
Total training Time_CPU:  19.04528829299714

With accuracy taken only from the error of a batch:

avg total error of last mini_batch =  0.00806
presently reached train accuracy   =  0.99194
-------------------
Total training Time_CPU:  11.331006342999899

Isn't this good news? A time of 11.3 secs is pretty close to what Keras provides us with! (Well, at least for a batch size of 500). And with a better result regarding accuracy on my side - but this has to do with a probably different handling of learning rates and the precise translation of the L2-regularization parameter for batches.

Plots:

How did I get to this point? As said: Two measures were sufficient.

A big leap in performance by turning to float32 precision

So far I have never cared too much for defining the level of precision by which Numpy handles arrays with floating point numbers. In the context of Machine Learning this is a profound mistake. on a 64bit CPU many time consuming operations can gain almost a factor of 2 in performance when using float 32 precision - if the programmers tweaked everything. And I assume the Numpy guys did it.

So: Just use "dtype=np.float32" (np means "numpy" which I always import as "np") whenever you initialize numpy arrays!

For the readers following my other series: You should look at multiple methods performing some kind of initialization of my "MyANN"-class. Here is a list:

 
    def _handle_input_data(self): 
        .....
            self._y = np.array([int(i) for i in self._y], dtype=np.float32)
        .....
        self._X = self._X.astype(np.float32)
        self._y = self._y.astype(np.int32)
        .....
    def _encode_all_y_labels(self, b_print=True):
        .....
        self._ay_onehot = np.zeros((self._n_labels, self._y_train.shape[0]), dtype=np.float32)
        self._ay_oneval = np.zeros((self._n_labels, self._y_train.shape[0], 2), dtype=np.float32)
   
        .....
    def _create_WM_Input(self):
        .....
        w0 = w0.astype(dtype=np.float32)
        .....
    def _create_WM_Hidden(self):
        .....
            w_i_next = w_i_next.astype(dtype=np.float32)
        .....
    def _create_momentum_matrices(self):
        .....
            self._li_mom[i] = np.zeros(self._li_w[i].shape, dtype=np.float32)
        .....
    def _prepare_epochs_and_batches(self, b_print = True):
        .....
        self._ay_theta = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        self._ay_costs = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        self._ay_reg_cost_contrib = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        .....
        self._ay_period_test_epoch     = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_acc_test_epoch        = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_err_test_epoch        = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_period_train_epoch    = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_acc_train_epoch       = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_err_train_epoch       = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_tot_costs_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_rel_reg_train_epoch   = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        .....
        self._ay_mean_abs_weight = -10 * np.ones(shape_weights, dtype=np.float32) 
        .....
    def _add_bias_neuron_to_layer(self, A, how='column'):
        .....
            A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32)
        .....
            A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32)
    .....

 

After I applied these changes the factor in comparison to Keras went down to 3.1 - for a batch size of 500. Good news after a first simple step!

Reducing the CPU time once more

The next step required a bit more thinking. When I went through further more detailed tests of CPU consumption for various steps during training I found that the error back propagation through the network required significantly more time than the forward propagation.

At first sight this seems to be logical. There are more operations to be done between layers - real matrix multiplications with np.dot() (or np.matmul()) and element-wise multiplications with the "*"-operation. See also my PDF on the basic math:
Back_Propagation_1.0_200216.

But this is wrong assumption: When I measured CPU times in detail I saw that such operations took most time when network layer L0 - i.e. the input layer of the MLP - got involved. This also seemed to be reasonable: the weight matrix is biggest there; the input layer of all layers has most neuron nodes.

But when I went through the code I saw that I just had been too lazy whilst coding back propagation:

 
    ''' -- Method to handle error BW propagation for a mini-batch --'''
    def _bw_propagation(self, 
                        ay_y_enc, li_Z_in, li_A_out, 
                        li_delta_out, li_delta, li_D, li_grad, 
                        b_print = True, b_internal_timing = False):
        
        # Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch 
        
        # Initiate BW propagation - provide delta-matrices for outermost layer
        # *********************** 
        # Input Z at outermost layer E  (4 layers -> layer 3)
        ay_Z_E = li_Z_in[self._n_total_layers-1]
        # Output A at outermost layer E (was calculated by output function)
        ay_A_E = li_A_out[self._n_total_layers-1]
        
        # Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid 
        ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print )
        
        # Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices)
        ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print) 
        
        # add the matrices at the outermost layer to their lists ; li_delta_out gets only one element 
        idxE = self._n_total_layers - 1
        li_delta_out[idxE] = ay_delta_out_E # this happens only once
        li_delta[idxE]     = ay_delta_E
        li_D[idxE]         = ay_D_E
        li_grad[idxE]      = None    # On the outermost layer there is no gradient ! 
        
        # Loop over all layers in reverse direction 
        # ******************************************
        # index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2))
        range_N_bw_layer = reversed(range(0, self._n_total_layers-1))   # must be -1 as the last element is not taken 
        
        # loop over layers 
        for N in range_N_bw_layer:
            
            # Back Propagation operations between layers N+1 and N 
            # *******************************************************
            # this method handles the special treatment of bias nodes in Z_in, too
            ay_delta_N, ay_D_N, ay_grad_N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False )
            
            # add matrices to their lists 
            li_delta[N] = ay_delta_N
            li_D[N]     = ay_D_N
            li_grad[N]= ay_grad_N
       
        return

 
with the following key function:

 
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta):
        '''
        BW-error-propagation between layer N+1 and N 
        Inputs: 
            li_Z_in:  List of input Z-matrices on all layers - values were calculated during FW-propagation
            li_A_out: List of output A-matrices - values were calculated during FW-propagation
            li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist 
        
        Returns: 
            ay_delta_N - delta-matrix of layer N (required in subsequent steps)
            ay_D_N     - derivative matrix for the activation function on layer N 
            ay_grad_N  - matrix with gradient elements of the cost fnction with respect to the weights on layer N 
        '''
        
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling between layers N and N+1 
        ay_W_N = self._li_w[N]
        # delta-matrix of layer N+1
        ay_delta_Np1 = li_delta[N+1]

        # !!! Add row (for bias) to Z_N intermediately !!!
        ay_Z_N = li_Z_in[N]
        ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        
        # Derivative matrix for the activation function (with extra bias node row)
        ay_D_N = self._calculate_D_N(ay_Z_N)
        
        # fetch output value saved during FW propagation 
        ay_A_N = li_A_out[N]
        
        # Propagate delta
        # **************
        # intermediate delta 
        ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
        # final delta 
        ay_delta_N = ay_delta_w_N * ay_D_N
        # reduce dimension again (bias row)
        ay_delta_N = ay_delta_N[1:, :]
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        ay_grad_N[:, 1:] += (self._li_w[N][:, 1:] * self._lambda2_reg + np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg) 
        
        return ay_delta_N, ay_D_N, ay_grad_N

 

Now, look at the eventual code:

 
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False):
        '''
        BW-error-propagation between layer N+1 and N 
        .... 
        '''
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling between layers N and N+1 
        ay_W_N = self._li_w[N]
        ay_delta_Np1 = li_delta[N+1]

        # fetch output value saved during FW propagation 
        ay_A_N = li_A_out[N]

        # Optimization ! 
        if N > 0: 
            ay_Z_N = li_Z_in[N]
            # !!! Add intermediate row (for bias) to Z_N !!!
            ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        
            # Derivative matrix for the activation function (with extra bias node row)
            ay_D_N = self._calculate_D_N(ay_Z_N)
        
            # Propagate delta
            # **************
            # intermediate delta 
            ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
            # final delta 
            ay_delta_N = ay_delta_w_N * ay_D_N
            # reduce dimension again 
            ay_delta_N = ay_delta_N[1:, :]
            
        else: 
            ay_delta_N = None
            ay_D_N = None
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        if self._lambda2_reg > 0.0: 
            ay_grad_N[:, 1:] += self._li_w[N][:, 1:] * self._lambda2_reg 
        if self._lambda1_reg > 0.0: 
            ay_grad_N[:, 1:] += np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg 
        
        return ay_delta_N, ay_D_N, ay_grad_N

 

You have, of course, detected the most important change:

We do not need to propagate any delta-matrices (originally coming from the error deviation at the output layer) down to layer 1!

This is due to the somewhat staggered nature of error back propagation - see the PDF on the math again. Between the first hidden layer L1 and the input layer L0 we only need to fetch the output matrix A at L0 to be able to calculate the gradient components for the weights in the weight matrix connecting L0 and L1. This saves us from the biggest matrix multiplication - and thus reduces computational time significantly.

Another bit of CPU time can be saved by calculating only the regularization terms really asked for; for my simple densely populated network I almost never use Lasso regularization; so L1 = 0.

These changes got me down to the values mentioned above. And, note: The CPU time for backward propagation then drops to the level of forward propagation. So: Be somewhat skeptical about your coding if backward propagation takes much more CPU time than forward propagation!

Dependency on the batch size

I should remark that TF2 still brings some major and remarkable advantages with it. Its strength becomes clear when we go to much bigger batch sizes than 500:
When we e.g. take a size of 10000 samples in a batch, the required time of Keras and TF2 goes down to 6.4 secs. This is again a factor of roughly 1.75 faster.
I do not see any such acceleration with batch size in case of my own program!

More detailed tests showed that I do not gain speed with a batch size over 1000; the CPU time increases linearly from that point on. This actually seems to be a limitation of Numpy and OpenBlas on my system.

Because , I have some reasons to believe that TF2 also uses some basic OpenBlas routines, this is an indication that we need to put more brain into further optimization.

Conclusion

We saw in this article that ML programs based on Python and Numpy may gain a boost by using only dtype=float32 and the related accuracy for Numpy arrays. In addition we saw that avoiding unnecessary propagation steps between the first hidden and at the input layer helps a lot.

In the next article of this series we shall look a bit at the performance of forward propagation - especially during accuracy tests on the training and test data set.

Further articles in this series

MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation
MLP, Numpy, TF2 – performance issues – Step II – bias neurons, F- or C- contiguous arrays and performance
MLP, Numpy, TF2 – performance issues – Step III – a correction to BW propagation

Getting a Keras based MLP to run with Cuda 10.2, Cudnn 7.6 and TensorFlow 2.0 on an Opensuse Leap 15.1 system

During last weekend I wanted to compare the performance of an old 20-line Keras setup for a simple MLP with the performance of a self-programmed Python- and Numpy-based MLP regarding training epochs on the MNIST dataset. The Keras code was set up in a Jupyter notebook last autumn - at that time for TensorFlow 1 and Cuda 10.0 for my Nvidia graphics card. I thought it might be a good time to move everything to Tensorflow 2 and the latest Cuda libraries on my Opensuse Leap 15.1 system. This was more work than expected - and for some problems I was forced to apply some dirty workarounds. I got it running. Maybe the necessary steps, which are not really obvious, are helpful for others, too.

Install Cuda 10.2 and Cudnn on an Opensuse Leap 15.1

Before you want to use TensorFlow [TF] on a Nvidia graphics card you must install Cuda. The present version is Cuda 10.2. I was a bit naive to assume that this should be the right version - as it has been available for some time already. Wrong! Afterwards I read somewhere that TensorFlow2 [TF2] is working with Cuda 10.1, only, and not yet fully compatible with Cuda 10.2. Well, at least for my purposes [MLP training] it seemed to work nevertheless - with some "dirty" additional library links.

There is a central Cuda repository available at this address: cuda10.2. Actually, the repo offers both cuda10.0, cuda10.1 and cuda10.2 (plus some nvidia drivers). I selected some central cuda10.2 packages for installation - just to find out where the related files were placed in the filesystem. I then ran into a major chain of packet dependencies, which I had to resolve during many tedious steps . Some packages may not have been necessary for a basic installation. In the end I was too lazy to restrict the libs to what is really required for Keras. The bill came afterwards: Cuda 10.2 is huge! If you do not know exactly what you need: Be prepared to invest up to 3 GB on your hard disk.

The Cuda 10.2 RPM packets install most of the the required "*.so"-shared library files and many other additional files in a directory "/usr/local/cuda-10.2/". To make changes between different versions of Cuda possible we also find a link "/usr/local/cuda" pointing to
"/usr/local/cuda-10.2/" after the installation. Ok, reasonable - we could change the link to point to "/usr/local/cuda-10.0/". This makes you assume that the Tensorflow 2 libraries and other modules in your virtual Python3 and Jupyter environment would look for required Cuda files in the central directory "/usr/local/cuda" - i.e. without special version attributes of the files. Unfortunately, this was another wrong assumption. 🙁 See below.

In addition to the Cuda packages you must install the present "cudnn" libraries from Nvidia - more precisely: The runtime and the development package. You get the RPMs from here. Be prepared to give Nvidia your private data. 🙁

I should add that I ignored and ignore the Nvidia drivers from the Cuda repository, i.e. I never installed them. Instead, I took those from the standard Nvidia community repository. They worked and work well so far - and make my update life on Opensuse easier.

Installation of Tensorflow2 modules in your (virtual) Python3 environment

I use a virtual Python3 environment and update it regularly via "pip". Regarding TF2 an update via the command "pip install --upgrade tensorflow" should be sufficient - it will resolve dependencies. (By the way: If you want to bring all Python libs to their present version you can also use "pip-review --auto". Note that under certain circumstances you may need the "--force" option for special upgrades. I cannot go into details in this article.)

Multiple complaints about missing libraries ...

Unfortunately, the next time I started my virtual Python environment I got the warning that the dynamic library "libnvinfer.so.6" could not be found, but was required in case I planned to use TensorRT. What? Well, you may find some information here
https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/
https://developer.nvidia.com/tensorrt

I leave it up to you whether you really need TensorRT. You can ignore this message - TF will run for standard purposes without it. But, dear TF-developers: a clear message in the warning would in my opinion have been helpful. Then I checked whether some version of the Nvidia related library "libnvinfer.so" came with Cuda or Cudnn onto my system. Yeah, it did - unfortunately version 7 instead of 6. :-(.
So, we are confronted with a dependency on a specific file version which is older than the present one. I do not like this style of development. Normally, it should be the other way round: If a newer version is required due to new capabilities you warn the user. But under normal circumstances a backward compatibility of libs should be established. You would assume such a backward compatibility and that TF would search for the present version via looking for files "libnvinfer.so" and "libnvinfer_plugin.so" which do exist and point to the latest versions. But, no, in this case they want it explicitly to be version 6 ... Makes you wonder whether the old Cudnn version is still available. I did not check it. Ok, ok - backward compatibility is not always possible ....

Just to see how good the internal checking of the alleged dependency is, I did something you normally do not do: I created a link "libnvinfer.so.6" in "/usr/lib64" to "libnvinfer.7.so". Had to do the same for "libnvinfer_plugin.so.6". Effect: I got rid of the warning - so much about dependency checking. I left the linking. You see I trust in coming developments sometimes and run some risks ....

Then came the next surprise. I had read a bit about changed statements in TF2 (compared to TF1) - and thought I was prepared for this. But, when I tried to execute some initial commands to configure TF2 from a Jupyter cell as e.g.

 
import time 
import tensorflow as tf
from tensorflow import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from tensorflow.python.client import device_lib

import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

tf.config.optimizer.set_jit(True)
tf.config.threading.set_intra_op_parallelism_threads(4)
tf.config.threading.set_inter_op_parallelism_threads(4)
tf.debugging.set_log_device_placement(True)

device_lib.list_local_devices()  

I at once got a complaint in the shell from which I had started the Jupyter notebook - saying that a lib called "libcudart.so.10.1" was missing. Again - an explicit version dependency 🙁 . On purpose or just a glitch? Just one out of many files version dependent? Without a clear information? If this becomes the standard in the interaction between TF2 and Cuda - well, no fun any longer. In my opinion the TF2 developers should not use a search for files via version specific names - but do an analysis of headers and warn explicitly that the present version requires a specific Cuda version. Would be much more convenient for the user and compatible with the link mechanism described above.

Whilst a bunch of other dynamic libs was loaded by their name without a version in this case TF2 asks for a very specific version - although there is a corresponding lib available in the directory "/usr/lib/cuda-10.2".... Nevertheless with full trust again in a better future I offered TF2 a softlink "libcudart.so.10.1" in "/usr/lib64/" pointing to the "/usr/local/cuda-10.2/lib64/libcudart.so". It cleared my way to the next hurdle. And my Keras MLP worked in the end ...

Missing "./bin" directory ... and other path related problems

When I tried to run specific Keras commands, which TF2 wanted to compile as XLA-supported statements, I again got complaints that files in a local directory "./bin" were missing. This was a first clear indication that Cuda paths were totally ignored in my Python/Jupyter environment. But what directory did the "./" refer to? Some experiments revealed:

I had to link an artificial subdirectory "./bin" in the directory where I kept my Jupyter notebooks to "/usr/local/Cuda-10.2/bin".

But the next problems with other directories waited directly around the corner. Actually many ... To make a long story short - the installation of TF2 in combination with Cuda 1.2 does not evaluate paths or ask for paths when used in a Python3/Jupyter environment. We have to provide and export them as shell environment variables. See below.

Warnings and errors regarding XLA capabilities

Another thing which drove me nuts was that TF2 required information about XLA-flags. It took me a while to find out that this also could be handled via environment variables.

All in all I now start the shell from which I launch my virtual Python environment and Jupyter notebooks with the following command sequence:

myself@mytux:/projekte/GIT/....../ml> export XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda
myself@mytux:/projekte/GIT/....../ml> export TF_XLA_FLAGS=--tf_xla_cpu_global_jit
myself@mytux:/projekte/GIT/....../ml_1> export OPENBLAS_NUM_THREADS=4              
myself@mytux:/projekte/GIT/....../ml_1> source bin/activate
(ml) myself@mytux:/projekte/GIT/....../ml_1> jupyter notebook 

The first two commands did the magic regarding the path-problems! TF2 worked afterwards both for XLA-capable CPUs and Nvidia GPUs. So, a specific version may or may not have advantages - I do not know - but at least you can get TF2 running with Cuda 10.2.

Changed commands to control threading and memory consumption

Without the use of explicit compatibility commands TF2 does not support commands like

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}, 
                        log_device_placement=True

                       )
config.gpu_options.per_process_gpu_memory_fraction=0.4
config.gpu_options.force_gpu_compatible = True  

any longer. But as with TF1 you probably do not want to pick all the memory from your graphics card and you do not want to use all cores of a CPU in TF2. You can circumvent the lack of a "ConfigProto" property in TF2 by the following commands:

# configure use of just in time compiler 
tf.config.optimizer.set_jit(True) 
# limit use of parallel threads 
tf.config.threading.set_intra_op_parallelism_threads(4) 
tf.config.threading.set_inter_op_parallelism_threads(4)
# Not required in TF2: tf.enable_eager_execution()
# print out use of certain device (at first run)
tf.debugging.set_log_device_placement(True)
#limit use of graphics card memory 
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
# Not required in TF2: tf.enable_eager_execution()
# print out a list of usable devices
device_lib.list_local_devices()   

Addendum, 15.05.2020:
Well, this actually proved to be correct for the limitation of the GPU memory, only. The limitations on the CPU cores do NOT work. At least not on my system. See also:
tensorflow issues 34415.

I give you a workaround below.

Test run with MNIST

Afterwards the following simple Keras MLP ran without problems and with the expected performance on a GPU and a multicore CPU:

Jupyter cell 1

import time 
import tensorflow as tf
#from tensorflow import keras as K
import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
from tensorflow.python.client import device_lib
import os

# use to work with CPU (CPU XLA ) only 
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
    
# if not yet done elsewhere 

tf.config.optimizer.set_jit(True)
tf.debugging.set_log_device_placement(True)

use_cpu_or_gpu = 1 # 0: cpu, 1: gpu

# The following can only be done once - all CPU cores are used otherwise  
#if use_cpu_or_gpu == 0:
#    tf.config.threading.set_intra_op_parallelism_threads(4)
#    tf.config.threading.set_inter_op_parallelism_threads(6)


# function for training 
def train(train_images, train_labels, epochs, batch_size):
    network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size)

# setup of the MLP
network = models.Sequential()
network.add(layers.Dense(200, activation='sigmoid', input_shape=(28*28,)))
network.add(layers.Dense(100, activation='sigmoid'))
network.add(layers.Dense(50, activation='sigmoid'))
network.add(layers.Dense(30, activation='sigmoid'))
network.add(layers.Dense(10, activation='sigmoid'))
network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# load MNIST 
mnist = K.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# simple normalization
train_images = X_train.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255
test_images = X_test.reshape((10000, 28*28))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Jupyter cell 2

# run it 
if use_cpu_or_gpu == 1:
    start_g = time.perf_counter()
    train(train_images, train_labels, epochs=45, batch_size=1500)
    end_g = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    train(train_images, train_labels, epochs=45, batch_size=1500)
    end_c = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_CPU: ', end_c - start_c)  

# test accuracy 
print('Acc: ', test_acc)

Switch to force Tensorflow to use the CPU, only

Another culprit is that - depending on the exact version of TF 2 - you may need to use the following statement to run (parts of) your code on the CPU only:

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

in the beginning. Otherwise Tensorflow 2.0 and version 2.1 will choose execution on the GPU even if you use a statement like

with tf.device("/CPU:0"):

(which worked in TF1).
It seems that this problem was solved with TF 2.2 (tested it on 15.05.2020)! But you may have to check it yourself.
You can watch the involvement of the GPU e.g. with "watch -n0.1 nvidia-smi" on a terminal. Another possibility is to set

tf.debugging.set_log_device_placement(True)  

and get messages in the shell of your virtual Python environment or in the presently used Jupyter cell.

Addendum 16.05.2020: Limiting the number of CPU cores for Tensorflow 2.0 on Linux

After several trials and tests I think that both TF2 and the Keras version delivered with handle the above given TF2 statements to limit the number of CPU cores to use inefficiently. I addition the behavior of TF2/Keras has changed with the TF2 versions 2.0, 2.1 and now 2.2.

Strange things also happen, when you combine statements of the TF1 compat layer with pure TF2 restriction statements. You should refrain from mixing them.

So, it is either

Option 1: CPU only and limited number of cores

from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
...
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=4, inter_op_parallelism_threads=1)
B.set_session(tf.compat.v1.Session(config=config))    
...

OR
Option 2: Mixture of GPU (with limited memory) and CPU (limited core number) with TF2 statements

import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist
from tensorflow.python.client import device_lib
import os

tf.config.threading.set_intra_op_parallelism_threads(6)
tf.config.threading.set_inter_op_parallelism_threads(1)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], 
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    except RuntimeError as e:
        print(e)

OR
Option 3: Mixture of GPU (limited memory) and CPU (limited core numbers) with TF1 compat statements

import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist
from tensorflow.python.client import device_lib
import os

#gpu = False 
gpu = True
if gpu: 
    GPU = True;  CPU = False; num_GPU = 1; num_CPU = 1
else: 
    GPU = False; CPU = True;  num_CPU = 1; num_GPU = 0

config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6,
                        inter_op_parallelism_threads=1, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}, 
                        log_device_placement=True

                       )
config.gpu_options.per_process_gpu_memory_fraction=0.35
config.gpu_options.force_gpu_compatible = True
B.set_session(tf.compat.v1.Session(config=config))

Hint 1:

If you want to test some code parts on the GPU and others on the CPU in the same session, I strongly recommend to use the compat statements in the form given by Option 3 above

The reason is that it - strangely enough - gives you a faster performance on a multicore CPU by more than 25% in comparison to the pure TF2 statements .

Afterwards you can use statements like:

batch_size=64
epochs=5

if use_cpu_or_gpu == 0:
    start_g = time.perf_counter()
    with tf.device("/GPU:0"):
        train(train_imgs, train_labels, epochs, batch_size)
    end_g = time.perf_counter()
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    with tf.device("/CPU:0"):
        train(train_imgs, train_labels, epochs, batch_size)
    end_c = time.perf_counter()
    print('Time_CPU: ', end_c - start_c)  

Hint 2:
If you check the limitations on CPU cores (threads) via watching the CPU load on tools like "gkrellm" or "ksysguard", it may appear that all cores are used in parallel. You have to set the update period of these tools to 0.1 sec to see that each core is only used intermittently. In gkrellm you should also see a variation of the average CPU load value with a variation of the parameter "intra_op_parallelism_threads=n".

Hint 3:
In my case with a Quadcore CPU with hyperthreading the following settings seem to be optimal for a variety of Keras CNN models - whenever I want to train them on the CPU only:

...
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1)
B.set_session(tf.compat.v1.Session(config=config)) 
...

Hint 4:
If you want to switch settings in a Jupyter session it is best to stop and restart the respective kernel. You can do this via the commands under "kernel" in the Jupyter interface.

Conclusion

Well, my friends the above steps where what I had to do to get Keras working in combination with TF2, Cuda 10.2 and the present version of Cudnn. I regard this not as a straightforward procedure - to say it mildly.

In addition after some tests I might also say that the performance seems to be worse than with Tensorflow 1. Especially when using Keras - whether the Keras included with Tensorflow 2 or Keras in form of separate Python lib. Especially the performance on a GPU is astonishingly bad with Keras for small networks.

This impression of sluggishness stands in a strange contrast to elementary tests were I saw a factor of 5 difference for a series of typical matrix multiplications executed directly with tf.matmul() on a GPU vs. a CPU. But this another story .....

Links

tensorflow-running-version-with-cuda-on-cpu-only

 

A simple Python program for an ANN to cover the MNIST dataset – XIV – cluster detection in feature space

We extend our studies of a program for a Multilayer perceptron and gradient descent in combination with the MNIST dataset:

A simple Python program for an ANN to cover the MNIST dataset – XIII – the impact of regularization
A simple Python program for an ANN to cover the MNIST dataset – XII – accuracy evolution, learning rate, normalization
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix
A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests
A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests
A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation
A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles
A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“
A simple Python program for an ANN to cover the MNIST dataset – V – coding the loss function
A simple Python program for an ANN to cover the MNIST dataset – IV – the concept of a cost or loss function
A simple Python program for an ANN to cover the MNIST dataset – III – forward propagation
A simple Python program for an ANN to cover the MNIST dataset – II - initial random weight values
A simple Python program for an ANN to cover the MNIST dataset – I - a starting point

In this article we shall work a bit on the following topic: How can we reduce the computational time required for gradient descent runs of our MLP?

Readers who followed my last articles will have noticed that I sometimes used 1800 epochs in a gradient descent run. The computational time including

  • costly intermediate print outs into Jupyter cells,
  • a full determination of the reached accuracy both on the full training and the test dataset at every epoch

lay in a region of 40 to 45 minutes for our MLP with two hidden layers and roughly 58000 weights. Using an Intel I7 standard CPU with OpenBlas support. And I plan to work with bigger MLPs - not on MNIST but other data sets. Believe me: Everything beyond 10 minutes is a burden. So, I have a natural interest in accelerating things on a very basic level already before turning to GPUs or arrays of them.

Factors for CPU-time

This introductory question leads to another one: What basic factors beyond technical capabilities of our Linux system and badly written parts of my Python code influence the consumption of computational time? Four points come to my mind; you probably find even more:

  • One factor is certainly the extra forward propagation run which we apply to all samples of both the test and training data seat the end of each epoch. We perform this propagation to make predictions and to get data on the evolution of the accuracy, the total loss and the ratio of the regularization term to the real costs. We could do this in the future at every 2nd or 5th epoch to save some time. But this will reduce CPU-time only by less than 22%. 76% of the CPU-time of an epoch is spent in batch-handling with a dominant part in error backward propagation and weight corrections.
  • The learning rate has a direct impact on the number of required epochs. We could enlarge the learning rate in combination with input data normalization; see the last article. This could reduce the number of required epochs significantly. Depending on the parameter choices before by up to 40% or 50%. But it requires a bit of experimenting ....
  • Two other, more important factors are the frequent number of matrix operations during error back-propagation and the size of the involved matrices. These operations depend directly on the number of nodes involved. We could therefore reduce the number of nodes of our MLP to a minimum compatible with the required accuracy and precision. This leads directly to the next point.
  • The dominant weight matrix is of course the one which couples layer L0 and layer L1. In our case its shape is 784 x 70; it has almost 55000 elements. The matrix for the next pair of layers has only 70x30 = 2100 elements - it is much, much smaller. To reduce CPU time for forward propagation we should try to make this matrix smaller. During error back propagation we must perform multiple matrix multiplications; the matrix dimensions depend on the number of samples in a mini-batch AND on the number of nodes in the involved layers. The dimensions of the the result matrix correspond to the those of the weight matrix. So once again: A reduction of the nodes in the first 2 layers would be extremely helpful for the expensive backward propagation. See: The math behind EBP.

We shall mainly concentrate on the last point in this article.

Reduction of the dimensions of the dominant matrix"requires a reduction of input features

The following numbers show typical CPU times spend for matrix operations during error back propagation [EBP] between different layers of our MLP and for two different batches at the beginning of gradient descent:

Time_CPU for BW layer operations (to L2) 0.00029015699965384556
Time_CPU for BW layer operations (to L1) 0.0008645610000712622
Time_CPU for BW layer operations (to L0) 0.006551215999934357

Time_CPU for BW layer operations (to L2) 0.00029157400012991275
Time_CPU for BW layer operations (to L1) 0.0009575330000188842
Time_CPU for BW layer operations (to L0) 0.007488838999961445

The operations involving layer L0 cost a factor of 7 more CPU time than the other operations! Therefore, a key to the reduction of the number of mathematical operations is obviously the reduction of the number of nodes in the input layer! We cannot reduce the numbers in the hidden layers much, if we do not want to hamper the accuracy properties of our MLP too much. So the basic question is

Can we reduce the number of input nodes somehow?

Yes, maybe we can! Input nodes correspond to "features". In case of the MNIST dataset the relevant features are given by the gray-values for the 784 pixels of each image. A first idea is that there are many pixels within each MNIST image which are probably not used at all for classification - especially pixels at the outer image borders. So, it would be helpful to chop them off or to ignore them by some appropriate method. In addition, special significant pixel areas may exist to which the MLP, i.e. its weight optimization, reacts during training. For example: The digits 3, 5, 6, 8, 9 all have a bow within the lower 30% of an image, but in other regions, e.g. to the left and the right, they are rather different.

If we could identify suitable image areas in which dark pixels have a higher probability for certain digits then, maybe, we could use this information to discriminate the represented digits? But a "higher density of dark pixels in an image area" is nothing else than a description of a "cluster" of (dark) pixels in certain image areas. Can we use pixel clusters at numerous areas of an image to learn about the represented digits? Is the combination of (averaged) feature values in certain clusters of pixels representative for a handwritten digit in the MNIST dataset?

If the number of such pixel clusters could be reduced below lets say 100 then we could indeed reduce the number of input features significantly!

Cluster detection

To be able to use relevant "clusters" of pixels - if they exist in a usable form in MNIST images at all - we must first identify them. Cluster identification and discrimination is a major discipline of Machine Learning. This discipline works in general with unlabeled data. In the MNIST case we would not use the labels in the "y"-data at all to identify clusters; we would only use the "X"-data. A nice introduction to the mechanisms of cluster identification is given in the book of Paul Wilcott (see Machine Learning – book recommendations for the reference). The most fundamental method - called "kmeans" - iterates over 3 major steps [I simplify a bit :-)]:

  • We assume that K clusters exist and start with random initial positions of their centers (called "centroids") in the multidimensional feature space
  • We measure the distance of all data points to he centroids and associate a point with that centroid to which the distance is smallest
  • We determine the "center of mass" (according to some distance metric) of the identified data point groups and assume it as a new position of the centroids and move the old positions (a bit) in this direction.

We iterate over these steps until the centroids' positions hopefully get stable. Pretty simple. But there is a major drawback: You must make an assumption on the number "K" of clusters. To make such an assumption can become difficult in the complex case of a feature space with hundreds of dimensions.

You can compensate this by executing multiple cluster runs and comparing the results. By what? Regarding the closure or separation of clusters in terms of an appropriate norm. One such norm is called "cluster inertia"; it measures the mean squared distance to the center for all points of a cluster. The theory is that the sum of the inertias for all clusters drops significantly with the number of clusters until an optimal number is reached and the inertia curve flattens out. The point where this happens in a plot of inertia vs. number of clusters is called "elbow". Identifying this "elbow" is one of the means to find an optimal number of clusters. However, this recipe does not work under all circumstances. As the number of clusters get big we may be confronted with a smooth decline of the inertia sum.

What data do we use for gradient descent after cluster detection?

How could we measure whether an image shows certain clusters? We could e.g. measure distances (with some appropriate metric) of all image points to the clusters. The "fit_transform()"-method of KMeans and MiniBatchKMeans provide us with with some distance measure of each image to the identified clusters. This means our images are transformed into a new feature space - namely into a "cluster-distance space". This is a quite complex space, too. But it has less dimensions than the original feature space!

Note: We would of course normalize the resulting distance data in the new feature space before applying gradient descent.

Application of "KMeansBatch" to MNIST

There are multiple variants of "KMeans". We shall use one which is provided by SciKit-Learn and which is optimized for large datasets: "MiniBatchKMeans". It operates batch-wise without loosing too much of accuracy and convergence properties in comparison to KMeans (or a comparison see here). "MiniBatchKMeans"has some parameters you can play with.

We could be tempted to use 10 clusters as there are 10 digits to discriminate between. But remember: A digit can be written in very many ways. So, it is much more probable that we need a significant larger number of clusters. But again: How to determine on which K-values we should invest a bit more time? "Kmeans" and methods alike offer another quantity called "silhouette" coefficient. It measures how well the data points are within, at or outside the borders of a cluster. See the book of Geron referenced at the link given above on more information.

Variation of CPU time, inertia and average silhouette coefficients with the number of clusters "K"

Let us first have a look at the evolution of CPU time, total inertia and averaged silhouette with the number of clusters "K" for two different runs. The following code for a Jupyter cell gives us the data:

    
# *********************************************************
# Pre-Clustering => Searching for the elbow 
# *********************************************************
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
X = np.concatenate((ANN._X_train, ANN._X_test), axis=0)
y = np.concatenate((ANN._y_train, ANN._y_test), axis=0)
print("X-shape = ", X.shape, "y-shape = ", y.shape)
num = X.shape[0]

li_n = []
li_inertia = []
li_CPU = []
li_sil1 = []

# Loop over the number "n" of assumed clusters 
rg_n = range(10,171,10)
for n in rg_n:
    print("\nNumber of clusters: ", n)
    start = time.perf_counter()
    kmeans = MiniBatchKMeans(n_clusters=n, n_init=500, max_iter=1000, batch_size=500 )  
    X_clustered = kmeans.fit_transform(X)
    sil1 = silhouette_score(X, kmeans.labels_)
    #sil2 = silhouette_score(X_clustered, kmeans.labels_)
    end = time.perf_counter()
    dtime = end - start
    print('Inertia = ', kmeans.inertia_)
    print('Time_CPU = ', dtime)
    print('sil1 score = ', sil1)
    li_n.append(n)    
    li_inertia.append(kmeans.inertia_)    
    li_CPU.append(dtime)    
    li_sil1.append(sil1)    

    
# Plots         
# ******
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 14
fig_size[1] = 5
fig1 = plt.figure(1)
fig2 = plt.figure(2)

ax1_1 = fig1.add_subplot(121)
ax1_2 = fig1.add_subplot(122)

ax1_1.plot(li_n, li_CPU)
ax1_1.set_xlabel("num clusters K")
ax1_1.set_ylabel("CPU time")

ax1_2.plot(li_n, li_inertia)
ax1_2.set_xlabel("num clusters K")
ax1_2.set_ylabel("inertia")

ax2_1 = fig2.add_subplot(121)
ax2_2 = fig2.add_subplot(122)

ax2_1.plot(li_n, li_sil1)
ax2_1.set_xlabel("num clusters K")
ax2_1.set_ylabel("silhoutte 1")

 
You see that I allowed for large numbers of initial centroid positions and iterations to be on the safe side. Before you try it yourself: Such runs for a broad variation of K-values are relatively costly. The CPU time rises from around 32 seconds for 30 clusters to a little less than 1 minute for 180 clusters. These times add up to a significant sum after a while ...

Here are some plots:

The second run was executed with a higher resolution of K_(n+1) - K_n 5 = 5.

We see that the CPU time to determine the centroids' positions varies fairly linear with "K". And even for 170 clusters it does not take more than a minute! So, CPU-time for cluster identification is not a major limitation.

Unfortunately, we do not see a clear elbow in the inertia curve! What you regard as a reasonable choice for the number K depends a lot on where you say the curve starts to flatten. You could say that this happens around K = 60 to 90. But the results for the silhouette-quantity indicate for our parameter setting that K=40, K=70, K=90 are interesting points. We shall look at these points a bit closer with higher resolution later on.

Reduction of the regularization factor (for Ridge regularization)

Now, I want to discuss an important point which I did not find in the literature:
In my last article we saw that regularization plays a significant but also delicate role in reaching top accuracy values for the test dataset. We saw that Lambda2 = 0.2 was a good choice for a normalized input of the MNIST data. It corresponded to a certain ratio of the regularization term to average batch costs.
But when we reduce the number of input nodes we also reduce the number of total weights. So the weight values themselves will automatically become bigger if we want to get to similar good values at the second layer. But as the regularization term depends in a quadratic way on the weights we may assume that we roughly need a linear reduction of Lambda2. So, for K=100 clusters we may shrink Lambda2 to (0.2/784*100) = 0.025 instead of 0.2. In general:

Lambda2_cluster = Lambda2_std * K / (number of input nodes)

I applied this rule of a thumb successfully throughout experiments with clustering befor gradient descent.

Reference run without clustering

We saw at the end of article XII that we could reach an accuracy of around 0.975 after 500 epochs under optimal circumstances. But in the case I presented ten I was extremely lucky with the statistical initial weight distribution and the batch composition. In other runs with the same parameter setup I got smaller accuracy values. So, let us take an ad hoc run with the following parameters and results:
Parameters: learn_rate = 0.001, decrease_rate = 0.00001, mom_rate = 0.00005, n_size_mini_batch = 500, n_epochs = 600, Lambda2 = 0.2, weights at all layers in [-2*1.0/sqrt(num_nodes_layer), 2*1.0/sqrt(num_nodes_layer)]
Results: acc_train: 0.9949 , acc_test: 0.9735, convergence after ca. 550-600 epochs

The next plot shows (from left to right and the down) the evolution of the costs per batch, the averaged error of the last mini-batch during an epoch, the ratio of regularization to batch costs and the total costs of the training set, respectively .

The following plot summarizes the evolution of the total costs of the traaining set (including the regularization contribution) and the evolution of the accuracy on the training and the test data sets (in orange and blue, respectively).

The required computational time for the 600 epochs was roughly 18,2 minutes.

Results of gradient descent based on a prior cluster identification

Before we go into a more detailed discussion of code adaption and test runs with things like clusters in unnormalized and normalized feature spaces, I want to show what we - without too much effort - can get out of using cluster detection ahead of gradient descent. The next plot shows the evolution of a run for K=70 clusters in combination with a special normalization:

and the total cost and accuracy evolution

The dotted line marks an accuracy of 97.8%! This is 0.5% bigger then our reference value of 97.3%. The total gain of %gt; 0.5% means however 18.5% of the remaining difference of 2.7% to 100% and we past a value of 97.8% already at epoch 600 of the run.

What were the required computational times?

If we just wanted 97.4% as accuracy we need around 150 epochs. And a total CPU time of 1.3 minutes to get to the same accuracy as our reference run. This is a factor of roughly 14 in required CPU time. For a stable 97.73% after epoch 350 we were still a factor of 5.6 better. For a stable accuracy beyond 97.8% we needed around 600 epochs - and still were by a factor of 3.3 faster than our reference run! So, clustering really brings some big advantages with it.

Conclusion

In this article I discussed the idea of introducing cluster identification in the (unnormalized or normalized) feature space ahead of gradient descent as a possible means to save computational time. A preliminary trial run showed that we indeed can become significantly faster by at least a factor of 3 up to 5 and even more. This is just due to the point that we reduced the number of input nodes and thus the number of mathematical calculations during matrix operations.

In the next article we shall have a more detailed look at clustering techniques in combination with normalization.