MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation

In my last article in this blog I wrote a bit about some steps to get Keras running with Tensorflow 2 [TF2] and Cuda 10.2 on Opensuse Leap 15.1. One objective of these efforts was a performance comparison between two similar Multilayer Perceptrons [MLP] :

  • my own MLP programmed with Python and Numpy; I have discuss this program in another article series;
  • an MLP with a similar setup based on Keras and TF2

Not for reasons of a competition, but to learn a bit about differences. When and for what parameters do Keras/TF2 offer a better performance?
Another objective is to test TF-alternatives to Numpy functions and possible performance gains.

For the Python code of my own MLP see the article series starting with the following post:

A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

But I will discuss relevant code fragments also here when needed.

I think, performance is always an interesting topic – especially for dummies as me regarding Python. After some trials and errors I decided to discuss some of my experiences with MLP performance and optimization options in a separate series of the section “Machine learning” in this blog. This articles starts with two simple measures.

A factor of 6 turns turns into a factor below 2

Well, what did a first comparison give me? Regarding CPU time I got a factor of 6 on the MNIST dataset for a batch-size of 500. Of course, Keras with TF2 was faster 🙂 . Devastating? Not at all … After years of dealing with databases and factors of up to 100 by changes of SQL-statements and indexing a factor of 6 cannot shock or surprise me.

The Python code was the product of an unpaid hobby activity in my scarce free time. And I am still a beginner in Python. The code was also totally unoptimized, yet – both regarding technical aspects and the general handling of forward and backward propagation. It also contained and still contains a lot of superfluous statements for testing. Actually, I had expected an even bigger factor.

In addition, some things between Keras and my Python programs are not directly comparable as I only use 4 CPU cores for Openblas – this gave me an optimum for Python/Numpy programs in a Jupyter environment. Keras and TF2 instead seem to use all available CPU threads (successfully) despite limiting threading with TF-statements. (By the way: This is an interesting point in itself. If OpenBlas cannot give them advantages what else do they do?)

A very surprising point was, however, that using a GPU did not make the factor much bigger – despite the fact that TF2 should be able to accelerate certain operations on a GPU by at least by a factor of 2 up to 5 as independent tests on matrix operations showed me. And a factor of > 2 between my GPU and the CPU is what I remember from TF1-times last year. So, either the CPU is better supported now or the GPU-support of TF2 has become worse compared to TF1. An interesting point, too, for further investigations …

An even bigger surprise was that I could reduce the factor for the given batch-size down to 2 by just two major, butsimple code changes! However, further testing also showed a huge dependency on the batch sizechosen for training – which is another interesting point. Simple tests show that we may even be able to reduce the performance factor further by

  • by using directly coupled matrix operations – if logically possible
  • by using the basic low-level Python API for some operations

Hope, this sounds interesting for you.

The reference model based on Keras

I used the following model as a reference
in a Jupyter environment executed on Firefox:

Jupyter Cell 1

 
# compact version 
# ****************
import time 
import tensorflow as tf
#from tensorflow import keras as K
import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras import regularizers
from tensorflow.python.client import device_lib
import os

# use to work with CPU (CPU XLA ) only 
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
# The following can only be done once - all CPU cores are used otherwise  
tf.config.threading.set_intra_op_parallelism_threads(4)
tf.config.threading.set_inter_op_parallelism_threads(4)

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
    
# if not yet done elsewhere 
#tf.compat.v1.disable_eager_execution()
#tf.config.optimizer.set_jit(True)
tf.debugging.set_log_device_placement(True)

use_cpu_or_gpu = 0 # 0: cpu, 1: gpu

# function for training 
def train(train_images, train_labels, epochs, batch_size, shuffle):
    network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size, shuffle=shuffle)

# setup of the MLP
network = models.Sequential()
network.add(layers.Dense(70, activation='sigmoid', input_shape=(28*28,), kernel_regularizer=regularizers.l2(0.01)))
#network.add(layers.Dense(80, activation='sigmoid'))
#network.add(layers.Dense(50, activation='sigmoid'))
network.add(layers.Dense(30, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01)))
network.add(layers.Dense(10, activation='sigmoid'))
network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# load MNIST 
mnist = K.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# simple normalization
train_images = X_train.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255
test_images = X_test.reshape((10000, 28*28))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

 

Jupyter Cell 2

# run it 
if use_cpu_or_gpu == 1:
    start_g = time.perf_counter()
    train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True)
    end_g = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    with tf.device("/CPU:0"):
        train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True)
    end_c = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_CPU: ', end_c - start_c)  

# test accuracy 
print('Acc:: ', test_acc)

Typical output – first run:

 
Epoch 1/35
60000/60000 [==============================] - 1s 16us/step - loss: 2.6700 - accuracy: 0.1939
Epoch 2/35
60000/60000 [==============================] - 0s 5us/step - loss: 2.2814 - accuracy: 0.3489
Epoch 3/35
60000/60000 [==============================] - 0s 5us/step - loss: 2.1386 - accuracy: 0.3848
Epoch 4/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.9996 - accuracy: 0.3957
Epoch 5/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.8941 - accuracy: 0.4115
Epoch 6/35
60000/60000 [==============================] - 
0s 5us/step - loss: 1.8143 - accuracy: 0.4257
Epoch 7/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.7556 - accuracy: 0.4392
Epoch 8/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.7086 - accuracy: 0.4542
Epoch 9/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6726 - accuracy: 0.4664
Epoch 10/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6412 - accuracy: 0.4767
Epoch 11/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6156 - accuracy: 0.4869
Epoch 12/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5933 - accuracy: 0.4968
Epoch 13/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5732 - accuracy: 0.5078
Epoch 14/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5556 - accuracy: 0.5180
Epoch 15/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5400 - accuracy: 0.5269
Epoch 16/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5244 - accuracy: 0.5373
Epoch 17/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5106 - accuracy: 0.5494
Epoch 18/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4969 - accuracy: 0.5613
Epoch 19/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4834 - accuracy: 0.5809
Epoch 20/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4648 - accuracy: 0.6112
Epoch 21/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4369 - accuracy: 0.6520
Epoch 22/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3976 - accuracy: 0.6821
Epoch 23/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3602 - accuracy: 0.6984
Epoch 24/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3275 - accuracy: 0.7084
Epoch 25/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3011 - accuracy: 0.7147
Epoch 26/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2777 - accuracy: 0.7199
Epoch 27/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2581 - accuracy: 0.7261
Epoch 28/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2411 - accuracy: 0.7265
Epoch 29/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2259 - accuracy: 0.7306
Epoch 30/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2140 - accuracy: 0.7329
Epoch 31/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2003 - accuracy: 0.7355
Epoch 32/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1890 - accuracy: 0.7378
Epoch 33/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1783 - accuracy: 0.7410
Epoch 34/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1700 - accuracy: 0.7425
Epoch 35/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1605 - accuracy: 0.7449
10000/10000 [==============================] - 0s 37us/step
Time_CPU:  11.055424336002034
Acc::  0.7436000108718872

 
A second run was a bit faster: 10.8 secs. Accuracy around: 0.7449.
The relatively low accuracy is mainly due to the regularization (and reasonable to avoid overfitting). Without regularization we would already have passed the 0.9 border.

My own unoptimized MLP-program was executed with the following parameter setting:

 

             my_data_set="mnist_keras", 
             n_hidden_layers = 2, 
             ay_nodes_layers = [0, 70, 30, 0], 
             n_nodes_layer_out = 10,
             num_test_records = 10000, # 
number of test data
             
             # Normalizing - you should play with scaler1 only for the time being      
             scaler1 = 1,   # 1: StandardScaler (full set), 1: Normalizer (per sample)        
             scaler2 = 0,   # 0: StandardScaler (full set), 1: MinMaxScaler (full set)       
             b_normalize_X_before_preproc = False,     
             b_normalize_X_after_preproc  = True,     

             my_loss_function = "LogLoss",
 
             n_size_mini_batch = 500,
             n_epochs = 35, 
             lambda2_reg = 0.01,  

             learn_rate = 0.001,
             decrease_const = 0.000001, 

             init_weight_meth_L0 = "sqrt_nodes",  # method to init weights in an interval defined by  =>"sqrt_nodes" or a constant interval  "const"
             init_weight_meth_Ln = "sqrt_nodes",  # sqrt_nodes", "const"
             init_weight_intervals = [(-0.5, 0.5), (-0.5, 0.5), (-0.5, 0.5)],   # in case of a constant interval
             init_weight_fact = 2.0,              # extends the interval 
             mom_rate   = 0.00005,

             b_shuffle_batches = True,    # shuffling the batches at the start of each epoch 
             b_predictions_train = True,  # test accuracy by  predictions for ALL samples of the training set (MNIST: 60000) at the start of each epoch
             b_predictions_test  = False,  
             prediction_train_period = 1, # 1: each and every epoch is used for accuracy tests on the full training set
             prediction_test_period = 1,  # 1: each and every epoch is used for accuracy tests on the full test dataset

 

People familiar with my other article series on the MLP program know the parameters. But I think their names and comments are clear enough.

With a measurement of accuracy based on a forward propagation of the complete training set after each and every epoch (with the adjusted weights) I got a run time of 60 secs.

With accuracy measurements based on error tracking for batches and averaging over all batches, I get 49.5 secs (on 4 CPU threads). So, this is the mentioned factor between 5 and 6.

(By the way: The test indicates some space for improvement on the “Forward Propagation” 🙂 We shall take care of this in the next article of this series – promised).

So, these were the references or baselines for improvements.

Two measures – and a significant acceleration

Well, let us look at the results after two major code changes. With a test of accuracy performed on the full training set of 60000 samples at the start of each epoch I get the following result :

------------------
Starting epoch 35

Time_CPU for epoch 35 0.5518779030026053
relative CPU time portions: shuffle: 0.05  batch loop: 0.58  prediction:  0.37
Total CPU-time:  19.065050211000198

learning rate =  0.0009994051838157095

total costs of training set   =  5843.522
rel. reg. contrib. to total costs =  0.0013737131

total costs of last mini_batch   =  56.300297
rel. reg. contrib. to batch costs =  0.14256112

mean abs weight at L0 :  0.06393985
mean abs weight at L1 :  0.37341583
mean abs weight at L2 :  1.302389

avg total error of last mini_batch =  0.00709
presently reached train accuracy   =  0.99072

-------------------
Total training Time_CPU:  19.04528829299714

With accuracy taken only from the error of a batch:

avg total error of last mini_batch =  0.00806
presently reached train accuracy   =  0.99194
-------------------
Total training Time_CPU:  11.331006342999899

Isn’t this good news? A time of 11.3 secs is pretty close to what Keras provides us with! (Well, at least for a batch size of 500). And with a better result regarding accuracy on my side – but this has to do with a probably different
handling of learning rates and the precise translation of the L2-regularization parameter for batches.

Plots:

How did I get to this point? As said: Two measures were sufficient.

A big leap in performance by turning to float32 precision

So far I have never cared too much for defining the level of precision by which Numpy handles arrays with floating point numbers. In the context of Machine Learning this is a profound mistake. on a 64bit CPU many time consuming operations can gain almost a factor of 2 in performance when using float 32 precision – if the programmers tweaked everything. And I assume the Numpy guys did it.

So: Just use “dtype=np.float32” (np means “numpy” which I always import as “np”) whenever you initialize numpy arrays!

For the readers following my other series: You should look at multiple methods performing some kind of initialization of my “MyANN”-class. Here is a list:

 
    def _handle_input_data(self): 
        .....
            self._y = np.array([int(i) for i in self._y], dtype=np.float32)
        .....
        self._X = self._X.astype(np.float32)
        self._y = self._y.astype(np.int32)
        .....
    def _encode_all_y_labels(self, b_print=True):
        .....
        self._ay_onehot = np.zeros((self._n_labels, self._y_train.shape[0]), dtype=np.float32)
        self._ay_oneval = np.zeros((self._n_labels, self._y_train.shape[0], 2), dtype=np.float32)
   
        .....
    def _create_WM_Input(self):
        .....
        w0 = w0.astype(dtype=np.float32)
        .....
    def _create_WM_Hidden(self):
        .....
            w_i_next = w_i_next.astype(dtype=np.float32)
        .....
    def _create_momentum_matrices(self):
        .....
            self._li_mom[i] = np.zeros(self._li_w[i].shape, dtype=np.float32)
        .....
    def _prepare_epochs_and_batches(self, b_print = True):
        .....
        self._ay_theta = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        self._ay_costs = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        self._ay_reg_cost_contrib = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        .....
        self._ay_period_test_epoch     = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_acc_test_epoch        = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_err_test_epoch        = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_period_train_epoch    = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_acc_train_epoch       = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_err_train_epoch       = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_tot_costs_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_rel_reg_train_epoch   = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        .....
        self._ay_mean_abs_weight = -10 * np.ones(shape_weights, dtype=np.float32) 
        .....
 
   def _add_bias_neuron_to_layer(self, A, how='column'):
        .....
            A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32)
        .....
            A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32)
    .....

 

After I applied these changes the factor in comparison to Keras went down to 3.1 – for a batch size of 500. Good news after a first simple step!

Reducing the CPU time once more

The next step required a bit more thinking. When I went through further more detailed tests of CPU consumption for various steps during training I found that the error back propagation through the network required significantly more time than the forward propagation.

At first sight this seems to be logical. There are more operations to be done between layers – real matrix multiplications with np.dot() (or np.matmul()) and element-wise multiplications with the “*”-operation. See also my PDF on the basic math:
Back_Propagation_1.0_200216.

But this is wrong assumption: When I measured CPU times in detail I saw that such operations took most time when network layer L0 – i.e. the input layer of the MLP – got involved. This also seemed to be reasonable: the weight matrix is biggest there; the input layer of all layers has most neuron nodes.

But when I went through the code I saw that I just had been too lazy whilst coding back propagation:

 
    ''' -- Method to handle error BW propagation for a mini-batch --'''
    def _bw_propagation(self, 
                        ay_y_enc, li_Z_in, li_A_out, 
                        li_delta_out, li_delta, li_D, li_grad, 
                        b_print = True, b_internal_timing = False):
        
        # Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch 
        
        # Initiate BW propagation - provide delta-matrices for outermost layer
        # *********************** 
        # Input Z at outermost layer E  (4 layers -> layer 3)
        ay_Z_E = li_Z_in[self._n_total_layers-1]
        # Output A at outermost layer E (was calculated by output function)
        ay_A_E = li_A_out[self._n_total_layers-1]
        
        # Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid 
        ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print )
        
        # Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices)
        ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print) 
        
        # add the matrices at the outermost layer to their lists ; li_delta_out gets only one element 
        idxE = self._n_total_layers - 1
        li_delta_out[idxE] = ay_delta_out_E # this happens only once
        li_delta[idxE]     = ay_delta_E
        li_D[idxE]         = ay_D_E
        li_grad[idxE]      = None    # On the outermost layer there is no gradient ! 
        
        # Loop over all layers in reverse direction 
        # ******************************************
        # index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2))
        range_N_bw_layer = reversed(range(0, self._n_total_layers-1))   # must be -1 as the last element is not taken 
        
        # loop over layers 
        for N in range_N_bw_layer:
            
            # Back Propagation operations between layers N+1 and N 
            # *******************************************************
            # this method handles the special treatment of bias nodes in Z_in, too
            ay_delta_N, ay_D_N, ay_grad_
N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False )
            
            # add matrices to their lists 
            li_delta[N] = ay_delta_N
            li_D[N]     = ay_D_N
            li_grad[N]= ay_grad_N
       
        return

 
with the following key function:

 
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta):
        '''
        BW-error-propagation between layer N+1 and N 
        Inputs: 
            li_Z_in:  List of input Z-matrices on all layers - values were calculated during FW-propagation
            li_A_out: List of output A-matrices - values were calculated during FW-propagation
            li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist 
        
        Returns: 
            ay_delta_N - delta-matrix of layer N (required in subsequent steps)
            ay_D_N     - derivative matrix for the activation function on layer N 
            ay_grad_N  - matrix with gradient elements of the cost fnction with respect to the weights on layer N 
        '''
        
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling between layers N and N+1 
        ay_W_N = self._li_w[N]
        # delta-matrix of layer N+1
        ay_delta_Np1 = li_delta[N+1]

        # !!! Add row (for bias) to Z_N intermediately !!!
        ay_Z_N = li_Z_in[N]
        ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        
        # Derivative matrix for the activation function (with extra bias node row)
        ay_D_N = self._calculate_D_N(ay_Z_N)
        
        # fetch output value saved during FW propagation 
        ay_A_N = li_A_out[N]
        
        # Propagate delta
        # **************
        # intermediate delta 
        ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
        # final delta 
        ay_delta_N = ay_delta_w_N * ay_D_N
        # reduce dimension again (bias row)
        ay_delta_N = ay_delta_N[1:, :]
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        ay_grad_N[:, 1:] += (self._li_w[N][:, 1:] * self._lambda2_reg + np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg) 
        
        return ay_delta_N, ay_D_N, ay_grad_N

 

Now, look at the eventual code:

 
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False):
        '''
        BW-error-propagation between layer N+1 and N 
        .... 
        '''
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling between layers N and N+1 
        ay_W_N = self._li_w[N]
        ay_delta_Np1 = li_delta[N+1]

        # fetch output value saved during FW propagation 
        ay_A_N = li_A_out[N]

        # Optimization ! 
        if N > 0: 
            ay_Z_N = li_Z_in[N]
            # !!! Add intermediate row (for bias) to Z_N !!!
            ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        
            # Derivative matrix for the activation function (with extra bias node 
row)
            ay_D_N = self._calculate_D_N(ay_Z_N)
        
            # Propagate delta
            # **************
            # intermediate delta 
            ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
            # final delta 
            ay_delta_N = ay_delta_w_N * ay_D_N
            # reduce dimension again 
            ay_delta_N = ay_delta_N[1:, :]
            
        else: 
            ay_delta_N = None
            ay_D_N = None
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        if self._lambda2_reg > 0.0: 
            ay_grad_N[:, 1:] += self._li_w[N][:, 1:] * self._lambda2_reg 
        if self._lambda1_reg > 0.0: 
            ay_grad_N[:, 1:] += np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg 
        
        return ay_delta_N, ay_D_N, ay_grad_N

 

You have, of course, detected the most important change:

We do not need to propagate any delta-matrices (originally coming from the error deviation at the output layer) down to layer 1!

This is due to the somewhat staggered nature of error back propagation – see the PDF on the math again. Between the first hidden layer L1 and the input layer L0 we only need to fetch the output matrix A at L0 to be able to calculate the gradient components for the weights in the weight matrix connecting L0 and L1. This saves us from the biggest matrix multiplication – and thus reduces computational time significantly.

Another bit of CPU time can be saved by calculating only the regularization terms really asked for; for my simple densely populated network I almost never use Lasso regularization; so L1 = 0.

These changes got me down to the values mentioned above. And, note: The CPU time for backward propagation then drops to the level of forward propagation. So: Be somewhat skeptical about your coding if backward propagation takes much more CPU time than forward propagation!

Dependency on the batch size

I should remark that TF2 still brings some major and remarkable advantages with it. Its strength becomes clear when we go to much bigger batch sizes than 500:
When we e.g. take a size of 10000 samples in a batch, the required time of Keras and TF2 goes down to 6.4 secs. This is again a factor of roughly 1.75 faster.
I do not see any such acceleration with batch size in case of my own program!

More detailed tests showed that I do not gain speed with a batch size over 1000; the CPU time increases linearly from that point on. This actually seems to be a limitation of Numpy and OpenBlas on my system.

Because , I have some reasons to believe that TF2 also uses some basic OpenBlas routines, this is an indication that we need to put more brain into further optimization.

Conclusion

We saw in this article that ML programs based on Python and Numpy may gain a boost by using only dtype=float32 and the related accuracy for Numpy arrays. In addition we saw that avoiding unnecessary propagation steps between the first hidden and at the input layer helps a lot.

In the next article of this series we shall look a bit at the performance of forward propagation – especially during accuracy tests on the training and test data set.

Further articles in this series

MLP, Numpy, TF2 – performance issues – Step II – bias neurons,
F- or C- contiguous arrays and performance

MLP, Numpy, TF2 – performance issues – Step III – a correction to BW propagation

Getting a Keras based MLP to run with Cuda 10.2, Cudnn 7.6 and TensorFlow 2.0 on an Opensuse Leap 15.1 system

During last weekend I wanted to compare the performance of an old 20-line Keras setup for a simple MLP with the performance of a self-programmed Python- and Numpy-based MLP regarding training epochs on the MNIST dataset. The Keras code was set up in a Jupyter notebook last autumn – at that time for TensorFlow 1 and Cuda 10.0 for my Nvidia graphics card. I thought it might be a good time to move everything to Tensorflow 2 and the latest Cuda libraries on my Opensuse Leap 15.1 system. This was more work than expected – and for some problems I was forced to apply some dirty workarounds. I got it running. Maybe the necessary steps, which are not really obvious, are helpful for others, too.

Install Cuda 10.2 and Cudnn on an Opensuse Leap 15.1

Before you want to use TensorFlow [TF] on a Nvidia graphics card you must install Cuda. The present version is Cuda 10.2. I was a bit naive to assume that this should be the right version – as it has been available for some time already. Wrong! Afterwards I read somewhere that TensorFlow2 [TF2] is working with Cuda 10.1, only, and not yet fully compatible with Cuda 10.2. Well, at least for my purposes [MLP training] it seemed to work nevertheless – with some “dirty” additional library links.

There is a central Cuda repository available at this address: cuda10.2. Actually, the repo offers both cuda10.0, cuda10.1 and cuda10.2 (plus some nvidia drivers). I selected some central cuda10.2 packages for installation – just to find out where the related files were placed in the filesystem. I then ran into a major chain of packet dependencies, which I had to resolve during many tedious steps . Some packages may not have been necessary for a basic installation. In the end I was too lazy to restrict the libs to what is really required for Keras. The bill came afterwards: Cuda 10.2 is huge! If you do not know exactly what you need: Be prepared to invest up to 3 GB on your hard disk.

The Cuda 10.2 RPM packets install most of the the required “*.so”-shared library files and many other additional files in a directory “/usr/local/cuda-10.2/”. To make changes between different versions of Cuda possible we also find a link “/usr/local/cuda” pointing to
“/usr/local/cuda-10.2/” after the installation. Ok, reasonable – we could change the link to point to “/usr/local/cuda-10.0/”. This makes you assume that the Tensorflow 2 libraries and other modules in your virtual Python3 and Jupyter environment would look for required Cuda files in the central directory “/usr/local/cuda” – i.e. without special version attributes of the files. Unfortunately, this was another wrong assumption. 🙁 See below.

In addition to the Cuda packages you must install the present “cudnn” libraries from Nvidia – more precisely: The runtime and the development package. You get the RPMs from here. Be prepared to give Nvidia your private data. 🙁

I should add that I ignored and ignore the Nvidia drivers from the Cuda repository, i.e. I never installed them. Instead, I took those from the standard Nvidia community repository. They worked and work well so far – and make my update life on Opensuse easier.

Installation of Tensorflow2 modules in your (virtual) Python3 environment

I use a virtual Python3 environment and update it regularly via “pip”. Regarding TF2 an update via the command “pip install –upgrade tensorflow” should be sufficient – it will resolve dependencies. (By the way: If you want to bring all Python libs to their present version you can also use “pip-review –auto”. Note that under certain circumstances you may need the “–force” option for special upgrades. I cannot go into
details in this article.)

Multiple complaints about missing libraries …

Unfortunately, the next time I started my virtual Python environment I got the warning that the dynamic library “libnvinfer.so.6” could not be found, but was required in case I planned to use TensorRT. What? Well, you may find some information here
https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/
https://developer.nvidia.com/tensorrt

I leave it up to you whether you really need TensorRT. You can ignore this message – TF will run for standard purposes without it. But, dear TF-developers: a clear message in the warning would in my opinion have been helpful. Then I checked whether some version of the Nvidia related library “libnvinfer.so” came with Cuda or Cudnn onto my system. Yeah, it did – unfortunately version 7 instead of 6. :-(.
So, we are confronted with a dependency on a specific file version which is older than the present one. I do not like this style of development. Normally, it should be the other way round: If a newer version is required due to new capabilities you warn the user. But under normal circumstances a backward compatibility of libs should be established. You would assume such a backward compatibility and that TF would search for the present version via looking for files “libnvinfer.so” and “libnvinfer_plugin.so” which do exist and point to the latest versions. But, no, in this case they want it explicitly to be version 6 … Makes you wonder whether the old Cudnn version is still available. I did not check it. Ok, ok – backward compatibility is not always possible ….

Just to see how good the internal checking of the alleged dependency is, I did something you normally do not do: I created a link “libnvinfer.so.6” in “/usr/lib64” to “libnvinfer.7.so”. Had to do the same for “libnvinfer_plugin.so.6”. Effect: I got rid of the warning – so much about dependency checking. I left the linking. You see I trust in coming developments sometimes and run some risks ….

Then came the next surprise. I had read a bit about changed statements in TF2 (compared to TF1) – and thought I was prepared for this. But, when I tried to execute some initial commands to configure TF2 from a Jupyter cell as e.g.

 
import time 
import tensorflow as tf
from tensorflow import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from tensorflow.python.client import device_lib

import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

tf.config.optimizer.set_jit(True)
tf.config.threading.set_intra_op_parallelism_threads(4)
tf.config.threading.set_inter_op_parallelism_threads(4)
tf.debugging.set_log_device_placement(True)

device_lib.list_local_devices()  

I at once got a complaint in the shell from which I had started the Jupyter notebook – saying that a lib called “libcudart.so.10.1” was missing. Again – an explicit version dependency 🙁 . On purpose or just a glitch? Just one out of many files version dependent? Without a clear information? If this becomes the standard in the interaction between TF2 and Cuda – well, no fun any longer. In my opinion the TF2 developers should not use a search for files via version specific names – but do an analysis of headers and warn explicitly that the present version requires a specific Cuda version. Would be much more convenient for the user and compatible with the link mechanism described above.

Whilst a bunch of other dynamic libs was loaded by their name without a version in this case TF2 asks for a very specific version – although there is a corresponding lib available in the directory “/usr/lib/cuda-10.2″…. Nevertheless with full trust again in a better future
I offered TF2 a softlink “libcudart.so.10.1” in “/usr/lib64/” pointing to the “/usr/local/cuda-10.2/lib64/libcudart.so”. It cleared my way to the next hurdle. And my Keras MLP worked in the end …

Missing “./bin” directory … and other path related problems

When I tried to run specific Keras commands, which TF2 wanted to compile as XLA-supported statements, I again got complaints that files in a local directory “./bin” were missing. This was a first clear indication that Cuda paths were totally ignored in my Python/Jupyter environment. But what directory did the “./” refer to? Some experiments revealed:

I had to link an artificial subdirectory “./bin” in the directory where I kept my Jupyter notebooks to “/usr/local/Cuda-10.2/bin”.

But the next problems with other directories waited directly around the corner. Actually many … To make a long story short – the installation of TF2 in combination with Cuda 1.2 does not evaluate paths or ask for paths when used in a Python3/Jupyter environment. We have to provide and export them as shell environment variables. See below.

Warnings and errors regarding XLA capabilities

Another thing which drove me nuts was that TF2 required information about XLA-flags. It took me a while to find out that this also could be handled via environment variables.

All in all I now start the shell from which I launch my virtual Python environment and Jupyter notebooks with the following command sequence:

myself@mytux:/projekte/GIT/....../ml> export XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda
myself@mytux:/projekte/GIT/....../ml> export TF_XLA_FLAGS=--tf_xla_cpu_global_jit
myself@mytux:/projekte/GIT/....../ml_1> export OPENBLAS_NUM_THREADS=4              
myself@mytux:/projekte/GIT/....../ml_1> source bin/activate
(ml) myself@mytux:/projekte/GIT/....../ml_1> jupyter notebook 

The first two commands did the magic regarding the path-problems! TF2 worked afterwards both for XLA-capable CPUs and Nvidia GPUs. So, a specific version may or may not have advantages – I do not know – but at least you can get TF2 running with Cuda 10.2.

Changed commands to control threading and memory consumption

Without the use of explicit compatibility commands TF2 does not support commands like

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}, 
                        log_device_placement=True

                       )
config.gpu_options.per_process_gpu_memory_fraction=0.4
config.gpu_options.force_gpu_compatible = True  

any longer. But as with TF1 you probably do not want to pick all the memory from your graphics card and you do not want to use all cores of a CPU in TF2. You can circumvent the lack of a “ConfigProto” property in TF2 by the following commands:

# configure use of just in time compiler 
tf.config.optimizer.set_jit(True) 
# limit use of parallel threads 
tf.config.threading.set_intra_op_parallelism_threads(4) 
tf.config.threading.set_inter_op_parallelism_threads(4)
# Not required in TF2: tf.enable_eager_execution()
# print out use of certain device (at first run)
tf.debugging.set_log_device_placement(True)
#limit use of graphics card memory 
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
# Not required in TF2: tf.enable_eager_execution()
# print out a list of usable devices
device_lib.list_local_devices()   

Addendum, 15.
05.2020:

Well, this actually proved to be correct for the limitation of the GPU memory, only. The limitations on the CPU cores do NOT work. At least not on my system. See also:
tensorflow issues 34415.

I give you a workaround below.

Test run with MNIST

Afterwards the following simple Keras MLP ran without problems and with the expected performance on a GPU and a multicore CPU:

Jupyter cell 1

import time 
import tensorflow as tf
#from tensorflow import keras as K
import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
from tensorflow.python.client import device_lib
import os

# use to work with CPU (CPU XLA ) only 
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
    
# if not yet done elsewhere 

tf.config.optimizer.set_jit(True)
tf.debugging.set_log_device_placement(True)

use_cpu_or_gpu = 1 # 0: cpu, 1: gpu

# The following can only be done once - all CPU cores are used otherwise  
#if use_cpu_or_gpu == 0:
#    tf.config.threading.set_intra_op_parallelism_threads(4)
#    tf.config.threading.set_inter_op_parallelism_threads(6)


# function for training 
def train(train_images, train_labels, epochs, batch_size):
    network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size)

# setup of the MLP
network = models.Sequential()
network.add(layers.Dense(200, activation='sigmoid', input_shape=(28*28,)))
network.add(layers.Dense(100, activation='sigmoid'))
network.add(layers.Dense(50, activation='sigmoid'))
network.add(layers.Dense(30, activation='sigmoid'))
network.add(layers.Dense(10, activation='sigmoid'))
network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# load MNIST 
mnist = K.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# simple normalization
train_images = X_train.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255
test_images = X_test.reshape((10000, 28*28))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Jupyter cell 2

# run it 
if use_cpu_or_gpu == 1:
    start_g = time.perf_counter()
    train(train_images, train_labels, epochs=45, batch_size=1500)
    end_g = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    train(train_images, train_labels, epochs=45, batch_size=1500)
    end_c = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_CPU: ', end_c - start_c)  

# test accuracy 
print('Acc: ', test_acc)

Switch to force Tensorflow to use the CPU, only

Another culprit is that – depending on the exact version of TF 2 – you may need to use the following statement to run (parts of) your code on the CPU only:

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

in the beginning. Otherwise Tensorflow 2.0 and version 2.1 will choose execution on the GPU even if you use a statement like

with tf.device("/CPU:0"):

(which worked in TF1).
It seems that this problem was solved with TF 2.2 (tested it on 15.05.2020)! But you may have to check it yourself.
You can
watch the involvement of the GPU e.g. with “watch -n0.1 nvidia-smi” on a terminal. Another possibility is to set

tf.debugging.set_log_device_placement(True)  

and get messages in the shell of your virtual Python environment or in the presently used Jupyter cell.

Addendum 16.05.2020: Limiting the number of CPU cores for Tensorflow 2.0 on Linux

After several trials and tests I think that both TF2 and the Keras version delivered with handle the above given TF2 statements to limit the number of CPU cores to use inefficiently. I addition the behavior of TF2/Keras has changed with the TF2 versions 2.0, 2.1 and now 2.2.

Strange things also happen, when you combine statements of the TF1 compat layer with pure TF2 restriction statements. You should refrain from mixing them.

So, it is either

Option 1: CPU only and limited number of cores

from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
...
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=4, inter_op_parallelism_threads=1)
B.set_session(tf.compat.v1.Session(config=config))    
...

OR
Option 2: Mixture of GPU (with limited memory) and CPU (limited core number) with TF2 statements

import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist
from tensorflow.python.client import device_lib
import os

tf.config.threading.set_intra_op_parallelism_threads(6)
tf.config.threading.set_inter_op_parallelism_threads(1)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], 
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    except RuntimeError as e:
        print(e)

OR
Option 3: Mixture of GPU (limited memory) and CPU (limited core numbers) with TF1 compat statements

import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist
from tensorflow.python.client import device_lib
import os

#gpu = False 
gpu = True
if gpu: 
    GPU = True;  CPU = False; num_GPU = 1; num_CPU = 1
else: 
    GPU = False; CPU = True;  num_CPU = 1; num_GPU = 0

config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6,
                        inter_op_parallelism_threads=1, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}, 
                        log_device_placement=True

                       )
config.gpu_options.per_process_gpu_memory_fraction=0.35
config.gpu_options.force_gpu_compatible = True
B.set_session(tf.compat.v1.Session(config=config))

Hint 1:

If you want to test some code parts on the GPU and others on the CPU in the same session, I strongly recommend to use the compat statements in the form given by Option 3 above

The reason is that it – strangely enough – gives you a faster performance on a multicore CPU by more than 25% in comparison to the pure TF2 statements .

Afterwards you can use statements like:

batch_size=64
epochs=5

if use_cpu_or_gpu == 0:
    start_g = time.perf_counter()
    with tf.device("/GPU:0"):
        train(train_imgs, train_labels, epochs, batch_size)
    end_g = time.perf_
counter()
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    with tf.device("/CPU:0"):
        train(train_imgs, train_labels, epochs, batch_size)
    end_c = time.perf_counter()
    print('Time_CPU: ', end_c - start_c)  

Hint 2:
If you check the limitations on CPU cores (threads) via watching the CPU load on tools like “gkrellm” or “ksysguard”, it may appear that all cores are used in parallel. You have to set the update period of these tools to 0.1 sec to see that each core is only used intermittently. In gkrellm you should also see a variation of the average CPU load value with a variation of the parameter “intra_op_parallelism_threads=n”.

Hint 3:
In my case with a Quadcore CPU with hyperthreading the following settings seem to be optimal for a variety of Keras CNN models – whenever I want to train them on the CPU only:

...
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1)
B.set_session(tf.compat.v1.Session(config=config)) 
...

Hint 4:
If you want to switch settings in a Jupyter session it is best to stop and restart the respective kernel. You can do this via the commands under “kernel” in the Jupyter interface.

Conclusion

Well, my friends the above steps where what I had to do to get Keras working in combination with TF2, Cuda 10.2 and the present version of Cudnn. I regard this not as a straightforward procedure – to say it mildly.

In addition after some tests I might also say that the performance seems to be worse than with Tensorflow 1. Especially when using Keras – whether the Keras included with Tensorflow 2 or Keras in form of separate Python lib. Especially the performance on a GPU is astonishingly bad with Keras for small networks.

This impression of sluggishness stands in a strange contrast to elementary tests were I saw a factor of 5 difference for a series of typical matrix multiplications executed directly with tf.matmul() on a GPU vs. a CPU. But this another story …..

Links

tensorflow-running-version-with-cuda-on-cpu-only