MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation

In my last article in this blog I wrote a bit about some steps to get Keras running with Tensorflow 2 [TF2] and Cuda 10.2 on Opensuse Leap 15.1. One objective of these efforts was a performance comparison between two similar Multilayer Perceptrons [MLP] :

  • my own MLP programmed with Python and Numpy; I have discuss this program in another article series;
  • an MLP with a similar setup based on Keras and TF2

Not for reasons of a competition, but to learn a bit about differences. When and for what parameters do Keras/TF2 offer a better performance?
Another objective is to test TF-alternatives to Numpy functions and possible performance gains.

For the Python code of my own MLP see the article series starting with the following post:

A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

But I will discuss relevant code fragments also here when needed.

I think, performance is always an interesting topic - especially for dummies as me regarding Python. After some trials and errors I decided to discuss some of my experiences with MLP performance and optimization options in a separate series of the section "Machine learning" in this blog. This articles starts with two simple measures.

A factor of 6 turns turns into a factor below 2

Well, what did a first comparison give me? Regarding CPU time I got a factor of 6 on the MNIST dataset for a batch-size of 500. Of course, Keras with TF2 was faster 🙂 . Devastating? Not at all ... After years of dealing with databases and factors of up to 100 by changes of SQL-statements and indexing a factor of 6 cannot shock or surprise me.

The Python code was the product of an unpaid hobby activity in my scarce free time. And I am still a beginner in Python. The code was also totally unoptimized, yet - both regarding technical aspects and the general handling of forward and backward propagation. It also contained and still contains a lot of superfluous statements for testing. Actually, I had expected an even bigger factor.

In addition, some things between Keras and my Python programs are not directly comparable as I only use 4 CPU cores for Openblas - this gave me an optimum for Python/Numpy programs in a Jupyter environment. Keras and TF2 instead seem to use all available CPU threads (successfully) despite limiting threading with TF-statements. (By the way: This is an interesting point in itself. If OpenBlas cannot give them advantages what else do they do?)

A very surprising point was, however, that using a GPU did not make the factor much bigger - despite the fact that TF2 should be able to accelerate certain operations on a GPU by at least by a factor of 2 up to 5 as independent tests on matrix operations showed me. And a factor of > 2 between my GPU and the CPU is what I remember from TF1-times last year. So, either the CPU is better supported now or the GPU-support of TF2 has become worse compared to TF1. An interesting point, too, for further investigations ...

An even bigger surprise was that I could reduce the factor for the given batch-size down to 2 by just two major, butsimple code changes! However, further testing also showed a huge dependency on the batch sizechosen for training - which is another interesting point. Simple tests show that we may even be able to reduce the performance factor further by

  • by using directly coupled matrix operations - if logically possible
  • by using the basic low-level Python API for some operations

Hope, this sounds interesting for you.

The reference model based on Keras

I used the following model as a reference in a Jupyter environment executed on Firefox:

Jupyter Cell 1

 
# compact version 
# ****************
import time 
import tensorflow as tf
#from tensorflow import keras as K
import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras import regularizers
from tensorflow.python.client import device_lib
import os

# use to work with CPU (CPU XLA ) only 
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
# The following can only be done once - all CPU cores are used otherwise  
tf.config.threading.set_intra_op_parallelism_threads(4)
tf.config.threading.set_inter_op_parallelism_threads(4)

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
    
# if not yet done elsewhere 
#tf.compat.v1.disable_eager_execution()
#tf.config.optimizer.set_jit(True)
tf.debugging.set_log_device_placement(True)

use_cpu_or_gpu = 0 # 0: cpu, 1: gpu

# function for training 
def train(train_images, train_labels, epochs, batch_size, shuffle):
    network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size, shuffle=shuffle)

# setup of the MLP
network = models.Sequential()
network.add(layers.Dense(70, activation='sigmoid', input_shape=(28*28,), kernel_regularizer=regularizers.l2(0.01)))
#network.add(layers.Dense(80, activation='sigmoid'))
#network.add(layers.Dense(50, activation='sigmoid'))
network.add(layers.Dense(30, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01)))
network.add(layers.Dense(10, activation='sigmoid'))
network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# load MNIST 
mnist = K.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# simple normalization
train_images = X_train.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255
test_images = X_test.reshape((10000, 28*28))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

 

Jupyter Cell 2

# run it 
if use_cpu_or_gpu == 1:
    start_g = time.perf_counter()
    train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True)
    end_g = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    with tf.device("/CPU:0"):
        train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True)
    end_c = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_CPU: ', end_c - start_c)  

# test accuracy 
print('Acc:: ', test_acc)

Typical output - first run:

 
Epoch 1/35
60000/60000 [==============================] - 1s 16us/step - loss: 2.6700 - accuracy: 0.1939
Epoch 2/35
60000/60000 [==============================] - 0s 5us/step - loss: 2.2814 - accuracy: 0.3489
Epoch 3/35
60000/60000 [==============================] - 0s 5us/step - loss: 2.1386 - accuracy: 0.3848
Epoch 4/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.9996 - accuracy: 0.3957
Epoch 5/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.8941 - accuracy: 0.4115
Epoch 6/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.8143 - accuracy: 0.4257
Epoch 7/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.7556 - accuracy: 0.4392
Epoch 8/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.7086 - accuracy: 0.4542
Epoch 9/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6726 - accuracy: 0.4664
Epoch 10/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6412 - accuracy: 0.4767
Epoch 11/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.6156 - accuracy: 0.4869
Epoch 12/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5933 - accuracy: 0.4968
Epoch 13/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5732 - accuracy: 0.5078
Epoch 14/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5556 - accuracy: 0.5180
Epoch 15/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5400 - accuracy: 0.5269
Epoch 16/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5244 - accuracy: 0.5373
Epoch 17/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.5106 - accuracy: 0.5494
Epoch 18/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4969 - accuracy: 0.5613
Epoch 19/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4834 - accuracy: 0.5809
Epoch 20/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4648 - accuracy: 0.6112
Epoch 21/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.4369 - accuracy: 0.6520
Epoch 22/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3976 - accuracy: 0.6821
Epoch 23/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3602 - accuracy: 0.6984
Epoch 24/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3275 - accuracy: 0.7084
Epoch 25/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.3011 - accuracy: 0.7147
Epoch 26/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2777 - accuracy: 0.7199
Epoch 27/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2581 - accuracy: 0.7261
Epoch 28/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2411 - accuracy: 0.7265
Epoch 29/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2259 - accuracy: 0.7306
Epoch 30/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2140 - accuracy: 0.7329
Epoch 31/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.2003 - accuracy: 0.7355
Epoch 32/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1890 - accuracy: 0.7378
Epoch 33/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1783 - accuracy: 0.7410
Epoch 34/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1700 - accuracy: 0.7425
Epoch 35/35
60000/60000 [==============================] - 0s 5us/step - loss: 1.1605 - accuracy: 0.7449
10000/10000 [==============================] - 0s 37us/step
Time_CPU:  11.055424336002034
Acc::  0.7436000108718872

 
A second run was a bit faster: 10.8 secs. Accuracy around: 0.7449.
The relatively low accuracy is mainly due to the regularization (and reasonable to avoid overfitting). Without regularization we would already have passed the 0.9 border.

My own unoptimized MLP-program was executed with the following parameter setting:

 

             my_data_set="mnist_keras", 
             n_hidden_layers = 2, 
             ay_nodes_layers = [0, 70, 30, 0], 
             n_nodes_layer_out = 10,
             num_test_records = 10000, # number of test data
             
             # Normalizing - you should play with scaler1 only for the time being      
             scaler1 = 1,   # 1: StandardScaler (full set), 1: Normalizer (per sample)        
             scaler2 = 0,   # 0: StandardScaler (full set), 1: MinMaxScaler (full set)       
             b_normalize_X_before_preproc = False,     
             b_normalize_X_after_preproc  = True,     

             my_loss_function = "LogLoss",
 
             n_size_mini_batch = 500,
             n_epochs = 35, 
             lambda2_reg = 0.01,  

             learn_rate = 0.001,
             decrease_const = 0.000001, 

             init_weight_meth_L0 = "sqrt_nodes",  # method to init weights in an interval defined by  =>"sqrt_nodes" or a constant interval  "const"
             init_weight_meth_Ln = "sqrt_nodes",  # sqrt_nodes", "const"
             init_weight_intervals = [(-0.5, 0.5), (-0.5, 0.5), (-0.5, 0.5)],   # in case of a constant interval
             init_weight_fact = 2.0,              # extends the interval 
             mom_rate   = 0.00005,

             b_shuffle_batches = True,    # shuffling the batches at the start of each epoch 
             b_predictions_train = True,  # test accuracy by  predictions for ALL samples of the training set (MNIST: 60000) at the start of each epoch
             b_predictions_test  = False,  
             prediction_train_period = 1, # 1: each and every epoch is used for accuracy tests on the full training set
             prediction_test_period = 1,  # 1: each and every epoch is used for accuracy tests on the full test dataset

 

People familiar with my other article series on the MLP program know the parameters. But I think their names and comments are clear enough.

With a measurement of accuracy based on a forward propagation of the complete training set after each and every epoch (with the adjusted weights) I got a run time of 60 secs.

With accuracy measurements based on error tracking for batches and averaging over all batches, I get 49.5 secs (on 4 CPU threads). So, this is the mentioned factor between 5 and 6.

(By the way: The test indicates some space for improvement on the "Forward Propagation" 🙂 We shall take care of this in the next article of this series - promised).

So, these were the references or baselines for improvements.

Two measures - and a significant acceleration

Well, let us look at the results after two major code changes. With a test of accuracy performed on the full training set of 60000 samples at the start of each epoch I get the following result :

------------------
Starting epoch 35

Time_CPU for epoch 35 0.5518779030026053
relative CPU time portions: shuffle: 0.05  batch loop: 0.58  prediction:  0.37
Total CPU-time:  19.065050211000198

learning rate =  0.0009994051838157095

total costs of training set   =  5843.522
rel. reg. contrib. to total costs =  0.0013737131

total costs of last mini_batch   =  56.300297
rel. reg. contrib. to batch costs =  0.14256112

mean abs weight at L0 :  0.06393985
mean abs weight at L1 :  0.37341583
mean abs weight at L2 :  1.302389

avg total error of last mini_batch =  0.00709
presently reached train accuracy   =  0.99072

-------------------
Total training Time_CPU:  19.04528829299714

With accuracy taken only from the error of a batch:

avg total error of last mini_batch =  0.00806
presently reached train accuracy   =  0.99194
-------------------
Total training Time_CPU:  11.331006342999899

Isn't this good news? A time of 11.3 secs is pretty close to what Keras provides us with! (Well, at least for a batch size of 500). And with a better result regarding accuracy on my side - but this has to do with a probably different handling of learning rates and the precise translation of the L2-regularization parameter for batches.

Plots:

How did I get to this point? As said: Two measures were sufficient.

A big leap in performance by turning to float32 precision

So far I have never cared too much for defining the level of precision by which Numpy handles arrays with floating point numbers. In the context of Machine Learning this is a profound mistake. on a 64bit CPU many time consuming operations can gain almost a factor of 2 in performance when using float 32 precision - if the programmers tweaked everything. And I assume the Numpy guys did it.

So: Just use "dtype=np.float32" (np means "numpy" which I always import as "np") whenever you initialize numpy arrays!

For the readers following my other series: You should look at multiple methods performing some kind of initialization of my "MyANN"-class. Here is a list:

 
    def _handle_input_data(self): 
        .....
            self._y = np.array([int(i) for i in self._y], dtype=np.float32)
        .....
        self._X = self._X.astype(np.float32)
        self._y = self._y.astype(np.int32)
        .....
    def _encode_all_y_labels(self, b_print=True):
        .....
        self._ay_onehot = np.zeros((self._n_labels, self._y_train.shape[0]), dtype=np.float32)
        self._ay_oneval = np.zeros((self._n_labels, self._y_train.shape[0], 2), dtype=np.float32)
   
        .....
    def _create_WM_Input(self):
        .....
        w0 = w0.astype(dtype=np.float32)
        .....
    def _create_WM_Hidden(self):
        .....
            w_i_next = w_i_next.astype(dtype=np.float32)
        .....
    def _create_momentum_matrices(self):
        .....
            self._li_mom[i] = np.zeros(self._li_w[i].shape, dtype=np.float32)
        .....
    def _prepare_epochs_and_batches(self, b_print = True):
        .....
        self._ay_theta = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        self._ay_costs = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        self._ay_reg_cost_contrib = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) 
        .....
        self._ay_period_test_epoch     = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_acc_test_epoch        = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_err_test_epoch        = -1 * np.ones(shape_test_epochs, dtype=np.float32) 
        self._ay_period_train_epoch    = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_acc_train_epoch       = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_err_train_epoch       = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_tot_costs_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        self._ay_rel_reg_train_epoch   = -1 * np.ones(shape_train_epochs, dtype=np.float32) 
        .....
        self._ay_mean_abs_weight = -10 * np.ones(shape_weights, dtype=np.float32) 
        .....
    def _add_bias_neuron_to_layer(self, A, how='column'):
        .....
            A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32)
        .....
            A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32)
    .....

 

After I applied these changes the factor in comparison to Keras went down to 3.1 - for a batch size of 500. Good news after a first simple step!

Reducing the CPU time once more

The next step required a bit more thinking. When I went through further more detailed tests of CPU consumption for various steps during training I found that the error back propagation through the network required significantly more time than the forward propagation.

At first sight this seems to be logical. There are more operations to be done between layers - real matrix multiplications with np.dot() (or np.matmul()) and element-wise multiplications with the "*"-operation. See also my PDF on the basic math:
Back_Propagation_1.0_200216.

But this is wrong assumption: When I measured CPU times in detail I saw that such operations took most time when network layer L0 - i.e. the input layer of the MLP - got involved. This also seemed to be reasonable: the weight matrix is biggest there; the input layer of all layers has most neuron nodes.

But when I went through the code I saw that I just had been too lazy whilst coding back propagation:

 
    ''' -- Method to handle error BW propagation for a mini-batch --'''
    def _bw_propagation(self, 
                        ay_y_enc, li_Z_in, li_A_out, 
                        li_delta_out, li_delta, li_D, li_grad, 
                        b_print = True, b_internal_timing = False):
        
        # Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch 
        
        # Initiate BW propagation - provide delta-matrices for outermost layer
        # *********************** 
        # Input Z at outermost layer E  (4 layers -> layer 3)
        ay_Z_E = li_Z_in[self._n_total_layers-1]
        # Output A at outermost layer E (was calculated by output function)
        ay_A_E = li_A_out[self._n_total_layers-1]
        
        # Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid 
        ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print )
        
        # Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices)
        ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print) 
        
        # add the matrices at the outermost layer to their lists ; li_delta_out gets only one element 
        idxE = self._n_total_layers - 1
        li_delta_out[idxE] = ay_delta_out_E # this happens only once
        li_delta[idxE]     = ay_delta_E
        li_D[idxE]         = ay_D_E
        li_grad[idxE]      = None    # On the outermost layer there is no gradient ! 
        
        # Loop over all layers in reverse direction 
        # ******************************************
        # index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2))
        range_N_bw_layer = reversed(range(0, self._n_total_layers-1))   # must be -1 as the last element is not taken 
        
        # loop over layers 
        for N in range_N_bw_layer:
            
            # Back Propagation operations between layers N+1 and N 
            # *******************************************************
            # this method handles the special treatment of bias nodes in Z_in, too
            ay_delta_N, ay_D_N, ay_grad_N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False )
            
            # add matrices to their lists 
            li_delta[N] = ay_delta_N
            li_D[N]     = ay_D_N
            li_grad[N]= ay_grad_N
       
        return

 
with the following key function:

 
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta):
        '''
        BW-error-propagation between layer N+1 and N 
        Inputs: 
            li_Z_in:  List of input Z-matrices on all layers - values were calculated during FW-propagation
            li_A_out: List of output A-matrices - values were calculated during FW-propagation
            li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist 
        
        Returns: 
            ay_delta_N - delta-matrix of layer N (required in subsequent steps)
            ay_D_N     - derivative matrix for the activation function on layer N 
            ay_grad_N  - matrix with gradient elements of the cost fnction with respect to the weights on layer N 
        '''
        
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling between layers N and N+1 
        ay_W_N = self._li_w[N]
        # delta-matrix of layer N+1
        ay_delta_Np1 = li_delta[N+1]

        # !!! Add row (for bias) to Z_N intermediately !!!
        ay_Z_N = li_Z_in[N]
        ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        
        # Derivative matrix for the activation function (with extra bias node row)
        ay_D_N = self._calculate_D_N(ay_Z_N)
        
        # fetch output value saved during FW propagation 
        ay_A_N = li_A_out[N]
        
        # Propagate delta
        # **************
        # intermediate delta 
        ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
        # final delta 
        ay_delta_N = ay_delta_w_N * ay_D_N
        # reduce dimension again (bias row)
        ay_delta_N = ay_delta_N[1:, :]
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        ay_grad_N[:, 1:] += (self._li_w[N][:, 1:] * self._lambda2_reg + np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg) 
        
        return ay_delta_N, ay_D_N, ay_grad_N

 

Now, look at the eventual code:

 
    ''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
    def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False):
        '''
        BW-error-propagation between layer N+1 and N 
        .... 
        '''
        # Prepare required quantities - and add bias neuron to ay_Z_in 
        # ****************************
        
        # Weight matrix meddling between layers N and N+1 
        ay_W_N = self._li_w[N]
        ay_delta_Np1 = li_delta[N+1]

        # fetch output value saved during FW propagation 
        ay_A_N = li_A_out[N]

        # Optimization ! 
        if N > 0: 
            ay_Z_N = li_Z_in[N]
            # !!! Add intermediate row (for bias) to Z_N !!!
            ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row')
        
            # Derivative matrix for the activation function (with extra bias node row)
            ay_D_N = self._calculate_D_N(ay_Z_N)
        
            # Propagate delta
            # **************
            # intermediate delta 
            ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
            # final delta 
            ay_delta_N = ay_delta_w_N * ay_D_N
            # reduce dimension again 
            ay_delta_N = ay_delta_N[1:, :]
            
        else: 
            ay_delta_N = None
            ay_D_N = None
        
        # Calculate gradient
        # ********************
        #     required for all layers down to 0 
        ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T)
        
        # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) 
        if self._lambda2_reg > 0.0: 
            ay_grad_N[:, 1:] += self._li_w[N][:, 1:] * self._lambda2_reg 
        if self._lambda1_reg > 0.0: 
            ay_grad_N[:, 1:] += np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg 
        
        return ay_delta_N, ay_D_N, ay_grad_N

 

You have, of course, detected the most important change:

We do not need to propagate any delta-matrices (originally coming from the error deviation at the output layer) down to layer 1!

This is due to the somewhat staggered nature of error back propagation - see the PDF on the math again. Between the first hidden layer L1 and the input layer L0 we only need to fetch the output matrix A at L0 to be able to calculate the gradient components for the weights in the weight matrix connecting L0 and L1. This saves us from the biggest matrix multiplication - and thus reduces computational time significantly.

Another bit of CPU time can be saved by calculating only the regularization terms really asked for; for my simple densely populated network I almost never use Lasso regularization; so L1 = 0.

These changes got me down to the values mentioned above. And, note: The CPU time for backward propagation then drops to the level of forward propagation. So: Be somewhat skeptical about your coding if backward propagation takes much more CPU time than forward propagation!

Dependency on the batch size

I should remark that TF2 still brings some major and remarkable advantages with it. Its strength becomes clear when we go to much bigger batch sizes than 500:
When we e.g. take a size of 10000 samples in a batch, the required time of Keras and TF2 goes down to 6.4 secs. This is again a factor of roughly 1.75 faster.
I do not see any such acceleration with batch size in case of my own program!

More detailed tests showed that I do not gain speed with a batch size over 1000; the CPU time increases linearly from that point on. This actually seems to be a limitation of Numpy and OpenBlas on my system.

Because , I have some reasons to believe that TF2 also uses some basic OpenBlas routines, this is an indication that we need to put more brain into further optimization.

Conclusion

We saw in this article that ML programs based on Python and Numpy may gain a boost by using only dtype=float32 and the related accuracy for Numpy arrays. In addition we saw that avoiding unnecessary propagation steps between the first hidden and at the input layer helps a lot.

In the next article of this series we shall look a bit at the performance of forward propagation - especially during accuracy tests on the training and test data set.

Further articles in this series

MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation
MLP, Numpy, TF2 – performance issues – Step II – bias neurons, F- or C- contiguous arrays and performance
MLP, Numpy, TF2 – performance issues – Step III – a correction to BW propagation

Getting a Keras based MLP to run with Cuda 10.2, Cudnn 7.6 and TensorFlow 2.0 on an Opensuse Leap 15.1 system

During last weekend I wanted to compare the performance of an old 20-line Keras setup for a simple MLP with the performance of a self-programmed Python- and Numpy-based MLP regarding training epochs on the MNIST dataset. The Keras code was set up in a Jupyter notebook last autumn - at that time for TensorFlow 1 and Cuda 10.0 for my Nvidia graphics card. I thought it might be a good time to move everything to Tensorflow 2 and the latest Cuda libraries on my Opensuse Leap 15.1 system. This was more work than expected - and for some problems I was forced to apply some dirty workarounds. I got it running. Maybe the necessary steps, which are not really obvious, are helpful for others, too.

Install Cuda 10.2 and Cudnn on an Opensuse Leap 15.1

Before you want to use TensorFlow [TF] on a Nvidia graphics card you must install Cuda. The present version is Cuda 10.2. I was a bit naive to assume that this should be the right version - as it has been available for some time already. Wrong! Afterwards I read somewhere that TensorFlow2 [TF2] is working with Cuda 10.1, only, and not yet fully compatible with Cuda 10.2. Well, at least for my purposes [MLP training] it seemed to work nevertheless - with some "dirty" additional library links.

There is a central Cuda repository available at this address: cuda10.2. Actually, the repo offers both cuda10.0, cuda10.1 and cuda10.2 (plus some nvidia drivers). I selected some central cuda10.2 packages for installation - just to find out where the related files were placed in the filesystem. I then ran into a major chain of packet dependencies, which I had to resolve during many tedious steps . Some packages may not have been necessary for a basic installation. In the end I was too lazy to restrict the libs to what is really required for Keras. The bill came afterwards: Cuda 10.2 is huge! If you do not know exactly what you need: Be prepared to invest up to 3 GB on your hard disk.

The Cuda 10.2 RPM packets install most of the the required "*.so"-shared library files and many other additional files in a directory "/usr/local/cuda-10.2/". To make changes between different versions of Cuda possible we also find a link "/usr/local/cuda" pointing to
"/usr/local/cuda-10.2/" after the installation. Ok, reasonable - we could change the link to point to "/usr/local/cuda-10.0/". This makes you assume that the Tensorflow 2 libraries and other modules in your virtual Python3 and Jupyter environment would look for required Cuda files in the central directory "/usr/local/cuda" - i.e. without special version attributes of the files. Unfortunately, this was another wrong assumption. 🙁 See below.

In addition to the Cuda packages you must install the present "cudnn" libraries from Nvidia - more precisely: The runtime and the development package. You get the RPMs from here. Be prepared to give Nvidia your private data. 🙁

I should add that I ignored and ignore the Nvidia drivers from the Cuda repository, i.e. I never installed them. Instead, I took those from the standard Nvidia community repository. They worked and work well so far - and make my update life on Opensuse easier.

Installation of Tensorflow2 modules in your (virtual) Python3 environment

I use a virtual Python3 environment and update it regularly via "pip". Regarding TF2 an update via the command "pip install --upgrade tensorflow" should be sufficient - it will resolve dependencies. (By the way: If you want to bring all Python libs to their present version you can also use "pip-review --auto". Note that under certain circumstances you may need the "--force" option for special upgrades. I cannot go into details in this article.)

Multiple complaints about missing libraries ...

Unfortunately, the next time I started my virtual Python environment I got the warning that the dynamic library "libnvinfer.so.6" could not be found, but was required in case I planned to use TensorRT. What? Well, you may find some information here
https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/
https://developer.nvidia.com/tensorrt

I leave it up to you whether you really need TensorRT. You can ignore this message - TF will run for standard purposes without it. But, dear TF-developers: a clear message in the warning would in my opinion have been helpful. Then I checked whether some version of the Nvidia related library "libnvinfer.so" came with Cuda or Cudnn onto my system. Yeah, it did - unfortunately version 7 instead of 6. :-(.
So, we are confronted with a dependency on a specific file version which is older than the present one. I do not like this style of development. Normally, it should be the other way round: If a newer version is required due to new capabilities you warn the user. But under normal circumstances a backward compatibility of libs should be established. You would assume such a backward compatibility and that TF would search for the present version via looking for files "libnvinfer.so" and "libnvinfer_plugin.so" which do exist and point to the latest versions. But, no, in this case they want it explicitly to be version 6 ... Makes you wonder whether the old Cudnn version is still available. I did not check it. Ok, ok - backward compatibility is not always possible ....

Just to see how good the internal checking of the alleged dependency is, I did something you normally do not do: I created a link "libnvinfer.so.6" in "/usr/lib64" to "libnvinfer.7.so". Had to do the same for "libnvinfer_plugin.so.6". Effect: I got rid of the warning - so much about dependency checking. I left the linking. You see I trust in coming developments sometimes and run some risks ....

Then came the next surprise. I had read a bit about changed statements in TF2 (compared to TF1) - and thought I was prepared for this. But, when I tried to execute some initial commands to configure TF2 from a Jupyter cell as e.g.

 
import time 
import tensorflow as tf
from tensorflow import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from tensorflow.python.client import device_lib

import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

tf.config.optimizer.set_jit(True)
tf.config.threading.set_intra_op_parallelism_threads(4)
tf.config.threading.set_inter_op_parallelism_threads(4)
tf.debugging.set_log_device_placement(True)

device_lib.list_local_devices()  

I at once got a complaint in the shell from which I had started the Jupyter notebook - saying that a lib called "libcudart.so.10.1" was missing. Again - an explicit version dependency 🙁 . On purpose or just a glitch? Just one out of many files version dependent? Without a clear information? If this becomes the standard in the interaction between TF2 and Cuda - well, no fun any longer. In my opinion the TF2 developers should not use a search for files via version specific names - but do an analysis of headers and warn explicitly that the present version requires a specific Cuda version. Would be much more convenient for the user and compatible with the link mechanism described above.

Whilst a bunch of other dynamic libs was loaded by their name without a version in this case TF2 asks for a very specific version - although there is a corresponding lib available in the directory "/usr/lib/cuda-10.2".... Nevertheless with full trust again in a better future I offered TF2 a softlink "libcudart.so.10.1" in "/usr/lib64/" pointing to the "/usr/local/cuda-10.2/lib64/libcudart.so". It cleared my way to the next hurdle. And my Keras MLP worked in the end ...

Missing "./bin" directory ... and other path related problems

When I tried to run specific Keras commands, which TF2 wanted to compile as XLA-supported statements, I again got complaints that files in a local directory "./bin" were missing. This was a first clear indication that Cuda paths were totally ignored in my Python/Jupyter environment. But what directory did the "./" refer to? Some experiments revealed:

I had to link an artificial subdirectory "./bin" in the directory where I kept my Jupyter notebooks to "/usr/local/Cuda-10.2/bin".

But the next problems with other directories waited directly around the corner. Actually many ... To make a long story short - the installation of TF2 in combination with Cuda 1.2 does not evaluate paths or ask for paths when used in a Python3/Jupyter environment. We have to provide and export them as shell environment variables. See below.

Warnings and errors regarding XLA capabilities

Another thing which drove me nuts was that TF2 required information about XLA-flags. It took me a while to find out that this also could be handled via environment variables.

All in all I now start the shell from which I launch my virtual Python environment and Jupyter notebooks with the following command sequence:

myself@mytux:/projekte/GIT/....../ml> export XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda
myself@mytux:/projekte/GIT/....../ml> export TF_XLA_FLAGS=--tf_xla_cpu_global_jit
myself@mytux:/projekte/GIT/....../ml_1> export OPENBLAS_NUM_THREADS=4              
myself@mytux:/projekte/GIT/....../ml_1> source bin/activate
(ml) myself@mytux:/projekte/GIT/....../ml_1> jupyter notebook 

The first two commands did the magic regarding the path-problems! TF2 worked afterwards both for XLA-capable CPUs and Nvidia GPUs. So, a specific version may or may not have advantages - I do not know - but at least you can get TF2 running with Cuda 10.2.

Changed commands to control threading and memory consumption

Without the use of explicit compatibility commands TF2 does not support commands like

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}, 
                        log_device_placement=True

                       )
config.gpu_options.per_process_gpu_memory_fraction=0.4
config.gpu_options.force_gpu_compatible = True  

any longer. But as with TF1 you probably do not want to pick all the memory from your graphics card and you do not want to use all cores of a CPU in TF2. You can circumvent the lack of a "ConfigProto" property in TF2 by the following commands:

# configure use of just in time compiler 
tf.config.optimizer.set_jit(True) 
# limit use of parallel threads 
tf.config.threading.set_intra_op_parallelism_threads(4) 
tf.config.threading.set_inter_op_parallelism_threads(4)
# Not required in TF2: tf.enable_eager_execution()
# print out use of certain device (at first run)
tf.debugging.set_log_device_placement(True)
#limit use of graphics card memory 
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
# Not required in TF2: tf.enable_eager_execution()
# print out a list of usable devices
device_lib.list_local_devices()   

Addendum, 15.05.2020:
Well, this actually proved to be correct for the limitation of the GPU memory, only. The limitations on the CPU cores do NOT work. At least not on my system. See also:
tensorflow issues 34415.

I give you a workaround below.

Test run with MNIST

Afterwards the following simple Keras MLP ran without problems and with the expected performance on a GPU and a multicore CPU:

Jupyter cell 1

import time 
import tensorflow as tf
#from tensorflow import keras as K
import keras as K
from keras.datasets import mnist
from keras import models
from keras import layers
from keras.utils import to_categorical
from tensorflow.python.client import device_lib
import os

# use to work with CPU (CPU XLA ) only 
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], 
          [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)
    
# if not yet done elsewhere 

tf.config.optimizer.set_jit(True)
tf.debugging.set_log_device_placement(True)

use_cpu_or_gpu = 1 # 0: cpu, 1: gpu

# The following can only be done once - all CPU cores are used otherwise  
#if use_cpu_or_gpu == 0:
#    tf.config.threading.set_intra_op_parallelism_threads(4)
#    tf.config.threading.set_inter_op_parallelism_threads(6)


# function for training 
def train(train_images, train_labels, epochs, batch_size):
    network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size)

# setup of the MLP
network = models.Sequential()
network.add(layers.Dense(200, activation='sigmoid', input_shape=(28*28,)))
network.add(layers.Dense(100, activation='sigmoid'))
network.add(layers.Dense(50, activation='sigmoid'))
network.add(layers.Dense(30, activation='sigmoid'))
network.add(layers.Dense(10, activation='sigmoid'))
network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# load MNIST 
mnist = K.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# simple normalization
train_images = X_train.reshape((60000, 28*28))
train_images = train_images.astype('float32') / 255
test_images = X_test.reshape((10000, 28*28))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Jupyter cell 2

# run it 
if use_cpu_or_gpu == 1:
    start_g = time.perf_counter()
    train(train_images, train_labels, epochs=45, batch_size=1500)
    end_g = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    train(train_images, train_labels, epochs=45, batch_size=1500)
    end_c = time.perf_counter()
    test_loss, test_acc= network.evaluate(test_images, test_labels)
    print('Time_CPU: ', end_c - start_c)  

# test accuracy 
print('Acc: ', test_acc)

Switch to force Tensorflow to use the CPU, only

Another culprit is that - depending on the exact version of TF 2 - you may need to use the following statement to run (parts of) your code on the CPU only:

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

in the beginning. Otherwise Tensorflow 2.0 and version 2.1 will choose execution on the GPU even if you use a statement like

with tf.device("/CPU:0"):

(which worked in TF1).
It seems that this problem was solved with TF 2.2 (tested it on 15.05.2020)! But you may have to check it yourself.
You can watch the involvement of the GPU e.g. with "watch -n0.1 nvidia-smi" on a terminal. Another possibility is to set

tf.debugging.set_log_device_placement(True)  

and get messages in the shell of your virtual Python environment or in the presently used Jupyter cell.

Addendum 16.05.2020: Limiting the number of CPU cores for Tensorflow 2.0 on Linux

After several trials and tests I think that both TF2 and the Keras version delivered with handle the above given TF2 statements to limit the number of CPU cores to use inefficiently. I addition the behavior of TF2/Keras has changed with the TF2 versions 2.0, 2.1 and now 2.2.

Strange things also happen, when you combine statements of the TF1 compat layer with pure TF2 restriction statements. You should refrain from mixing them.

So, it is either

Option 1: CPU only and limited number of cores

from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
...
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=4, inter_op_parallelism_threads=1)
B.set_session(tf.compat.v1.Session(config=config))    
...

OR
Option 2: Mixture of GPU (with limited memory) and CPU (limited core number) with TF2 statements

import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist
from tensorflow.python.client import device_lib
import os

tf.config.threading.set_intra_op_parallelism_threads(6)
tf.config.threading.set_inter_op_parallelism_threads(1)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], 
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    except RuntimeError as e:
        print(e)

OR
Option 3: Mixture of GPU (limited memory) and CPU (limited core numbers) with TF1 compat statements

import tensorflow as tf
from tensorflow import keras as K
from tensorflow.python.keras import backend as B 
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist
from tensorflow.python.client import device_lib
import os

#gpu = False 
gpu = True
if gpu: 
    GPU = True;  CPU = False; num_GPU = 1; num_CPU = 1
else: 
    GPU = False; CPU = True;  num_CPU = 1; num_GPU = 0

config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6,
                        inter_op_parallelism_threads=1, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}, 
                        log_device_placement=True

                       )
config.gpu_options.per_process_gpu_memory_fraction=0.35
config.gpu_options.force_gpu_compatible = True
B.set_session(tf.compat.v1.Session(config=config))

Hint 1:

If you want to test some code parts on the GPU and others on the CPU in the same session, I strongly recommend to use the compat statements in the form given by Option 3 above

The reason is that it - strangely enough - gives you a faster performance on a multicore CPU by more than 25% in comparison to the pure TF2 statements .

Afterwards you can use statements like:

batch_size=64
epochs=5

if use_cpu_or_gpu == 0:
    start_g = time.perf_counter()
    with tf.device("/GPU:0"):
        train(train_imgs, train_labels, epochs, batch_size)
    end_g = time.perf_counter()
    print('Time_GPU: ', end_g - start_g)  
else:
    start_c = time.perf_counter()
    with tf.device("/CPU:0"):
        train(train_imgs, train_labels, epochs, batch_size)
    end_c = time.perf_counter()
    print('Time_CPU: ', end_c - start_c)  

Hint 2:
If you check the limitations on CPU cores (threads) via watching the CPU load on tools like "gkrellm" or "ksysguard", it may appear that all cores are used in parallel. You have to set the update period of these tools to 0.1 sec to see that each core is only used intermittently. In gkrellm you should also see a variation of the average CPU load value with a variation of the parameter "intra_op_parallelism_threads=n".

Hint 3:
In my case with a Quadcore CPU with hyperthreading the following settings seem to be optimal for a variety of Keras CNN models - whenever I want to train them on the CPU only:

...
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1)
B.set_session(tf.compat.v1.Session(config=config)) 
...

Hint 4:
If you want to switch settings in a Jupyter session it is best to stop and restart the respective kernel. You can do this via the commands under "kernel" in the Jupyter interface.

Conclusion

Well, my friends the above steps where what I had to do to get Keras working in combination with TF2, Cuda 10.2 and the present version of Cudnn. I regard this not as a straightforward procedure - to say it mildly.

In addition after some tests I might also say that the performance seems to be worse than with Tensorflow 1. Especially when using Keras - whether the Keras included with Tensorflow 2 or Keras in form of separate Python lib. Especially the performance on a GPU is astonishingly bad with Keras for small networks.

This impression of sluggishness stands in a strange contrast to elementary tests were I saw a factor of 5 difference for a series of typical matrix multiplications executed directly with tf.matmul() on a GPU vs. a CPU. But this another story .....

Links

tensorflow-running-version-with-cuda-on-cpu-only

 

A single neuron perceptron with sigmoid activation function – III – two ways of applying Normalizer

In this article series on a perceptron with only one computing neuron we saw that saturation effects of the sigmoid activation function can hamper gradient descent if input data on some features become too big and/or the initial weight distribution is not adapted to the number of input features. See:

A single neuron perceptron with sigmoid activation function – I – failure of gradient descent due to saturation

We can remedy the first point by applying a normalization transformation to the input data before starting gradient descent. I showed the positive result of such a transformation for our perceptron with a rather specific set of input data in the last article:

A single neuron perceptron with sigmoid activation function – II – normalization to overcome saturation

At that time we used the "StandardScaler" provided by Scikit-Learn. In this article we shall instead use an instance of the "Normalizer" class for scaling. With "Normalizer" you have to be a bit careful how you use its interface. We shall apply "Normalizer in two different ways. Besides having some fun with the outcome, we will also learn that the shape of the clusters in which the input samples may be arranged in feature space should be taken into account before normalizing ahead of classification tasks. Which may be difficult in multiple dimensions ... but it brings us to the general idea of identifying a method of cluster identification ahead of classification training with gradient descent.

How does the "Normalizer" work?

Let us offer a "Normalizer"-instance an input array "ay_in" with 2 rows and 4 columns for each row. The shape of "ay_in" is (2,4). The first row "s1" shall have elements like s1=[4, 1, 2, 2]. Then Normalizer will then calculate a L2-norm value for the column data of our specific row as

L2([4, 1, 2, 2]) = sqrt(4**2 + 1**2 + 2**2 + 2**2) = 5
=> s1_trafo = [4/L2, 1/L2, 2/L2, 2/L2] = [0.8, 0.2, 0.4, 0.4].

I.e., all columns in one row are multiplied by one common factor determined as the L2-norm of the column data of the sample. Note again: Each row is treated separately. So, an array as

[
  [1, 3, 9, 3],
  [5, 7, 5, 1]
]

will be transformed to

  [0.1,, 0.3, 0.9, 0.3],
  [0.5, 0.7, 0.5, 0.1]
]

How can we make use of this for our perceptron samples?

Standard scaling per feature with Normalizer

A first idea is that we could scale the data of all samples for our perceptron separately per feature; i.e. we collect the data-values of all M samples for "feature 1" in an array and offer it as the first row of an array to Normalizer, plus a row with all the data values for "feature 2", .... and so on.

If we had M samples and N features we would present an array with shape (N, M) to "Normalizer". In our simple perceptron experiment this is equivalent to scaling data of an array where the two rows are defined by our K1 and K2-input arrays => ay_K = [ li_K1, li_K2 ].

What would the outcome of such a scaling be?

A constant factor per feature determined by the L2-norm of all samples' values for the chosen feature brings all values safely down into an interval of [-1, 1]. But this also means that the maximum value of all samples for a specific feature determines the scale.

Then so called "outliers", i.e. samples whose values are far away from the average values of the samples, would have a major impact. So "Normalizer"-Scaling is especially helpful, if the values per feature are limited by principle reasons. Note that this is e.g. the case with RGB-color or gray-scale values! Note also that the possible impact of outliers is also relevant for other normalizers as the "MinMaxNormalizer" of SciKit-Learn.

Although the scaling factors will be different per feature I would like to point out another aspect of scaling by a constant factor per feature over all samples: Such a transformation keeps up at least some structural similarity of the sample distribution in the feature space.

Scaling features per sample with Normalizer (?)

A different way of applying "Normalizer" would be to use the transformed array "ay_K.T" as input: For M samples and N features we would then present an array with shape (M, N) to a Normalizer instance. Its algorithm would then scale across the features of each sample. If we interpret a specific sample as a vector in the feature space then the L2-norm corresponds naturally to the length of this vector. Meaning: Normalizer would scale each sample by its vector length.

Two questions before experimenting

The two possible application methods for Normalizer lead directly to two questions for our simple test setup in a 2-dim feature space:

  • How will lines of equal cost values (i.e. cost or loss contours) for our sigmoid-based loss function look like in the {K1, K2}-space after scaling a bunch of N (K1, K2)-datapoints with Normalizer per feature? I.e., if and when should we present an array of feature values with shape (N,M)?
  • What would happen instead if we scaled each input sample individually across its features? I.e., what happens in a situation with M samples and N features and we feed "Normalizer" with an array (of the same feature values) which has a shape (N,M) instead of (M,N)?

I guess a "natural talent" on numbers as Mr Trump could give the answers without hesitation 🙂 . As we certainly are below the standards of the "genius" Mr Trump (his own words on multiple occasions) we shall pick the answers from plots below before we even try a deeper reasoning.

Application of "Normalizer" separately to the feature data of all batch samples

As you remember from the first article of this series our input batch contained samples (K1, K2) with values for K1 and K2 given by two 1-dim arrays :

li_K1 = [200.0,   1.0, 160.0,  11.0, 220.0,  11.0, 120.0,  22.0, 195.0,  15.0, 130.0,   5.0, 185.0,  16.0]
li_K2 = [ 14.0, 107.0,  10.0, 193.0,  32.0, 178.0,   2.0, 210.0,  12.0, 134.0,  15.0, 167.0,  10.0, 229.0] 

The standard scaling application of Normalizer can be coded explicitly as (see the code given in the last article):

    rg_idx = range(num_samples)
    if scale_method == 0:      
        shape_input = (2, num_samples)
        ay_K = np.zeros(shape_input)
        for idx in rg_idx:
            ay_K[0][idx] = li_K1[idx] 
            ay_K[1][idx] = li_K2[idx] 
        scaler = Normalizer()
        ay_K = scaler.fit_transform(ay_K)
        for idx in rg_idx:
            ay_K1[idx] = ay_K[0][idx]   
            ay_K2[idx] = ay_K[1][idx]
        scaling_fact_K1 = ay_K1[0] / li_K1[0]
        scaling_fact_K2 = ay_K2[0] / li_K2[0]
        print(ay_K1)
        print("\n")
        print(ay_K2)

However, a much faster form, which avoids the explicit Python loop, is given by:

ay_K= np.vstack( (li_K1, li_K2) )
ay_K = scaler.fit_transform(ay_K)
ay_K1, ay_K2 = ay_K
scaling_fact_K1 = ay_K1[0] / li_K1[0]
scaling_fact_K2 = ay_K2[0] / li_K2[0]

Here OpenBlas helps 🙂 .

In contrast to other scalers we need to save and keep the factors by which we transform the various feature data by ourselves somewhere. (This is clear as "Normalizer" calculates a different factor for each feature.) So, we change our Jupyter cell code for scaling to:

# ********
# Scaling
# ********

b_scale = True
scale_method = 0
# 0: Normalizer (standard), 1: StandardScaler, 2. By factor, 3: Normalizer per pair 
# 4: Min_Max, 5: Identity (no transformation) - just there for convenience  

shape_ay = (num_samples,)
ay_K1 = np.zeros(shape_ay)
ay_K2 = np.zeros(shape_ay)

# apply scaling
if b_scale:
    # shape_input = (num_samples,2)
    rg_idx = range(num_samples)
    if scale_method == 0:      
        ay_K = np.vstack( (li_K1, li_K2) )
        print("ay_k.shape = ", ay_K.shape)
        scaler = Normalizer()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K
        scaling_fact_K1 = ay_K1[0] / li_K1[0]
        scaling_fact_K2 = ay_K2[0] / li_K2[0]
        print("\nay_K1 = \n", ay_K1)
        print("\nay_K2 = \n", ay_K2)
        print("\nscaling_fact_K1: ", scaling_fact_K1, ", scaling_fact_K2: ", scaling_fact_K2)
       
    elif scale_method == 1: 
        ay_K = np.column_stack((li_K1, li_K2))
        scaler = StandardScaler()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K.T    
            
    elif scale_method == 2:
        dmax = max(li_K1.max() - li_K1.min(), li_K2.max() - li_K2.min())
        ay_K1 = 1.0/dmax * li_K1
        ay_K2 = 1.0/dmax * li_K2
        scaling_fact_K1 = ay_K1[0] / li_K1[0]
        scaling_fact_K2 = ay_K2[0] / li_K2[0]
    
    elif scale_method == 3:
        ay_K = np.column_stack((li_K1, li_K2))
        scaler = Normalizer()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K.T    
    
    elif scale_method == 4:
        ay_K = np.column_stack((li_K1, li_K2))
        scaler = MinMaxScaler()
        ay_K = scaler.fit_transform(ay_K)
        ay_K1, ay_K2 = ay_K.T    
    
    elif scale_method == 5:
        ay_K1 = li_K1
        ay_K2 = li_K2
            
            
# Get overview over costs on weight-mesh
#wm1 = np.arange(-5.0,5.0,0.002)
#wm2 = np.arange(-5.0,5.0,0.002)
wm1 = np.arange(-5.5,5.5,0.002)
wm2 = np.arange(-5.5,5.5,0.002)
W1, W2 = np.meshgrid(wm1, wm2) 
C, li_C_sgl = costs_mesh(num_samples = num_samples, W1=W1, W2=W2, li_K1 = ay_K1, li_K2 = ay_K2, \
                               li_a_tgt = li_a_tgt)


C_min = np.amin(C)
print("\nC_min = ", C_min)
IDX = np.argwhere(C==C_min)
print ("Coordinates: ", IDX)
# print(IDX.shape)
# print(IDX[0][0])
wmin1 = W1[IDX[0][0]][IDX[0][1]] 
wmin2 = W2[IDX[0][0]][IDX[0][1]]
print("Weight values at cost minimum:",  wmin1, wmin2)

# Plots
# ******
fig_size = plt.rcParams["figure.figsize"]
#print(fig_size)
fig_size[0] = 16; fig_size[1] = 16

fig3 = plt.figure(3); fig4 = plt.figure(4)

ax3 = fig3.gca(projection='3d')
ax3.get_proj = lambda: np.dot(Axes3D.get_proj(ax3), np.diag([1.0, 1.0, 1, 1]))
ax3.view_init(20,135)
ax3.set_xlabel('w1', fontsize=16)
ax3.set_ylabel('w2', fontsize=16)
ax3.set_zlabel('Total costs', fontsize=16)
ax3.plot_wireframe(W1, W2, 1.2*C, colors=('green'))


ax4 = fig4.gca(projection='3d')
ax4.get_proj = lambda: np.dot(Axes3D.get_proj(ax4), np.diag([1.0, 1.0, 1, 1]))
ax4.view_init(25,135)
ax4.set_xlabel('w1', fontsize=16)
ax4.set_ylabel('w2', fontsize=16)
ax4.set_zlabel('Single costs', fontsize=16)
ax4.plot_wireframe(W1, W2, li_C_sgl[0], colors=('blue'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[1], colors=('red'))
ax4.plot_wireframe(W1, W2, li_C_sgl[5], colors=('orange'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[6], colors=('yellow'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[9], colors=('magenta'))
#ax4.plot_wireframe(W1, W2, li_C_sgl[12], colors=('green'))

plt.show()

 

Ok, lets apply the "Normalizer" to our input samples. We get:

ay_K1 = 
 [0.42786745 0.00213934 0.34229396 0.02353271 0.47065419 0.02353271
 0.25672047 0.02995072 0.41717076 0.03209006 0.27811384 0.01069669
 0.39577739 0.0342294 ]

ay_K2 = 
 [0.02955501 0.22588473 0.02111072 0.40743694 0.06755431 0.37577085
 0.00422214 0.44332516 0.02533287 0.28288368 0.01477751 0.35254906
 0.02111072 0.48343554]

scaling_fact_K1:  0.0021393372268854655 , scaling_fact_K2:  0.0021110722130092533

How do the transformed data points look like in the {K1, K2}-feature-space? See the plot:

Structurally very like the original; but with values reduced to [0,1]. This was to be expected.

The cost hyperplane for the data normalized "per feature"

After the transformation of the sample data the cost hyperplane over the {w1, w2}-space looks as follows:

We see a clear minimum; it does, however, not appear as pronounced as for the StandardScaler, which we applied in the last article.

But: There are no side valleys with small gradients at the end of the steep slope area. This means that a path into a minimum will probably look a bit different compared to a path on the hyperplane we got with the "StandardScaler".

Our mesh in the {w1, w2}-space indicates the following position of the minimum:

C_min =  0.0006350159045771724
Coordinates:  [[3949 1542]]
Weight values at cost minimum: -2.4160000000003397 2.39799999999913

Gradient descent results after normalization per feature with "Normalizer"

With our gradient descent method and the following run-parameters

w1_start = -0.20, w2_start = 0.25 eta = 0.2, decrease_rate = 0.00000001, num_steps = 2500

we get the following result of a run which explores both stochastic and batch gradient descent:

Stoachastic Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.427867  0.029555  200.0   14.0  0.3  0.276365  0.078783
1   0.002139  0.225885    1.0  107.0  0.7  0.630971  0.098613
2   0.342294  0.021111  160.0   10.0  0.3  0.315156  0.050519
3   0.023533  0.407437   11.0  193.0  0.7  0.715038  0.021483
4   0.470654  0.067554  220.0   32.0  0.3  0.273924  0.086920
5   0.023533  0.375771   11.0  178.0  0.7  0.699320  0.000971
6   0.256720  0.004222  120.0    2.0  0.3  0.352075  0.173584
7   0.029951  0.443325   14.0  210.0  0.7  0.729191  0.041701
8   0.417171  0.025333  195.0   12.0  0.3  0.279519  0.068271
9   0.032090  0.282884   15.0  134.0  0.7  0.645816  0.077405
10  0.278114  0.014778  130.0    7.0  0.3  0.346085  0.153615
11  0.010697  0.352549    5.0  167.0  0.7  0.694107  0.008418
12  0.395777  0.021111  185.0   10.0  0.3  0.287962  0.040126
13  0.034229  0.483436   16.0  229.0  0.7  0.745803  0.065432

Batch Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.427867  0.029555  200.0   14.0  0.3  0.276360  0.078799
1   0.002139  0.225885    1.0  107.0  0.7  0.630976  0.098606
2   0.342294  0.021111  160.0   10.0  0.3  0.315152  0.050505
3   0.023533  0.407437   11.0  193.0  0.7  0.715045  0.021493
4   0.470654  0.067554  220.0   32.0  0.3  0.273919  0.086935
5   0.023533  0.375771   11.0  178.0  0.7  0.699326  0.000962
6   0.256720  0.004222  120.0    2.0  0.3  0.352072  0.173572
7   0.029951  0.443325   14.0  210.0  0.7  0.729198  0.041711
8   0.417171  0.025333  195.0   12.0  0.3  0.279514  0.068287
9   0.032090  0.282884   15.0  134.0  0.7  0.645821  0.077398
10  0.278114  0.014778  130.0    7.0  0.3  0.346081  0.153603
11  0.010697  0.352549    5.0  167.0  0.7  0.694113  0.008410
12  0.395777  0.021111  185.0   10.0  0.3  0.287957  0.040142
13  0.034229  0.483436   16.0  229.0  0.7  0.745810  0.065443

Total error stoch descent:  0.06898872490256348
Total error batch descent:  0.06899042421795792

Good! Seemingly we got some convergence in both cases. The overall "accuracy" achieved on the training set is even a bit better than for the "StandardScaler". And:

Final (w1,w2)-values stoch : ( -2.4151 ,  2.3977 )
Final (w1,w2)-values batch : ( -2.4153 ,  2.3976 )

This fits very well to the data we got from our mesh analysis of the cost hyperplane!

Regarding the evolution of the costs and the weights we see a slightly different picture than with the "StandardScaler":

Cost and weight evolution during stochastic gradient descent

and:

Cost and weight evolution during batch gradient descent

From the evolution of the weight parameters we can assume that gradient descent moved along a direct path into the cost minimum. This fits to the different shape of the cost hyperplane in comparison with the hyperplane we got after the application of the "StandardScaler".

Predicted contour and separation lines in the {K1, K2}-plane after feature-scaling with "Normalizer"

We compute the contour lines of the output A of our solitary neuron (see article 1 of this series) with the following code:

 
# ***********
# Contours 
# ***********
from matplotlib import ticker, cm

# Take w1/w2-vals from above w1f, w2f
w1_len = len(li_w1_ba)
w2_len = len(li_w1_ba)
w1f = li_w1_ba[w1_len -1]
w2f = li_w2_ba[w2_len -1]

def A_mesh(w1,w2, Km1, Km2):
    kshape = Km1.shape
    A = np.zeros(kshape) 
    
    Km1V = Km1.reshape(kshape[0]*kshape[1], )
    Km2V = Km2.reshape(kshape[0]*kshape[1], )
    print("km1V.shape = ", Km1V.shape, "\nkm1V.shape = ", Km2V.shape )
    
    # scaling trafo
    if scale_method == 0: 
        Km1V = scaling_fact_K1 * Km1V
        Km2V = scaling_fact_K2 * Km2V
        KmV = np.vstack( (Km1V, Km2V) )
        KmT = KmV.T
    else: 
        KmV = np.column_stack((Km1V, Km2V))
        KmT = scaler.transform(KmV)
    
    Km1T, Km2T = KmT.T
    Km1TR = Km1T.reshape(kshape)
    Km2TR = Km2T.reshape(kshape)
    print("km1TR.shape = ", Km1TR.shape, "\nkm2TR.shape = ", Km2TR.shape )
    
    
    rg_idx = range(num_samples)
    Z      = w1 * Km1TR + w2 * Km2TR
    A = expit(Z)
    return A

#Build K1/K2-mesh 
minK1, maxK1 = li_K1.min()-20, li_K1.max()+20 
minK2, maxK2 = li_K2.min()-20, li_K2.max()+20
resolution = 0.1
Km1, Km2 = np.meshgrid( np.arange(minK1, maxK1, resolution), 
                        np.arange(minK2, maxK2, resolution))

A = A_mesh(w1f, w2f, Km1, Km2 )
print("A.shape = ", A.shape)

fig_size = plt.rcParams["figure.figsize"]
#print(fig_size)
fig_size[0] = 14
fig_size[1] = 11
fig, ax = plt.subplots()
cmap=cm.PuBu_r
cmap=cm.RdYlBu
#cs = plt.contourf(X, Y, Z1, levels=25, alpha=1.0, cmap=cm.PuBu_r)
cs = ax.contourf(Km1, Km2, A, levels=25, alpha=1.0, cmap=cmap)
cbar = fig.colorbar(cs)
N = 14
r0 = 0.6
x = li_K1
y = li_K2
area = 6*np.sqrt(x ** 2 + y ** 2)  # 0 to 10 point radii
c = np.sqrt(area)
r = np.sqrt(x ** 2 + y ** 2)
area1 = np.ma.masked_where(x < 100, area)
area2 = np.ma.masked_where(x >= 100, area)
ax.scatter(x, y, s=area1, marker='^', c=c)
ax.scatter(x, y, s=area2, marker='o', c=c)
# Show the boundary between the regions:
ax.set_xlabel("K1", fontsize=16)
ax.set_ylabel("K2", fontsize=16)

 

Please note the differences in how we handle the creation of the array "KmT" with the transformed data for "scale_method=0", i.e. "Normalizer", in comparison to other methods.

Here is the result:

Looks very similar to our plot for the StandardScaler in the last article - but with a slight shift on the K1-axis. So, the answer to our first question is: The contour lines are straight diagonal lines!

This is a direct result of the equations

expit(z) = E_z = const. => z = const. => w1*f1*K1 + w2*f2*K2 = C_z =>
K2 = C_k -fact*K1

The last one is nothing but an equation for a straight line. As "factor" is a constant, the angle α with the K1-axis remains the same for different E_z and C_k, i.e. we get parallel lines. If "fact = "-w1*f1/w2*f2 ≈ 1 = tan(α)" we get almost a 45°ree;-angle α. Let us see in our case : w1 = -2.4151 , w2 = 2.3977, f1 = 0.00214, f2 = 0.00211 => fact = 1.0215. This explains our plot.

"Normalizer" used per sample

Now we scale the (K1, K2) coordinates in feature space of each single sample with the Normalizer. I.e. we scale K1 and K2 for each individual sample by a common factor 1/sqrt(K1**2 + K2**2). Meaning: No scaling with a common factor per feature over all samples; instead scaling of the features per sample. As already said: If we regard K1 and K2 as coordinates of a vector then we scale the distance of vectors end point radially to the origin of the coordinate system down to a length of 1.

Thus: After this normalization transformation we expect that our points are located on a unit circle! Note, however, that our transformation keeps up the angular distance of all data points. By "angular distance" for two selected points we mean the difference of the angles of these data points with e.g. the K1-axis.

Let us look at the transformed sample points in the {K1, K2}-plane:

Ok, our transformation has done a more pronounced "clustering" for us. Our transformed clusters are even more clearly separated from each other than before!

What does this mean for our cost hyperplane in the {w1, w2}-space? Well, here is a mesh-plot:

Cost hyperplane of the data scaled per sample by "Normalizer" in the {w1, w2}-space

According to our mesh the minimum is located at:

C_min =  2.2726781812937556e-05
Coordinates:  [[3200 2296]]
Weight values at cost minimum: -0.9080000000005057 0.8999999999992951

Comparison of the cost hyperplane with center of the original hyperplane for the unscaled batch data

Now comes a really funny point: Do you remember that we have gotten a similar plot before? Actually, we did when we looked at a tiny surroundings of the center of the cost hyperplane of the original unscaled data in the first article of this series:

Cost hyperplane at the center of the original unscaled input data in the {w1, w2}-space?

A somewhat different viewing angle - but the similarity is obvious. Note however the very different scales of the (w1, w2)-values compared to the version of the scaled data.

How do we explain this similarity? Part of the answer lies in the fact that the total costs of the batch are dominated by those samples who have the biggest coordinate values, i.e. of those points where either K1 or K2 is biggest. Now, these points were very close to each other in the original data set. Now, for such points a centric stretch by a factor of around 1/200 would require a centric stretch (but now an expansion!) for the (w1, w2)-data with a reciprocate factor if we wanted to reproduce the same cost values. Reason: Linear coupling w1*K1+w2*K2! You compensate a constant factor in the {K1,K2}-space by its reciprocate one in the {w1, W2}-space!

But that is more or less what we have done by our somewhat strange application of the "Normalizer"! At least almost ... Fun, isn't it?

Gradient descent after sample-wise (!) normalization by the "Normalizer"

The clearer separation of the clusters in the {K1, K2}-space after separation and a well formed cost hyperplane over the {w1, w2}-space should help us a bit with our gradient descent. We set the parameters of a gradient descent run to

w1_start = -0.20, w2_start = 0.25 eta = 0.2, decrease_rate = 0.00000001, num_steps = 1000

and get:

Stoachastic Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.997559  0.069829  200.0   14.0  0.3  0.300715  0.002383
1   0.009345  0.999956    1.0  107.0  0.7  0.709386  0.013408
2   0.998053  0.062378  160.0   10.0  0.3  0.299211  0.002629
3   0.056902  0.998380   11.0  193.0  0.7  0.700095  0.000136
4   0.989586  0.143940  220.0   32.0  0.3  0.316505  0.055018
5   0.061680  0.998096   11.0  178.0  0.7  0.699129  0.001244
6   0.999861  0.016664  120.0    2.0  0.3  0.290309  0.032305
7   0.066519  0.997785   14.0  210.0  0.7  0.698144  0.002652
8   0.998112  0.061422  195.0   12.0  0.3  0.299019  0.003269
9   0.111245  0.993793   15.0  134.0  0.7  0.688737  0.016090
10  0.998553  0.053768  130.0    7.0  0.3  0.297492  0.008360
11  0.029927  0.999552    5.0  167.0  0.7  0.705438  0.007769
12  0.998542  0.053975  185.0   10.0  0.3  0.297533  0.008223
13  0.069699  0.997568   16.0  229.0  0.7  0.697493  0.003581

Batch Descent
          Kt1       Kt2     K1     K2  Tgt       Res       Err
0   0.997559  0.069829  200.0   14.0  0.3  0.300723  0.002409
1   0.009345  0.999956    1.0  107.0  0.7  0.709388  0.013411
2   0.998053  0.062378  160.0   10.0  0.3  0.299219  0.002604
3   0.056902  0.998380   11.0  193.0  0.7  0.700097  0.000139
4   0.989586  0.143940  220.0   32.0  0.3  0.316513  0.055044
5   0.061680  0.998096   11.0  178.0  0.7  0.699131  0.001241
6   0.999861  0.016664  120.0    2.0  0.3  0.290316  0.032280
7   0.066519  0.997785   14.0  210.0  0.7  0.698146  0.002649
8   0.998112  0.061422  195.0   12.0  0.3  0.299027  0.003244
9   0.111245  0.993793   15.0  134.0  0.7  0.688739  0.016087
10  0.998553  0.053768  130.0    7.0  0.3  0.297500  0.008335
11  0.029927  0.999552    5.0  167.0  0.7  0.705440  0.007771
12  0.998542  0.053975  185.0   10.0  0.3  0.297541  0.008198
13  0.069699  0.997568   16.0  229.0  0.7  0.697495  0.003578

Total error stoch descent:  0.011219103621660675
Total error batch descent:  0.01121352661948904         

Well, this is a almost perfect result on the training set; just between 1% and 3% deviation from the aspired output values. We have obviously found something new! Before, we always had deviations up to 15% or even 20% in the prediction for some of the data samples in our training set.

The final values of the weights become:

Final (w1,w2)-values stoch : ( -0.9093 ,  0.9009 )
Final (w1,w2)-values batch : ( -0.9090 ,  0.9009 )

Also very perfect. You should not forget - we worked with just 14 samples and 1 neuron.

The evolution data look like:

Cost and weight evolution during stochastic gradient descent

and:

Cost and weight evolution during batch gradient descent

Smooth development; fast convergence!

Separation lines in the {K1, K2}-space after "per sample"-normalization with "Normalizer"

Now we turn to the answer to the second question we asked above: What changes regarding the separation or contour lines of the output values of our solitary neuron? Well as in our last article, we are interested in the output of our neuron after the normalization transformation of the data. I.e. we are on the search for contour lines, which we get for those points in the original {K1, K2}-space for which the sigmoid function produces a constant after transformation.

Here is the plot:

Ooops, now we get a real difference. The contour curves are straight lines, but now directed radially outwards from the origin into the {K1, K2}-space! You see in addition that most of the data points are located very close to the lines for the set values A=0.3 and A=0.7!

We also get a very clear separation line close to diagonal at 45°ree;. A few comments on this finding:

The subdivision of the {K1, K2}-plane into sectors is very appropriate for clusters with data which show a tendency of a constant ration between the K1 and K2 values or clusters with a narrow extension in both directions. Note, however, that if we had two clusters at different radial distances but at roughly the same angle our present Normalizer transformation per sample would not have been helpful but disastrous regarding separation. So: The application of special normalization procedures ahead of classification training must be done with a feeling or insight into the clustering structure in the feature space.

Why radial contour lines?

What are the contour lines in the original {K1, K2}-space which produce the same output A for the transformed data? If we name the transformed (K1, K2) values by (k1, k2) we get in our case

k1 = K1/(K1**2 + K2**2)
k2 = K2/(K1**2 + K2**2)

So, we are looking for points in the {K1, K2}-space for which the equation

expit(w1*k1+ w2*k2) = const.

We now have to show that this is fulfilled for lines that have the property K2/K1 = tan(alpha) with alpha = const.. The proof is a small algebraic exercise, which I leave to my readers. Of course a genius like Mr Trump would give a direct answer based on the transformation properties itself: We just eliminated the radial distance to the origin as a feature! I leave it up to you which way of reasoning you want to go.

Clustering ahead of gradient descent?

Our very specific way of using the "Normalizer" has led us to a clearer clustering after the scaling transformation. This gives rise to a fundamental idea:

What if you could use some method to detect clusters in the distribution of datapoints in feature space ahead of gradient decent?

But, on basis of what input or feature data then? Well, we could use some norm (as L2) to describe the distance of the data points from the centers of the different identified clusters as the new features! If we knew the centers of the clusters such an approach could have a potential advantage: It would set the the number of the new features to the number of the identified clusters. And this number could be substantially smaller than the number of originally features Why? Because in general not all features may be independent of each other and not all may be of major importance for the classification and cluster membership.

We shall follow this idea in my other series on a real MLP and MNIST in more detail.

Conclusion

In this article we studied the application of the "Normalizer" offered by Scikit-Learn in two different ways to a training scenario for a one neuron perceptron and data with two input features (only). Normally we would apply "Normalizer" such that we would scale the data of all samples for each feature separately. And use the found stretching factors later on on new data points for which we want to make a classification prediction.

We saw that such a transformation roughly kept up the structure of the datapoint distribution in the {K1, K2}-fature-space. Scaling into an interval [-1, 1] had a major and healthy impact on the structure of the cost hyperplane in the {w1, w2}-weight-space. This helped us to perform a smooth gradient descent calculation.

Then we performed an application of "Normalizer" per sample. This corresponded to a radial stretch of all datapoints down to a unit cycle, whilst keeping up the values of the angles. We got a more structured cost hyperplane afterwards and a stronger clustering effect in the special case of our transformed data distribution in feature space. This helped gradient descent quite a lot: We could classify our data much better according to our discrimination prescription A=0.3 vs. A=0.7.

Our transformation also had the interesting effect of sub-dividing the feature space into radial sectors instead of parallel stripes. This would be helpful in case of data clusters with a certain radial elongation in the feature space but a clear difference and separation in angle. Such data do indeed exist - just think of the distribution of stars or microwave radiation clusters on the nightly sky sphere. At least in the latter case the radial distance of the sources may be of minor importance: You do not need radial distance information to note a concentration in a region which we call "milky way"!

What we actually did with our special normalization was to indirectly eliminate the radial distance information hidden in our (K1, K2)-data. We could also have calculated the angle (or a function of it) directly and thrown away all other information. If we had done so, we would have reduced our 2-dim the feature space to just one dimension! We saw this directly on the plot of the contour lines! Thus: It would have been much more intelligent, if we had used our transformation in a slightly modified form, determined just the angle of our data-points directly and uses these data as the only feature guiding gradient descent.

This led us to the idea that a clear identification of clusters by some appropriate method before we start a gradient descent analysis might be helpful for classification tasks.

This in turn triggers the idea of a cluster detection in feature space - which itself actually is a major discipline of Machine Learning. An advantage of using cluster detection ahead of gradient descent would be the possible reduction of the number of input features for the artificial neural network. Take a look at a forthcoming article in my other series on a Multilayer Perceptron [MLP] in this blog for an application in combination with a MLP and the MNIST daset.

In the next article of this series on a minimalistic perceptron we shall add a bias neuron to the input layer and investigate the impact.