A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer

A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test

A simple CNN for the MNIST datasets – I – CNN basics

we invested some work into building layers and into the parameterization of a training run. Our rewards comprised a high accuracy value of around 99.35% and watching interactive plots during training.

But a CNN offers much more information which is worth and instructive to *look* at. In the first article I have talked a bit about *feature detection* happening via the "convolution" of filters with the original image data or the data produced at feature maps of previous layers. What if we could see what different filters do to the underlying data? Can we have a *look* at the output selected "feature maps" produce for a specific input image?

Yes, we can. And it is intriguing! The objective of this article is to plot images of the feature map output at a chosen convolutional or pooling layer of our CNN. This is accompanied by the hope to better understand the concept of abstract features extracted from an input image.

I follow an original idea published by F. Chollet (in his book "Deep Learning mit Python und Keras", mitp Verlag) and adapt it to the code discussed in the previous articles.

So far we have dealt with a complete CNN with a multitude of layers that produce intermediate tensors and a "one-hot"-encoded output to indicate the prediction for a hand-written digit represented by a MNIST image. The CNN itself was handled by Keras in form of a *sequential* model of defined convolutional and pooling layers plus layers of a multi-layer perceptron [MLP]. By the definition of such a "model" Keras does all the work required for forward and backward propagation steps in the background. After training we can "**predict**" the outcome for any new digit image which we feed into the CNN: We just have to fetch the data form th eoutput layer (at the end of the MLP) after a forward propagation with the weights optimized during training.

But now, we need something else:

We need a **model** which gives us the output, i.e. a 2-dimensional tensor - of a specific map of an * intermediate* Conv-layer as a

I.e. we want the output of a sub-model of our CNN containing only a part of the layers. How can we define such an (additional) *model* based on the layers of our complete original CNN-model?

Well, with Keras we can build a general model based on any (partial) graph of connected layers which somebody has set up. The input of such a model must follow rules appropriate to the receiving layer and the output can be that of a defined subsequent layer or map. Setting up layers and models can on a very basic level be done with the so called "**Functional API of Keras**". This API enables us to directly refer to methods of the classes "Layer", "Model", "Input" and "Output".

A *model* - as an instance of the Model-class - can be called like a function for its input (in tensor form) and it returns its output (in tensor form). As we deal with classes you will not be surprised over the fact that we can refer to the * input-layer* of a general model via the model's instance name - let us say "cnnx" - and an instance attribute. A model has a unique input layer which later is fed by tensor input data. We can refer to this input layer via the attribute "input" of the model object. So, e.g. "cnnx.input" gives us a clear unique reference to the input layer. With the attribute "

But, how can we refer to the output of a *specific layer or map* of a CNN-model? If you look it up in the Keras documentation you will find that we can give each layer of a model a specific "* name*". And a Keras model, of course, has a method to retrieve a reference to a layer via its name:

cnnx.get_layer(layer_name) .

Each convolutional layer of our CNN is an instance of the class "Conv2D-Layer" with an attribute "output" - this comprises the multidimensional tensor delivered by the **activation** function of the layer's nodes (or units in Keras slang). Such a tensor has in general 4 axes for images:

sample-number of the batch, px width, px height, filter number

The "filter number" identifies a map of the Conv2D-layer. To get the "image"-data provided of a specific map (identified by "map-number") we have to address the array as

cnnx.get_layer(layer_name)[sample-number, :, :, map-number]

We know already that these data are values in a certain range (here above 0, due to our choice of the activation function as "relu").

**Hint regarding wording:** F. Chollet calls the output of the activation functions of the nodes of a layer or map the "*activation*" of the layer or map, repsectively. We shall use this wording in the code we are going to build.

It may be necessary later on to depict a chosen input image for our analysis - e.g. a MNIST image of the test data set. How can we do this? We just fill a new Jupyter cell with the following code:

ay_img = test_imgs[7:8] plt.imshow(ay_img[0,:,:,0], cmap=plt.cm.binary)

This code lines would plot the eighths sample image of the already shuffled test data set.

We first must extend our previously defined functions to be able to deal with layer names. We change the code in our Jupyter Cell 8 (see the last article) in the following way:

**Jupyter Cell 8: **Setting up a training run

# Perform a training run # ******************** # Prepare the plotting # The really important command for interactive (=interediate) plot updating %matplotlib notebook plt.ion() #sizing fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 8 fig_size[1] = 3 # One figure # ----------- fig1 = plt.figure(1) #fig2 = plt.figure(2) # first figure with two plot-areas with axes # -------------------------------------------- ax1_1 = fig1.add_subplot(121) ax1_2 = fig1.add_subplot(122) fig1.canvas.draw() # second figure with just one plot area with axes # ------------------------------------------------- #ax2 = fig2.add_subplot(121) #ax2_1 = fig2.add_subplot(121) #ax2_2 = fig2.add_subplot(122) #fig2.canvas.draw() # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Parameterization of the training run #build = False build = True if cnn == None: build = True x_optimizer = None batch_size=64 epochs=80 reset = False #reset = True # we want training to start again with the initial weights my_loss ='categorical_crossentropy' my_metrics =['accuracy'] my_regularizer = None my_regularizer = 'l2' my_reg_param_l2 = 0.001 #my_reg_param_l2 = 0.01 my_reg_param_l1 = 0.01 my_optimizer = 'rmsprop' # Present alternatives: rmsprop, nadam, adamax my_momentum = 0.5 # momentum value my_lr_sched = 'powerSched' # Present alternatrives: None, powerSched, exponential #my_lr_sched = None # Present alternatrives: None, powerSched, exponential my_lr_init = 0.001 # initial leaning rate my_lr_decay_steps = 1 # decay steps = 1 my_lr_decay_rate = 0.001 # decay rate li_conv_1 = [32, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [128, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_Conv_Name = ["Conv2D_1", "Conv2D_2", "Conv2D_3"] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_Pool_Name = ["Max_Pool_1", "Max_Pool_2", "Max_Pool_3"] li_dense_1 = [100, 0] #li_dense_2 = [30, 0] li_dense_3 = [10, 0] li_MLP = [li_dense_1, li_dense_2, li_dense_3] li_MLP = [li_dense_1, li_dense_3] input_shape = (28,28,1) try: if gpu: with tf.device("/GPU:0"): cnn, fit_time, history, x_optimizer = train( cnn, build, train_imgs, train_labels, li_Conv, li_Conv_Name, li_Pool, li_Pool_Name, li_MLP, input_shape, reset, epochs, batch_size, my_loss=my_loss, my_metrics=my_metrics, my_regularizer=my_regularizer, my_reg_param_l2=my_reg_param_l2, my_reg_param_l1=my_reg_param_l1, my_optimizer=my_optimizer, my_momentum = 0.8, my_lr_sched=my_lr_sched, my_lr_init=my_lr_init, my_lr_decay_steps=my_lr_decay_steps, my_lr_decay_rate=my_lr_decay_rate, fig1=fig1, ax1_1=ax1_1, ax1_2=ax1_2 ) print('Time_GPU: ', fit_time) else: with tf.device("/CPU:0"): cnn, fit_time, history = train( cnn, build, train_imgs, train_labels, li_Conv, li_Conv_Name, li_Pool, li_Pool_Name, li_MLP, input_shape, reset, epochs, batch_size, my_loss=my_loss, my_metrics=my_metrics, my_regularizer=my_regularizer, my_reg_param_l2=my_reg_param_l2, my_reg_param_l1=my_reg_param_l1, my_optimizer=my_optimizer, my_momentum = 0.8, my_lr_sched=my_lr_sched, my_lr_init=my_lr_init, my_lr_decay_steps=my_lr_decay_steps, my_lr_decay_rate=my_lr_decay_rate, fig1=fig1, ax1_1=ax1_1, ax1_2=ax1_2 ) print('Time_CPU: ', fit_time) except SystemExit: print("stopped due to exception")

You see that I added a list

li_Conv_Name = ["Conv2D_1", "Conv2D_2", "Conv2D_3"]

...

li_Pool_Name = ["Max_Pool_1", "Max_Pool_2", "Max_Pool_3"]

which provides *names* of the (presently three) defined convolutional and (presently two) pooling layers. The interface to the training function has, of course, to be extended to accept these arrays. The function "**train()**" in Jupyter cell 7 (see the last article) is modified accordingly:

**Jupyter cell 7: Trigger (re-) building and training of the CNN**

# Training 2 - with test data integrated # ***************************************** def train( cnn, build, train_imgs, train_labels, li_Conv, li_Conv_Name, li_Pool, li_Pool_Name, li_MLP, input_shape, reset=True, epochs=5, batch_size=64, my_loss='categorical_crossentropy', my_metrics=['accuracy'], my_regularizer=None, my_reg_param_l2=0.01, my_reg_param_l1=0.01, my_optimizer='rmsprop', my_momentum=0.0, my_lr_sched=None, my_lr_init=0.001, my_lr_decay_steps=1, my_lr_decay_rate=0.00001, fig1=None, ax1_1=None, ax1_2=None ): if build: # build cnn layers - now with regularizer - 200603 rm cnn = build_cnn_simple( li_Conv, li_Conv_Name, li_Pool, li_Pool_Name, li_MLP, input_shape, my_regularizer = my_regularizer, my_reg_param_l2 = my_reg_param_l2, my_reg_param_l1 = my_reg_param_l1) # compile - now with lr_scheduler - 200603 cnn = my_compile(cnn=cnn, my_loss=my_loss, my_metrics=my_metrics, my_optimizer=my_optimizer, my_momentum=my_momentum, my_lr_sched=my_lr_sched, my_lr_init=my_lr_init, my_lr_decay_steps=my_lr_decay_steps, my_lr_decay_rate=my_lr_decay_rate) # save the inital (!) weights to be able to restore them cnn.save_weights('cnn_weights.h5') # save the initial weights # reset weights(standard) if reset: cnn.load_weights('cnn_weights.h5') # Callback list # ~~~~~~~~~~~~~ use_scheduler = True if my_lr_sched == None: use_scheduler = False lr_history = LrHistory(use_scheduler) callbacks_list = [lr_history] if fig1 != None: epoch_plot = EpochPlot(epochs, fig1, ax1_1, ax1_2) callbacks_list.append(epoch_plot) start_t = time.perf_counter() if reset: history = cnn.fit(train_imgs, train_labels, initial_epoch=0, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True, validation_data=(test_imgs, test_labels), callbacks=callbacks_list) else: history = cnn.fit(train_imgs, train_labels, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True, validation_data=(test_imgs, test_labels), callbacks=callbacks_list ) end_t = time.perf_counter() fit_t = end_t - start_t # save the model cnn.save('cnn.h5') return cnn, fit_t, history, x_optimizer # we return cnn to be able to use it by other Jupyter functions

We transfer the name-lists further on to the function "**build_cnn_simple()**":

**Jupyter Cell 4:** Build a simple CNN

# Sequential layer model of our CNN # *********************************** # important !! # ~~~~~~~~~~~ cnn = None x_optimizers = None # function to build the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ def build_cnn_simple(li_Conv, li_Conv_Name, li_Pool, li_Pool_Name, li_MLP, input_shape, my_regularizer=None, my_reg_param_l2=0.01, my_reg_param_l1=0.01 ): use_regularizer = True if my_regularizer == None: use_regularizer = False # activation functions to be used in Conv-layers li_conv_act_funcs = ['relu', 'sigmoid', 'elu', 'tanh'] # activation functions to be used in MLP hidden layers li_mlp_h_act_funcs = ['relu', 'sigmoid', 'tanh'] # activation functions to be used in MLP output layers li_mlp_o_act_funcs = ['softmax', 'sigmoid'] # dictionary for regularizer functions d_reg = { 'l2': regularizers.l2, 'l1': regularizers.l1 } if use_regularizer: if my_regularizer not in d_reg: print("regularizer " + my_regularizer + " not known!") sys.exit() else: regul = d_reg[my_regularizer] if my_regularizer == 'l2': reg_param = my_reg_param_l2 elif my_regularizer == 'l1': reg_param = my_reg_param_l1 # Build the Conv part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_conv_layers = len(li_Conv) num_pool_layers = len(li_Pool) if num_pool_layers != num_conv_layers - 1: print("\nNumber of pool layers does not fit to number of Conv-layers") sys.exit() rg_il = range(num_conv_layers) # Define a sequential CNN model # ~~~~~~~~~~~~~~~~~~~~~~~~~----- cnn = models.Sequential() # in our simple model each con2D layer is followed by a Pooling layer (with the exeception of the last one) for il in rg_il: # add the convolutional layer num_filters = li_Conv[il][0] t_fkern_size = li_Conv[il][1] cact = li_conv_act_funcs[li_Conv[il][2]] cname = li_Conv_Name[il] if il==0: cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact, name=cname, input_shape=input_shape)) else: cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact, name=cname)) # add the pooling layer if il < num_pool_layers: t_pkern_size = li_Pool[il][0] pname = li_Pool_Name[il] cnn.add(layers.MaxPooling2D(t_pkern_size, name=pname)) # Build the MLP part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_mlp_layers = len(li_MLP) rg_im = range(num_mlp_layers) cnn.add(layers.Flatten()) for im in rg_im: # add the dense layer n_nodes = li_MLP[im][0] if im < num_mlp_layers - 1: m_act = li_mlp_h_act_funcs[li_MLP[im][1]] if use_regularizer: cnn.add(layers.Dense(n_nodes, activation=m_act, kernel_regularizer=regul(reg_param))) else: cnn.add(layers.Dense(n_nodes, activation=m_act)) else: m_act = li_mlp_o_act_funcs[li_MLP[im][1]] if use_regularizer: cnn.add(layers.Dense(n_nodes, activation=m_act, kernel_regularizer=regul(reg_param))) else: cnn.add(layers.Dense(n_nodes, activation=m_act)) return cnn

The layer names are transferred to Keras via the parameter "name" of the Model's method "**model.add()**" to add a layer, e.g.:

cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact, name=cname))

Note that all other Jupyter cells remain unchanged.

Predictions of a neural network require a forward propagation of an input and thus a precise definition of layers and weights. In the last article we have already seen how we save and reload weight data of a model. However, weights make only a part of the information defining a model in a certain state. For seeing the activation of certain maps of a trained model we would like to be able to reload the *full* model in its trained status. Keras offers a very simple method to save and reload the complete set of data for a given model-state:

cnn.save(filename.h5')

cnnx = models.load_model('filename.h5')

This statement creates a file with the name name "filename.h5" in the h5-format (for large hierarchically organized data) in our Jupyter environment. You would of course replace "filename" by a more appropriate name to characterize your saved model-state. In my combined Eclipse-Jupyter-environment the standard path for such files points to the directory where I keep my notebooks. We included a corresponding statement at the end of the function "train()". The attentive reader has certainly noticed this fact already.

We now build a new function to do the plotting of the outputs of all maps of a layer.

**Jupyter Cell 9** - filling a grid with output-images of all maps of a layer

# Function to plot the activations of a layer # ------------------------------------------- # Adaption of a method originally designed by F.Chollet def img_grid_of_layer_activation(d_img_sets, model_fname='cnn.h5', layer_name='', img_set="test_imgs", num_img=8, scale_img_vals=False): ''' Input parameter: ----------------- d_img_sets: dictionary with available img_sets, which contain img tensors (presently: train_imgs, test_imgs) model_fname: Name of the file containing the models data layer_name: name of the layer for which we plot the activation; the name must be known to the Keras model (string) image_set: The set of images we pick a specific image from (string) num_img: The sample number of the image in the chosen set (integer) scale_img_vals: False: Do NOT scale (standardize) and clip (!) the pixel values. True: Standardize the values. (Boolean) Hints: ----------------- We assume quadratic images ''' # Load a model cnnx = models.load_model(model_fname) # get the output of a certain named layer - this includes all maps # https://keras.io/getting_started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer-feature-extraction cnnx_layer_output = cnnx.get_layer(layer_name).output # build a new model for input "cnnx.input" and output "output_of_layer" # ~~~~~~~~~~~~~~~~~ # Keras knows the required connections and intermediat layers from its tensorflow graphs - otherwise we get an error # The new model can make predictions for a suitable input in the required tensor form mod_lay = models.Model(inputs=cnnx.input, outputs=cnnx_layer_output) # Pick the input image from a set of respective tensors if img_set not in d_img_sets: print("img set " + img_set + " is not known!") sys.exit() # slicing to get te right tensor ay_img = d_img_sets[img_set][num_img:(num_img+1)] # Use the tensor data as input for a prediction of model "mod_lay" lay_activation = mod_lay.predict(ay_img) print("shape of layer " + layer_name + " : ", lay_activation.shape ) # number of maps of the selected layer n_maps = lay_activation.shape[-1] # size of an image - we assume quadratic images img_size = lay_activation.shape[1] # Only for testing: plot an image for a selected # map_nr = 1 #plt.matshow(lay_activation[0,:,:,map_nr], cmap='viridis') # We work with a grid of images for all maps # ~~~~~~~~~~~~~~~---------------------------- # the grid is build top-down (!) with num_cols and num_rows # dimensions for the grid num_imgs_per_row = 8 num_cols = num_imgs_per_row num_rows = n_maps // num_imgs_per_row #print("img_size = ", img_size, " num_cols = ", num_cols, " num_rows = ", num_rows) # grid dim_hor = num_imgs_per_row * img_size dim_ver = num_rows * img_size img_grid = np.zeros( (dim_ver, dim_hor) ) # horizontal, vertical matrix print(img_grid.shape) # double loop to fill the grid n = 0 for row in range(num_rows): for col in range(num_cols): n += 1 #print("n = ", n, "row = ", row, " col = ", col) present_img = lay_activation[0, :, :, row*num_imgs_per_row + col] # standardization and clipping of the img data if scale_img_vals: present_img -= present_img.mean() if present_img.std() != 0.0: # standard deviation present_img /= present_img.std() #present_img /= (present_img.std() +1.e-8) present_img *= 64 present_img += 128 present_img = np.clip(present_img, 0, 255).astype('uint8') # limit values to 255 # place the img-data at the right space and position in the grid # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # the following is only used if we had reversed vertical direction by accident #img_grid[row*img_size:(row+1)*(img_size), col*img_size:(col+1)*(img_size)] = np.flip(present_img, 0) img_grid[row*img_size:(row+1)*(img_size), col*img_size:(col+1)*(img_size)] = present_img return img_grid, img_size, dim_hor, dim_ver

I explain the core parts of this code in the next two sections.

In a first step of the function "img_grid_of_layer_activation()" we load a CNN model saved at the end of a previous training run:

cnnx = models.load_model(model_fname)

The file-name "Model_fname" is a parameter.

With the lines

cnnx_layer_output = cnnx.get_layer(layer_name).output

mod_lay = models.Model(inputs=cnnx.input, outputs=cnnx_layer_output)

we define a **new model** "cnnx" comprising all layers (of the loaded model) in between cnnx.input and cnnx_layer_output. "cnnx_layer_output" serves as an output layer of this new model "cnnx". This model - as every working CNN model - can make predictions for a given input tensor. The output of this prediction is a tensor produced by cnnx_layer_output; a typical shape of the tensor is:

shape of layer Conv2D_1 : (1, 26, 26, 32)

From this tensor we can retrieve the size of the comprised quadratic image data.

Matplotlib can plot a grid of equally sized images. We use such a grid to collect the activation data produced by all maps of a chosen layer, which was given by its name as an input parameter.

The first statements define the number of images in a row of the grid - i.e. the number of columns of the grid. With the number of layer maps this in turn defines the required number of rows in the grid. From the number of pixel data in the tensor we can now define the grid dimensions in terms of pixels. The double loop eventually fills in the image data extracted from the tensors produced by the layer maps.

If requested by a function parameter "scale_img_vals=True" we standardize the image data and limit the pixel values to a maximum of 255 (clipping). This can in some cases be useful to get a better graphical representation of the activation data with some color maps.

Our function "mg_grid_of_layer_activation()" returns the grid and dimensional data.

Note that the grid is oriented from its top downwards and from the left to the right side.

In a further Jupyter cell we prepare and perform a call of our new function. Afterwards we plot resulting information in two figures.

**Jupyter Cell 10** - plotting the activations of a layer

# Plot the img grid of a layers activation # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # global dict for the image sets d_img_sets= {'train_imgs':train_imgs, 'test_imgs':test_imgs} # layer - pick one of the names which you defined for your model layer_name = "Conv2D_1" # choose a image_set and an img number img_set = "test_imgs" num_img = 19 # Two figures # ----------- fig1 = plt.figure(1) # figure for th einput img fig2 = plt.figure(2) # figure for the activation outputs of th emaps ay_img = test_imgs[num_img:num_img+1] plt.imshow(ay_img[0,:,:,0], cmap=plt.cm.binary) # getting the img grid img_grid, img_size, dim_hor, dim_ver = img_grid_of_layer_activation( d_img_sets, model_fname='cnn.h5', layer_name=layer_name, img_set=img_set, num_img=num_img, scale_img_vals=False) # Define reasonable figure dimensions by scaling the grid-size scale = 1.6 / (img_size) fig2 = plt.figure( figsize=(scale * dim_hor, scale * dim_ver) ) #axes ax = fig2.gca() ax.set_xlim(-0,dim_hor-1.0) ax.set_ylim(dim_ver-1.0, 0) # the grid is oriented top-down #ax.set_ylim(-0,dim_ver-1.0) # normally wrong # setting labels - tick positions and grid lines ax.set_xticks(np.arange(img_size-0.5, dim_hor, img_size)) ax.set_yticks(np.arange(img_size-0.5, dim_ver, img_size)) ax.set_xticklabels([]) # no labels should be printed ax.set_yticklabels([]) # preparing the grid plt.grid(b=True, linestyle='-', linewidth='.5', color='#ddd', alpha=0.7) # color-map #cmap = 'viridis' #cmap = 'inferno' #cmap = 'jet' cmap = 'magma' plt.imshow(img_grid, aspect='auto', cmap=cmap)

The first figure contains the original MNIST image. The second figure will contain the grid with its images of the maps' output. The code is straightforward; the corrections of the dimensions have to do with the display of intermittent lines to separate the different images. Statements like "ax.set_xticklabels([])" set the tick-mark-texts to empty strings. At the end of the code we choose a color map.

Note that I avoided to standardize the image data. Clipping suppresses extreme values; however, the map-related filters react to these values. So, let us keep the full value spectrum for a while ...

I performed a training run with the following setting and saved the last model:

build = True if cnn == None: build = True x_optimizer = None batch_size=64 epochs=80 reset = False # we want training to start again with the initial weights #reset = True # we want training to start again with the initial weights my_loss ='categorical_crossentropy' my_metrics =['accuracy'] my_regularizer = None my_regularizer = 'l2' my_reg_param_l2 = 0.001 #my_reg_param_l2 = 0.01 my_reg_param_l1 = 0.01 my_optimizer = 'rmsprop' # Present alternatives: rmsprop, nadam, adamax my_momentum = 0.5 # momentum value my_lr_sched = 'powerSched' # Present alternatrives: None, powerSched, exponential #my_lr_sched = None # Present alternatrives: None, powerSched, exponential my_lr_init = 0.001 # initial leaning rate my_lr_decay_steps = 1 # decay steps = 1 my_lr_decay_rate = 0.001 # decay rate li_conv_1 = [32, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [128, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_Conv_Name = ["Conv2D_1", "Conv2D_2", "Conv2D_3"] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_Pool_Name = ["Max_Pool_1", "Max_Pool_2", "Max_Pool_3"] li_dense_1 = [100, 0] #li_dense_2 = [30, 0] li_dense_3 = [10, 0] li_MLP = [li_dense_1, li_dense_2, li_dense_3] li_MLP = [li_dense_1, li_dense_3] input_shape = (28,28,1)

This run gives us the following results:

and

Epoch 80/80 933/938 [============================>.] - ETA: 0s - loss: 0.0030 - accuracy: 0.9998 present lr: 1.31509732e-05 present iteration: 75040 938/938 [==============================] - 4s 5ms/step - loss: 0.0030 - accuracy: 0.9998 - val_loss: 0.0267 - val_accuracy: 0.9944

Ok, let us test the code to plot the maps' output. For the input data

# layer - pick one of the names which you defined for your model layer_name = "Conv2D_1" # choose a image_set and an img number img_set = "test_imgs" num_img = 19

we get the following results:

**Layer "Conv2D_1"**

**Layer "Conv2D_2"**

**Layer "Conv2D_3"**

Keras' flexibility regarding model definitions allows for the definition of new models based on parts of the original CNN. The output layer of these new models can be set to any of the convolutional or pooling layers. With predictions for an input image we can extract the activation results of all maps of a layer. These data can be visualized in form of a grid that shows the reaction of a layer to the input image. A first test shows that the representations of the input get more and more abstract with higher convolutional layers.

In the next article we shall have a closer look of what these abstractions may mean for the classification of certain digit images.

Ich habe die deutsche Corona-App nun seit einer Woche auf meinem Android-Smartphone installiert und bin zunehmend verärgert. Zur Irritation beigetragen hat gestern die Lektüre einer Leser-Diskussion in der Zeit zum Thema; siehe die Debatte zum Artikel

https://www.zeit.de/digital/mobil/2020-06/tracing-app-corona-datenschutz-standortdaten-bluetooth-virus-bekaempfung.

Da überziehen sich die Debattanten mit dem Vorwurf der Unkenntnis ["keine Ahnung", "Unwissenheit", ...] und sind z.T. stolz darauf, dass sie selbst (die Schlaumeier) wissen, dass man Zugriffsrechte (begrenzt) pro App regeln kann (übrigens auf die Gefahr hin, dass die betroffene App dann nicht mehr funktioniert). Scheingefechte unter Möchtegern-Gurus ... und aus meiner Sicht eine Thema-Verfehlung.

Das eigentliche Thema ist, dass die Aktivierung der Corona-App vom Anwender die Freigabe der Standortbestimmung grundsätzlich und in pauschaler Weise verlangt, obwohl die Corona-App selbst die dadurch ermittelbaren Daten gar nicht nutzt.

Vorgeblicher Grund für die notwendige Freigabe: Nicht durchschaubare Android-Spezifika seit der Android Version 6.0, die ab der Version 8 aber wieder aufgeweicht wurden. Eine pauschale Freigabe der Standortbestimmung ist natürlich ein gefundenes Fressen für viele andere auf einem Android-Smartphone installierte Apps (darunter natürlich an dominanter Stelle etliche Standard-Apps von Google).

Man müsste also nach der Installation der Corona-App alle anderen Anwendungen Stück für Stück bzgl. der Rechte konfigurieren. Auf das notwendige Verfahren weist einen die Corona-App aber gar nicht hin! Und ich bezweifle, dass jeder Anwender jede installierte App auf die Rechtesetzung überprüfen würde.

Die Aktivierung der Corona-App bringt also (wirtschaftliche) Vorteile für Google und andere App-Ersteller mit sich, die weit über die Entwicklungsgelder für die Corona-App hinaus reichen. Man reibt sich verwundert die maskengereizten Wangen ...

Ich habe in der vergangenen Woche mehrere Leute (8) - darunter auch 4 IT-ler - danach befragt, ob Ihnen bewusst sei, dass die Aktivierung der Corona-App auf ihrem Handy zu einer Freigabe der Standortbestimmung geführt habe. Sieben wussten davon nichts. Drei entgegneten mir, das sei ja gerade bei der Corona-App *nicht* der Fall. Eine hatte etwas während der Installation gelesen, aber dann weiter geklickt ...

Wahrlich Zeit für eine "Standortbestimmung" in puncto Corona-App und zugehörige Vorinformationen durch Regierung und Presse. Gerade, weil ich selbst kein Android-Experte bin, möchte ich dabei ein paar Ungereimtheiten ansprechen ...

Bevor ich mir - ähnlich wie manche Kommentierenden des Zeit-Artikels - anhören muss, ich würde ja auch sonst viele Daten durch Nutzung meines Smartphones an Google und Co weitergeben, stelle ich mal meine Sicht auf die Dinge dar:

**Die Corona-Perspektive:** Ich reise aufs beruflichen Gründen generell viel mit der Bahn. So war ich etwa die ganze letzte Woche unterwegs; quer durch Deutschland. In durchaus gut gefüllten Zügen, die die DB selbst als zu mehr als 50% ausgelastet angezeigt hat. Wegen meines Alters zähle ich leider bereits zu dem Personenkreis mit erhöhtem Corona-Risiko. Die Corona-App bewahrt mich natürlich nicht vor diesen Risiken, aber sie wäre doch ein wichtiger Puzzlestein im Bemühen, *rechtzeitig* einen Arzt zu kontaktieren, wenn eine bestimmtes Infektionsrisiko aus den erfassten Daten mit einer hohen Eintrittswahrscheinlichkeit bewertet wird. Die App böte auch die Chance, meine ältere Lebensgefährtin, die aus verschiedenen Gründen zur Hochrisikogruppe gehört, zu warnen und geeignete Maßnahmen zu treffen. Deshalb habe ich mir am Dienstag vergangener Woche die deutsche Corona-App auf mein Smartphone installiert.

**Die Smartphone-Perspektive:** Mein privates Android-Smartphone ist relativ alt und mit Android 6 versehen. Es hat einen guten HW-Unterbau, viel Speicher, eine schnelle verschlüsselte Zusatzkarte und ist performanter als manches Smartphone von anderen Familienangehörigen mit Android 8. Ich benutze mein Smartphone primär zum Lesen in Zeitungen (per FF-Browser bei aktivierter VPN-Lösung). VLC ermöglicht mir dabei, Musik zu hören. Die Kommunikation mit anderen Mitbürgern erfolgt ausschließlich über Signal. Standortbestimmung, GPS, Bluetooth sind normalerweise abgeschaltet; der Zugang ins Internet über variierende VPN-Server ist regelmäßig aktiviert. Entfernbare Apps sind entfernt. Die Zugriffsberechtigungen von Apps sind auf ein Minimum reduziert (soweit möglich). Ich nutze weder Google Chrome noch Googles Suchmaschine am Handy für Standardsuchen im Browser; Noscript blockt Javascript, bis ich es zulasse. Ich nutze Facebook und Whatsapp (nach eingehendem Studium über 4 Monate hinweg) natürlich nicht, Youtube ist für mich auf dem Smartphone uninteressantes Terrain und deaktiviert.

Ich meine deshalb das, was man ohne Rooten des Smartphones tun kann, getan zu haben, um für Samsung und Google zumindest nicht unmittelbar überwachbar zu sein. Ja, ich höre schon: das ist eine Illusion. Stimmt, aber ich reduziere halt den direkten Zugriff, so gut es geht. Meine Frau hält mein Handy deshalb für nicht benutzbar - ein gutes Zeichen. Meine grundlegende Devise ist: Ich habe keine Freunde im Internet. Und Google ist mir bislang sicher nicht als Freund, sondern als kommerzielles Unternehmen mit kapitalistisch unterfüttertem Interesse an Daten zu meiner Person begegnet.

**Die Perspektive auf unseren Staat in der aktuellen Krise:** Ich hege kein fundamentales Misstrauen gegenüber unserem Staat und seinen Institutionen; eher gegenüber einzelnen Politikern und bestimmten Parteien. Ich bin sehr froh, in Deutschland zu leben und schätze unser Grundgesetz mit jedem Lebensalter mehr. Ich fand die Maßnahmen unserer Bundesregierung und der meisten Landesregierungen in Sachen Corona-Pandemie richtig. (Es schadet bei der Beurteilung dieser Maßnahmen übrigens nicht, selbst mal ein wenig auf Basis der bekannten Zahlen zu rechnen. Aber die Fähigkeit zum "Rechnen", geschweige denn das Beherrschen von ein wenig Statistik, scheint ja bei vielen Mitbürgern, Moderatoren und Politikern nicht mehr zur Grundausstattung zu gehören ... außer bei Fr. Dr. Merkel). Die Anstrengungen der Bundesregierung, eine App auf die Beine zu stellen, fand ich begrüßenswert. Den leider erst spät erfolgten Schritt in Richtung dezentrale Datenhaltung und Open Source auch.

Es war klar, dass die Corona-App Bluetooth (Low Energy) nutzen würde. Die generelle Umsetzung von Bluetooth auf Smartphones durch die jeweiligen Hersteller ist bislang leider immer wieder durch viele Sicherheitslücken aufgefallen und weniger als solide Kommunikationstechnologie. Normalerweise aktiviert kein vernünftiger Mensch Bluetooth in einem dicht gefüllten Zugabteil. Aber in Coronazeiten muss man hier leider Kompromisse schließen. Das war - wie gesagt - von Anfang an klar und ist aus meiner Sicht auch hinreichend dargestellt und debattiert worden.

Dass man auf Android Smartphones aber pauschal die Standortbestimmung (inkl. GPS) aktivieren muss, das war * nicht* ausgemacht. In der öffentlichen Darstellung wurde im Vorfeld auch eher das Gegenteil betont. So traute ich meinen Augen nicht, als ich im Zuge der Installation und Aktivierung der Corona-App auf Seite 2 einer Popup-Meldung lesen musste:

"Für diese Funktionen [es ging um Bluetooth] sind die folgenden Berechtigungen erforderlich:

**Gerätestandort**

Diese Einstellung ist erforderlich, damit Bluetooth-Geräte in deiner Nähe gefunden werden können. Für Benachrichtigungen zu möglicher Begegnung mit Covid-19-Infizierten wird der Gerätestandort jedoch nicht genutzt. Andere Apps mit der Berechtigung zur Standortermittlung haben Zugriff auf den Gerätestandort."

Ich hätte das beinahe überblättert. Zwar fiel mir das Wort "Gerätestandort" auf; aber aufgrund der Meldungen im Fernsehen und der Presse hatte ich die Erwartungshaltung, dass ich diesbzgl. lediglich darüber informiert würde, dass die Standortbestimmung nicht notwendig sei. Erst als ich erneut "Aktivieren" drücken sollte, wurde ich unschlüssig - die App oder was anderes? Und begann zu lesen ...

Davon habe ich dann einem IT-Kollegen am Mittwoch vergangener Woche im Zug erzählt. Er hatte die App selbst installiert, hatte dabei allerdings über die zwei Erläuterungs-Popup-Seiten am Anfang zügig hinweggeklickt. Er wollte meine Darstellung erst nicht glauben. Bis ich ihm dann eine damals noch englische (!) Warnmeldung zeigte, die hochkam, wenn man die durch die Corona-App aktivierte Standortbestimmung (inkl. GPS) explizit wieder deaktivierte. Inzwischen (App-Version 1.0.4) ist die Meldung auf Deutsch da, aber zumindest auf meinem Gerät im Meldungsbereich nicht direkt in Gänze zu lesen. Da muss man dann auf dem Smartphone schon unter den eigenen **"Google"-Einstellungen** nachsehen:

"Benachrichtigungen zu möglicher Begegnung mit Covid-19-Infizierten. Diese Funktion ist deaktiviert".

Tja, das ist wohl eindeutig. Keine Aktivierung der Standortbestimmung => dann auch keine Aktivierung der Benachrichtigung durch die Corona-App. War beim Kollegen dann auch so (trotz eines viel aktuelleren Handys). Richtig geglaubt hat der Kollege es aber erst, als ich ihm die Beschreibung der technischen Voraussetzungen unter den "**Datenschutzinformationen**" der App zeigte.

Interessanterweise findet man ja zur Standortbestimmung nichts unter "Häufige Fragen", sondern eben unter "Datenschutzinformation" und dort weit unten unter **Punkt 7 "Technische Voraussetzungen"** - Punkt b) Android-Smartphones:

"Die Standortermittlung Ihres Smartphones muss aktiviert sein, damit ihr Gerät nach Bluetooth-Signalen anderer Smartphones sucht. Standortdaten werden dabei aber nicht erhoben. "

Eine fast gleichlautende Info erhält man übrigens auch, wenn man die Corona-App über deren eigenen Schalter deaktiviert und dann wieder reaktiviert.

Wirkt sich die Aktivierung der Standortbestimmung auf andere Apps aus? Oh ja! Am einfachsten kann man das über Google Maps oder Kompass-Anwendungen testen. Die fragen dann nicht mehr nach, ob z.B. GPS aktiviert werden soll; man hat das im Zuge der Aktivierung der Corona-App tatsächlich pauschal freigegeben (falls man anderweitig nicht andere Einschränkungen getroffen hat.) Was nicht heißt, das die Corona-App selbst Standortdaten verwendet. Dazu unten mehr.

Soviel erst mal zu den Fakten und zu dem, was einem die Corona-App selbst mitteilt.

Man kann nun ein wenig im Internet recherchieren. Dabei wird man feststellen, dass man sich nicht getäuscht hat. Auf Android-Geräten ist zunächst die pauschale Standortfreigabe notwendig. Auch wenn die Corona-App selbst keine Standortdaten nutzt ... Und was lernt man auf die Schnelle sonst noch?

**Punkt 1:** Google koppelt die "Standortbestimmung" (inkl. der generellen Freigabe von GPS für andere Apps) und die Nutzung von "Bluetooth Low Energy" faktisch und technisch aneinander. Das war und ist so beabsichtigt! Mindestens mal in Android 6.x und 7.X. Siehe:

stackoverflow.com/ questions/ 33045581/ location-needs-to-be-enabled-for-bluetooth-low-energy-scanning-on-android-6-0

https://www.mobiflip.de/shortnews/corona-warn-app-geraetestandort-unter-android/

Man kann Bluetooth zwar separat aktivieren (inkl. Low Energy-Funktionalität [LE]), aber es nutzt dir nichts, wenn du nicht gleichzeitig eine Standortbestimmung zulässt. Grund: Bluetooth LE würde sonst angeblich gar nicht anfangen, nach anderen Bluetooth-Geräten zu suchen. Leider wird dir als Corona-App-Nutzer in der der Statuszeile des Android-Bildschirms nicht angezeigt, dass die Standortbestimmung aktiviert wurde. Das musst du dann schon selbst entdecken ...

Interessant ist in diesem Zusammenhang die Erkenntnis, dass die zwei technisch zunächst getrennten Funktionalitäten der Bluetooth-LE-Aktivierung und der faktischen Freigabe einer Standortbestimmung in Apples iOS nicht miteinander verbunden worden. Der Bluetooth-Chip ist auch auf Android-Handys eine eigene Einheit und hat z.B. mit einer groben weiteren Standortbestimmung per Wifi oder GPS oder durch andere verortete Bluetooth-Empfänger-Geräte technisch nicht direkt etwas zu tun. Dennoch hat Google die Kopplung auf Betriebssystemebene (!) herbeigeführt. Ein Schelm, wer dabei Böses denkt ....

**Punkt 2:** Die fast verschämte Antwort, die dazu an verschiedenen Stellen im Internet kolportiert wird, lautet, dass man das deshalb so handhabe, weil ja auch über Bluetooth eine Standortbestimmung vorgenommen werden könne. Ich zitiere aus https://developer.android.com/guide/topics/connectivity/bluetooth-le:

"Because discoverable devices might reveal information about the user's location, the device discovery process requires location access. If your app is being used on a device that runs Android 8.0 (API level 26) or higher, use the Companion Device Manager API. This API performs device discovery on your app's behalf, so your app doesn't need to request location permissions."

Das ist schon eine seltsame Logik. Weil die Aktivierung eines technischen Senders indirekt und in Kombination mit anderen technischen Funktionalitäten oder anderen Geräten eine Standortbestimmung ermöglichen könnte, muss man die Standortbestimmung auf dem eigenen Gerät *pauschal* für alle anderen Apps freigeben? Aber ab Android 8 dann doch nicht mehr? Verarschen kann man sich alleine wirklich besser ....

Siehe hierzu auch:

https://android.stackexchange.com/ questions/ 160479/ why-do-i-need-to-turn-on-location-services-to-pair-with-a-bluetooth-device

https://www.t-online.de/ digital/ id_88069644/ corona-warn-app-android-standort-freigeben-so-verhindern-sie-google-tracking.html

https://www.smartdroid.de/ corona-warn-app-standortermittlung-kommt-von-google-ein-erklaerungsversuch/

Tja, Google, warum wurde ich eigentlich bisher beim Nutzen von Bluetooth LE darauf nicht aufmerksam gemacht? Wie man es auch dreht und wendet: Es gibt keine logische Begründung dafür, zwingend eine Standortbestimmung auch für andere Apps (inkl. GPS-Nutzung) freizugeben, weil Bluetooth ggf. eine Standortbestimmung möglich macht.

**Punkt 3:** Dennoch bleibt festzuhalten: Ja, es stimmt, man (ein anderer Bluetoothnutzer bzw. eine anderes Bluetooth-Gerät) kann über Bluetooth und z.B. Wifi indirekt eine Standortbestimmung deines Handys vornehmen. U.a. über die Beacon-Funktionalität. Auch Google konnte (bei aktiviertem Bluetooth auf deinem Geräte) wohl immer schon deinen Standort über Dritte ermitteln, die sich in deiner Nähe mit aktiviertem Bluetooth, GPS oder WiFi befanden.

**Punkt 4:** Es bleibt die Frage, warum Google zwei, eigentlich völlig unabhängige Funktionen (nämlich Bluetooth LE und eine Standortbestimmung) seit Android 6 technisch so eng aneinander gekoppelt hat. Es wäre ja auch anders gegangen. Du aktivierst Bluetooth und erhältst eine Warnung, dass damit auch eine Standortbestimmungen möglich sind. Und eine Beschreibung der möglichen Mechanismen. So wurde das aber nicht aufgezogen ... Zieht man allerdings für die Kopplung auch andere als nur technische Gründe in Betracht, so macht das Ganze plötzlich sehr viel Sinn; allerdings keinen technischen sondern einen ökonomischen. Zufall?

**Punkt 5:** Bluetooth und Bluetooth LE ... Habe ich als Anwender eigentlich Kontrolle darüber, welche App was genau einsetzt? Und ob die App bei Bedarf nicht auch die Standortfreigabe auf irgendeine Rückfrage im Hintergrund aktiviert? Schalte ich (normales) Bluetooth an, ohne Standortfreigabe und GPS explizit zu aktivieren, sehe ich im Moment gerade 10 Geräte in unserem Wohnhaus mit aktiviertem Bluetooth. Es gibt für den Android-Laien somit vier Möglichkeiten, die Situation zu beurteilen:

(a) Entweder ist die Aktivierung der Standortbestimmung für Bluetooth und das Erkennen anderer Geräte technisch gar nicht erforderlich (entgegen Google's eigener Verlautbarung für Bluetooth LE ...). Zu dieser Variante passt: Es gibt unter den Bluetooth-Einstellungen die Möglichkeit, sich selbst aktiv sichtbar zu machen.

(b) Oder die Standortfreigabe wird im Hintergrund aktiviert - ohne dass ich als Anwender informiert werde.

(c) Eine dritte Variante ist, dass die Standortfreigabe für das aktive Suchen nach anderen Bluetooth-Geräten nur im Falle von "Bluetooth **Low Energy**" [BLE], wie für die Corona-App eingesetzt, verwendet wird - und in diesem Fall aber technisch unumgänglich ist.

(d) Über eine deutlich schlimmere Variante mag ich nicht weiter nachdenken.

Lieber Leser: Können wir uns darauf einigen, dass das ohne tiefere technische Einblicke in Android mindestens mal eine äußerst dubiose Sachlage ist? Die nun aber Google indirekt sehr zugute kommt?

**Punkt 6:** Nun nutzt ja die Corona-App selbst die im Prinzip ermittelbaren Standortdaten nicht. Da glaube ich mal jenen, die den Code der App bereits im Detail durchforstet haben (hoffentlich). Aber was macht Google ggf. auf anderen Wegen mit der aktivierten "groben" Standortbestimmung? Aus Diskussionen in Zeitungen (z.B. der SZ, der Zeit oder der Faz) entnimmt man nicht mehr als die * Versicherung* von Google, dass es die Daten im Kontext der Corona-App nicht erfasse. Soll ich das glauben? Hmm, es ist immerhin strafbewehrt ....

**Punkt 7:** Aber halt: Da fand sich in den Hinweisen der Corona-App selbst ja noch der kleine Satz am Schluss: "* Andere Apps* mit der Berechtigung zur Standortermittlung haben Zugriff auf den Gerätestandort." Offenbar dürfen die anderen Apps die Standortdaten dann wohl auch nutzen. Der User hat ja schließlich die Standortbestimmung explizit freigegeben. Um sich gegen Corona zu schützen ...

Aha! Wer dürfte sich da freuen? Google! 10 Millionen Downloads => 10 Millionen pauschal eine für Apps pauschal aktivierte Standortbestimmung frei Haus - durch Aktivierung der Corona-App! Nein, die Standortermittlung erfolgt nicht durch die Corona-App selbst - aber eben (potentiell) durch andere Apps, die über die Aktivierung der Corona App nun die Nutzung der Standortbestimmung freigeschaltet bekommen haben. Das ist so genial, dass man fast neidisch wird.

**Punkt 8:** Nun werden einige Schlaumeier sagen, man könne ja die Rechte der (vielen) anderen Apps über den "Anwendungsmanager" einschränken. Stimmt. Kann man. Darauf weist die Corona-App aber leider gar nicht hin. Und, hast du, Schlaumeier, das auch schon mal gemacht? Jede Anwendung im Anwendungsmanager aufgerufen und die Rechte beschränkt? Das ist übrigens eine sehr lehrreiche Übung, die ich jedem Android-Nutzer dringend ans Herz lege ... Bei der Corona-App selbst kannst du auf dem Weg übrigens nur die Berechtigung zur Nutzung der Kamera (wird bei Übermittlung eines Testergebnisses verwendet) abschalten ...

**Punkt 10:** Ein wenig Nachforschen zeigt: Ein paar Einstellungen zur Corona-App können auch im Kontext der Google-Einstellungen gesetzt werden (man öffne die Android-Anwendung "Einstellungen" und dort "Google"!) . Da erfährt man übrigens auch definitiv, dass die Benachrichtigung bzgl. des Kontakts mit Infizierten deaktiviert wurde, wenn man die Standortfreigabe zwischenzeitlich - aus welchen Gründen auch immer - abgeschaltet hat. Und da überfällt mich dann wieder ein größeres Stirnrunzeln. Die Corona-App selbst zeigt dir das nämlich nicht an; die werkelt einfach weiter. Ein Hinweis darauf, dass die Entwickler davon ausgegangen sind, dass die Freigabe der Standortbestimmung nicht mehr deaktiviert wird?

Ich habe die Standortbestimmung am Mittwochnachmittag letzter Woche abgeschaltet. Damals kam dann noch ein englischer Hinweis auf eine Deaktivierung der Benachrichtigung zum Infektionsrisiko. Die App selber meldete mir über ihren misslichen Zustand erst mal gar nichts. Sie lief mindestens drei Tage weiter, zeigte mir jeden Tag an dass sie aktiv sei (entgegen der Feststellung unter meinen Google-Einstellungen) und dass mein Risiko nach wie vor gering sei. Heute - am Montagvormittag - zeigte sie mir dann an, dass kein Risiko bestimmt werden könne, weil die App seit 3 Tage nicht "updaten" konnte. ????

Test: Standortbestimmung wieder freigeben. Corona-Benachrichtigung wieder in der App aktiviert. App läuft => geringes Risiko. Standortermittlung explizit deaktiviert => Corona-App läuft weiter, als sei nichts geschehen => "Geringes Risiko. 6 von 14 Tagen aktiv!". Aber Meldung unter den Google-Einstellungen: Diese Funktion (gemeint ist die Risikobenachrichtigung) ist deaktiviert!

Ich würde das mindestens mal als Inkonsistenz bezeichnen ... eigentlich ist es ein massiver **Bug**.

Auf Nachfrage bei anderen Nutzern der Corona-App hat sich übrigens herausgestellt, dass mancher der Befragten die Standortbestimmung tatsächlich einfach wieder deaktiviert hatte: Sobald sie nämlich zufällig entdeckten, dass sie aktiv war. Zwei dachten, sie hätten wohl vergessen, die Standortbestimmung nach einer Benutzung von Google Maps wieder abzuschalten. Die Meldung, die dann nicht vollständig lesbar war, hatten sie ignoriert. Sie konnten sie auch nicht unmittelbar der Corona-App zuordnen. Die "lief" ja noch und zeigte ein geringes Risiko an .... Oh mei ...

Ich finde, das Ganze ist eine schwierige, bis üble Gemengelage. Das liegt zunächst nicht an der Corona-App selbst und schon gar nicht an unserer Regierung. Es liegt einzig und allein an Google - und deren Kopplung des Einsatzes von Bluetooth LE mit der Freigabe der Standortbestimmung - für alle anderen Apps. Diese Kopplung spielt Google nicht erst jetzt eindeutig ökonomisch in die Hände bzw. unterstützt Googles Geschäftsmodell massiv.

Ja, man kann zwar nach der Installation der Corona-App alle anderen Apps bzgl. der Rechte konfigurieren. Aber ehrlich: Wer macht das schon? Der normale Anwender weiß nicht mal, wo und wie er Zugriffsrechte verwalten kann. Die Corona-App macht einen darauf leider auch nicht aufmerksam.

Hinzu kommt auch, dass die Corona-App nichts anzeigt oder meldet, wenn die angeblich zwingend benötigte Freigabe der Standortbestimmung im Nachhinein abgestellt wird. Das ist entweder fahrlässig - oder aber Freigabe der Standortbestimmung wird entgegen aller Beteuerungen doch nicht benötigt. Keine der beiden Alternativen ist gut.

Was hätte ich erwartet?

- Zunächst mal klare und eindeutige Informationen über die notwendige Aktivierung der Standortbestimmung im Vorfeld der Veröffentlichung der App - durch die Regierung und auch durch Google. Später eine klare Information durch die App selbst. Plus Informationen für Interessierte, warum zwei technisch unterschiedliche Funktionen ab Android 6 überhaupt aneinander gekoppelt wurden.
- Ich hätte erwartet, dass mir die pauschale Aktivierung der Standortbestimmung in der oberen Statusleiste von Android durch ein eindeutiges Symbol im Zuge der Aktivierung der Corona-App angezeigt wird. Am besten in einer Signalfarbe.
- Ich hätte detaillierte Informationen dazu erwartet, dass man nach der Installation und Aktivierung der Corona-App andere Apps bzgl. deren Zugriffsrechte nachjustieren muss - und für den Normalverbraucher auch Infos dazu, wie man das macht.
- Ich hätte grundsätzlich erwartet, dass Google eine technische Lösung anbietet, die eine Umgehung einer pauschalen Freigabe der Standortbestimmung ermöglicht. Das war schon lange überfällig. Warum hat man da seitens des Auftraggebers (= Regierung) nicht mehr Druck ausgeübt?
- Ich hätte zumindest erwartet, dass mir die Corona-App anzeigt, dass sie nicht mehr richtig funktioniert, wenn ich die Standortbestimmung aus irgendwelchen Gründen abschalte.
- Ich hätte mindestens mal vom CCC (Chaos Computer Club) eine Bewertung der pauschalen Freigabe der Standortbestimmung für andere Apps unter Android erwartet. Statt dessen: https://www.zdf.de/nachrichten/politik/corona-app-launch-100.html. Sicher ist Vieles vorbildlich gewesen ... aber Open Source allein reicht nicht als Gütesiegel ...

**Fazit:** Ich bin mit der App nicht zufrieden. Es bleibt das schale Gefühl, das Google uns alle mal wieder über den Tisch gezogen hat, auch wenn die Corona-App selbst sinnvoll ist. Meine Lehre aus dem Ganzen ist: Es wäre endlich an der Zeit, eine Alternative zu Android auf europäischer Ebene zu entwickeln. Aber das werde ich wohl nicht mehr erleben ...

Habe ich die Corona-App jetzt abgeschaltet? Nein. Aber ich habe die Rechte anderer Apps weiter reduziert - mit interessanten Folgen. Jetzt hat mir Google eine freundliche Mail geschickt, ich möge nun doch bitte meine Google-Account-Einstellungen zur Nutzung personenbezogener Werbung vervollständigen ... Ein Schelm, wer Böses denkt ....

]]>A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test

A simple CNN for the MNIST datasets – I – CNN basics

we saw that we could easily create a *Convolutional Neural Network* [CNN] with Keras and apply it to the MNIST dataset. But this was only a quick win. Readers who followed me through the MLP-oriented article series may have noticed that we are far, yet, from having the full control over a variety of things we would use to optimize the classification algorithm. After our experience with MLPs you may rightfully argue that

- both a systematic reduction of the learning rate in the vicinity of a (hopefully) global minimum of the multidimensional loss function hyperplane in the {weights, costs}-space
- and the use of a L2- or L1-regularizer

could help to squeeze out the last percentage of accuracy of a CNN. Actually, in our first test run we did not even care about how the RMSProp algorithm invoked a learning rate and what the value(s) may have been. A learning rate "lr" was no parameter of our CNN setup so far. The same is true for the use and parameterization of a regularizer.

I find it interesting that some of the introductory books on "Deep Learning" with Keras do not discuss the handling of *learning rates* beyond providing a fixed (!) rate parameter on rare occasions to a SGD- or RMSProp-optimizer. Folks out there seem to rely more on the type and basic settings of an optimizer than on a variable learning rate. Within the list of books I appended to the last article, only the book of A. Geron gives you some really useful hints. So, we better find information within the online documentation of the Keras framework to get control over learning rates and regularizers.

Against my original plans, the implementation of schedulers for the learning rate and the usage of a L2-regularizer will be the main topic of this article. But I shall - as a side-step - also cover the use of "*callbacks*" for some *interactive* plotting during a training run. This is a bit more interesting than just plotting key data (loss, accuracy) after training - as we did for MLPs.

A systematic reduction of the learning rate or the momentum can be handled with Keras via a parameterization of a * scheduler*. Such a scheduler can e.g. be invoked by a chosen optimizer. As the optimizer came into our CNN with the

The optimizer for the gradient calculation directly accepts parameters for an (initial and constant) learning rate (parameter "learning_rate") and a momentum (parameter "momentum"). An example with the SGD-optimizer would be:

optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.5) cnn.compile(optimizer=optimizer, ....

If you just provide a parameter "learning_rate=0.001" then the optimizer will use a ** constant** learning rate. However, if we provide a scheduler object - an instance of a

keras.io/api/optimizers/ learning rate schedules/

A simple scheduler which allows for the "power scheduling" which we implemented for our MLPs is the scheduler "**InverseTimeDecay**". It can be found under the library branch **optimizers.schedules**. We take this scheduler as an example in this article.

But how do we deliver the scheduler object to the optimizer? With tensorflow.keras and TF > 2.0 this is simply done by providing it as the parameter (object) for the learning_rate to the optimizer, as e.g. i the following code example:

from tensorflow.keras import optimizers from tensorflow.keras.optimizers import schedules ... ... scheduled_lr_rate = schedules.InverseTimeDecay(lr_init, lr_decay_steps, lr_decay_rate) optimizer=optimizers.RMSprop(learning_rate=scheduled_lr_rate, momentum=my_momentum) ...

Here "lr_init" defines an initial learning rate (as 0.001), "lr_decay_steps" a number of steps, after which the rate is changed, and lr_decay_rate a decay rate to be applied in the formula of the scheduler. Note that different schedulers take different parameters - so look them up before applying them. Also check that your optimizer accepts a scheduler object as a parameter ...

The scheduler instance works as if we had a function returning an output - a learning rate - for an input, which corresponds to a number of "* steps*":

**learning_rate=scheduler(steps)** .

Now what do we mean by "steps"? Epochs? No, actually a *step* here corresponds practically to the result of an iterator over *batches*. Each optimization "step" in the sense of a weight adjustment after a gradient analysis for a mini-batch defines a step for the scheduler. If you want to change the learning rate on the level of epochs only, then you must either adjust the "decay_step"-parameter or write a so called "callback-function" invoked after each epoch and control the learning rate yourself.

**Momentum**

Note, by the way, that in addition to the scheduler we also provided a value for the "* momentum*" parameter an optimizer may use during the gradient evolution via some specific adaptive formula. How the real weight changes are calculated based on gradient development and momentum parameters is of course specific for an optimizer.

We use the Jupyter cells of the code we built in the last article as far as possible. We need to perform some more imports to work with schedulers, regularizers and plotting. You can exchange the contents of the first Jupyter cell with the following statements:

**Jupyter Cell 1:**

import numpy as np import scipy import time import sys from sklearn.preprocessing import StandardScaler import tensorflow as tf from tensorflow import keras as K from tensorflow.python.keras import backend as B from tensorflow.keras import models from tensorflow.keras import layers from tensorflow.keras import regularizers from tensorflow.keras import optimizers from tensorflow.keras.optimizers import schedules from tensorflow.keras.utils import to_categorical from tensorflow.keras.datasets import mnist from tensorflow.python.client import device_lib import matplotlib as mpl from matplotlib import pyplot as plt from matplotlib.colors import ListedColormap import matplotlib.patches as mpat import os

Then we have two unchanged cells:

**Jupyter Cell 2:**

#gpu = False gpu = True if gpu: # GPU = True; CPU = False; num_GPU = 1; num_CPU = 1 GPU = True; CPU = False; num_GPU = 1; num_CPU = 1 else: GPU = False; CPU = True; num_CPU = 1; num_GPU = 0 config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1, allow_soft_placement=True, device_count = {'CPU' : num_CPU, 'GPU' : num_GPU}, log_device_placement=True ) config.gpu_options.per_process_gpu_memory_fraction=0.35 config.gpu_options.force_gpu_compatible = True B.set_session(tf.compat.v1.Session(config=config))

and

**Jupyter Cell 3:**

# load MNIST # ********** #@tf.function def load_Mnist(): mnist = K.datasets.mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() #print(X_train.shape) #print(X_test.shape) # preprocess - flatten len_train = X_train.shape[0] len_test = X_test.shape[0] X_train = X_train.reshape(len_train, 28*28) X_test = X_test.reshape(len_test, 28*28) #concatenate _X = np.concatenate((X_train, X_test), axis=0) _y = np.concatenate((y_train, y_test), axis=0) _dim_X = _X.shape[0] # 32-bit _X = _X.astype(np.float32) _y = _y.astype(np.int32) # normalize scaler = StandardScaler() _X = scaler.fit_transform(_X) # mixing the training indices - MUST happen BEFORE encoding shuffled_index = np.random.permutation(_dim_X) _X, _y = _X[shuffled_index], _y[shuffled_index] # split again num_test = 10000 num_train = _dim_X - num_test X_train, X_test, y_train, y_test = _X[:num_train], _X[num_train:], _y[:num_train], _y[num_train:] # reshape to Keras tensor requirements train_imgs = X_train.reshape((num_train, 28, 28, 1)) test_imgs = X_test.reshape((num_test, 28, 28, 1)) #print(train_imgs.shape) #print(test_imgs.shape) # one-hot-encoding train_labels = to_categorical(y_train) test_labels = to_categorical(y_test) #print(test_labels[4]) return train_imgs, test_imgs, train_labels, test_labels if gpu: with tf.device("/GPU:0"): train_imgs, test_imgs, train_labels, test_labels= load_Mnist() else: with tf.device("/CPU:0"): train_imgs, test_imgs, train_labels, test_labels= load_Mnist()

A "regularizer" modifies the loss function via a sum over contributions delivered by a (common) function of the weight at every node. In Keras this sum is split up into contributions at the different layers, which we define via the "model.add()"-function. You can build layers with or without regularization contributions. Already for our present simple CNN case this is very useful:

In a first trial we only want to add weights to the hidden and the output layers of the concluding MLP-part of our CNN. To do this we have to provide a parameter "**kernel_regularizer**" to "model.add()", which specifies the type of regularizer to use. We restrict the regularizers in our example to a "l2"- and a "l1"-regularizer, for which Keras provides predefined functions/objects. This leads us to the following change of the function "build_cnn_simple()", which we used in the last article:

**Jupyter Cell 4:**

# Sequential layer model of our CNN # *********************************** # just for illustration - the real parameters are fed later # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ li_conv_1 = [32, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [64, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_dense_1 = [70, 0] li_dense_2 = [10, 0] li_MLP = [li_dense_1, li_dense_2] input_shape = (28,28,1) # important !! # ~~~~~~~~~~~ cnn = None x_optimizers = None def build_cnn_simple(li_Conv, li_Pool, li_MLP, input_shape, my_regularizer=None, my_reg_param_l2=0.01, my_reg_param_l1=0.01 ): use_regularizer = True if my_regularizer == None: use_regularizer = False # activation functions to be used in Conv-layers li_conv_act_funcs = ['relu', 'sigmoid', 'elu', 'tanh'] # activation functions to be used in MLP hidden layers li_mlp_h_act_funcs = ['relu', 'sigmoid', 'tanh'] # activation functions to be used in MLP output layers li_mlp_o_act_funcs = ['softmax', 'sigmoid'] # dictionary for regularizer functions d_reg = { 'l2': regularizers.l2, 'l1': regularizers.l1 } if use_regularizer: if my_regularizer not in d_reg: print("regularizer " + my_regularizer + " not known!") sys.exit() else: regul = d_reg[my_regularizer] if my_regularizer == 'l2': reg_param = my_reg_param_l2 elif my_regularizer == 'l1': reg_param = my_reg_param_l1 # Build the Conv part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_conv_layers = len(li_Conv) num_pool_layers = len(li_Pool) if num_pool_layers != num_conv_layers - 1: print("\nNumber of pool layers does not fit to number of Conv-layers") sys.exit() rg_il = range(num_conv_layers) # Define a sequential model # ~~~~~~~~~~~~~~~~~~~~~~~~~ cnn = models.Sequential() for il in rg_il: # add the convolutional layer num_filters = li_Conv[il][0] t_fkern_size = li_Conv[il][1] cact = li_conv_act_funcs[li_Conv[il][2]] if il==0: cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact, input_shape=input_shape)) else: cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact)) # add the pooling layer if il < num_pool_layers: t_pkern_size = li_Pool[il][0] cnn.add(layers.MaxPooling2D(t_pkern_size)) # Build the MLP part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_mlp_layers = len(li_MLP) rg_im = range(num_mlp_layers) cnn.add(layers.Flatten()) for im in rg_im: # add the dense layer n_nodes = li_MLP[im][0] if im < num_mlp_layers - 1: m_act = li_mlp_h_act_funcs[li_MLP[im][1]] if use_regularizer: cnn.add(layers.Dense(n_nodes, activation=m_act, kernel_regularizer=regul(reg_param))) else: cnn.add(layers.Dense(n_nodes, activation=m_act)) else: m_act = li_mlp_o_act_funcs[li_MLP[im][1]] if use_regularizer: cnn.add(layers.Dense(n_nodes, activation=m_act, kernel_regularizer=regul(reg_param))) else: cnn.add(layers.Dense(n_nodes, activation=m_act)) return cnn

I have used a dictionary to enable an indirect call of the regularizer object. The invocation of a regularizer happens in the statements:

cnn.add(layers.Dense(n_nodes, activation=m_act, kernel_regularizer=regul(reg_param)))

We only use a regularizer for MLP-layers in our example.

Note that I used *predefined* regularizer objects above. If you want to use a regularizer object defined and customized by yourself you can either define a simple regularizer function (which accepts the weights as a parameter) or define a child class of the "keras.regularizers.Regularizer"-class. You find hints how to do this at https://keras.io/api/layers/regularizers/ and in the book of Aurelien Geron (see the literature list at the end of the last article).

Although I have limited the use of a regularizer in the code above to the MLP layers you are of course free to apply a regularizer also to convolutional layers. (The book of A. Geron gives some hints how to avoid code redundancies in chapter 11).

As we saw above a scheduler for the learning rate can be provided as a parameter to the optimizer of a CNN. As the optimizer objects are invoked via the function "model.compile()", we prepare our *own* function "**my_compile()**" with an appropriate interface, which then calls "model.compile()" with the necessary parameters. For indirect calls of predefined scheduler objects we again use a dictionary:

**Jupyter Cell 5:**

# compilation function for further parameters def my_compile(cnn, my_loss='categorical_crossentropy', my_metrics=['accuracy'], my_optimizer='rmsprop', my_momentum=0.0, my_lr_sched='powerSched', my_lr_init=0.01, my_lr_decay_steps=1, my_lr_decay_rate=0.00001 ): # dictionary for the indirct call of optimizers d_optimizer = { 'rmsprop': optimizers.RMSprop, 'nadam': optimizers.Nadam, 'adamax': optimizers.Adamax } if my_optimizer not in d_optimizer: print("my_optimzer" + my_optimizer + " not known!") sys.exit() else: optim = d_optimizer[my_optimizer] use_scheduler = True if my_lr_sched == None: use_scheduler = False print("\n No scheduler will be used") # dictionary for the indirct call of scheduler funcs d_sched = { 'powerSched' : schedules.InverseTimeDecay, 'exponential': schedules.ExponentialDecay } if use_scheduler: if my_lr_sched not in d_sched: print("lr scheduler " + my_lr_sched + " not known!") sys.exit() else: sched = d_sched[my_lr_sched] if use_scheduler: print("\n lr_init = ", my_lr_init, " decay_steps = ", my_lr_decay_steps, " decay_rate = ", my_lr_decay_rate) scheduled_lr_rate = sched(my_lr_init, my_lr_decay_steps, my_lr_decay_rate) optimizer = optim(learning_rate=scheduled_lr_rate, momentum=my_momentum) else: print("\n lr_init = ", my_lr_init) optimizer=optimizers.RMSprop(learning_rate=my_lr_init, momentum=my_momentum) cnn.compile(optimizer=optimizer, loss=my_loss, metrics=my_metrics) return cnn

You see that, for the time being, I only offered the choice between the predefined schedulers "**InverseTimeDecay**" and "**ExponentialDecay**". If no scheduler name is provided then only a constant learning rate is delivered to the chosen optimizer.

You find some basic information on optimizers and schedulers here:

https://keras.io/api/optimizers/

https://keras.io/api/optimizers/ learning rate schedules/.

Note that you can build your own scheduler object via defining a child class of "keras.callbacks.LearningRateScheduler" and provide with a list of other callbacks to the "**model.fit()**"-function; see

https://keras.io/api/callbacks/ learning rate scheduler/.

or the book of Aurelien Geron for more information.

As I am interested in the changes of the learning rate with the steps during an epoch I want to print them after each epoch during training. In addition I want to plot key data produced during the training of a CNN model.

For both purposes we can use a convenient mechanism Keras offers - namely "callbacks", which can either be invoked at the beginning/end of the treatment of a mini-batch or at the beginning/end of each epoch.

Basic information is provided here:

https://keras.io/api/callbacks/

https://keras.io/guides/ writing your own callbacks/

It is best to just look at examples to get the basic points; below we invoke two callback objects which provide data for us at the end of each epoch.

**Jupyter Cell 6:**

# Preparing some callbacks # ************************** # Callback class to print information on the iteration and the present learning rate class LrHistory(K.callbacks.Callback): def __init__(self, use_scheduler): self.use_scheduler = use_scheduler def on_train_begin(self, logs={}): self.lr = [] def on_epoch_end(self, epoch, logs={}): if self.use_scheduler: optimizer = self.model.optimizer iterations = optimizer.iterations present_lr = optimizer.lr(iterations) else: present_lr = optimizer.lr self.lr.append(present_lr) #print("\npresent lr:", present_lr) tf.print("\npresent lr: ", present_lr) tf.print("present iteration:", iterations) # Callback class to support interactive printing class EpochPlot(K.callbacks.Callback): def __init__(self, epochs, fig1, ax1_1, ax2_2): #def __init__(self, epochs): self.fig1 = fig1 self.ax1_1 = ax1_1 self.ax1_2 = ax1_2 self.epochs = epochs rg_i = range(epochs) self.lin92 = [] for i in rg_i: self.lin92.append(0.992) def on_train_begin(self, logs={}): self.train_loss = [] self.train_acc = [] self.val_loss = [] self.val_acc = [] def on_epoch_end(self, epoch, logs={}): if epoch == 0: for key in logs.keys(): print(key) self.train_loss.append(logs.get('loss')) self.train_acc.append(logs.get('accuracy')) self.val_loss.append(logs.get('val_loss')) self.val_acc.append(logs.get('val_accuracy')) if len(self.train_acc) > 0: self.ax1_1.clear() self.ax1_1.set_xlim (0, self.epochs+1) self.ax1_1.set_ylim (0.97, 1.001) self.ax1_1.plot(self.train_acc, 'g' ) self.ax1_1.plot(self.val_acc, 'r' ) self.ax1_1.plot(self.lin92, 'b', linestyle='--' ) self.ax1_2.clear() self.ax1_2.set_xlim (0, self.epochs+1) self.ax1_2.set_ylim (0.0, 0.2) self.ax1_2.plot(self.train_loss, 'g' ) self.ax1_2.plot(self.val_loss, 'r' ) self.fig1.canvas.draw()

From looking at the code we see that a callback can be defined as a child class of a base class "keras.callbacks.Callback" or of some specialized predefined callback classes listed under "https://keras.io/api/callbacks/". They are all useful, but perhaps the classes "ModelCheckpoint", "LearningRateScheduler" (see above) and "EarlyStopping" are of direct practical interest for people who want to move beyond the scope of this article. In the above example I, however, only used the base class "callbacks.Callback".

**Printing steps, iterations and the learning rate **

Let us look at the first callback, which provides a printout of the learning rate. This is a bit intricate: Regarding the learning rate I have already mentioned that a scheduler changes it after each operation on a batch; such a *step* corresponds to a variable "**iterations**" which is counted up by the optimizer during training after the treatment of each mini-batch. [Readers from my series on MLPs remember that gradient descent can be based on a gradient evaluation over the samples of mini-batches - instead of using a gradient evaluation based on *all* samples (batch gradient descent or full gradient descent) or of each individual sample (stochastic gradient descent).]

As we defined a batch size = 64 for the fit()-method of our CNN in the last article the number of optimizer *iterations* (=steps) per epoch is quite big: 60000/64 => 938 (with a smaller last batch).

Normally, the constant initial learning rate of an optimizer could be retrieved in a callback by a property "lr" as in "self.model.optimizer.lr". However, in our example this attribute now points to a method "schedule()" of the *scheduler* object. (See schedule() method of the LearningRateScheduler class). We must provide the number of "iterations" (= number of steps) as a parameter to this schedule()-function and take the returned result as the present lr-value.

Our own callback class "**LrHistory**(K.callbacks.Callback)" gets three methods which are called at different types of "**events**" (Javascript developers should feel at home ):

- At initialization (__init__()) we provide a parameter which defines whether we used a scheduler at all, or a constant lr.
- At the beginning of the training (
**on_train_begin()**) we just instantiate a list, which we later fill with retrieved lr-values; we could use this list for some evaluation e.g. at the end of the training. - At the end of each epoch (
**on_epoch-end()**) we determine the present learning rate via calling the scheduler (behind the "lr" attribute) - if we used one - for the number of iterations passed.

This explains the following statements:

optimizer = self.model.optimizer iterations = optimizer.iterations present_lr = optimizer.lr(iterations)

Otherwise we just print the *constant* learning rate used by the optimizer (e.g. a default value).

Note that we use the "**tf.print()**" method to do the printing and not the Python "print()" function. This is basically done for convenience reasons: We thus avoid a special evaluation of the tensor and a manual transformation to a Numpy value. Do not forget: All data in Keras/TF are basically *tensors* and, therefore, the Python print() function would print extra information besides the expected numerical value!

With our MLP we never used interactive plots in our Jupyter notebooks. We always plotted *after* training based on arrays or lists filled during the training iterations with intermediately achieved values of accuracy and loss. But seeing the evolution of key metric data during training is a bit more rewarding; we get a feeling for what happens and can stop a training run with problematic hyperparameters before we waste too much valuable GPU or CPU time. So, how do we do change plots during training in our Jupyter environment?

The first point is that we need to prepare the environment of our Jupyter notebook; we do this before we start the training run. Below you find the initial statements of the respective Jupyter cell; the really important statement is the call of the ion-function: "plt.ion()". For some more information see

https://linux-blog.anracom.com/2019/12/26/matplotlib-jupyter-and-updating-multiple-interactive-plots/

and links at te end of the article.

# Perform a training run # ******************** # Prepare the plotting # The really important commands for interactive (=intermediate) plot updating %matplotlib notebook plt.ion() #sizing fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 8 fig_size[1] = 3 # One figure # ----------- fig1 = plt.figure(1) #fig2 = plt.figure(2) # first figure with two plot-areas with axes # -------------------------------------------- ax1_1 = fig1.add_subplot(121) ax1_2 = fig1.add_subplot(122) fig1.canvas.draw()

The interesting statements are the first two. The statements for setting up the plot figures should be familiar to my readers already. The statement "fig1.canvas.draw()" leads to the display of two basic coordinate systems at the beginning of the run.

The real plotting is, however, done with the help of our second callback "**EpochPlot()**" (see the code given above for Jupyter cell 6). We initialize our class with the number of epochs for which our training run shall be performed. We also provide the addresses of the plot figures and their two coordinate systems (ax1_1 and ax1_2). At the beginning of the training we provide empty lists for some key data which we later (after each epoch) fill with calculated values.

Which key or metric data can the training loop of Keras provide us with?

Keras fills a dictionary "**logs**" with metric data and loss values. What the metric data are depends on the loss function. At the following link you find a discussion of the metrics you can use with respect to different kinds of ML problems: machinelearningmastery.com custom-metrics-deep-learning-keras-python/.

For categorical_crossentropy and our metrics list ['accuracy'] we get the following keys

loss, accuracy, val_loss, val_accuracy

as we also provided validation data. We print this information out at epoch == 0. (You will note however, that epoch 0 is elsewhere printed out as epoch 1. So, the telling is a bit different inside Keras and outside).

Then have a look at the parameterization of the method **on_epoch_end(self, epoch, logs={})**. This should not disturb you; logs is only empty at the beginning, later on a filled logs-dictionary is automatically provided.

As you see form the code we retrieve the data from the logs-dictionary at the end of each epoch. The rest are plain matplot-commands The "clear()"-statements empty the coordinate areas; the statement "self.fig1.canvas.draw()" triggers a redrawing.

To do things properly we now need to extend our training function which invokes the other already defined functions as needed:

**Jupyter Cell 7:**

# Training 2 - with test data integrated # ***************************************** def train( cnn, build, train_imgs, train_labels, li_Conv, li_Pool, li_MLP, input_shape, reset=True, epochs=5, batch_size=64, my_loss='categorical_crossentropy', my_metrics=['accuracy'], my_regularizer=None, my_reg_param_l2=0.01, my_reg_param_l1=0.01, my_optimizer='rmsprop', my_momentum=0.0, my_lr_sched=None, my_lr_init=0.001, my_lr_decay_steps=1, my_lr_decay_rate=0.00001, fig1=None, ax1_1=None, ax1_2=None ): if build: # build cnn layers - now with regularizer - 200603 by ralph cnn = build_cnn_simple( li_Conv, li_Pool, li_MLP, input_shape, my_regularizer = my_regularizer, my_reg_param_l2 = my_reg_param_l2, my_reg_param_l1 = my_reg_param_l1) # compile - now with lr_scheduler - 200603 by ralph cnn = my_compile(cnn=cnn, my_loss=my_loss, my_metrics=my_metrics, my_optimizer=my_optimizer, my_momentum=my_momentum, my_lr_sched=my_lr_sched, my_lr_init=my_lr_init, my_lr_decay_steps=my_lr_decay_steps, my_lr_decay_rate=my_lr_decay_rate) # save the inital (!) weights to be able to restore them cnn.save_weights('cnn_weights.h5') # save the initial weights # reset weights(standard) if reset and not build: cnn.load_weights('cnn_weights.h5') # Callback list # ~~~~~~~~~~~~~ use_scheduler = True if my_lr_sched == None: use_scheduler = False lr_history = LrHistory(use_scheduler) callbacks_list = [lr_history] if fig1 != None: epoch_plot = EpochPlot(epochs, fig1, ax1_1, ax1_2) callbacks_list.append(epoch_plot) start_t = time.perf_counter() if reset: history = cnn.fit(train_imgs, train_labels, initial_epoch=0, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True, validation_data=(test_imgs, test_labels), callbacks=callbacks_list) else: history = cnn.fit(train_imgs, train_labels, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True, validation_data=(test_imgs, test_labels), callbacks=callbacks_list ) end_t = time.perf_counter() fit_t = end_t - start_t return cnn, fit_t, history, x_optimizer # we return cnn to be able to use it by other functions in the Jupyter later on

Note, how big our interface became; there are a lot of parameters for the control of our training. Our set of configuration parameters is now very similar to what we used for MLP training runs before.

Note also how we provided a list of our callbacks to the "model.fit()"-function.

Now , we are only a small step away from testing our modified CNN-setup. We only need one further cell:

**Jupyter Cell 8:**

# Perform a training run # ******************** # Prepare the plotting # The really important command for interactive (=interediate) plot updating %matplotlib notebook plt.ion() #sizing fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 8 fig_size[1] = 3 # One figure # ----------- fig1 = plt.figure(1) #fig2 = plt.figure(2) # first figure with two plot-areas with axes # -------------------------------------------- ax1_1 = fig1.add_subplot(121) ax1_2 = fig1.add_subplot(122) fig1.canvas.draw() # second figure with just one plot area with axes # ------------------------------------------------- #ax2 = fig2.add_subplot(121) #ax2_1 = fig2.add_subplot(121) #ax2_2 = fig2.add_subplot(122) #fig2.canvas.draw() # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Parameterization of the training run build = False build = True if cnn == None: build = True x_optimizer = None batch_size=64 epochs=80 reset = False # we want training to start again with the initial weights my_loss ='categorical_crossentropy' my_metrics =['accuracy'] my_regularizer = None my_regularizer = 'l2' my_reg_param_l2 = 0.0009 my_reg_param_l1 = 0.01 my_optimizer = 'rmsprop' # Present alternatives: rmsprop, nadam, adamax my_momentum = 0.6 # momentum value my_lr_sched = 'powerSched' # Present alternatrives: None, powerSched, exponential #my_lr_sched = None # Present alternatrives: None, powerSched, exponential my_lr_init = 0.001 # initial leaning rate my_lr_decay_steps = 1 # decay steps = 1 my_lr_decay_rate = 0.001 # decay rate li_conv_1 = [64, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [128, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_dense_1 = [120, 0] #li_dense_2 = [30, 0] li_dense_3 = [10, 0] li_MLP = [li_dense_1, li_dense_2, li_dense_3] li_MLP = [li_dense_1, li_dense_3] input_shape = (28,28,1) try: if gpu: with tf.device("/GPU:0"): cnn, fit_time, history, x_optimizer = train( cnn, build, train_imgs, train_labels, li_Conv, li_Pool, li_MLP, input_shape, reset, epochs, batch_size, my_loss=my_loss, my_metrics=my_metrics, my_regularizer=my_regularizer, my_reg_param_l2=my_reg_param_l2, my_reg_param_l1=my_reg_param_l1, my_optimizer=my_optimizer, my_momentum = 0.8, my_lr_sched=my_lr_sched, my_lr_init=my_lr_init, my_lr_decay_steps=my_lr_decay_steps, my_lr_decay_rate=my_lr_decay_rate, fig1=fig1, ax1_1=ax1_1, ax1_2=ax1_2 ) print('Time_GPU: ', fit_time) else: with tf.device("/CPU:0"): cnn, fit_time, history = train( cnn, build, train_imgs, train_labels, li_Conv, li_Pool, li_MLP, input_shape, reset, epochs, batch_size, my_loss=my_loss, my_metrics=my_metrics, my_regularizer=my_regularizer, my_reg_param_l2=my_reg_param_l2, my_reg_param_l1=my_reg_param_l1, my_optimizer=my_optimizer, my_momentum = 0.8, my_lr_sched=my_lr_sched, my_lr_init=my_lr_init, my_lr_decay_steps=my_lr_decay_steps, my_lr_decay_rate=my_lr_decay_rate, fig1=fig1, ax1_1=ax1_1, ax1_2=ax1_2 ) print('Time_CPU: ', fit_time) except SystemExit: print("stopped due to exception")

Below I show screenshots taken during training for the parameters defined above.

The left plot shows the achieved accuracy; red for the validation set (around 99.32%). The right plot shows the decline of the loss function relative to the original value. The green lines are for the training data, the red for the validation data.

Besides the effect of seeing the data change and the plots evolve during data, we can also take the result with us that we have improved the accuracy already from 99.0% to **99.32%**. When you play around with the available hyperparameters a bit you may find that 99.25% is a reproducible accuracy. But in some cases you may reach 99.4% as best values.

The following plots have best values around 99.4% with an average above 99.35%.

Epoch 13/80 935/938 [============================>.] - ETA: 0s - loss: 0.0096 - accuracy: 0.9986 present lr: 7.57920279e-05 present iteration: 12194 938/938 [==============================] - 6s 6ms/step - loss: 0.0096 - accuracy: 0.9986 - val_loss: 0.0238 - val_accuracy: 0.9944 Epoch 19/80 937/938 [============================>.] - ETA: 0s - loss: 0.0064 - accuracy: 0.9993 present lr: 5.3129319e-05 present iteration: 17822 938/938 [==============================] - 6s 6ms/step - loss: 0.0064 - accuracy: 0.9993 - val_loss: 0.0245 - val_accuracy: 0.9942 Epoch 22/80 930/938 [============================>.] - ETA: 0s - loss: 0.0056 - accuracy: 0.9994 present lr: 4.6219262e-05 present iteration: 20636 938/938 [==============================] - 6s 6ms/step - loss: 0.0056 - accuracy: 0.9994 - val_loss: 0.0238 - val_accuracy: 0.9942 Epoch 30/80 937/938 [============================>.] - ETA: 0s - loss: 0.0043 - accuracy: 0.9996 present lr: 3.43170905e-05 present iteration: 28140 938/938 [==============================] - 6s 6ms/step - loss: 0.0043 - accuracy: 0.9997 - val_loss: 0.0239 - val_accuracy: 0.9941 Epoch 55/80 937/938 [============================>.] - ETA: 0s - loss: 0.0028 - accuracy: 0.9998 present lr: 1.90150222e-05 present iteration: 51590 938/938 [==============================] - 6s 6ms/step - loss: 0.0028 - accuracy: 0.9998 - val_loss: 0.0240 - val_accuracy: 0.9940 Epoch 69/80 936/938 [============================>.] - ETA: 0s - loss: 0.0025 - accuracy: 0.9999 present lr: 1.52156063e-05 present iteration: 64722 938/938 [==============================] - 6s 6ms/step - loss: 0.0025 - accuracy: 0.9999 - val_loss: 0.0245 - val_accuracy: 0.9940 Epoch 70/80 933/938 [============================>.] - ETA: 0s - loss: 0.0024 - accuracy: 0.9999 present lr: 1.50015e-05 present iteration: 65660 938/938 [==============================] - 6s 6ms/step - loss: 0.0024 - accuracy: 0.9999 - val_loss: 0.0245 - val_accuracy: 0.9940 Epoch 76/80 937/938 [============================>.] - ETA: 0s - loss: 0.0024 - accuracy: 0.9999 present lr: 1.38335554e-05 present iteration: 71288 938/938 [==============================] - 6s 6ms/step - loss: 0.0024 - accuracy: 0.9999 - val_loss: 0.0237 - val_accuracy: 0.9943

Parameters of the last run were

build = True if cnn == None: build = True x_optimizer = None batch_size=64 epochs=80 reset = True my_loss ='categorical_crossentropy' my_metrics =['accuracy'] my_regularizer = None my_regularizer = 'l2' my_reg_param_l2 = 0.0008 my_reg_param_l1 = 0.01 my_optimizer = 'rmsprop' # Present alternatives: rmsprop, nadam, adamax my_momentum = 0.5 # momentum value my_lr_sched = 'powerSched' # Present alternatrives: None, powerSched, exponential #my_lr_sched = None # Present alternatrives: None, powerSched, exponential my_lr_init = 0.001 # initial leaning rate my_lr_decay_steps = 1 # decay steps = 1 my_lr_decay_rate = 0.001 # decay rate li_conv_1 = [64, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [128, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_dense_1 = [120, 0] #li_dense_2 = [30, 0] li_dense_3 = [10, 0] li_MLP = [li_dense_1, li_dense_2, li_dense_3] li_MLP = [li_dense_1, li_dense_3] input_shape = (28,28,1)

It is interesting that RMSProp only requires small values of the L2-regularizer for a sufficient stabilization - such that the loss curve for the validation data does not rise again substantially. Instead we observe an oscillation around a minimum after the learning rate as decreased sufficiently.

An attentive reader has found out that I have cheated a bit: I have used 64 maps at the first convolutional layer instead of 32 according to the setup in the previous article. Yes, sorry. To compensate for this I include the plot of a run with 32 maps at the first convolution. The blue line marks an accuracy of 99.35%. You see that we can get above it with 32 maps, too.

The parameters were:

build = True if cnn == None: build = True x_optimizer = None batch_size=64 epochs=80 reset = False # reset the initial weight values to those saved? my_loss ='categorical_crossentropy' my_metrics =['accuracy'] my_regularizer = None my_regularizer = 'l2' my_reg_param_l2 = 0.001 #my_reg_param_l2 = 0.01 my_reg_param_l1 = 0.01 my_optimizer = 'rmsprop' # Present alternatives: rmsprop, nadam, adamax my_momentum = 0.9 # momentum value my_lr_sched = 'powerSched' # Present alternatrives: None, powerSched, exponential #my_lr_sched = None # Present alternatrives: None, powerSched, exponential my_lr_init = 0.001 # initial leaning rate my_lr_decay_steps = 1 # decay steps = 1 my_lr_decay_rate = 0.001 # decay rate li_conv_1 = [32, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [128, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_Conv_Name = ["Conv2D_1", "Conv2D_2", "Conv2D_3"] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_Pool_Name = ["Max_Pool_1", "Max_Pool_2", "Max_Pool_3"] li_dense_1 = [120, 0] #li_dense_2 = [30, 0] li_dense_3 = [10, 0] li_MLP = [li_dense_1, li_dense_2, li_dense_3] li_MLP = [li_dense_1, li_dense_3] input_shape = (28,28,1)

Epoch 15/80 926/938 [============================>.] - ETA: 0s - loss: 0.0095 - accuracy: 0.9988 present lr: 6.6357e-05 present iteration: 14070 938/938 [==============================] - 4s 5ms/step - loss: 0.0095 - accuracy: 0.9988 - val_loss: 0.0268 - val_accuracy: 0.9940 Epoch 23/80 935/938 [============================>.] - ETA: 0s - loss: 0.0067 - accuracy: 0.9995 present lr: 4.42987512e-05 present iteration: 21574 938/938 [==============================] - 4s 5ms/step - loss: 0.0066 - accuracy: 0.9995 - val_loss: 0.0254 - val_accuracy: 0.9943 Epoch 35/80 936/938 [============================>.] - ETA: 0s - loss: 0.0049 - accuracy: 0.9996 present lr: 2.95595619e-05 present iteration: 32830 938/938 [==============================] - 4s 5ms/step - loss: 0.0049 - accuracy: 0.9996 - val_loss: 0.0251 - val_accuracy: 0.9945

Another stupid thing, which I should have mentioned:

I have not yet found a simple way of how to explicitly reset the number of iterations, only,

- without a recompilation
- or reloading a fully saved model at iteration 0 instead of reloading only the weights
- or writing my own scheduler class based on epochs or batches.

With the present code you would have to (re-) build the model to avoid starting with a large number of "iterations" - and thus a small learning-rate in a second training run. My "reset"-parameter alone does not help.

Some of my readers may be tempted to have a look at the weight tensors. This is possible via the optimizer and a callback. Then they may wonder about the dimensions, which are logically a bit different from the weight matrices I have used in my code for MLPs in another article series.

In the last article of this CNN series I have mentioned that Keras takes care of configuring the weight matrices itself as soon as it got all information about the layers and layers' nodes (or units). Now, this still leaves some open degrees of handling the enumeration of layers and matrices; it is plausible that the logic of the layer and weight association must follow a *convention*. E.g., you can associate a weight matrix with the receiving layer in forward propagation direction. This is what Keras and TF2 do and it is *different* from what I did in my MLP code. Even if you have fixed this point, then you still can discuss the row/column-layout (shape) of the weight matrix:

Keras and TF2 require the "input_shape" i.e. the shape of the tensor which is fed into the present layer. Regarding the weight matrix it is the number of nodes (units) of the previous layer (in FW direction) which is interesting - normally this is given by the 2nd dimension number of the tensors shape. This has to be combined with the number of units in the present layer. The systematics is that TF2 and Keras define the weight matrix between two layers according to the following convention for two adjacent layers L_N and L_(N+1) in FW direction:

L_N with **A nodes** (6 units) and L_(N+1) with **B nodes** (4 units) give a shape of the weight matrix as **(A,B)**. [(6,4)]

See the references at the bottom of the article.

Note that this is NOT what we would expect from our handling of the MLP. However, the amount of information kept in the matrix is, of course, the same. It is just a matter of convention and array transposition for the required math operations during forward and error backward propagation. If you (for reasons of handling gradients) concentrate on BW propagation then the TF2/Keras convention is quite reasonable.

The Keras framework gives you access to a variety of hyperparameters you can use to control the use of a regularizer, the optimizer, the decline of the learning rate with batches/epochs, the momentum. The handling is a bit unconventional, but one gets pretty fast used to it.

We once again learned that it pays to invest a bit in a variation of such parameters, when you want to fight for the last tenth of percents beyond 99.0% accuracy. In the case of MNIST we can drive accuracy to 99.35%.

With the next article of this series

A simple CNN for the MNIST dataset – IV – Visualizing the output of convolutional layers and maps

we turn more to the question of visualizing some data at the convolutional layers to better understand their working.

**Layers and the shape of the weight matrices **

https://keras.io/guides/sequential_model/

https://stackoverflow.com/questions/ 44791014/ understanding-keras-weight-matrix-of-each-layers

https://keras.io/guides/ making new layers and models via subclassing/

**Learning rate and adaptive optimizers**

https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1

https://machinelearningmastery.com/ understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/

https://faroit.com/keras-docs/2.0.8/optimizers/

https://keras.io/api/optimizers/

https://keras.io/api/optimizers/ learning_rate_schedules/

https://www.jeremyjordan.me/nn-learning-rate/

**Keras Callbacks**

https://keras.io/api/callbacks/

https://keras.io/guides/ writing_your_own_callbacks/

https://keras.io/api/callbacks/ learning_rate_scheduler/

**Interactive plotting**

https://stackoverflow.com/questions/ 39428347/ interactive-matplotlib-plots-in-jupyter-notebook

https://matplotlib.org/3.2.1/api/ _as_gen/ matplotlib.pyplot.ion.html

https://matplotlib.org/3.1.3/tutorials/ introductory/ usage.html

A simple CNN for the MNIST datasets – I,

we are well prepared to build a very simple CNN with **Keras**. By *simple* I mean simple enough to handle the MNIST digit images. The Keras API for creating CNN models, layers and activation functions is very convenient; a simple CNN does not require much code. So, the Jupyter environment is sufficient for our first experiment.

An interesting topic is the use of a GPU. After a somewhat frustrating experience with a MLP on the GPU of a NV 960 GTX in comparison to a i7 6700K CPU I am eager to see whether we get a reasonable GPU acceleration for a CNN. So, we should prepare our code to use the GPU. This requires a bit of preparation.

We should also ask a subtle question: What do we expect from a CNN in comparison to a MLP regarding the MNIST data? A MLP with 2 hidden layers (with 70 and 30 nodes) provided over 99.5% accuracy on the training data and almost 98% accuracy on a test dataset after some tweaking. Even with basic settings for our MLP we arrived at a value over 97.7% after 35 epochs - below 8 secs. Well, a CNN is probably better in feature recognition than a cluster detection algorithm. But we are talking about the last 2 % of remaining accuracy. I admit that I did not know what to expect ...

At the end of the last article I had discussed a very simple layer structure of convolutional and pooling layers:

**Layer 0:****Input**layer (tensor of original image data, 3 layers for color channels or one layer for a gray channel)**Layer 1:**layer (small 3x3 kernel, stride 1, 32 filters => 32 maps (26x26), overlapping filter areas)*Conv***Layer 2:**layer (2x2 max pooling => 32 (13x13) maps,*Pooling*

a map node covers 4x4 non overlapping areas per node on the original image)**Layer 3:**layer (3x3 kernel, stride 1, 64 filters => 64 maps (11x11),*Conv*

a map node covers 8x8 overlapping areas on the original image (total effective stride 2))**Layer 4:**layer (2x2 max pooling => 64 maps (5x5),*Pooling*

a map node covers 10x10 areas per node on the original image (total effective stride 5), some border info lost)**Layer 5:**layer (3x3 kernel, stride 1, 64 filters => 64 maps (3x3),*Conv*

a map node covers 18x18 areas per node (effective stride 5), some border info lost )

This is the CNN structure we are going to use in the near future. (Actually, I followed a suggestion of *Francois Chollet*; see the literature list in the last article). Let us assume that we somehow have established all these convolution and pooling layers for a CNN. Each layer producse some "*feature*"-related output, structured in form of a * tensors*. This led to an open question at the end of the last article:

Where and by what do we get a classification of the resulting data with respect to the 10 digit categories of the MNIST images?

Applying filters and extracting "feature hierarchies" of an image alone does not help without a "learned" judgement about these data. But the answer is very simple:

Use a MLP after the last Conv layer and feed it with data from this Conv layer!

When we think in terms of nodes and artificial neurons, we could say: We just have to connect the "nodes" of the feature maps of layer 5 in our special CNN with the nodes of an input layer of a MLP. As a MLP has a *flat* input layer we need to prepare 9x64 = **576 receiving "nodes"** there. We would use weights with a value of "1.0" along these special connections.

Mathematically, this approach can be expressed in terms of a "*flattening*" operation on the tensor data produced by the the last *Conv* data. In Numpy terms: We need to reshape the multidimensional tensor containing the values across the stack of maps at the last Conv2D layer into a long 1D array (= a vector).

From a more general perspective we could say: Feeding the output of the Conv part of our CNN into a MLP for classification is quite similar to what we did when we pre-processed the MNIST data by an unsupervised *cluster detection algorithm*; also there we used the resulting data as input to an MLP. There is one big difference, however:

The optimization of the network's weights during training requires a BW propagation of error terms (more precisely: derivatives of the CNN's loss function) across the MLP * AND* the convolutional and pooling layers. Error BW propagation should not be stopped at the MLP's input layer: It has to move from the output layer of the MLP back to the MLP's input layer

If you read my PDF on the error back propagation for a MLP

PDF on the math behind Error Back_Propagation

and think a bit about its basic recipes and results you quickly see that the "input layer" of the MLP is no barrier to error back propagation: The "deltas" discussed in the PDF can be back-propagated right down to the MLP's input layer. Then we just apply the chain rule again. The partial derivatives at the nodes of the input layer with respect to their input values are just "1", because the activation function there is the identity function. The "weights" between the last Conv layer and the input layer of the MLP are no free parameters - we do not need to care about them. And then everything goes its normal way - we apply chain rule after chain rule for all nodes of the maps to determine the gradients of the CNN's loss function with respect to the weights there. But you need not think about the details - Keras and TF2 will take proper care about everything ...

But, you should always keep the following in mind: Whatever we discuss in terms of layers and nodes - in a CNN these are only fictitious interpretations of a series of mathematical operations on tensor data. Not less, not more ..,. Nodes and layers are just very helpful (!) illustrations of non-cyclic graphs of mathematical operations. KI on the level of my present discussion (MLPs, CNNs) "just" corresponds to algorithms which emerge out of a specific deterministic approach to solve an optimization problem.

Let us now turn to coding. To be able to use a Nvidia GPU we need a Cuda/Cudnn installation and a working Tensorflow backend for Keras. I have already described the installation of CUDA 10.2 and CUDNN on an Opensuse Linux system in some detail in the article Getting a Keras based MLP to run with Cuda 10.2, Cudnn 7.6 and TensorFlow 2.0 on an Opensuse Leap 15.1 system. You can follow the hints there. In case you run into trouble on your Linux distribution try everything with Cuda 10.1.

Some more hints: TF2 in version 2.2 can be installed by the Pip-mechanism in your virtual Python environment ("pip install --upgrade tensorflow"). TF2 *contains* already a special Keras version - which is the one we are going to use in our upcoming experiment. So, there is no need to install Keras separately with "pip". Note also that, in contrast to TF1, it is NOT necessary to install a separate package "tensorflow-gpu". If all these things are new to you: You find some articles on creating an adequate ML test and development environment based on Python/PyDev/Jupyter somewhere else in this blog.

We shall use a Jupyter notebook to perform the basic experiments; but I recommend strongly to consolidate your code in Python files of an Eclipse/PyDev environment afterwards. Before you start your virtual Python environment from a Linux shell you should set the following environment variables:

$>export OPENBLAS_NUM_THREADS=4 # or whatever is reasonable for your CPU (but do not use all CPU cores and/or hyper threads $>export OMP_NUM_THREADS=4 $>export TF_XLA_FLAGS=--tf_xla_cpu_global_jit $>export XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda $>source bin/activate (ml_1) $> jupyter notebook

The following commands in a first Jupyter cell perform the required library imports:

import numpy as np import scipy import time import sys import os import tensorflow as tf from tensorflow import keras as K from tensorflow.python.keras import backend as B from keras import models from keras import layers from keras.utils import to_categorical from keras.datasets import mnist from tensorflow.python.client import device_lib from sklearn.preprocessing import StandardScaler

Do not ignore the statement "**from tensorflow.python.keras import backend as B**"; we need it later.

The "StandardScaler" of Scikit-Learn will help us to "standardize" the MNIST input data. This is a step which you should know already from MLPs ... You can, of course, also experiment with different normalization procedures. But in my opinion using the "StandardScaler" is just convenient. ( I assume that you already have installed *scikit-learn* in your virtual Python environment).

With TF2 the switching between CPU and GPU is a bit of a mess. Not all new parameters and their settings work as expected. As I have explained in the article on the Cuda installation named above, I, therefore, prefer to an old school, but *reliable* TF1 approach and use the compatibility interface:

#gpu = False gpu = True if gpu: GPU = True; CPU = False; num_GPU = 1; num_CPU = 1 else: GPU = False; CPU = True; num_CPU = 1; num_GPU = 0 config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1, allow_soft_placement=True, device_count = {'CPU' : num_CPU, 'GPU' : num_GPU}, log_device_placement=True ) config.gpu_options.per_process_gpu_memory_fraction=0.35 config.gpu_options.force_gpu_compatible = True B.set_session(tf.compat.v1.Session(config=config))

We are brave and try our first runs directly on a GPU. The statement "log_device_placement" will help us to get information about which device - CPU or GP - is actually used.

We prepare a function which loads and prepares the MNIST data for us. The statements reflect more or less what we did with the MNIST dat when we used them for MLPs.

# load MNIST # ********** def load_Mnist(): mnist = K.datasets.mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() #print(X_train.shape) #print(X_test.shape) # preprocess - flatten len_train = X_train.shape[0] len_test = X_test.shape[0] X_train = X_train.reshape(len_train, 28*28) X_test = X_test.reshape(len_test, 28*28) #concatenate _X = np.concatenate((X_train, X_test), axis=0) _y = np.concatenate((y_train, y_test), axis=0) _dim_X = _X.shape[0] # 32-bit _X = _X.astype(np.float32) _y = _y.astype(np.int32) # normalize scaler = StandardScaler() _X = scaler.fit_transform(_X) # mixing the training indices - MUST happen BEFORE encoding shuffled_index = np.random.permutation(_dim_X) _X, _y = _X[shuffled_index], _y[shuffled_index] # split again num_test = 10000 num_train = _dim_X - num_test X_train, X_test, y_train, y_test = _X[:num_train], _X[num_train:], _y[:num_train], _y[num_train:] # reshape to Keras tensor requirements train_imgs = X_train.reshape((num_train, 28, 28, 1)) test_imgs = X_test.reshape((num_test, 28, 28, 1)) #print(train_imgs.shape) #print(test_imgs.shape) # one-hot-encoding train_labels = to_categorical(y_train) test_labels = to_categorical(y_test) #print(test_labels[4]) return train_imgs, test_imgs, train_labels, test_labels if gpu: with tf.device("/GPU:0"): train_imgs, test_imgs, train_labels, test_labels= load_Mnist() else: with tf.device("/CPU:0"): train_imgs, test_imgs, train_labels, test_labels= load_Mnist()

**Some comments: **

**Normalization and shuffling:**The "StandardScaler" is used for data normalization. I also shuffled the data to avoid any pre-ordered sequences. We know these steps already from the MLP code we built in another article series.- Image data in tensor form: Something, which is different from working with MLPs is that we have to fulfill some requirements regarding the form of input data. From the last article we know already that our data should have a
*tensor compatible form*; Keras expects data from us which have a certain shape. So,**no flattening**of the data into a vector here as we were used to with MLPs. For images we, instead, need the width, the height of our images in terms of pixels and also the depth (here 1 for gray-scale images).**In addition the data**.*samples*are to be indexed*along the first tensor axis*

This means that we need to deliver a 4-dimensional array corresponding to a TF tensor of rank 4. Keras/TF2 will do the necessary transformation from a Numpy array to a TF2 tensor automatically for us. The corresponding Numpy shape of the required array is:**(samples, height, width, depth)**

Some people also use the term "channels" instead of "depth". In the case of MNIST we reshape the input array - "train_imgs" to**(num_train, 28, 28, 1)**, with "num_train" being the number of samples. - The use of the function "to_categorical()", more precisely "tf.keras.utils.to_categorical()", corresponds to a
of the target data. All these concepts are well known from our study of MLPs and MNIST. Keras makes life easy regarding this point ...**one-hot-encoding** - The statements "
**with tf.device("/GPU:0"):**" and "w**ith tf.device("/CPU:0"):**" delegate the execution of (suitable) code on the GPU or the CPU. Note that due to the Python/Jupyter environment some code will of course also be executed on the CPU - even if you delegated execution to the GPU.

If you activate the print statements the resulting output should be:

(60000, 28, 28) (10000, 28, 28) (60000, 28, 28, 1) (10000, 28, 28, 1) [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]

The last line proves the one-hot-encoding.

Now, we come to a central point: We need to build the 5 central layers of our CNN-architecture. When we build our own MLP code we used a special method to build the different weight arrays, which represented the number of nodes via the array dimensions. A simple method was sufficient as we had no major differences between layers. But with CNNs we have to work with substantially different types of layers. So, how are layers to be handled with Keras?

Well, Keras provides a full layer API with different *classes* for a *variety* of layers. You find substantial information on this API and different types of layers at

https://keras.io/api/layers/.

The first section which is interesting for our experiment is https://keras.io/api/ layers/ convolution_layers/ convolution2d/.

You do not need to read much to understand that this is exactly what we need for the "convolutional layers" of our simple CNN model. But how do we instantiate the **Conv2D** class such that the output works seamlessly together with other layers?

Keras makes our life easy again. All layers are to be used in a purely *sequential order*. (There are much more complicated layer topologies you can build with Keras! Oh, yes ...). Well, you guess it: Keras offers you a * model API*; see:

https://keras.io/api/models/.

And there we find a class for a "**sequential model**" - see https://keras.io/api/ models/sequential/. This class offers us a method "**add()**" to add layers (and create an instance of the related layer class).

The only missing ingredient is a class for a "pooling" layer. Well, you find it in the layer API documentation, too. The following image depicts the basic structure of our CNN (see the left side of the drawing), as we designed it (see the list above):

The convolutional part of the CNN can be set up by the following commands:

**Convolutional part of the CNN**

# Sequential layer model of our CNN # *********************************** # Build the Conv part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Choose the activation function for the Conv2D layers conv_act_func = 1 li_conv_act_funcs = ['sigmoid', 'relu', 'elu', 'tanh'] cact = li_conv_act_funcs[conv_act_func] # Build the Conv2D layers cnn = models.Sequential() cnn.add(layers.Conv2D(32, (3,3), activation=cact, input_shape=(28,28,1))) cnn.add(layers.MaxPooling2D((2,2))) cnn.add(layers.Conv2D(64, (3,3), activation=cact)) cnn.add(layers.MaxPooling2D((2,2))) cnn.add(layers.Conv2D(64, (3,3), activation=cact))

Easy, isn't it? The nice thing about Keras is that it cares about the required tensor ranks and shapes itself; in a sequential model it evaluates the output of a already defined layer to guess the shape of the tensor data entering the next layer. Thus we have to define an "input_shape" only for data entering the first Conv2D layer!

The first Conv2D layer requires, of course, a shape for the input data. We must also tell the layer interface how many filters and "feature maps" we want to use. In our case we produce 32 maps by first Conv2D layer and 64 by the other two Conv2D layers. The (3x3)-parameter defines the filter area size to be covered by the filter kernel: 3x3 pixels. We define no "stride", so a stride of 1 is automatically used; all 3x3 areas lie close to each other and overlap each other. These parameters result in 32 maps of size 26x26 after the first convolution. The size of the maps of the other layers are given in the layer list at the beginning of this article.

In addition you saw from the code that we chose an activation function via an index of a Python list of reasonable alternatives. You find an explanation of all the different activation functions in the literature. (See also: wikipedia: Activation function). The "sigmoid" function should be well known to you already from my other article series on a MLP.

Now, we have to care about the MLP part of the CNN. The code is simple:

**MLP part of the CNN**

# Build the MLP part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Choose the activation function for the hidden layers of the MLP mlp_h_act_func = 0 li_mlp_h_act_funcs = ['relu', 'sigmoid', 'tanh'] mhact = li_mlp_h_act_funcs[mlp_h_act_func] # Choose the output function for the output layer of the MLP mlp_o_act_func = 0 li_mlp_o_act_funcs = ['softmax', 'sigmoid'] moact = li_mlp_o_act_funcs[mlp_o_act_func] # Build the MLP layers cnn.add(layers.Flatten()) cnn.add(layers.Dense(70, activation=mhact)) #cnn.add(layers.Dense(30, activation=mhact)) cnn.add(layers.Dense(10, activation=moact))

This all is very straight forward (with the exception of the last statement). The **"Flatten"-layer** corresponds to the MLP's inout layer. It just transforms the tensor output of the last Conv2D layer into the flat form usable for the first "Dense" layer of the MLP. The first and only "**Dense layer**" (MLP hidden layer) builds up connections from the flat MLP "input layer" and associates it with weights. Actually, it prepares a weight-tensor for a tensor-operation on the output data of the feeding layer. *Dense* means that all "nodes" of the previous layer are connected to the present layer's own "nodes" - meaning: setting the right dimensions of the weight tensor (matrix in our case). As a first trial we work with just one hidden layer. (We shall later see that more layers will not improve accuracy.)

I choose the output function (if you will: the activation function of the output layer) as "**softmax**". This gives us a probability distribution across the classification categories. Note that this is a different approach compared to what we have done so far with MLPs. I shall comment on the differences in a separate article when I find the time for it. At this point I just want to indicate that softmax combined with the "categorical cross entropy loss" is a generalized version of the combination "sigmoid" with "log loss" as we used it for our MLP.

The above code for creating the CNN would work. However, we want to be able to parameterize our simple CNN. So we include the above statements in a function for which we provide parameters for all layers. A quick solution is to define layer parameters as elements of a Python list - we then get one list per layer. (If you are a friend of clean code design I recommend to choose a more elaborated approach; inject just one parameter object containing all parameters in a structured way. I leave this exercise to you.)

We now combine the statements for layer construction in a function:

# Sequential layer model of our CNN # *********************************** # just for illustration - th ereal parameters are fed later # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ li_conv_1 = [32, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [64, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_dense_1 = [70, 0] li_dense_2 = [10, 0] li_MLP = [li_dense_1, li_dense_2] input_shape = (28,28,1) # important !! # ~~~~~~~~~~~ cnn = None def build_cnn_simple(li_Conv, li_Pool, li_MLP, input_shape ): # activation functions to be used in Conv-layers li_conv_act_funcs = ['relu', 'sigmoid', 'elu', 'tanh'] # activation functions to be used in MLP hidden layers li_mlp_h_act_funcs = ['relu', 'sigmoid', 'tanh'] # activation functions to be used in MLP output layers li_mlp_o_act_funcs = ['softmax', 'sigmoid'] # Build the Conv part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_conv_layers = len(li_Conv) num_pool_layers = len(li_Pool) if num_pool_layers != num_conv_layers - 1: print("\nNumber of pool layers does not fit to number of Conv-layers") sys.exit() rg_il = range(num_conv_layers) # Define a sequential model cnn = models.Sequential() for il in rg_il: # add the convolutional layer num_filters = li_Conv[il][0] t_fkern_size = li_Conv[il][1] cact = li_conv_act_funcs[li_Conv[il][2]] if il==0: cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact, input_shape=input_shape)) else: cnn.add(layers.Conv2D(num_filters, t_fkern_size, activation=cact)) # add the pooling layer if il < num_pool_layers: t_pkern_size = li_Pool[il][0] cnn.add(layers.MaxPooling2D(t_pkern_size)) # Build the MLP part of the CNN # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ num_mlp_layers = len(li_MLP) rg_im = range(num_mlp_layers) cnn.add(layers.Flatten()) for im in rg_im: # add the dense layer n_nodes = li_MLP[im][0] if im < num_mlp_layers - 1: m_act = li_mlp_h_act_funcs[li_MLP[im][1]] else: m_act = li_mlp_o_act_funcs[li_MLP[im][1]] cnn.add(layers.Dense(n_nodes, activation=m_act)) return cnn

We return the model "cnn" to be able to use it afterwards.

The layers contribute with the following numbers of weight parameters:

- Layer 1: (32 x (3x3)) + 32 = 320
- Layer 3: 32 x 64 x (3x3) + 64 = 18496
- Layer 5: 64 x 64 x (3x3) + 64 = 36928
- MLP : (576 + 1) x 70 + (70 + 1) x 10 = 41100

Making a total of **96844** weight parameters. Our standard MLP discussed in another article series had (784+1) x 70 + (70 + 1) x 30 + (30 +1 ) x 10 = 57390 weights. So, our CNN is bigger and the CPU time to follow and calculate all the partial derivatives will be significantly higher. So, we should definitely expect some better classification data, shouldn't we?

Now comes a thing which is necessary for models: We have not yet defined the **loss function** and the **optimizer** or a **learning rate**. For the latter Keras can choose a proper value itself - as soon as it knows the loss function. But we should give it a reasonable loss function and a suitable optimizer for gradient descent. This is the main purpose of the "**compile()**"-function.

cnn.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

Although TF2 can already analyze the graph of tensor operations for partial derivatives, it cannot guess the beginning of the chain rule sequence.

As we have multiple categories "categorial_crossentropy" is a good choice for the loss function. We should also define which optimized gradient descent method is used; we choose "rmsprop" - as this method works well in most cases. A nice introduction is given here: towardsdatascience: understanding-rmsprop-faster-neural-network-learning-62e116fcf29a. But see the books mentioned in the last article on "rmsprop", too.

Regarding the use of different metrics for different tasks see machinelearningmastery.com / custom-metrics-deep-learning-keras-python/. In case of a classification problem, we are interested in the categorical "accuracy". A metric can be monitored during training and will be recorded (besides aother data). We can use it for plotting information on the training process (a topic of the next article).

Training is done by a function model.fit() - here: cnn.fit(). This function accepts a variety of parameters explained here: https://keras.io/ api/ models/ model_training_apis/ #fit-method.

We now can combine compilation and training in one function:

# Training def train( cnn, build=False, train_imgs, train_labels, reset, epochs, batch_size, optimizer, loss, metrics, li_Conv, li_Poo, li_MLP, input_shape ): if build: cnn = build_cnn_simple( li_Conv, li_Pool, li_MLP, input_shape) cnn.compile(optimizer=optimizer, loss=loss, metrics=metrics) cnn.save_weights('cnn_weights.h5') # save the initial weights # reset weights if reset and not build: cnn.load_weights('cnn_weights.h5') start_t = time.perf_counter() cnn.fit(train_imgs, train_labels, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True) end_t = time.perf_counter() fit_t = end_t - start_t return cnn, fit_t # we return cnn to be able to use it by other functions

Note that we **save** the initial weights to be able to load them again for a new training - otherwise Keras saves the weights as other model data after training and continues with the last weights found. The latter can be reasonable if you want to continue training in defined steps. However, in our forthcoming tests we repeat the training from scratch.

Keras offers a "save"-model and methods to transfer data of a CNN model to files (in two specific formats). For saving weights the given lines are sufficient. Hint: When I specify no path to the file "cnn_weights.h5" the data are - at least in my virtual Python environment - saved in the directory where the notebook is located.

In a further Jupyter cell we place the following code for a test run:

# Perform a training run # ******************** build = False if cnn == None: build = True batch_size=64 epochs=5 reset = True # we want training to start again with the initial weights optimizer='rmsprop' loss='categorical_crossentropy' metrics=['accuracy'] li_conv_1 = [32, (3,3), 0] li_conv_2 = [64, (3,3), 0] li_conv_3 = [64, (3,3), 0] li_Conv = [li_conv_1, li_conv_2, li_conv_3] li_pool_1 = [(2,2)] li_pool_2 = [(2,2)] li_Pool = [li_pool_1, li_pool_2] li_dense_1 = [70, 0] li_dense_2 = [10, 0] li_MLP = [li_dense_1, li_dense_2] input_shape = (28,28,1) try: if gpu: with tf.device("/GPU:0"): cnn, fit_time = train( cnn, build, train_imgs, train_labels, reset, epochs, batch_size, optimizer, loss, metrics, li_Conv, li_Pool, li_MLP, input_shape) print('Time_GPU: ', fit_time) else: with tf.device("/CPU:0"): cnn, fit_time = train( cnn, build, train_imgs, train_labels, reset, epochs, batch_size, optimizer, loss, metrics, li_Conv, li_Pool, li_MLP, input_shape) print('Time_CPU: ', fit_time) except SystemExit: print("stopped due to exception")

You recognize the parameterization of our train()-function. What results do we get ?

Epoch 1/5 60000/60000 [==============================] - 4s 69us/step - loss: 0.1551 - accuracy: 0.9520 Epoch 2/5 60000/60000 [==============================] - 4s 69us/step - loss: 0.0438 - accuracy: 0.9868 Epoch 3/5 60000/60000 [==============================] - 4s 68us/step - loss: 0.0305 - accuracy: 0.9907 Epoch 4/5 60000/60000 [==============================] - 4s 69us/step - loss: 0.0227 - accuracy: 0.9931 Epoch 5/5 60000/60000 [==============================] - 4s 69us/step - loss: 0.0184 - accuracy: 0.9948 Time_GPU: 20.610678611003095

And a successive check on the test data gives us:

We can ask Keras also for a description of the model:

We got an accuracy on the test data of **99%**! With 5 epochs in 20 seconds - on my old GPU.

This leaves us a very good impression - on first sight ...

We saw today that it is easy to set up a CNN. We used a simple MLP to solve the problem of classification; the data to its input layer are provided by the output of the last convolutional layer. The tensor there has just to be "flattened".

The level of accuracy reached is impressing. Well, its also a bit frustrating when we think about the efforts we put into our MLP, but we also get a sense for the power and capabilities of CNNs.

In the next article

A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer

we will care a bit about plotting. We at least want to see the same accuracy and loss data which we used to plot at the end of our MLP tests.

]]>

A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

we have played with a Python/Numpy code, which created a configurable and trainable "Multilayer Perceptron" [

MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation

for ongoing code and performance optimization.

A MLP program is useful to study multiple topics in Machine Learning [ML] on a basic level. However, MLPs with dense layers are certainly not at the forefront of ML technology - though they still are fundamental bricks in other more complicated architectures of "Artifical Neural Networks" [**ANNs**]. During my MLP experiments I became sufficiently acquainted with Python, Jupyter and matplotlib to make some curious first steps into another field of Machine Learning [**ML**] now: "Convolutional Neural Networks" [**CNNs**].

CNNs on my level as an interested IT-affine person are most of all fun. Nevertheless, I quickly found out that a somewhat systematic approach is helpful - especially if you later on want to use the Tensorflow's API and not only ** Keras**. When I now write about some experiments I did and do I summarize my own biased insights and sometimes surprises. Probably there are other hobbyists as me out there who also fight with elementary points in the literature and practical experiments. Books alone are not enough ... I hope to deliver some practical hints for this audience. The present articles are, however, NOT intended for ML and CNN experts. Experts will almost certainly

Although I address CNN-beginners I assume that people who stumble across this article and want to follow me through some experiments have read something about CNNs already. You should know fundamentals about *filters, strides and the basic principles of convolution*. I shall comment on all these points but I shall not repeat the very basics. I recommend to read relevant chapters in one of the books I recommend at the end of this article. You should in addition have some knowledge regarding the basic structure and functionality of a MLP as well as "gradient descent" as an optimization technique.

The objective of this introductory mini-series is to build a first simple CNN, to apply it to the MNIST dataset and to *visualize* some of the elementary "*features*" the CNN detects in the images of handwritten digits. We shall use ** Keras** (with the Tensorflow 2.2 backend and CUDA 10.2) for this purpose. And, of course, a bit of matplotlib and Python/Numpy, too. We are working with MNIST images in the first place - although CNNs can be used to analyze other types of input data. After we have covered the simple standard MNIST image set, we shall also work a bit with the so called "MNIST fashion" set.

But in this article I start with some introductory words on the structure of CNNs and the task of its layers. We shall use the information later on as a reference. In the second article we shall set up and test a simple version of a CNN. Further articles will then concentrate on visualizing what a trained CNN reacts to and how it modifies and analyzes the input data on its layers.

When we studied an MLP in combination with the basic MNIST dataset of handwritten digits we found that we got an improvement in accuracy (for the same setup of dense layers) when we pre-processed the data to find "*clusters*" in the image data *before* training. Such a process corresponds to detecting parts of an MNIST image with certain gray-white pixel constellations. We used Scikit-Learn's "MiniBatchKMeans" for this purpose.

We saw that the identification of 40 to 70 cluster areas in the images helped the MLP algorithm to analyze the MNIST data faster and better than before. Obviously, training the MLP with respect to combinations of characteristic * sub-structures* of the different images helped us to

What if we could combine the detection of sub-structures in an image with the training process of an ANN?

CNNs are the answer! They are designed to detect elementary structures or features in image data (and other data) *systematically*. In addition they are enabled to learn something about characteristic *compositions* of such elementary * features* during training. I.e., they detect more abstract and composite features specific for the appearance of certain objects within an image. We speak of a "

While a MLP must learn about pixel constellations and their relations on the *whole* image area, CNNs are much more flexible and even reusable. They identify and *remember* elementary sub-structures *independent of the exact position* of such features within an image. They furthermore learn "abstract concepts" about depicted objects via identifying characteristic and complex *composite* features on a higher level.

This simplified description of the astonishing capabilities of a CNN indicates that its training and learning is basically a **two-fold** process:

- Detecting elementary structures in an image (or other structured data sets) by
*filtering*and extracting patterns within relatively small image areas. We shall call these areas "filter areas". - Constructing abstract characteristic features out of the elementary filtered structural elements. This corresponds to building a "
*hierarchy*" of significant features for the classification of images*or*of distinguished*objects**or*of the positions of such objects within an image.

Now, if you think about the MNIST digit data we understand intuitively that written digits represent some abstract concepts like certain combinations of straight vertical and horizontal line elements, bows and line crossings. The recognition of certain feature combinations of such elementary structures would of course be helpful to recognize and classify written digits better - especially when the recognition of the combination of such features is independent of their exact position on an image.

An important concept behind CNNs is the systematic application of (various) * filters* (described and defined by so called "kernels").

A "*filter*" defines a kind of *masking* pixel area of limited small size (e.g. 3x3 pixels). A *filter* combines weighted output values at *neighboring* nodes of a input layer in a specific defined way. It processes the offered information in a defined area always in the same fixed way - independent of where the filter area is exactly placed on the (bigger) image (or a processed version of it). We call a processed version of an image a "** map**".

A specific type of CNN layer, called a "*Convolution Layer*" [Conv layer], and a related operational algorithm let a series of such small masking areas cover the complete surface of an image (or a map). The first Conv layer of a CNN filters the information of the original image information via a multitude of such masking areas. The masks can be arranged overlapping, i.e. they can be shifted against each other by some distance along their axes. Think of the masking filter areas as a bunch of overlapping *tiles* covering the image. The shift is called * stride*.

The "filter" mechanism (better: the mathematical recipe) of a *specific* filter remains the *same* for all of its small masking areas covering the image. A specific filter emphasizes certain parts of the original information and suppresses other parts in a defined way. If you combine the information of all masks you get a new (filtered) representation of the image - we speak of a "* feature map*" - sometimes with a somewhat smaller size than the original image (or map) the filter is applied to. The blending of the original data with a filtering mask create a "

The picture below sketches the basic principle of a 3x3-filter which is applied with a constant *stride* of 2 along each axis of the image:

*Convolution* is not so complicated as it sounds. It means: You multiply the original data values in the covered small area by factors defined in the filter's *kernel* and add the resulting values up to get a a distinct value at a defined position inside the map. In the given example with a stride of 2 we get a resulting feature map of 4x4 out of a original 9x9 (image or map).

Note that a filter need not be defined as a *square*. It can have a rectangular (n x m) shape with (n, m) being integers. (In principle we could also think of other tile forms as e.g. hexagons - as long as they can seamlessly cover a defined plane. Interesting, although I have not seen a hexagon based CNN in the literature, yet).

A filter's * kernel* defines

Note also that filters may have a "depth" property when they shall be applied to three-dimensional data sets; we may need a depth when we cover colored images (which require 3 input layers). But let us keep to flat filters in this introductory discussion ...

Now we come to a central question: Does a CNN Conv layer use just *one* filter? The answer is: No!

A Conv layer of a CNN you *allows* for the construction of * multiple different filters*. Thus we have to deal with a whole bunch of filters *per* each convolutional layer. E.g. 32 filters for the first convolutional layer and 64 for the second and 128 for the third. The outcome of the respective filter operations is the creation is of equally many so called "* feature maps*" (one for each filter) per convolutional layer. With 32 different filters on a

This means: A Conv layer has a multitude of sub-layers called "feature maps" which result of the application of different filters on previous image or map data.

You may have guessed already that the next step of abstraction is:

You can apply filters also to "feature maps" of previous filters, i.e. you can **chain****convolutions**. Thus, feature maps are either connected to the image (1st *Conv* layer) or to the feature maps of a previous layer.

By using a sequence of multiple *Conv* layers you cover growing areas of the original image. Everything clear? Probably not ...

When I first was confronted with the concept of filters, I got confused because many authors only describe the basic *technical* details of the "convolution" mechanism. They explain with many words how a filter and its kernel work when the filtering area is "*moved*" across the surface of an image. They give you pretty concrete filter examples; very popular are straight lines and crosses indicated by "ones" as factors in the filter's kernel and zeros otherwise. And then you get an additional lecture on strides and padding. You have certainly read various related passages in books about ML and/or CNNs. A pretty good example for this "explanation" is the (otherwise interesting and helpful!) book of Deru and Ndiaye (see the bottom of this article). I refer to the introductory chapter 3.5.1 on CNN architectures.

Well, the technical procedure is pretty easy to understand from drawings as given above - the real question that nags in your brain is:

"Where the hell do all the different filter definitions come from?"

What many authors forget is a central introductory sentence for beginners:

A filter is *not* given a priori. Filters (and their kernels) are systematically constructed and build up during the training of a CNN; filters are the *end products* of a learning and optimization process every CNN must absolve.

This means: For a given problem or dataset you do * not* know in advance what the "filters" (and their defining kernels) will look like after training (aside of their pixel dimensions already fixed by the CNN's layer definitions). The "factors" of a filter used in the convolution operation are actually

Another critical point is the somewhat misleading analogy of "moving" a filter across an image's or map's pixel surface. Nothing is ever actually "moved" in a CNN's algorithm. All masks are already in place when the convolution operations are performed:

Every element of a specific e.g. 3x3 kernel corresponds to "factors" for the convolution operation. What are these *factors*? Again: They are nothing else but *weights* - in exactly the same sense as we used them in MLPs. A filter kernel represents a **set** of *weight*-values to be multiplied with original output values at the "nodes" in other layers or maps feeding input to the nodes of the present map.

Things become much clearer if you imagine a *feature map* as a bunch of arranged "nodes". Each node of a map is connected to (n x m) nodes of a previous set of nodes on a map or layer delivering input to the *Conv* layer's maps.

Let us look at an example. The following drawing shows the connections from "nodes" of a feature map "**m**" of a *Conv* layer **L_(N+1)** to nodes of two *different* maps "1" and "2" of *Conv* layer **L_N**. The *stride* for the kernels is assumed to be just 1.

In the example the related weights are described by **two** *different* (3x3) kernels. Note, however, that each node of a specific map uses the same weights for connections to another *specific* map or sub-layer of the previous (input) layer. This explains the total number of weights between two sequential *Conv* layers - one with 32 maps and the next with 64 maps - as (64 x 32 x 9) + 64 = **18496**. The 64 extra weights account for bias values per map on layer L_(N+1). (As all nodes of a map use *fixed* bunches of weights, we only need exactly *one* bias value per map).

Note also that a *stride* is defined for the whole layer and not per map. Thus we enforce the same size of all maps in a layer. The convolutions between a distinct map and all maps of the previous layer L_N can be thought as operations performed on a *column* of stacked filter areas at the same position - one above the other across all maps of L_N. See the illustration below:

The *weights* of a *specific* kernel work together as an *ensemble*: They condense the original 3x3 pixel information in the filtered area of the connected input layer or a map to a value at one * node* of the filter specific

A CNN *learns* the appropriate weights (= the filter definitions) for a given bunch of images via training and is guided by the optimization of a *loss function*. You know these concepts already from MLPs ...

The difference is that the ANN now learns about appropriate "weight ensembles" - eventually (!) working together as a defined convolutional filter between different maps of neighboring Conv (and/or sampling ) Layers. (For sampling see a separate paragraph below.)

The next picture illustrates the column like convolution of information across the identically positioned filter areas across multiple maps of a previous convolution layer:

The fact that the weight ensemble of a specific filter between maps is always the same, explains, by the way, the *relatively* (!) small number of weight parameters in deep CNNS.

**Intermediate summary:** The *weights*, which represent the factors used by a specific filter operation called convolution, are defined during a training process. The filter, its kernel and the respective weight values are the outcome of a mathematical optimization process - mostly guided by *gradient descent*.

As in MLPs each *Conv* layer has an associated "activation function" which is applied at each node of all maps after the resulting values of the convolution have been calculated as the nodes input. The output then feeds the connections to the next layer. In CNNs for image handling often "*Relu*" or "*Selu*" are used as activation functions - and not "sigmoid" which we applied in our personal MLP code.

The above drawings indicate already that we need to arrange the data (of an image) and also the resulting map data in an organized way to be able to apply the required convolutional multiplications and summations the right way.

An colored image is basically a regular 3 dimensional structure with a width "*w*" (number of pixels along the x-axis), a height "*h*" (number of pixels along the y-axis) and a (color) depth "*d*" (d=3 for RGB colors).

If you represent the color value at each pixel and RGB-layer by a float you get a bunch of * w x h x d* float values which we can organize and index in a 3 dimensional Numpy array. Mathematically such well organized arrays with a defined number of axes (

Now the important point is: The output data of *Conv*-layers and their feature maps also represent tensors. A bunch of 32 maps with a defined width and height defines data of a 3D-tensor.

You can imagine each value of such a tensor as the input or output given at a specific *node* in a layer with a 3-dimensional sub-structure. (In other even more complex data structures than images we would other *multi-dimensional data structures*.) The weights of a filter kernel describe the connections of the nodes of a feature map on a layer L_N to a specific map of a previous layer. Weights, actually, also define elements of a tensor.

The forward- and backward-propagation operations performed throughout such a complex net during training thus correspond to certain tensor-operations - i.e. generalized versions of the np.dot()-product we got to know in MLPs.

You understood already that e.g strides are important. But you do not need to care about details - Keras and Tensorflow will do the job for you! If you want to read a bit look a the documentation of the TF function "tf.nn.conv2d()".

When we later on train with mini-batches of input data (i.e. batches of images) we get yet another dimension of our tensors. This batch dimension can - quite similar to MLPs - be used to optimize the tensor operations in a *vectorized* way. See my series on MLPs.

Two sections above I characterized the training of a CNN as a two-fold procedure. From the first drawing it is relatively easy to understand how we get to grasp tiny sub-structures of an image: Just use filters with *small* kernel sizes!

Fine, but there is probably a second question already arising in your mind:

By what mechanism does a CNN find or recognize a * hierarchy of features*?

One part of the answer is: Chain convolutions!

Let us assume a first convolutional layer with filters having a stride of 1 and a (3x3) kernel. We get maps with a shape of (26, 26) on this layer. The next Conv layer shall use a (4x4) kernel and also a stride of 1; then we get maps with a shape of (23, 23). A node on the second layer covers (6x6)-arrays on the original image. Two neighboring nodes a total area of (7x7). The individual (6x6)-areas of course overlap.

With a stride of 2 on each Conv-layer the corresponding areas on the original image are (7x7) and (11x11).

So a stack of consecutive (sequential) Conv-layers covers growing areas on the original image. This supports the detection of a feature hierarchy.

**However:** Small strides require a relatively big number of sequential Conv-layers (for 3x3 kernels and stride 2) at least 13 layers to eventually cover the full image area.

Even if we would not enlarge the number of maps beyond 128 with growing layer number, we would get

(32 x 9 + 32) + (64 x 32 +64) + (128 x 64 + 128) + 10 x (128 x 128 + 128) = 320 + 18496 + 73856 + 10*147584 = 1.568 million weight parameters

to take care of!

This number has to be multiplied by the number of images in a mini-batch - e.g. 500. And - as we know from MLPs we have to keep all intermediate output results in RAM to accelerate the BW propagation for the determination of gradients. Too many data and parameters for the analysis of small 28x28 images!

Big strides, however, would affect the spatial resolution of the first layers in a CNN. What is the way out?

The famous VGG16 CNN uses pairs and triples of convolution chains in its architecture. How does such a network get control over the number of weight parameters and the RAM requirement for all the output data at all the layers?

To get information in the sense of a feature hierarchy the CNN clearly should not look at details and related small sub-fields of the image, only. It must cover step-wise growing (!) areas of the original image, too. How do we combine these seemingly contradictory objectives in one training algorithm which does *not* lead to an exploding number of parameters, RAM and CPU time? Well, guys, this is the point where we should pay due respect to all the creative inventors of CNNs:

The answer is: We must accumulate or * sample* information across larger image or map areas. This is the (underestimated?) task of

For me it was just another confusing point in the beginning - until one grasps the real magic behind it. At first sight a layer like a typical "maxpooling" layer seems to *reduce* information, only; see the next picture:

The drawing explains that we "sample" the information over multiple pixels e.g. by

- either calculating an average over pixels (or map node values)
- or by just picking the maximum value of pixels or map node values (thereby stressing the most important information)

in a certain defined sub-area of an image or map.

The shift or stride used as a *default* in a pooling layer is exactly the side length of the pooling area. We thus cover the image by adjacent, non-overlapping tiles! This leads to a substantial *decrease* of the dimensions of the resulting map! With a (2x2) pooling size by a an effective factor of 2. (You can change the default pooling stride - but think about the consequences!)

Of course, averaging or picking a max value corresponds to information reduction.

**However:** What the CNN really also will do in a subsequent *Conv* layer is to invest in further weights for the combination of information (features) in and of *substantially larger* areas of the original image! Pooling followed by an additional convolution obviously supports hierarchy building of information on different scales of image areas!

After we first have concentrated on small scale features (like with a magnifying glass) we now - in a figurative sense - make a step backwards and look at larger scales of the image again.

The trick is to evaluate large scale information by sampling layers in addition to the small scale information information already extracted by the previous convolutions. Yes, we drop resolution information - but by introducing a suitable mix of convolutions and sampling layers we also force the network systematically to concentrate on combined large scale features, which in the end are really important for the image classification as a whole!

As sampling counterbalances an explosion of parameters we can invest into a growing number of feature maps with growing scales of covered image areas. I.e. we add more and new filters reacting to combinations of larger scale information.

Look at the second to last illustration: Assume that the 32 maps on layer L_N depicted there are the result of a sampling operation. The next convolution gathers new knowledge about more, namely 64 different *combinations* of filtered structures over a whole vertical stack of small filter areas located at the *same* position on the 32 maps of layer N. The new information is in the course of training conserved into 64 weight ensembles for 64 maps on layer N+1.

We can think of multiple ways of combining Conv layers and pooling layers. A simple recipe for small images could be

**Layer 0:**Input layer (tensor of original image data, 3 color layers or one gray layer)- >Layer 1: Conv layer (small 3x3 kernel, stride 1, 32 filters, 32 maps (26x26), analyzes 3x3 overlapping areas)
**Layer 2:**Pooling layer (2x2 max pooling => 32 (13x13) maps,

a node covers 4x4 non overlapping areas per node on the original image)**Layer 3:**Conv layer (3x3 kernel, stride 1, 64 filters, 64 maps (11x11),

a node covers 8x8 overlapping areas on the original image (total effective stride 2))**Layer 4:**Pooling layer (2x2 max pooling => 64 maps (5x5),

a node covers 10x10 areas per node on the original image (total effective stride 5), some border info lost)**Layer 5:**Conv layer (3x3 kernel, stride 1, 64 filters, 64 maps (3x3),

a node covers 18x18 per node (effective stride 5), some border info lost )

The following picture illustrates the resulting successive combinations of nodes along one axis of a 28x28 image.

Note that I only indicated the connections to border nodes of the *Conv* filter areas.

The kernel size decides on the smallest structures we look at - especially via the first convolution. The sampling decides on the sequence of steadily growing areas which we then analyze for specific combinations of smaller structures.

Again: It is most of all the (down-) sampling which allows for an effective hierarchical information building over growing larger image areas! Actually we do not really drop information by sampling - instead we give the network a chance to collect and code new information on a higher, more abstract level (via a whole bunch of numerous new weights).

The big advantages of the sampling layers get obvious:

- They reduce the numbers of required weights
- They reduce the amount of required memory - not only for weights but also for the output data, which must be saved for every layer, map and node.
- They reduce the CPU load for FW and BW propagation
- They also limit the risk of overfitting as some detail information is dropped.

Of course there are many other sequences of layers one could think about. E.g., we could combine 2 to 3 Conv layers before we apply a pooling layer. Such a layer sequence is characteristic of the VGG nets.

Just as MLPs a CNN represents an *acyclic* graph, where the maps contain increasingly fewer nodes but where the number of maps per layer increases on average.

An interesting question, which seldom is answered in introductory books, is whether two totally independent training runs for a given CNN-architecture applied on the same input data will produce the same filters in the same order. We shall investigate this point in the forthcoming articles.

Another interesting point is: What does a CNN *see* at which convolution layer? What do the "features" (= basic structural elements) in an image which trigger a specific filter, look like?

If we could look into the output at some maps we could possibly see what filters do with the original image. And if we found a way to *construct* a structured image which triggers a specific filter then we could better understand what *patterns* the CNN reacts to. Examples for these different types of visualizations with respect to convolution in a CNN are objectives of this article series.

Today we covered a lot of "theory" on some aspects of CNNs. But we have a sufficiently solid basis regarding the structure and architecture now.

CNNs obviously have a much more complex structure than MLPs: They are deep in the sense of many sequential layers. And each convolutional layer has a complex structure in form of many parallel sub-layers (feature maps) itself. Feature maps are associated with filters, whose parameters (weights) get learned during the training. A map results from covering the original image or a map of a previous layer with small (overlapping) tiles of small filtering areas.

A mix of convolution and pooling layers allows for a look at detail features of the image in small areas in lower layers, whilst later layers can focus on feature combinations of larger image areas. The involved filters thus allow for the "awareness" of a hierarchy of features with translational invariance.

Pooling layers are important because they help to control the amount of weight parameters - and they enhance the effectiveness of detecting the most important feature correlations on larger image scales.

All nice and convincing - but the attentive reader will ask: **Where and how do we do the classification?**

Try to answer this question yourself first.

In the next article we shall build a concrete CNN and apply it to the MNIST dataset of images of handwritten digits. And whilst we do it I deliver the answer to the question posed above. Stay tuned ...

"Advanced Machine Learning with Python", John Hearty, 2016, Packt Publishing - See chapter 4.

"Deep Learning mit Python und Keras", Francois Chollet, 2018, mitp Verlag - See chapter 5.

"Hands-On Machine learning with SciKit-Learn, Keras & Tensorflow", 2nd edition, Aurelien Geron, 2019, O'Reilly - See chapter 14.

"Deep Learning mit Tensorflow, keras und Tensorflow.js", Matthieu Deru, Alassane Ndiaye, 2019, Rheinwerk Verlag, Bonn - see chapter 3

]]>Of course we want to use the SMB protocol in a modern version, i.e. version 3.x (SMB3) over TCP/IP for this purpose (port 445). In addition we need some mechanism to detect SMB servers. In the old days NetBIOS was used for the latter. On the Linux side we had the nmbd-daemon for it - and we could set up a special Samba server as a WINS server.

Microsoft - via updates and new builds of Windows 10 - has during the last year followed a consistent policy of deactivating the use of SMBV1.0 systematically. This, however, led to problems - not only between Windows PCs, but also between Win 10 instances and Samba 4 servers. This article addresses one of these problems: the missing list of available Samba servers in the Windows Explorer.

There are many contributions on the Internet describing this problem and some even say that you only can solve it by restoring SMBV1 capabilities in Win 10 again. In this article I want to recommend two different solutions:

- Ignore the problem of Samba server detection and use your Samba shares on Win 10 with the SMB3 protocol as
*network drives*. - If you absolutely want to see and list your Samba servers in the Windows Explorer of a Win 10 client, use the "
*Web-Service-Discovery*" service via a WSD-daemon provided by a Python script of Steffen Christgau.

I should say that I got on the right track of solving the named problem by an article of a guy called "Stilez". His article is the first one listed under the section "Links" below. I recommend strongly to read it; it is Stilez who deserves all credit in pointing out both the problem and the solution. I just applied his insight to my own situation with virtualized Samba servers based on Opensuse Leap 15.1.

SMB, especially version SMB1.0, is well known for security problems. Even MS has understood this - especially after the Wannacry disaster. See e.g. the links in the section "Links" => "Warnings of SMBV1" at the end of this article. MS has deactivated SMBV1 in the background via some updates of Win 8 and Win 10.

One of the resulting problem is that we do not see Samba servers in the Windows Explorer any longer. In the section "Network" of the Explorer you normally should see a list of servers which are members of a Workgroup and offer shares.

Two years ago it was clear that we would use NetBIOS's discovery protocol and a WINS server to get this information. Unfortunately, the NetBIOS service detection ability depends on SMB1 features. The stupid thing is that we for a long while now had and have a relatively secure SMB2/3, but NetBIOS discovery only worked with SMBV1 enabled on the Windows client. Deactivating SMBV1 means deactivating NetBIOS at the same time - and if you watch your Firewall logs for incoming packets from the Win 10 clients you will notice that exactly such a thing happened on Win 10 clients.

This actually means that you can have a full featured Samba/NetBIOS setup on the Linux side, have opened the right ports on the firewalls for your Samba/WINS server and client systems, but still you will not get any display of available Samba servers on a Win 10's Explorer.

Having understood this leads to the key question for our problem:

By what did MS replace the detection features of NetBIOS in combination with SMB-services?

When you google a bit regarding the problem of a missing list of network servers in the Windows Explorer you find many hints regarding settings by which you activate network "discovery" functionalities via two Windows services. See

https://www.wintips.org/fix-windows-10-network-computers-not-showing/

https://winaero.com/blog/network-computers-not-visible-windows-10-version-1803/

You can follow these recommendations. If you want to see your own PC and other Windows systems in the Explorer's list oif network resources you *must* have activated them (see below). However, in my Win 10 client the recommended settings were already activated - with the exception of SMBV1, which I do not wish to reactivate again. The "discovery" settings may directly help with other Windows systems, but they do not enable a listing of Samba 4 servers without additional measures.

Then we find another category of hints, which in my opinion are contra-productive regarding security. See https://devanswers.co/network-error-problem-windows-cannot-access-hostname-samba/

Why activate an insecure setting? Especially, as such a setting does not help with our special problem?

A last set of hints concerns the settings on the Samba server. I find it especially nice when the recommendations come from Microsoft. See: http://woshub.com/cannot-access-smb-network-shares-windows-10-1709/

[global] server min protocol = SMB2_10 client max protocol = SMB3 client min protocol = SMB2_10 encrypt passwords = true restrict anonymous = 2

Well, these are kind hints. Thx MS - we Linux users were too stupid up to now to understand that we should *not* use SMBV1 .... But, actually, these hints are insufficient regarding the Explorer problem ...

Once you have understood that NetBIOS and SMBV1 still have an intimate relation (at least on the Windows systems) you may get the idea that there might exist an option to reactivate SMBV1 again on the Win 10 system. This is indeed possible. See here:

https://community.nethserver.org/t/windows-10-not-showing-servers-shares-in-network-browser/14263/4

https://www.wintips.org/fix-windows-10-network-computers-not-showing/

If you follow the advice of the authors and in addition re-open the standard ports for NetBIOS (UDP) 137, 138, (TCP) 139 on your firewalls between the Win 10 machine and your Samba servers you will - almost at once - get up the list of your accessible Samba servers in the Network section of the Win 10 Explorer. (Maybe you have to restart the smb and nmb services on your Linux machines).

**But:** You should not do this! SMBV1 should definitely become history!

Fortunately, we will find out that a re-activation of SMBV1 on a Win 10 system is NOT required to mount Samba shares on Win 10 and that it is not even necessary to get a list of Samba servers in the Explorer.

There are two service settings which are required to see other servers and also your own Win10 PC itself in the list of network hosts in the Windows explorer:

Start services.msc (Windows key + R => Enter "services.msc" in the dialog / or start it via the Control Panel => System and Security => Services)

- Look for "Function Discovery Provider Host" => Set : Startup Type => Automatic
- Look for "Function Discovery Resource Publication" => Set : Startup Type => Automatic (Delayed Start) !!

I noticed that on my VMware Win 10 guests the second setting appeared to be crucial to get the Win 10 PC itself listed among the network servers.

As you as a Linux user meanwhile have probably replaced all your virtualized Win 7 guests, you should use the following settings in the **[global]** section of the configuration file "**/etc/samba/smb.conf**" of your Samba servers:

[global]

...

**"protocol = SMB3"**.

...

That is what Win 10 supports; you need SMB2_10 with some builds of Win 8 (???), only. Remember also that port 445 must be open on a firewall between the Win 10 client and your Samba server.

For Linux requirements to use SMB3 see

https://wiki.samba.org: SMB3 kernel status

For "SMB Direct" (RDMA) you normally need a kernel version > 4.16. On Opensuse Leap 15.1 most of the required kernel features have been backported. In Win 10 SMB Direct is normally activated; you find it in the "Window-Features" settings (https://www.windowscentral.com/how-manage-optional-features-windows-10)

Not seeing the Samba servers in the Win 10 Explorer - because the NetBIOS detection is defunct - does not mean that you cannot work with a Samba share on a Win 10 system. You can just "mount" it on Windows as a "**network drive**":

Open a Windows Explorer, choose **"This PC"** on the left side, then click **"Map network drive"** in the upper area of the window and follow the instructions:

You choose a free drive letter and provide the Samba server name and its share in the usual MS form as "\\SERVERNAME\SHARE".

Afterwards, you must activate the option "Connect using different credentials" in the dialog on the Win 10 side, if your Win 10 user for security reasons has a different UID and Password on the Samba server than on Win 10. Needless to say that this is a setting I strongly recommend - and of course we do not allow any direct anonymous or guest access to our Samba server without credentials delivered from a Windows machine (at least not without any central authentication systems).

So, you eventually must provide a valid Samba user name on your Samba server and the password - and there you happily go and use your resources on the Samba share from your Win 10 client.

I assumed of course that you have allowed access from the Win 10 host and the user by respective settings of "hosts allow" and "valid users" for the share in your Samba configuration.

Note: You need not mark the option for reconnecting the share in the Windows dialog for network drives if you only use the Samba exchange shares temporarily.

On an Opensuse system this works perfectly with the protocol settings for SMB3 on the server. So, you can use your shares even without seeing the samba server in the Explorer: You just have to know what your shares are named and on which Samba servers they are located. No problem for a Linux admin.

In my opinion this approach is the most secure one among all "peer to peer"-approaches which have to work without a central network wide authentication service. It only requires to open port 445 for the time of a Samba session to a specific Samba server. Otherwise you do not provide any information for free to the Win 10 system and its "users". (Well, an open question is what MS really does with the provided Samba credentials. But that is another story ....)

If you allow for some information sharing between your virtualized Win 10 and other KVM based virtual Samba machines in your LAN - and are not afraid of Microsoft or Antivirus companies on the Windows system to collect respective information - then there is a working option to get a stable list of the available Samba servers in the Windows Explorer - without the use of SMBV1.0.

Windows 10 implements web service detection via multiple mechanisms; among them: Multicast messages over ports 3702 (UDP), TCP 5357 and 1900 (UDP). For a detection of Samba services you "only" need ports 3072 (UDP) and 5357 (TCP). The general service detection port 1900 can remain closed in the firewalls between your Win 10 instances and your Linux world for our specific purpose. See

https://www.speedguide.net/port.php?port=5357

https://www.speedguide.net/port.php?port=3702

https://techcommunity.microsoft.com/t5/ask-the-performance-team/ws2008-the-wsd-port-monitor/ba-p/372760

https://en.wikipedia.org/wiki/Simple Service Discovery Protocol

The mechanism using ports 3702 and 5351 is called **"Web Service Discovery"** and was introduced by MS to cover the detection of printers and other devices in networks. In combination with SMB2 and SMB3 it is used today to detect SMB-services, too.

OK, do we have something like a counter-part available on a Linux system? Obviously, such a service is not (yet?) included in Samba 4 - at least not in the 4.9 version on my Opensuse system. WSD is not (yet?) a part of Samba - maybe for good reasons. See link.

One can understand the reservations and hesitation to include it as WSD also serves other purposes than just the detection of SMB services.

Fortunately, a guy named Steffen Christgau, has written an (interesting) Python 3 script, which offers you the basic WSD functionality. See https://github.com/christgau/wsdd.

You can use the script in form of a daemon process on a Linux system - hence we speak of WSD**D**.

Using YaST I quickly found out that a WSDD RPM package is actually included in my "Opensuse Leap 15.1 Update" repository. People with other Linux distros may download the present WSDD version from GitHub.

On Opensuse it comes with an associated systemd service-file which you find in the directory "/usr/lib/systemd/system".

[Unit] Description=Web Services Dynamic Discovery host daemon After=network-online.target Wants=network-online.target [Service] Type=simple AmbientCapabilities=CAP_SYS_CHROOT PermissionsStartOnly=true Environment= WSDD_ARGS=-p ExecStartPre=/usr/lib/wsdd/wsdd-init.sh EnvironmentFile=-/run/sysconfig/wsdd ExecStart=/usr/sbin/wsdd --shortlog -c /run/wsdd $WSDD_ARGS ExecStartPost=/usr/bin/rm /run/sysconfig/wsdd User=wsdd Group=wsdd [Install] WantedBy=multi-user.target

Reading the documentation you find out that the daemon runs chrooted - which is a reasonable security measure.

And, nicely, Opensuse provides an elementary configuration file in "**/etc/sysconfig/wsdd**".

I used the parameter

WSDD_WORKGROUP="MYWORKGROUP"

there to announce the right Workgroup for my (virtualized) Samba server.

So, I had everything ready to start WSDD by "rcwsdd start" (or by "systemctl start wsdd.service") on my Samba server.

On a local firewall *at the server* I opened

- port 445 (TCP) for SMB(3) In/Out for the server and from/to the Win-10-Client,
- port 3702 (UDP) for incoming packets to the server and outgoing packets from the server to the Multicast address
**239.255.255.250**, - port 5357 (TCP) In/Out for the server and from/to the Win 10 client.

**And I closed all NetBIOS ports (UDP 137, 138 / TCP 139) and stopped the "nmbd"-service on the Samba server!** (UDP 137, 138 / TCP 139)

But, within a second or so, my Samba 4 server appeared in the Windows 10 Explorer!

**Further hints:**

As the 3702 port is used with the UDP protocol it should be viewed upon as basically and potentially dangerous. See: https://blogs.akamai.com/sitr/2019/09/new-ddos-vector-observed-in-the-wild-wsd-attacks-hitting-35gbps.html

The port 1900 which appeared in the firewall logs does not seem to be important. I blocked it.

So far, so good. However, when I ** refreshed** the list in the Win 10 Explorer my SAMBA server disappeared again.

It took me a while to find out that the origin of the last problem had to do with the fact that my virtualized server and my Win 10 client have (multiple) network interfaces on virtualized bridges (without loops in the network). It seems, however, that multiple broadcasts arrive at the server via the KVM bridge and are answered - and thus multiple return messages appear at the Win 10 client during a refresh - which Win 10 does not like (see the discussion in the following link.

https://github.com/christgau/wsdd/issues/8

When I restricted the answer of the server to exactly one bridged interface via the "/etc/sysconfig/wsdd"-configuration file with the parameter "WSDD_INTERFACES" everything went fine. Refreshes now lead to an immediate update including the Samba server.

So, be a little careful, when you have some complicated bridge structures associated with your virtualized VMware or KVM guests. The WSDD service should be limited to exactly one interface of the server.

Note: As we do not need NetBIOS any longer - block ports 137, 138 (UDP) and 139 (TCP) in your firewalls now! Made me feel better instantaneously.

The "end" of SMBV1 on Win 10 is a reasonable step. However, it undermines the visibility of Samba servers in the Windows Explorers. The reason is that NetBIOS requires SMB1.0 features on Windows. NetBIOS is/was therefore consistently deactivated on Win 10, too. The service detection on the network is replaced by the WSD service which was originally introduced for printer detection (and possibly other devices). Activating it on the Win 10 system may help with the detection of other Windows (8 and 10) systems on the network, but not with Samba 4 servers. Samba servers presently only serve NetBIOS requests of Win clients to allow for server and share detection. Therefore they will not be displayed in the Windows Explorer of a regular Win 10 client.

This does, however, not restrict the *usage* of Samba shares on the Win 10 client via the SMB3 protocol. They can be used as "network drives" just as before. Not distributing name and device information on a network has its advantages regarding security.

If you absolutely must see your Samba servers in the Win 10 Explorer install and configure the WSDD package of Steffen Christgau. You can use it as a systemd service. You should restrict the interfaces WSDD gets attached to - especially if you have your servers on virtualization bridges (Linux bridges or VMware bridges).

So:

- Disable SMBV1 in Windows 10 if an update has not yet done it for you!
- Set the protocol in the Samba servers to SMBV3!
- Try to work with "networks drives" on your Win 10 guests, only!
- Install, configure and use WSDD, if you really need to see your Samba servers in the Windows Explorer.
- Open the port 445 (TCP, IN/OUT between the Win 10 client and the server), 3072 (UDP, OUT from the server and the Win 10 client to 239.255.255.250, IN to the server from the Win 10 client / IN to the Win 10 client from the server; rules details depending on the firewall location), port 5357 (TCP; In/OUT between the Samba server and the Win 10 client) on your firewalls between the Samba server and the Win 10 system.
- Close the NetBIOS ports in your firewalls!
- You should also take care of stopping multicast messages leaving perimeter firewalls; normally packets to multicast addresses should not be routed, but blocking them explicitly for certain interfaces is no harm, either.

Of course you must repeat the WSDD and firewall setup for all your Samba servers. But as a Linux admin you have your tools for distributing common configuration files or copying virtualization setups.

**The real story**

!!!! / !!!

https://forums.linuxmint.com/viewtopic.php?p=1799875

https://devanswers.co/discover-ubuntu-machines-samba-shares-windows-10-network/

https://bugs.launchpad.net/ubuntu/ source/ samba/ +bug/ 1831441

https://forums.opensuse.org/ showthread.php/ 540083-Samba-Network-Device-Type-for-Windows-10

https://kofler.info/zugriff-auf-netzwerkverzeichnisse-mit-nautilus/

**WSDD and its problems**

https://github.com/christgau/wsdd

https://github.com/christgau/wsdd/issues/8

https://forums.opensuse.org/ showthread.php/ 540083-Samba-Network-Device-Type-for-Windows-10

**Warnings of SMBV1**

https://docs.microsoft.com/de-de/windows-server/storage/file-server/troubleshoot/detect-enable-and-disable-smbv1-v2-v3

https://blog.malwarebytes.com/101/2018/12/how-threat-actors-are-using-smb-vulnerabilities/

https://securityboulevard.com/2018/12/whats-the-problem-with-smb-1-and-should-you-worry-about-smb-2-and-3/

https://techcommunity.microsoft.com/t5/storage-at-microsoft/stop-using-smb1/ba-p/425858

https://www.cubespotter.de/cubespotter/wannacry-nsa-exploits-und-das-maerchen-von-smbv1/

**Problems with Win 10 and shares**

https://social.technet.microsoft.com/ Forums/ en-US: cannot-connect-to-cifs-smb-samba-network-shares-amp-shared-folders-in-windows-10-after-update?forum=win10itpronetworking

**RDMA and SMB Direct**

https://searchstorage.techtarget.com/ definition/ Remote-Direct-Memory-Access

**Other settings in the SMB/Samba environment of minor relevance**

http://woshub.com/cannot-access-smb-network-shares-windows-10-1709/

https://superuser.com/questions/1466968/unable-to-connect-to-a-linux-samba-server-via-hostname-on-windows-10

https://superuser.com/questions/1522896/windows-10-cannot-connect-to-linux-samba-shares-except-from-smb1-cifs

https://www.reddit.com/ r/ techsupport/ comments/ 3yevip/ windows 10 cant see samba shares/

https://devanswers.co/network-error-problem-windows-cannot-access-hostname-samba/

]]>

MLP, Numpy, TF2 – performance issues – Step II – bias neurons, F- or C- contiguous arrays and performance

MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation

we looked at the FW-propagation of the MLP code which I discussed in another article series. We found that the treatment of bias neurons in the input layer was technically inefficient due to a collision of C- and F-contiguous arrays. By circumventing the problem we could accelerate the FW-propagation of big batches (as the complete training or test data set) by more than a factor of 2.

In this article I want to turn to the BW propagation and do some analysis regarding CPU consumption there. We will find a simple (and stupid) calculation step there which we shall replace. This will give us another 15% to 22% performance improvement in comparison to what we have reached in the last article for MNIST data:

**9.6 secs**for 35 epochs and a batch-size of**500**- and
**8.7 secs**for a batch-size of**20000**.

The central training of mini-batches is performed by the method "_handle_mini_batch()".

# ''' -- Method to deal with a batch -- ''' def _handle_mini_batch (self, num_batch = 0, num_epoch = 0, b_print_y_vals = False, b_print = False, b_keep_bw_matrices = True): ''' .... ''' # Layer-related lists to be filled with 2-dim Numpy matrices during FW propagation # ******************************************************************************** li_Z_in_layer = [None] * self._n_total_layers # List of matrices with z-input values for each layer; filled during FW-propagation li_A_out_layer = li_Z_in_layer.copy() # List of matrices with results of activation/output-functions for each layer; filled during FW-propagation li_delta_out = li_Z_in_layer.copy() # Matrix with out_delta-values at the outermost layer li_delta_layer = li_Z_in_layer.copy() # List of the matrices for the BW propagated delta values li_D_layer = li_Z_in_layer.copy() # List of the derivative matrices D containing partial derivatives of the activation/ouput functions li_grad_layer = li_Z_in_layer.copy() # List of the matrices with gradient values for weight corrections # Major steps for the mini-batch during one epoch iteration # ********************************************************** #ts=time.perf_counter() # Step 0: List of indices for data records in the present mini-batch # ****** ay_idx_batch = self._ay_mini_batches[num_batch] # Step 1: Special preparation of the Z-input to the MLP's input Layer L0 # ****** # ts=time.perf_counter() # slicing li_Z_in_layer[0] = self._X_train[ay_idx_batch] # numpy arrays can be indexed by an array of integers # transposition #~~~~~~~~~~~~~~ li_Z_in_layer[0] = li_Z_in_layer[0].T #te=time.perf_counter(); t_batch = te - ts; #print("\nti - transposed inputbatch =", t_batch) # Step 2: Call forward propagation method for the present mini-batch of training records # ******* #tsa = time.perf_counter() self._fw_propagation(li_Z_in = li_Z_in_layer, li_A_out = li_A_out_layer) #tea = time.perf_counter(); ta = tea - tsa; print("ta - FW-propagation", "%10.8f"%ta) # Step 3: Cost calculation for the mini-batch # ******** #tsb = time.perf_counter() ay_y_enc = self._ay_onehot[:, ay_idx_batch] ay_ANN_out = li_A_out_layer[self._n_total_layers-1] total_costs_batch, rel_reg_contrib = self._calculate_loss_for_batch(ay_y_enc, ay_ANN_out, b_print = False) # we add the present cost value to the numpy array self._ay_costs[num_epoch, num_batch] = total_costs_batch self._ay_reg_cost_contrib[num_epoch, num_batch] = rel_reg_contrib #teb = time.perf_counter(); tb = teb - tsb; print("tb - cost calculation", "%10.8f"%tb) # Step 4: Avg-error for later plotting # ******** #tsc = time.perf_counter() # mean "error" values - averaged over all nodes at outermost layer and all data sets of a mini-batch ay_theta_out = ay_y_enc - ay_ANN_out ay_theta_avg = np.average(np.abs(ay_theta_out)) self._ay_theta[num_epoch, num_batch] = ay_theta_avg #tec = time.perf_counter(); tc = tec - tsc; print("tc - error", "%10.8f"%tc) # Step 5: Perform gradient calculation via back propagation of errors # ******* #tsd = time.perf_counter() self._bw_propagation( ay_y_enc = ay_y_enc, li_Z_in = li_Z_in_layer, li_A_out = li_A_out_layer, li_delta_out = li_delta_out, li_delta = li_delta_layer, li_D = li_D_layer, li_grad = li_grad_layer, b_print = b_print, b_internal_timing = False ) #ted = time.perf_counter(); td = ted - tsd; print("td - BW propagation", "%10.8f"%td) # Step 7: Adjustment of weights # ******* #tsf = time.perf_counter() rg_layer=range(0, self._n_total_layers -1) for N in rg_layer: delta_w_N = self._learn_rate * li_grad_layer[N] self._li_w[N] -= ( delta_w_N + (self._mom_rate * self._li_mom[N]) ) # save momentum self._li_mom[N] = delta_w_N #tef = time.perf_counter(); tf = tef - tsf; print("tf - weight correction", "%10.8f"%tf) return None

I took some time measurements there:

ti - transposed inputbatch = 0.0001785 ta - FW-propagation 0.00080975 tb - cost calculation 0.00030705 tc - error 0.00016182 td - BW propagation 0.00112558 tf - weight correction 0.00020079 ti - transposed inputbatch = 0.00018144 ta - FW-propagation 0.00082022 tb - cost calculation 0.00031284 tc - error 0.00016652 td - BW propagation 0.00106464 tf - weight correction 0.00019576

You see that the FW-propagation is a bit faster than the BW-propagation. This is a bit strange as the FW-propagation is dominated meanwhile by a really expensive operation which we cannot accelerate (without choosing a new activation function): The calculation of the sigmoid value for the inputs at layer L1.

So let us look into the BW-propagation; the code for it is momentarily:

''' -- Method to handle error BW propagation for a mini-batch --''' def _bw_propagation(self, ay_y_enc, li_Z_in, li_A_out, li_delta_out, li_delta, li_D, li_grad, b_print = True, b_internal_timing = False): # List initialization: All parameter lists or arrays are filled or to be filled by layer operations # Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch # Initiate BW propagation - provide delta-matrices for outermost layer # *********************** tsa = time.perf_counter() # Input Z at outermost layer E (4 layers -> layer 3) ay_Z_E = li_Z_in[self._n_total_layers-1] # Output A at outermost layer E (was calculated by output function) ay_A_E = li_A_out[self._n_total_layers-1] # Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print ) #ay_D_E = ay_A_E * (1.0 - ay_A_E) # Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices) ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print) # add the matrices to their lists ; li_delta_out gets only one element idxE = self._n_total_layers - 1 li_delta_out[idxE] = ay_delta_out_E # this happens only once li_delta[idxE] = ay_delta_E li_D[idxE] = ay_D_E li_grad[idxE] = None # On the outermost layer there is no gradient ! tea = time.perf_counter(); ta = tea - tsa; print("\nta-bp", "%10.8f"%ta) # Loop over all layers in reverse direction # ****************************************** # index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2)) range_N_bw_layer = reversed(range(0, self._n_total_layers-1)) # must be -1 as the last element is not taken # loop over layers tsb = time.perf_counter() for N in range_N_bw_layer: # Back Propagation operations between layers N+1 and N # ******************************************************* # this method handles the special treatment of bias nodes in Z_in, too tsib = time.perf_counter() ay_delta_N, ay_D_N, ay_grad_N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False ) teib = time.perf_counter(); tib = teib - tsib; print("N = ", N, " tib-bp", "%10.8f"%tib) # add matrices to their lists #tsic = time.perf_counter() li_delta[N] = ay_delta_N li_D[N] = ay_D_N li_grad[N]= ay_grad_N #teic = time.perf_counter(); tic = teic - tsic; print("\nN = ", N, " tic = ", "%10.8f"%tic) teb = time.perf_counter(); tb = teb - tsb; print("tb-bp", "%10.8f"%tb) return

Typical timing results are:

ta-bp 0.00007112 N = 2 tib-bp 0.00025399 N = 1 tib-bp 0.00051683 N = 0 tib-bp 0.00035941 tb-bp 0.00126436 ta-bp 0.00007492 N = 2 tib-bp 0.00027644 N = 1 tib-bp 0.00090043 N = 0 tib-bp 0.00036728 tb-bp 0.00168378

We see that the CPU consumption of "_bw_prop_Np1_to_N()" should be analyzed in detail. It is relatively time consuming at every layer, but especially at layer L1. (The list adds are insignificant.)

What does this method presently look like?

''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N ''' def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False): ''' BW-error-propagation between layer N+1 and N Version 1.5 - partially accelerated Inputs: li_Z_in: List of input Z-matrices on all layers - values were calculated during FW-propagation li_A_out: List of output A-matrices - values were calculated during FW-propagation li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist Returns: ay_delta_N - delta-matrix of layer N (required in subsequent steps) ay_D_N - derivative matrix for the activation function on layer N ay_grad_N - matrix with gradient elements of the cost fnction with respect to the weights on layer N ''' # Prepare required quantities - and add bias neuron to ay_Z_in # **************************** # Weight matrix meddling between layers N and N+1 ay_W_N = self._li_w[N] # delta-matrix of layer N+1 ay_delta_Np1 = li_delta[N+1] # fetch output value saved during FW propagation ay_A_N = li_A_out[N] # Optimization V1.5 ! if N > 0: #ts=time.perf_counter() ay_Z_N = li_Z_in[N] # !!! Add intermediate row (for bias) to Z_N !!! ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row') #te=time.perf_counter(); t1 = te - ts; print("\nBW t1 = ", t1, " N = ", N) # Derivative matrix for the activation function (with extra bias node row) # ******************** # can only be calculated now as we need the z-values #ts=time.perf_counter() ay_D_N = self._calculate_D_N(ay_Z_N) #te=time.perf_counter(); t2 = te - ts; print("\nBW t2 = ", t2, " N = ", N) # Propagate delta # ************** # intermediate delta # ~~~~~~~~~~~~~~~~~~ #ts=time.perf_counter() ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1) #te=time.perf_counter(); t3 = te - ts; print("\nBW t3 = ", t3) # final delta # ~~~~~~~~~~~ #ts=time.perf_counter() ay_delta_N = ay_delta_w_N * ay_D_N # Orig reduce dimension again # **************************** ay_delta_N = ay_delta_N[1:, :] #te=time.perf_counter(); t4 = te - ts; print("\nBW t4 = ", t4) else: ay_delta_N = None ay_D_N = None # Calculate gradient # ******************** #ts=time.perf_counter() ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T) #te=time.perf_counter(); t5 = te - ts; print("\nBW t5 = ", t5) # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) #ts=time.perf_counter() if self._lambda2_reg > 0: ay_grad_N[:, 1:] += self._li_w[N][:, 1:] * self._lambda2_reg if self._lambda1_reg > 0: ay_grad_N[:, 1:] += np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg #te=time.perf_counter(); t6 = te - ts; print("\nBW t6 = ", t6) return ay_delta_N, ay_D_N, ay_grad_N

Timing data for a batch-size of **500** are:

N = 2 BW t1 = 0.0001169009999557602 N = 2 BW t2 = 0.00035331499998392246 N = 2 BW t3 = 0.00018078099992635543 BW t4 = 0.00010234199999104021 BW t5 = 9.928200006470433e-05 BW t6 = 2.4267000071631628e-05 N = 2 tib-bp 0.00124414 N = 1 BW t1 = 0.0004323499999827618 N = 1 BW t2 = 0.000781415999881574 N = 1 BW t3 = 4.2077999978573644e-05 BW t4 = 0.00022921000004316738 BW t5 = 9.376399998473062e-05 BW t6 = 0.00012183700005152787 N = 1 tib-bp 0.00216281 N = 0 BW t5 = 0.0004289769999559212 BW t6 = 0.00015404999999191205 N = 0 tib-bp 0.00075249 .... N = 2 BW t1 = 0.00012802800006284087 N = 2 BW t2 = 0.00034988200013685855 N = 2 BW t3 = 0.0001854429999639251 BW t4 = 0.00010359299994888715 BW t5 = 0.00010210400000687514 BW t6 = 2.4010999823076418e-05 N = 2 tib-bp 0.00125854 N = 1 BW t1 = 0.0004407169999467442 N = 1 BW t2 = 0.0007845899999665562 N = 1 BW t3 = 0.00025684100000944454 BW t4 = 0.00012409999999363208 BW t5 = 0.00010345399982725212 BW t6 = 0.00012994100006835652 N = 1 tib-bp 0.00221321 N = 0 BW t5 = 0.00044504700008474174 BW t6 = 0.00016473000005134963 N = 0 tib-bp 0.00071442 .... N = 2 BW t1 = 0.000292730999944979 N = 2 BW t2 = 0.001102525000078458 N = 2 BW t3 = 2.9429999813146424e-05 BW t4 = 8.547999868824263e-06 BW t5 = 3.554099998837046e-05 BW t6 = 2.5041999833774753e-05 N = 2 tib-bp 0.00178565 N = 1 BW t1 = 3.143399999316898e-05 N = 1 BW t2 = 0.0006720640001276479 N = 1 BW t3 = 5.4785999964224175e-05 BW t4 = 9.756200006449944e-05 BW t5 = 0.0001605449999715347 BW t6 = 1.8391000139672542e-05 N = 1 tib-bp 0.00147566 N = 0 BW t5 = 0.0003641810001226986 BW t6 = 6.338999992294703e-05 N = 0 tib-bp 0.00046542

It seems that we should care about t1, t2, t3 for hidden layers and maybe about t5 at layers L1/L0.

However, for a **batch-size of 15000** things look a bit different:

N = 2 BW t1 = 0.0005776280000304723 N = 2 BW t2 = 0.004995969999981753 N = 2 BW t3 = 0.0003165199999557444 BW t4 = 0.0005244750000201748 BW t5 = 0.000518499999998312 BW t6 = 2.2458999978880456e-05 N = 2 tib-bp 0.00736144 N = 1 BW t1 = 0.0010120430000029046 N = 1 BW t2 = 0.010797029000002567 N = 1 BW t3 = 0.0005006920000028003 BW t4 = 0.0008704929999794331 BW t5 = 0.0010805200000163495 BW t6 = 3.0326000000968634e-05 N = 1 tib-bp 0.01463436 N = 0 BW t5 = 0.006987539000022025 BW t6 = 0.00023552499999368592 N = 0 tib-bp 0.00730959 N = 2 BW t1 = 0.0006299790000525718 N = 2 BW t2 = 0.005081416999985322 N = 2 BW t3 = 0.00018547400003399162 BW t4 = 0.0005970070000103078 BW t5 = 0.000564008000026206 BW t6 = 2.3311000006742688e-05 N = 2 tib-bp 0.00737899 N = 1 BW t1 = 0.0009376909999900818 N = 1 BW t2 = 0.010650266999959968 N = 1 BW t3 = 0.0005232729999988806 BW t4 = 0.0009100700000317374 BW t5 = 0.0011237720000281115 BW t6 = 0.00016643800000792908 N = 1 tib-bp 0.01466144 N = 0 BW t5 = 0.006987463000029948 BW t6 = 0.00023978600000873485 N = 0 tib-bp 0.00734308

For big batch-sizes **"t2"** dominates everything. It seems that we have found another code area which causes the trouble with big batch-sizes which we already observed before!

To keep an overview without looking into the code again, I briefly summarize which operations cause which of the measured time differences:

- "
**t1**" - which contributes for small batch-sizes stands for adding a bias neuron to the input data Z_in at each layer. - "
**t2**" - which is by far dominant for big batch sizes stands for calculating the derivative of the output/activation function (in our case of the sigmoid function) at the various layers. - "
**t3**" - which contributes at some layers stands for a dot()-matrix multiplication with the transposed weight-matrix, - "
**t4**" - covers an element-wise matrix-multiplication, - "
**t5**" - contributes at the BW-transition from layer L1 to L0 and covers the matrix multiplication there (including the full output matrix with the bias neurons at L0)

Why does the calculation of the derivative of the sigmoid function take so much time? Answer: Because I coded it stupidly! Just look at it:

''' -- Method to calculate the matrix with the derivative values of the output function at outermost layer ''' def _calculate_D_N(self, ay_Z_N, b_print= False): ''' This method calculates and returns the D-matrix for the outermost layer The D matrix contains derivatives of the output function with respect to local input "z_j" at outermost nodes. Returns ------ ay_D_E: Matrix with derivative values of the output function with respect to local z_j valus at the nodes of the outermost layer E Note: This is a 2-dim matrix over layer nodes and training samples of the mini-batch ''' if self._my_out_func == 'sigmoid': ay_D_E = self._D_sigmoid(ay_Z = ay_Z_N) else: print("The derivative for output function " + self._my_out_func + " is not known yet!" ) sys.exit() return ay_D_E ''' -- method for the derivative of the sigmoid function-- ''' def _D_sigmoid(self, ay_Z): ''' Derivative of sigmoid function with respect to Z-values - works via expit element-wise on matrices Input: Z - Matrix with Input values for the activation function Phi() = sigmoid() Output: D - Matrix with derivative values ''' S_Z = self._sigmoid(ay_Z) return S_Z * (1.0 - S_Z)

We first call an intermediate function which then directs us to the right function for a chosen activation function. Well meant: So far, we use only the sigmoid function, but it could e.g. also be the relu() or tanh()-function. So, we did what we did for the sake of generalization. But we did it badly because of two reasons:

- We did not keep up a function call pattern which we introduced in the FW-propagation.
- The calculation of the derivative is inefficient.

The first point is a minor one: During FW-propagation we called the right (!) activation function, i.e. the one we choose by input parameters to our ANN-object, by an *indirect* call. Why not do it the same way here? We would avoid an intermediate function call and keep up a pattern. Actually, we prepared the necessary definitions already in the __init__()-function.

The second point is relevant for performance: The derivative function produces the correct results for a given "ay_Z", but this is totally inefficient in our BW-situation. The code repeats a really expensive operation which we have already performed during FW-propagation: calling sigmoid(ay_Z) to get "A_out"-values per layer then. We even put the A_out-values [=sigmoid(ay_Z_in)] per layer and batch (!) with some foresight into a list in "li_A_out[]" at that point of the code (see the FW-propagation code discussed in the last article).

So, of course, we should use these "A_out"-values now in the BW-steps! No further comment .... you see what we need to do.

**Hint:** Actually, also other activation functions "*act*(Z)" like e.g. the "tanh()"-function have derivatives which depend on on "A=act(Z)", only. So, we should provide Z and A via an interface to the derivative function and let the respective functions take what it needs.

But, my insight into my own dumbness gets worse.

Why did we need a bias-neuron operation? Answer: We do not need it! It was only introduced due to insufficient cleverness. In the article

I have already indicated that we use the function for adding a row of bias-neurons again *only* to compensate one deficit: The matrix of the derivative values did not fit the shape of the weight matrix for the required element-wise operations. However, I also said: There probably is an alternative.

Well, let me make a long story short: The steps behind t1 up to t4 to calculate "ay_delta_N" for the present layer L_N (with N>=1) can be compressed into two relatively simple lines:

ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)

ay_delta_N = ay_delta_w_N[1:,:] * ay_A_N[1:,:] * (1.0 - ay_A_N[1:,:]); ay_D_N = None;

No bias back and forth corrections! Instead we use simple *slicing* to compensate for our weight matrices with a shape covering an extra row of bias node output. No Z-based derivative calculation; no sigmoid(Z)-call. The last statement is only required to support the present output interface. Think it through in detail; the shortcut does not cause any harm.

Before we bring the code into a new consolidated form with re-coded methods let us see what we gain by just changing the code to the two lines given above in terms of CPU time and performance. Our function "_bw_prop_Np1_to_N()" then gets reduced to the following lines:

''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N ''' def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False): # Weight matrix meddling between layers N and N+1 ay_W_N = self._li_w[N] ay_delta_Np1 = li_delta[N+1] # fetch output value saved during FW propagation ay_A_N = li_A_out[N] # Optimization from previous version if N > 0: #ts=time.perf_counter() ay_Z_N = li_Z_in[N] # Propagate delta # ~~~~~~~~~~~~~~~~~ ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1) ay_delta_N = ay_delta_w_N[1:,:] * ay_A_N[1:,:] * (1.0 - ay_A_N[1:,:]) ay_D_N = None; else: ay_delta_N = None ay_D_N = None # Calculate gradient # ******************** ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T) if self._lambda2_reg > 0: ay_grad_N[:, 1:] += self._li_w[N][:, 1:] * self._lambda2_reg if self._lambda1_reg > 0: ay_grad_N[:, 1:] += np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg return ay_delta_N, ay_D_N, ay_grad_N

What run times do we get with this setting? We perform our typical test runs over 35 epochs - but this time for two different batch-sizes:

**Batch-size = 500**

------------------ Starting epoch 35 Time_CPU for epoch 35 0.2169024469985743 Total CPU-time: 7.52385053600301 learning rate = 0.0009994051838157095 total costs of training set = -1.0 rel. reg. contrib. to total costs = -1.0 total costs of last mini_batch = 65.43618 rel. reg. contrib. to batch costs = 0.12302863 mean abs weight at L0 : -10.0 mean abs weight at L1 : -10.0 mean abs weight at L2 : -10.0 avg total error of last mini_batch = 0.00758 presently batch averaged accuracy = 0.99272 ------------------- Total training Time_CPU: 7.5257336139984545

Not bad! We became **faster by around 2 secs** compared to the results of the last article! This is close to an improvement of **20%**.

But what about big batch sizes? Here is the result for a relatively big batch size:

**Batch-size = 20000**

------------------ Starting epoch 35 Time_CPU for epoch 35 0.2019189490019926 Total CPU-time: 6.716679593999288 learning rate = 9.994051838157101e-05 total costs of training set = -1.0 rel. reg. contrib. to total costs = -1.0 total costs of last mini_batch = 13028.141 rel. reg. contrib. to batch costs = 0.00021923862 mean abs weight at L0 : -10.0 mean abs weight at L1 : -10.0 mean abs weight at L2 : -10.0 avg total error of last mini_batch = 0.04389 presently batch averaged accuracy = 0.95602 ------------------- Total training Time_CPU: 6.716954112998792

Again an acceleration by roughly **2 secs** - corresponding to an improvement of **22%**!

In both cases I took the best result out of three runs.

Enough for today! We have done a major step with regard to performance optimization also in the BW-propagation. It remains to re-code the derivative calculation in form which uses indirect function calls to remain flexible. I shall give you the code in the next article.

We learned today is that we, of course, should reuse the results of the FW-propagation and that it is indeed a good investment to save the output data per layer in some Python list or other suitable structures during FW-propagation. We also saw again that a sufficiently efficient bias neuron treatment can be achieved by a more efficient solution than provisioned so far.

All in all we have meanwhile gained **more than a factor of 6.5** in performance since we started with optimization. Our new standard values are **7.3 secs** and **6.8 secs** for 35 epochs on MNIST data and batch sizes of **500** and **20000**, respectively.

We have reached the order of what Keras and TF2 can deliver on a CPU for big batch sizes. For small batch sizes we are already faster. This indicates that we have done no bad job so far ...

In the next article we shall look a bit at the matrix operations and evaluate further optimization options.

]]>A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

a first performance boost by two simple measures:

- We set all major arrays to the "numpy.float32" data type instead of the default "float64".
- In addition we eliminated superfluous parts of the backward [BW] propagation between the first hidden layer and the input layer.

This brought us already down to around

**11 secs** for 35 epochs on the MNIST dataset, a batch-size of 500 and an accuracy around 99 % on the training set

This was close to what Keras (and TF2) delivered for the same batch size. It marks the baseline for further performance improvements of our MLP code.

Can we get better than 11 secs for 35 epochs? The answer is: Yes, we can - but only in small steps. So, do not expect any gigantic performance leaps for the training loop itself. But, there was and is also our observation that there is no significant acceleration with growing batch sizes over 1000 - but with Keras we saw such an acceleration.

In this article I shall shortly discuss why we should care about big batch sizes - at least in combination with FW-propagation. Afterwards I want to draw your attention to a specific code segment of our MLP. We shall see that an astonishingly simple array operation dominates the CPU time of our straight forward coded FW propagation. Especially for big batch sizes!

Actually, it is an operation I would never have guessed to be such an an obstacle to efficiency if somebody had asked me. As a naive Python beginner I had to learn that the arrangement of arrays in the computer's memory sometimes have an impact - especially when big arrays are involved. To get to this generally useful insight we will have to invest some effort into performance tests of some specific Numpy operations on arrays. The results give us some options for possible performance improvements; but in the end we shall *circumvent* the impediment all together.

The discussion will indicate that we should change our treatment of bias neurons fundamentally. We shall only go a preliminary step in this direction. This step will give us already a **15%** improvement regarding the training time. But even more important, it will reward us with a * significant* improvement -

"np." abbreviates the "Numpy" library below. I shall sometimes speak of our 2-dimensional Numpy "*arrays*" as "*matrices*" in an almost synonymous way. See, however, one of the links at the bottom of the article for the subtle differences of related data types. For the time being we can safely ignore the mathematical differences between matrices, stacks of matrices and tensors. But we should have a clear understanding of the profound difference between the "*****"-operation and the "**np.dot()**"-operation on 2-dimensional arrays.

There are several reasons why we should care about an efficient treatment of big batches. I name a few, only:

- Numpy operations on bigger matrices may become more efficient on systems with multiple CPUs, CPU cores or multiple GPUs.
- Big batch sizes together with a relatively small learning rate will lead to a smoother descent path on the cost hyperplane. Could become important in some intricate real life scenarios beyond MNIST.
- We should test the achieved accuracy on evaluation and test datasets during training. This data sets may have a much bigger size than the training batches.

The last point addresses the problem of overfitting: We may approach a minimum of the loss function of the *training* data set, but may leave the minimum of the cost function (and of related errors) of the *test* data set at some point. Therefore, we should check the accuracy on evaluation and test data sets already during the training phase. This requires the FW-propagation of such sets - preferably in one sweep. I.e. we talk about the propagation of really big batches with 10000 samples or more.

How do we measure the accuracy? Regarding the *training set* we gather averaged errors of batches during the training run and determine the related accuracy at the end of every printout period via an average over all batches: The average is taken over the absolute values of the difference between the sigmoidal output and the one-hot encoded target values of the batch samples. Note that this will give us slightly different values than tests where Numpy.argmax() is applied to the output first.

We can verify the accuracy also on the complete training and test data sets. Often we will do so after each and every epoch. Then we involve argmax(), by the way to get numbers in terms of correctly classified samples.

We saw that the forward [FW] propagation of the complete training data set "X_train" in one sweep requires a substantial (!) amount of CPU time in the present state of our code. When we perform such a test at each and every epoch on the training set the pure training time is prolonged **by roughly a factor 1.75**. As said: In real live scenarios we would rather or in addition perform full accuracy tests on prepared evaluation and test data sets - but they are big "batches" as well.

So, one relevant question is: Can we reduce the time required for a forward [FW] propagation of complete training and test data sets in one vectorized sweep?

The present code for the FW-propagation of a mini-batch through my MLP comprises the following statements - enriched below by some lines to measure the required CPU-time:

''' -- Method to handle FW propagation for a mini-batch --''' def _fw_propagation(self, li_Z_in, li_A_out): ''' Parameter: li_Z_in : list of input values at all layers - li_Z_in[0] is already filled - other elemens to to be filled during FW-propagation li_A_out: list of output values at all layers - to be filled during FW-propagation ''' # index range for all layers # Note that we count from 0 (0=>L0) to E L(=>E) / # Careful: during BW-propagation we need a clear indexing of the lists filled during FW-propagation ilayer = range(0, self._n_total_layers-1) # propagation loop # *************** for il in ilayer: # Step 1: Take input of last layer and apply activation function # ****** ts=time.perf_counter() if il == 0: A_out_il = li_Z_in[il] # L0: activation function is identity !!! else: A_out_il = self._act_func( li_Z_in[il] ) # use defined activation function (e.g. sigmoid) te=time.perf_counter(); ta = te - ts; print("\nta = ", ta, " shape = ", A_out_il.shape, " type = ", A_out_il.dtype, " A_out flags = ", A_out_il.flags) # Step 2: Add bias node # ****** ts=time.perf_counter() A_out_il = self._add_bias_neuron_to_layer(A_out_il, 'row') li_A_out[il] = A_out_il te=time.perf_counter(); tb = te - ts; print("tb = ", tb, " shape = ", A_out_il.shape, " type = ", A_out_il.dtype) # Step 3: Propagate by matrix operation # ****** ts=time.perf_counter() Z_in_ilp1 = np.dot(self._li_w[il], A_out_il) li_Z_in[il+1] = Z_in_ilp1 te=time.perf_counter(); tc = te - ts; print("tc = ", tc, " shape = ", li_Z_in[il+1].shape, " type = ", li_Z_in[il+1].dtype) # treatment of the last layer # *************************** ts=time.perf_counter() il = il + 1 A_out_il = self._out_func( li_Z_in[il] ) # use the defined output function (e.g. sigmoid) li_A_out[il] = A_out_il te=time.perf_counter(); tf = te - ts; print("\ntf = ", tf) return None

The attentive reader notices that I also included statements to print out information about the shape and so called "flags" of the involved arrays.

I give you some typical CPU times for the MNIST dataset first. Characteristics of the test runs were:

- data were taken during the first two epochs;
- the batch-size was
**10000**; i.e. we processed 6 batches per epoch; - "ta, tb, tc, tf" are representative data for a
*single*batch comprising 10000 MNIST samples.

Averaged timing results for such batches are:

Layer L0 ta = 2.6999987312592566e-07 tb = 0.013209896002081223 tc = 0.004847299001994543 Layer L1 ta = 0.005858420001459308 tb = 0.0005839099976583384 tc = 0.00040631899901200086 Layer L2 ta = 0.0025550600003043655 tb = 0.00026626299950294197 tc = 0.00022965300013311207 Layer3 tf = 0.0008438359982392285

Such CPU time data vary of course a bit (2%) with the background activity on my machine and with the present batch, but the basic message remains the same. When I first saw it I could not believe it:

Adding a bias-neuron to the input layer obviously dominated the CPU-consumption during forward propagation. Not the matrix multiplication at the input layer L0!

I should add at this point that the problem increases with growing batch size! (We shall see this later in elementary test, too). This means that propagating the complete training or test dataset for accuracy check at each epoch will cost us an enormous amount of CPU time - as we have indeed seen in the last article. Performing a full propagation for an accuracy test at the end of each and every epoch increased the total CPU time roughly **by a factor of 1.68** (19 sec vs. 11.33 secs for 35 epochs; see the last article).

I first wanted to know, of course, whether my specific method of adding a bias neuron to the A-output matrix at each layer really was so expensive. My naive approach - following a suggestion in a book of S. Rashka, by the way - was:

def add_bias_neuron_to_layer(A, how='column'): if how == 'column': A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32) A_new[:, 1:] = A elif how == 'row': A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32) A_new[1:, :] = A return A_new

What we do here is to create a new array which is bigger by one row and fit the original array into it. Seemed to be a clever approach at the time of coding (and actually it is faster than using np.vstack or np.hstack). The operation is different from directly adding a row to the existing input array explicitly, but it still requires a lot of row operations.

As we have seen I call this function in "_fw_propagation()" by

A_out_il = self._add_bias_neuron_to_layer(A_out_il, 'row')

"A_out_il" is the *transposition* of a slice of the original X_train array. The slice in our test case for MNIST had a shape of (10000, 784).

This means that we talk about a matrix with shape (784, 10000) in the case of the MNIST dataset *before* adding the bias neuron and a shape of (785, 10000) *after*. I.e. we add a row with 10000 constant entries at the beginning of our transposed slice. Note also that the function returns a **new** array in memory.

Thus, our approach contains two possibly costly operations. Why did we do such a strange thing in the first place?

Well, when we coded the MLP it seemed to be a good idea to include the fact that we have bias neurons directly in the definition of the weight matrices and their shapes. So, we need(ed) to fit our input matrices at the layers to the defined shape of the weight matrices. As we see it now, this is a questionable strategy regarding performance. But, well, let us not attack something at the very center of the MLP code *for all layers* (except the output layer) at this point in time. We shall do this in a forthcoming article.

To understand my performance problem a bit better, I did the following test in a Jupyter cell:

''' Method to add values for a bias neuron to A_out all with C-cont. arrays ''' def add_bias_neuron_to_layer_C(A, how='column'): if how == 'column': A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32) A_new[:, 1:] = A elif how == 'row': A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32) A_new[1:, :] = A return A_new input_shape =(784, 10000) ay_inpC = np.array(np.random.random_sample(input_shape)*2.0, dtype=np.float32) tx = time.perf_counter() ay_inpCb = add_bias_neuron_to_layer_C(ay_inpC, 'row') li_A.append(ay_inpCb) ty = time.perf_counter(); t_biasC = ty - tx; print("\n bias time = ", "%10.8f"%t_biasC) print("shape_biased = ", ay_inpCb.shape)

to get:

bias time = 0.00423444

Same batch-size, but **substantially faster** - by roughly a factor of 3! - compared to what my MLP code delivered. Actually the timing data varied a bit between 0.038 and 0.045 (with an average at 0.0042) when repeating the run. To exclude any problems with calling the function from within a Python class I repeated the same test inside the class "MyANN" during FW-propagation - with the same result (as it should be; see the first link at the end of this article).

So: Applying one and the same function on a randomly filled array was much faster than applying it on my Numpy (input) array "A_out_il" (with the same shape). **????**

It took me a while to find the reason: "A_out_il" is the result of a matrix *transposition*. In Numpy this corresponds to a certain view on the original array data - but this still has major consequences for the handling of the data:

A 2 dimensional array or matrix is an *ordered* addressable sequence of data in the computer's memory. Now, if you yourself had to program an array representation in memory on a basic level you would - due to performance reasons - make a choice whether you arrange data row-wise or column-wise. And you would program functions for array-operations with your chosen "order" in mind!

Actually, if you google a bit you find that the two ways of arranging array or matrix data are both well established. In connection with Numpy we speak of either a **C-contiguous order** or a **F-contiguous order** of the array data. In the first case (C) data are stored and addressed row by row and can be read efficiently this way, in the other (F) case data are arranged column by column. By the way: The "C" refers to the C-language, the "F" to Fortran.

On a Linux system Numpy normally creates and operates with C-contiguous arrays - except when you ask Numpy explicitly to work differently. Quite many array related functions, therefore, have a parameter "order", which you can set to either 'C' or 'F'.

Now, let us assume that we have a C-contiguous array. What happens when we transpose it - or look at it in a transposed way? Well, logically it then becomes F-contiguous! Then our "A_out_il" would be seen as F-contiguous. Could this in turn have an impact on performance? Well, I create "A_out_il" in method "_handle_mini_batch()" of my MyANN-class via

# Step 0: List of indices for data records in the present mini-batch # ****** ay_idx_batch = self._ay_mini_batches[num_batch] # Step 1: Special preparation of the Z-input to the MLP's input Layer L0 # ****** # Layer L0: Fill in the input vector for the ANN's input layer L0 li_Z_in_layer[0] = self._X_train[ay_idx_batch] # numpy arrays can be indexed by an array of integers li_Z_in_layer[0] = li_Z_in_layer[0].T ... ...

Hm, pretty simple. But then, what happens if we perform our rather special adding of the bias-neuron row-wise, as we logically are forced to? Remember, the array originally had a shape of (10000, 784) and after transposing a shape of (784, 10000), i.e. the columns then represent the samples of the mini-batch. Well, instead of inserting a row of 10000 data contiguously into memory in one swipe into a C-contiguous array we must hop to the end of each contiguous column of the F-contiguous array "A_out_il" in memory and add one element there. Even if you would optimize it there are many more addresses and steps involved. Can't become efficient ....

How can we see, which order an array or view onto it follows? We just have to print its "**flags**". And I indeed got:

flags li_Z_in[0] = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False

Let us extend the tests of our function in the Jupyter cell in the following way to cover a variety of options related to our method of adding bias neurons:

# The bias neuron problem # ************************ import numpy as np import scipy from scipy.special import expit import time ''' Method to add values for a bias neuron to A_out - by creating a new C-cont. array ''' def add_bias_neuron_to_layer_C(A, how='column'): if how == 'column': A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32) A_new[:, 1:] = A elif how == 'row': A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32) A_new[1:, :] = A return A_new ''' Method to add values for a bias neuron to A_out - by creating a new F-cont. array ''' def add_bias_neuron_to_layer_F(A, how='column'): if how == 'column': A_new = np.ones((A.shape[0], A.shape[1]+1), order='F', dtype=np.float32) A_new[:, 1:] = A elif how == 'row': A_new = np.ones((A.shape[0]+1, A.shape[1]), order='F', dtype=np.float32) A_new[1:, :] = A return A_new rg_j = range(50) li_A = [] t_1 = 0.0; t_2 = 0.0; t_3 = 0.0; t_4 = 0.0; t_5 = 0.0; t_6 = 0.0; t_7 = 0.0; t_8 = 0.0; # two types of input shapes input_shape1 =(784, 10000) input_shape2 =(10000, 784) for j in rg_j: # For test 1: C-cont. array with shape (784, 10000) # in a MLP programm delivering X_train as (10000, 784) we would have to (re-)create it # explicitly with the C-order (np.copy or np.asarray) ay_inpC = np.array(np.random.random_sample(input_shape1)*2.0, order='C', dtype=np.float32) # For test 2: C-cont. array with shape (10000, 784) as it typically is given by a slice of the # original X_train ay_inpC2 = np.array(np.random.random_sample(input_shape2)*2.0, order='C', dtype=np.float32) # For tests 3 and 4: transposition - this corresponds to the MLP code ay_inpF = ay_inpC2.T # For test 5: The original X_train or mini-batch data are somehow given in F-cont.form, # then inpF3 below would hopefully be in C-cont. form ay_inpF2 = np.array(np.random.random_sample(input_shape2)*2.0, order='F', dtype=np.float32) # For test 6 ay_inpF3 = ay_inpF2.T # Test 1: C-cont. input to add_bias_neuron_to_layer_C - with a shape that fits already # ****** tx = time.perf_counter() ay_Cb = add_bias_neuron_to_layer_C(ay_inpC, 'row') li_A.append(ay_Cb) ty = time.perf_counter(); t_1 += ty - tx; # Test 2: Standard C-cont. input to add_bias_neuron_to_layer_C - but col.-operation due to shape # ****** tx = time.perf_counter() ay_C2b = add_bias_neuron_to_layer_C(ay_inpC2, 'column') li_A.append(ay_C2b) ty = time.perf_counter(); t_2 += ty - tx; # Test 3: F-cont. input to add_bias_neuron_to_layer_C (!) - but row-operation due to shape # ****** will give us a C-cont. output array which later is used in np.dot() on the left side tx = time.perf_counter() ay_C3b = add_bias_neuron_to_layer_C(ay_inpF, 'row') li_A.append(ay_C3b) ty = time.perf_counter(); t_3 += ty - tx; # Test 4: F-cont. input to add_bias_neuron_to_layer_F (!) - but row-operation due to shape # ****** will give us a F-cont. output array which later is used in np.dot() on the left side tx = time.perf_counter() ay_F4b = add_bias_neuron_to_layer_F(ay_inpF, 'row') li_A.append(ay_F4b) ty = time.perf_counter(); t_4 += ty - tx; # Test 5: F-cont. input to add_bias_neuron_to_layer_F (!) - but col-operation due to shape # ****** will give us a F-cont. output array with wrong shape for weight matrix tx = time.perf_counter() ay_F5b = add_bias_neuron_to_layer_F(ay_inpF2, 'column') li_A.append(ay_F5b) ty = time.perf_counter(); t_5 += ty - tx; # Test 6: C-cont. input to add_bias_neuron_to_layer_C (!) - row-operation due to shape # ****** will give us a C-cont. output array with wrong shape for weight matrix tx = time.perf_counter() ay_C6b = add_bias_neuron_to_layer_C(ay_inpF3, 'row') li_A.append(ay_C6b) ty = time.perf_counter(); t_6 += ty - tx; # Test 7: C-cont. input to add_bias_neuron_to_layer_F (!) - row-operation due to shape # ****** will give us a F-cont. output array with wrong shape for weight matrix tx = time.perf_counter() ay_F7b = add_bias_neuron_to_layer_F(ay_inpC2, 'column') li_A.append(ay_F7b) ty = time.perf_counter(); t_7 += ty - tx; print("\nTest 1: nbias time C-cont./row with add_.._C() => ", "%10.8f"%t_1) print("shape_ay_Cb = ", ay_Cb.shape, " flags = \n", ay_Cb.flags) print("\nTest 2: nbias time C-cont./col with add_.._C() => ", "%10.8f"%t_2) print("shape of ay_C2b = ", ay_C2b.shape, " flags = \n", ay_C2b.flags) print("\nTest 3: nbias time F-cont./row with add_.._C() => ", "%10.8f"%t_3) print("shape of ay_C3b = ", ay_C3b.shape, " flags = \n", ay_C3b.flags) print("\nTest 4: nbias time F-cont./row with add_.._F() => ", "%10.8f"%t_4) print("shape of ay_F4b = ", ay_F4b.shape, " flags = \n", ay_F4b.flags) print("\nTest 5: nbias time F-cont./col with add_.._F() => ", "%10.8f"%t_5) print("shape of ay_F5b = ", ay_F5b.shape, " flags = \n", ay_F5b.flags) print("\nTest 6: nbias time C-cont./row with add_.._C() => ", "%10.8f"%t_6) print("shape of ay_C6b = ", ay_C6b.shape, " flags = \n", ay_C6b.flags) print("\nTest 7: nbias time C-cont./col with add_.._F() => ", "%10.8f"%t_7) print("shape of ay_F7b = ", ay_F7b.shape, " flags = \n", ay_F7b.flags)

You noticed that I defined *two different ways* of creating the bigger array into which we place the original one.

Results are:

Test 1: bias time C-cont./row with add_.._C() => 0.20854935 shape_ay_Cb = (785, 10000) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 2: bias time C-cont./col with add_.._C() => 0.25661559 shape of ay_C2b = (10000, 785) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 3: bias time F-cont./row with add_.._C() => 0.67718296 shape of ay_C3b = (785, 10000) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 4: nbias time F-cont./row with add_.._F() => 0.25958392 shape of ay_F4b = (785, 10000) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 5: nbias time F-cont./col with add_.._F() => 0.20990409 shape of ay_F5b = (10000, 785) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 6: nbias time C-cont./row with add_.._C() => 0.22129941 shape of ay_C6b = (785, 10000) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 7: nbias time C-cont./col with add_.._F() => 0.67642328 shape of ay_F7b = (10000, 785) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False

These results

- confirm that it is a bad idea to place a F-contiguous array or (F-contiguous view on an array) into a C-contiguous one the way we presently do it;
- confirm that we should at least create the surrounding array with the
*same order*as the input array, which we place into it.

The best combinations are

- either to put an original C-contiguous array with fitting shape into a C-contiguous one with one more row,
- or to place an original F-contiguous array with suitable shape into a F-contiguous one with one more column.

By the way: Some systematic tests also showed that the time difference between the first and the third operation grows with batch size:

bs = 60000, rep. = 30 => t1=0.70, t3=2.91, fact=4.16 bs = 50000, rep. = 30 => t1=0.58, t3=2.34, fact=4.03 bs = 40000, rep. = 50 => t1=0.78, t3=3.07, fact=3.91 bs = 30000, rep. = 50 => t1=0.60, t3=2.21, fact=3.68 bs = 20000, rep. = 60 => t1=0.49, t3=1.63, fact=3.35 bs = 10000, rep. = 60 => t1=0.26, t3=0.82, fact=3.20 bs = 5000, rep. = 60 => t1=0.11, t3=0.35, fact=3.24 bs = 2000, rep. = 60 => t1=0.04, t3=0.10, fact=2.41 bs = 1000, rep. = 200 => t1=0.17, t3=0.38, fact=2.21 bs = 500, rep. = 1000 => t1=0.15, t3=0.32, fact=2.17 bs = 500, rep. = 200 => t1=0.03, t3=0.06, fact=2.15 bs = 100, rep. = 1500 => t1=0.04, t3=0.07, fact=1.92

"rep" is the loop range (repetition), "fact" is the factor between the fastest operation (test1: C-cont. into C-cont.) and the slowest (test3: F-cont. into C-cont). (The best results were selected among multiple runs with different repetitions for the table above).

We clearly see that our problem gets worse with batch sizes above bs=1000!

Okay, let us assume we wanted to go either of the 2 optimization paths indicated above. Then we would need to prepare the input array in a suitable form. But, how does such an approach fit to the present initialization of the input data and the shuffling of "X_train" at the beginning of each epoch?

If we keep up our policy of adding a bias neuron to the input layer by the mechanism we use we either have to get the transposed view into C-contiguous form ** or** at least create the new array (including the row) in F-contiguous form. (The latter will not hamper the later np.dot()-multiplication with the weight-matrix as we shall see below.) Or we must circumvent the bias neuron problem at the input layer in a different way.

Actually, there are two fast shuffling options - and both are designed to work efficiently with rows, only. Another point is that the result is *always* C-contiguous. Let us look at some tests:

# Shuffling # ********** dim1 = 60000 input_shapeX =(dim1, 784) input_shapeY =(dim1, ) ay_X = np.array(np.random.random_sample(input_shapeX)*2.0, order='C', dtype=np.float32) ay_Y = np.array(np.random.random_sample(input_shapeY)*2.0, order='C', dtype=np.float32) ay_X2 = np.array(np.random.random_sample(input_shapeX)*2.0, order='C', dtype=np.float32) ay_Y2 = np.array(np.random.random_sample(input_shapeY)*2.0, order='C', dtype=np.float32) # Test 1: Shuffling of C-cont. array by np.random.shuffle tx = time.perf_counter() np.random.shuffle(ay_X) np.random.shuffle(ay_Y) ty = time.perf_counter(); t_1 = ty - tx; print("\nShuffle Test 1: time C-cont. => t = ", "%10.8f"%t_1) print("shape of ay_X = ", ay_X.shape, " flags = \n", ay_X.flags) print("shape of ay_Y = ", ay_Y.shape, " flags = \n", ay_Y.flags) # Test 2: Shuffling of C-cont. array by random index permutation # as we have coded it for the beginning of each epoch tx = time.perf_counter() shuffled_index = np.random.permutation(dim1) ay_X2, ay_Y2 = ay_X2[shuffled_index], ay_Y2[shuffled_index] ty = time.perf_counter(); t_2 = ty - tx; print("\nShuffle Test 2: time C-cont. => t = ", "%10.8f"%t_2) print("shape of ay_X2 = ", ay_X2.shape, " flags = \n", ay_X2.flags) print("shape of ay_Y2 = ", ay_Y2.shape, " flags = \n", ay_Y2.flags) # Test3 : Copy Time for writing the whole X-array into 'F' ordered form # such that slices transposed get C-order ay_X3x = np.array(np.random.random_sample(input_shapeX)*2.0, order='C', dtype=np.float32) tx = time.perf_counter() ay_X3 = np.copy(ay_X3x, order='F') ty = time.perf_counter(); t_3 = ty - tx; print("\nTest 3: time to copy to F-cont. array => t = ", "%10.8f"%t_3) print("shape of ay_X3 = ", ay_X3.shape, " flags = \n", ay_X3.flags) # Test4 - shuffling of rows in F-cont. array => The result is C-contiguous! tx = time.perf_counter() shuffled_index = np.random.permutation(dim1) ay_X3, ay_Y2 = ay_X3[shuffled_index], ay_Y2[shuffled_index] ty = time.perf_counter(); t_4 = ty - tx; print("\nTest 4: Shuffle rows of F-cont. array => t = ", "%10.8f"%t_4) print("shape of ay_X3 = ", ay_X3.shape, " flags = \n", ay_X3.flags) # Test 5 - transposing and copying after => F-contiguous with changed shape tx = time.perf_counter() ay_X5 = np.copy(ay_X.T) ty = time.perf_counter(); t_5 = ty - tx; print("\nCopy Test 5: time copy to F-cont. => t = ", "%10.8f"%t_5) print("shape of ay_X5 = ", ay_X5.shape, " flags = \n", ay_X5.flags) # Test 6: shuffling columns in F-cont. array tx = time.perf_counter() shuffled_index = np.random.permutation(dim1) ay_X6 = (ay_X5.T[shuffled_index]).T ty = time.perf_counter(); t_6 = ty - tx; print("\nCopy Test 6: shuffling F-cont. array in columns => t = ", "%10.8f"%t_6) print("shape of ay_X6 = ", ay_X6.shape, " flags = \n", ay_X6.flags)

Results are:

Shuffle Test 1: time C-cont. => t = 0.08650427 shape of ay_X = (60000, 784) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False shape of ay_Y = (60000,) flags = C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Shuffle Test 2: time C-cont. => t = 0.02296818 shape of ay_X2 = (60000, 784) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False shape of ay_Y2 = (60000,) flags = C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 3: time to copy to F-cont. array => t = 0.09333340 shape of ay_X3 = (60000, 784) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Test 4: Shuffle rows of F-cont. array => t = 0.25790425 shape of ay_X3 = (60000, 784) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Copy Test 5: time copy to F-cont. => t = 0.02146052 shape of ay_X5 = (784, 60000) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False Copy Test 6: shuffling F-cont. array in columns by using the transposed view => t = 0.02402249 shape of ay_X6 = (784, 60000) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False

The results reveal three points:

- Applying a random permutation of an index is faster than using np.random.shuffle() on the array.
- The result is C-contiguous in both cases.
- Shuffling of columns can be done in a fast way by shuffling rows of the transposed array.

So, at the beginning of each epoch we are in any case confronted with a C-contiguous array of shape (batch_size, 784). Comparing this with the test data further above seems to leave us with three choices:

**Approach 1:**At the beginning of each epoch wethe input array into a F-contiguous one, such that the required transposed array afterwards is C-contiguous and our present version of "_add_bias_neuron_to_layer()" works fast with adding a row of bias nodes. The result would be a**copy***C-contiguous*array with shape (785, size_batch).**Approach 2:**We define a new method "_add_bias_neuron_to_layer_F()" which creates an F-contiguous array with an extra row into which we fit the existing (transposed) array "A_out_il". The result would be a*F-contiguous*array with shape (785, size_batch).**Approach 3:**We skip adding a row for bias neurons altogether.

The first method has the disadvantage that the copy-process requires time itself at the beginning of each epoch. But according to the test data the total gain would be bigger than the loss (6 batches!). The second approach has a small disadvantage because "_add_bias_neuron_to_layer_F()" is slightly slower than its row oriented counterpart - but this will be compensated by a slightly faster matrix dot()-multiplication. All in all the second option seems to be the better one - in case we do not find a completely different approach. Just wait a minute ...

As we have come so far: How does np.dot() react to C- or F-contiguous arrays? The first two optimization approaches would end in different situations regarding the matrix multiplication. Let us cover all 4 possible combinations by some test:

# A simple test on np.dot() on C-contiguous and F-contiguous matrices # ******************************************************* # Is the dot() multiplication fasterfor certain combinations of C- and F-contiguous matrices? input_shape =(800, 20000) ay_inpC1 = np.array(np.random.random_sample(input_shape)*2.0, dtype=np.float32 ) #print("shape of ay_inpC1 = ", ay_inpC1.shape, " flags = ", ay_inpC1.flags) ay_inpC2 = np.array(np.random.random_sample(input_shape)*2.0, dtype=np.float32 ) #print("shape of ay_inpC2 = ", ay_inpC2.shape, " flags = ", ay_inpC2.flags) ay_inpC3 = np.array(np.random.random_sample(input_shape)*2.0, dtype=np.float32 ) print("shape of ay_inpC3 = ", ay_inpC3.shape, " flags = ", ay_inpC3.flags) ay_inpF1 = np.copy(ay_inpC1, order='F') ay_inpF2 = np.copy(ay_inpC2, order='F') ay_inpF3 = np.copy(ay_inpC3, order='F') print("shape of ay_inpF3 = ", ay_inpF3.shape, " flags = ", ay_inpF3.flags) weight_shape =(101, 800) weightC = np.array(np.random.random_sample(weight_shape)*0.5, dtype=np.float32) print("shape of weightC = ", weightC.shape, " flags = ", weightC.flags) weightF = np.copy(weightC, order='F') print("shape of weightF = ", weightF.shape, " flags = ", weightF.flags) rg_j = range(300) ts = time.perf_counter() for j in rg_j: resCC1 = np.dot(weightC, ay_inpC1) resCC2 = np.dot(weightC, ay_inpC2) resCC3 = np.dot(weightC, ay_inpC3) resCC1 = np.dot(weightC, ay_inpC1) resCC2 = np.dot(weightC, ay_inpC2) resCC3 = np.dot(weightC, ay_inpC3) te = time.perf_counter(); tcc = te - ts; print("\n dot tCC time = ", "%10.8f"%tcc) ts = time.perf_counter() for j in rg_j: resCF1 = np.dot(weightC, ay_inpF1) resCF2 = np.dot(weightC, ay_inpF2) resCF3 = np.dot(weightC, ay_inpF3) resCF1 = np.dot(weightC, ay_inpF1) resCF2 = np.dot(weightC, ay_inpF2) resCF3 = np.dot(weightC, ay_inpF3) te = time.perf_counter(); tcf = te - ts; print("\n dot tCF time = ", "%10.8f"%tcf) ts = time.perf_counter() for j in rg_j: resF1 = np.dot(weightF, ay_inpC1) resF2 = np.dot(weightF, ay_inpC2) resF3 = np.dot(weightF, ay_inpC3) resF1 = np.dot(weightF, ay_inpC1) resF2 = np.dot(weightF, ay_inpC2) resF3 = np.dot(weightF, ay_inpC3) te = time.perf_counter(); tfc = te - ts; print("\n dot tFC time = ", "%10.8f"%tfc) ts = time.perf_counter() for j in rg_j: resF1 = np.dot(weightF, ay_inpF1) resF2 = np.dot(weightF, ay_inpF2) resF3 = np.dot(weightF, ay_inpF3) resF1 = np.dot(weightF, ay_inpF1) resF2 = np.dot(weightF, ay_inpF2) resF3 = np.dot(weightF, ay_inpF3) te = time.perf_counter(); tff = te - ts; print("\n dot tFF time = ", "%10.8f"%tff)

The results show some differences - but they are relatively small:

shape of ay_inpC3 = (800, 20000) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False shape of ay_inpF3 = (800, 20000) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False shape of weightC = (101, 800) flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False shape of weightF = (101, 800) flags = C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False dot tCC time = 21.77729867 dot tCF time = 20.68745600 dot tFC time = 21.42704156 dot tFF time = 20.65543837

Actually, most of the tiny differences comes from putting the matrix into a fitting order. This is something Numpy.dot() performs automatically; see the documentation. The matrix operation is fastest for the second matrix being in F-order, but the difference is nothing to worry about at our present discussion level.

We could now apply one of the two strategies to improve our mechanism of dealing with the bias nodes at the input layer. You would notice a significant acceleration there. But you leave the other layers unchanged. Why?

The reason is quite simple: The matrix multiplications with the weight matrix - done by "np.dot()" - produces the C-contiguous arrays at later layers with the required shapes! E.g., an input array at layer L1 of the suitable shape (70, 10000). So, we can for the moment leave everything at the hidden layers and at the output layer untouched.

However, the discussion above made one thing clear: The whole approach of how we technically treat bias nodes is to be criticized. Can we at least go another way at the input layer?

Yes, we can. Without touching the weight matrix connecting the layers L0 and L1. We need to get rid of unnecessary or inefficient operations in the training loop, but we can afford some bigger operations during the setup of the input data. What, if we added the required bias values already to the input data array?

This would require a column operation on a transposition of the whole dataset "X". But, we need to perform this operation *only once* - and before splitting the data set into training and test sets! As a MLP generally works with flattened data such an approach should work for other datasets, too.

Measurements show that adding a bias column will cost us between 0.030 and 0.035 secs. A worthy one time investment! Think about it: We would not need to touch our already fast methods of shuffling and slicing to get the batches - and even the transposed matrix would already have the preferred F-contiguous order for np.dot()! The required code changes are minimal; we just need to adapt our methods "_handle_input_data()" and "_fw_propagation()" by two, three lines:

''' -- Method to handle different types of input data sets Currently only different MNIST sets are supported We can also IMPORT a preprocessed MIST data set --''' def _handle_input_data(self): ''' Method to deal with the input data: - check if we have a known data set ("mnist" so far) - reshape as required - analyze dimensions and extract the feature dimension(s) ''' # check for known dataset try: if (self._my_data_set not in self._input_data_sets ): raise ValueError except ValueError: print("The requested input data" + self._my_data_set + " is not known!" ) sys.exit() # MNIST datasets # ************** # handle the mnist original dataset - is not supported any more if ( self._my_data_set == "mnist"): mnist = fetch_mldata('MNIST original') self._X, self._y = mnist["data"], mnist["target"] print("Input data for dataset " + self._my_data_set + " : \n" + "Original shape of X = " + str(self._X.shape) + # "\n" + "Original shape of y = " + str(self._y.shape)) # # handle the mnist_784 dataset if ( self._my_data_set == "mnist_784"): mnist2 = fetch_openml('mnist_784', version=1, cache=True, data_home='~/scikit_learn_data') self._X, self._y = mnist2["data"], mnist2["target"] print ("data fetched") # the target categories are given as strings not integers self._y = np.array([int(i) for i in self._y], dtype=np.float32) print ("data modified") print("Input data for dataset " + self._my_data_set + " : \n" + "Original shape of X = " + str(self._X.shape) + "\n" + "Original shape of y = " + str(self._y.shape)) # handle the mnist_keras dataset - PREFERRED if ( self._my_data_set == "mnist_keras"): (X_train, y_train), (X_test, y_test) = kmnist.load_data() len_train = X_train.shape[0] len_test = X_test.shape[0] X_train = X_train.reshape(len_train, 28*28) X_test = X_test.reshape(len_test, 28*28) # Concatenation required due to possible later normalization of all data self._X = np.concatenate((X_train, X_test), axis=0) self._y = np.concatenate((y_train, y_test), axis=0) print("Input data for dataset " + self._my_data_set + " : \n" + "Original shape of X = " + str(self._X.shape) + "\n" + "Original shape of y = " + str(self._y.shape)) # # common MNIST handling if ( self._my_data_set == "mnist" or self._my_data_set == "mnist_784" or self._my_data_set == "mnist_keras" ): self._common_handling_of_mnist() # handle IMPORTED MNIST datasets (could in later versions also be used for other dtaasets # **************************+++++ # Note: Imported sets are e.g. useful for testing some new preprocessing in a Jupyter environment before implementing related new methods if ( self._my_data_set == "imported"): if (self._X_import is not None) and (self._y_import is not None): self._X = self._X_import self._y = self._y_import else: print("Shall handle imported datasets - but they are not defined") sys.exit() # # number of total records in X, y self._dim_X = self._X.shape[0] # ************************ # Common dataset handling # ************************ # transform to 32 bit # ~~~~~~~~~~~~~~~~~~~~ self._X = self._X.astype(np.float32) self._y = self._y.astype(np.int32) # Give control to preprocessing - Note: preproc. includes also normalization # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ self._preprocess_input_data() # scaling, PCA, cluster detection .... # ADDING A COLUMN FOR BIAS NEURONS # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ self._X = self._add_bias_neuron_to_layer(self._X, 'column') print("type of self._X = ", self._X.dtype, " flags = ", self._X.flags) print("type of self._y = ", self._y.dtype) # mixing the training indices - MUST happen BEFORE encoding # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ shuffled_index = np.random.permutation(self._dim_X) self._X, self._y = self._X[shuffled_index], self._y[shuffled_index] # Splitting into training and test datasets # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ if self._num_test_records > 0.25 * self._dim_X: print("\nNumber of test records bigger than 25% of available data. Too big, we stop." ) sys.exit() else: num_sep = self._dim_X - self._num_test_records self._X_train, self._X_test, self._y_train, self._y_test = self._X[:num_sep], self._X[num_sep:], self._y[:num_sep], self._y[num_sep:] # numbers, dimensions # ********************* self._dim_sets = self._y_train.shape[0] self._dim_features = self._X_train.shape[1] print("\nFinal dimensions of training and test datasets of type " + self._my_data_set + " : \n" + "Shape of X_train = " + str(self._X_train.shape) + "\n" + "Shape of y_train = " + str(self._y_train.shape) + "\n" + "Shape of X_test = " + str(self._X_test.shape) + "\n" + "Shape of y_test = " + str(self._y_test.shape) ) print("\nWe have " + str(self._dim_sets) + " data records for training") print("Feature dimension is " + str(self._dim_features)) # Encode the y-target labels = categories // MUST happen AFTER encoding # ************************** self._get_num_labels() self._encode_all_y_labels(self._b_print_test_data) # return None ..... ..... ''' -- Method to handle FW propagation for a mini-batch --''' def _fw_propagation(self, li_Z_in, li_A_out): ''' Parameter: li_Z_in : list of input values at all layers - li_Z_in[0] is already filled - other elements of this list are to be filled during FW-propagation li_A_out: list of output values at all layers - to be filled during FW-propagation ''' # index range for all layers # Note that we count from 0 (0=>L0) to E L(=>E) / # Careful: during BW-propagation we need a clear indexing of the lists filled during FW-propagation ilayer = range(0, self._n_total_layers-1) # do not change if you use vstack - shape may vary for predictions - cannot take self._no_ones yet # np_bias = np.ones((1,li_Z_in[0].shape[1])) # propagation loop # *************** for il in ilayer: # Step 1: Take input of last layer and apply activation function # ****** #ts=time.perf_counter() if il == 0: A_out_il = li_Z_in[il] # L0: activation function is identity !!! else: A_out_il = self._act_func( li_Z_in[il] ) # use real activation function # Step 2: Add bias node # ****** # As we have taken care of this for the input layer already at data setup we perform this only for hidden layers if il > 0: A_out_il = self._add_bias_neuron_to_layer(A_out_il, 'row') li_A_out[il] = A_out_il # save data for the BW propagation # Step 3: Propagate by matrix operation # ****** Z_in_ilp1 = np.dot(self._li_w[il], A_out_il) li_Z_in[il+1] = Z_in_ilp1 # treatment of the last layer # *************************** il = il + 1 A_out_il = self._out_func( li_Z_in[il] ) # use the output function li_A_out[il] = A_out_il # save data for the BW propagation return None

The required change of the first method consists of adding just one effective line

self._X = self._add_bias_neuron_to_layer(self._X, 'column')

Note that I added the column for the bias values * after* pre-processing. The bias neurons - more precisely - their constant values should not be regarded or included in clustering, PCA, normalization or whatever other things we do ahead of training.

In the second method we just had to eliminate a statement and add a condition, which excludes the input layer from an (additional) bias neuron treatment. That is all we need to do.

How much of an improvement can we expect? Assuming that the forward propagation consumes around 40% of the total computational time of an epoch, and taking the introductory numbers we would say that we should gain something like 0.40*0.43*100 %, i.e. 17.2%. However, this too much as the basic effect of our change varies non-linearly with the batch-size.

So, something around a 15% reduction of the CPU time for a training run with 35 epochs and a batch size of only 500 would be great.

However, we should expect a much bigger effect on the FW-propagation of the complete training set (though the test data set may be more interesting otherwise). OK, let us do 2 test runs - the first without a special verification of the accuracy on the training set, the second with a verification of the accuracy via propagating the training set at the end of each and every epoch.

Results of the first run:

------------------ Starting epoch 35 Time_CPU for epoch 35 0.2717692229998647 Total CPU-time: 9.625694645001204 learning rate = 0.0009994051838157095 total costs of training set = -1.0 rel. reg. contrib. to total costs = -1.0 total costs of last mini_batch = 65.10513 rel. reg. contrib. to batch costs = 0.121494114 mean abs weight at L0 : -10.0 mean abs weight at L1 : -10.0 mean abs weight at L2 : -10.0 avg total error of last mini_batch = 0.00805 presently batch averaged accuracy = 0.99247 ------------------- Total training Time_CPU: 9.625974849001068

And the second run gives us :

------------------ Starting epoch 35 Time_CPU for epoch 35 0.37750117799805594 Total CPU-time: 13.164013020999846 learning rate = 0.0009994051838157095 total costs of training set = 5929.9297 rel. reg. contrib. to total costs = 0.0013557569 total costs of last mini_batch = 50.148125 rel. reg. contrib. to batch costs = 0.16029811 mean abs weight at L0 : 0.064023666 mean abs weight at L1 : 0.38064405 mean abs weight at L2 : 1.320015 avg total error of last mini_batch = 0.00626 presently reached train accuracy = 0.99045 presently batch averaged accuracy = 0.99267 ------------------- Total training Time_CPU: 13.16432525900018

The small deviation of the accuracy values determined by error averaging over batches vs. the test on the complete training set stems from slightly different measurement methods as discussed in the first sections of this article.

What do our results mean with respect to performance?

Well, we went down from 11.33 secs to **9.63** secs for the CPU time of the training run. This is a fair **15% improvement**. But remember that we came from something like 50 secs at the beginning of our optimization, so all in all we have gained an improvement by a factor of 5 already!

In our last article we found a factor of 1.68 between the runs with a full propagation of the complete training set at each and every epoch for accuracy evaluation. Such a run lasted roughly for 19 secs. We now went down to 13.16 secs. Meaning: Instead of 7.7 secs we only consumed 3.5 secs for propagating all 60000 samples 35 times in one sweep.

**We reduced the CPU time for the FW propagation of the training set (plus error evaluation) by 54%**, i.e. by more than a factor of 2! Meaning: We have really achieved something for the FW-propagation of big batches!

By the way: Checking accuracy on the test dataset instead on the training dataset after each and every epoch requires 10.15 secs.

------------------ Starting epoch 35 Time_CPU for epoch 35 0.29742689200065797 Total CPU-time: 10.150781942997128 learning rate = 0.0009994051838157095 total costs of training set = -1.0 rel. reg. contrib. to total costs = -1.0 total costs of last mini_batch = 73.17834 rel. reg. contrib. to batch costs = 0.10932728 mean abs weight at L0 : -10.0 mean abs weight at L1 : -10.0 mean abs weight at L2 : -10.0 avg total error of last mini_batch = 0.00804 presently reached test accuracy = 0.96290 presently batch averaged accuracy = 0.99269 ------------------- Total training Time_CPU: 10.1510079389991

You see the variation in the accuracy values.

Eventually, I give you run times for 35 epochs of the MLP for larger batch sizes:

bs = 500 => t(35) = 9.63 secs bs = 5000 => t(35) = 8.75 secs bs = 10000 => t(35) = 8.55 secs bs = 20000 => t(35) = 8.68 secs bs = 30000 => t(35) = 8.65 secs

So, we get not below a certain value - despite the fact that FW-propagation gets faster with batch-size. So, we have some more batch-size dependent impediments in the BW-propagation, too, which compensate our gains.

Just to show that our modified program still produces reasonable results after 650 training steps - here the plot and result data :

------------------ Starting epoch 651 .... .... avg total error of last mini_batch = 0.00878 presently reached train accuracy = 0.99498 presently reached test accuracy = 0.97740 presently batch averaged accuracy = 0.99214 ------------------- Total training Time_CPU: 257.541123711002

The total time was to be expected as we checked accuracy values at each and every epoch ** both** for the complete training

This was a funny ride today. We found a major impediment for a fast FW-propagation. We determined its cause in the inefficient combination of two differently *ordered* matrices which we used to account for bias nodes in the input layer. We investigated some optimization options for our present approach regarding bias neurons at layer L0. But it was much more reasonable to circumvent the whole problem by adding bias values already to the input array itself. This gave us a significant improvement for the FW-propagation of big batches - roughly by a factor of **2.5** for the complete training data set as an extreme example. But also testing accuracy on the full test data set at each and every epoch is no major performance factor any longer.

However, our whole analysis showed that we must put a big question mark behind our present approach to bias neurons. But before we attack this problem, we shall take a closer look at BW-propagation in the next article:

MLP, Numpy, TF2 – performance issues – Step III – a correction to BW propagation

And there we shall replace another stupid time wasting part of the code, too. It will give us another improvement of around 15% to 20%. Stay tuned ...

**Performance of class methods vs. pure Python functions**

stackoverflow : how-much-slower-python-classes-are-compared-to-their-equivalent-functions

**Shuffle columns?**

stackoverflow: shuffle-columns-of-an-array-with-numpy

**Numpy arrays or matrices? **

stackoverflow : numpy-np-array-versus-np-matrix-performance

- my own MLP programmed with Python and Numpy; I have discuss this program in another article series;
- an MLP with a similar setup based on Keras and TF2

Not for reasons of a competition, but to learn a bit about differences. When and for what parameters do Keras/TF2 offer a better performance?

Another objective is to test TF-alternatives to Numpy functions and possible performance gains.

For the Python code of my own MLP see the article series starting with the following post:

A simple Python program for an ANN to cover the MNIST dataset – I – a starting point

But I will discuss relevant code fragments also here when needed.

I think, performance is always an interesting topic - especially for dummies as me regarding Python. After some trials and errors I decided to discuss some of my experiences with MLP performance and optimization options in a separate series of the section "Machine learning" in this blog. This articles starts with two simple measures.

Well, what did a first comparison give me? Regarding CPU time I got a **factor of 6** on the MNIST dataset for a batch-size of 500. Of course, Keras with TF2 was faster . Devastating? Not at all ... After years of dealing with databases and factors of up to 100 by changes of SQL-statements and indexing a factor of 6 cannot shock or surprise me.

The Python code was the product of an unpaid hobby activity in my scarce free time. And I am still a beginner in Python. The code was also totally unoptimized, yet - both regarding technical aspects and the general handling of forward and backward propagation. It also contained and still contains a lot of superfluous statements for testing. Actually, I had expected an even bigger factor.

In addition, some things between Keras and my Python programs are not directly comparable as I only use 4 CPU cores for Openblas - this gave me an optimum for Python/Numpy programs in a Jupyter environment. Keras and TF2 instead seem to use all available CPU threads (successfully) despite limiting threading with TF-statements. (By the way: This is an interesting point in itself. If OpenBlas cannot give them advantages what else do they do?)

A very surprising point was, however, that using a GPU did not make the factor much bigger - despite the fact that TF2 should be able to accelerate certain operations on a GPU by at least by a factor of 2 up to 5 as independent tests on matrix operations showed me. And a factor of > 2 between my GPU and the CPU is what I remember from TF1-times last year. So, either the CPU is better supported now or the GPU-support of TF2 has become worse compared to TF1. An interesting point, too, for further investigations ...

An even bigger surprise was that I could reduce the factor for the given batch-size down to 2 by just two major, butsimple code changes! However, further testing also showed a huge dependency on the * batch size*chosen for training - which is another interesting point. Simple tests show that we may even be able to reduce the performance factor further by

- by using directly coupled matrix operations - if logically possible
- by using the basic low-level Python API for some operations

Hope, this sounds interesting for you.

I used the following model as a reference in a Jupyter environment executed on Firefox:

**Jupyter Cell 1**

# compact version # **************** import time import tensorflow as tf #from tensorflow import keras as K import keras as K from keras.datasets import mnist from keras import models from keras import layers from keras.utils import to_categorical from keras import regularizers from tensorflow.python.client import device_lib import os # use to work with CPU (CPU XLA ) only os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # The following can only be done once - all CPU cores are used otherwise tf.config.threading.set_intra_op_parallelism_threads(4) tf.config.threading.set_inter_op_parallelism_threads(4) gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]) except RuntimeError as e: print(e) # if not yet done elsewhere #tf.compat.v1.disable_eager_execution() #tf.config.optimizer.set_jit(True) tf.debugging.set_log_device_placement(True) use_cpu_or_gpu = 0 # 0: cpu, 1: gpu # function for training def train(train_images, train_labels, epochs, batch_size, shuffle): network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size, shuffle=shuffle) # setup of the MLP network = models.Sequential() network.add(layers.Dense(70, activation='sigmoid', input_shape=(28*28,), kernel_regularizer=regularizers.l2(0.01))) #network.add(layers.Dense(80, activation='sigmoid')) #network.add(layers.Dense(50, activation='sigmoid')) network.add(layers.Dense(30, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))) network.add(layers.Dense(10, activation='sigmoid')) network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # load MNIST mnist = K.datasets.mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() # simple normalization train_images = X_train.reshape((60000, 28*28)) train_images = train_images.astype('float32') / 255 test_images = X_test.reshape((10000, 28*28)) test_images = test_images.astype('float32') / 255 train_labels = to_categorical(y_train) test_labels = to_categorical(y_test)

**Jupyter Cell 2**

# run it if use_cpu_or_gpu == 1: start_g = time.perf_counter() train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True) end_g = time.perf_counter() test_loss, test_acc= network.evaluate(test_images, test_labels) print('Time_GPU: ', end_g - start_g) else: start_c = time.perf_counter() with tf.device("/CPU:0"): train(train_images, train_labels, epochs=35, batch_size=500, shuffle=True) end_c = time.perf_counter() test_loss, test_acc= network.evaluate(test_images, test_labels) print('Time_CPU: ', end_c - start_c) # test accuracy print('Acc:: ', test_acc)

**Typical output - first run: **

Epoch 1/35 60000/60000 [==============================] - 1s 16us/step - loss: 2.6700 - accuracy: 0.1939 Epoch 2/35 60000/60000 [==============================] - 0s 5us/step - loss: 2.2814 - accuracy: 0.3489 Epoch 3/35 60000/60000 [==============================] - 0s 5us/step - loss: 2.1386 - accuracy: 0.3848 Epoch 4/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.9996 - accuracy: 0.3957 Epoch 5/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.8941 - accuracy: 0.4115 Epoch 6/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.8143 - accuracy: 0.4257 Epoch 7/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.7556 - accuracy: 0.4392 Epoch 8/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.7086 - accuracy: 0.4542 Epoch 9/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.6726 - accuracy: 0.4664 Epoch 10/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.6412 - accuracy: 0.4767 Epoch 11/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.6156 - accuracy: 0.4869 Epoch 12/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.5933 - accuracy: 0.4968 Epoch 13/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.5732 - accuracy: 0.5078 Epoch 14/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.5556 - accuracy: 0.5180 Epoch 15/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.5400 - accuracy: 0.5269 Epoch 16/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.5244 - accuracy: 0.5373 Epoch 17/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.5106 - accuracy: 0.5494 Epoch 18/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.4969 - accuracy: 0.5613 Epoch 19/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.4834 - accuracy: 0.5809 Epoch 20/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.4648 - accuracy: 0.6112 Epoch 21/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.4369 - accuracy: 0.6520 Epoch 22/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.3976 - accuracy: 0.6821 Epoch 23/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.3602 - accuracy: 0.6984 Epoch 24/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.3275 - accuracy: 0.7084 Epoch 25/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.3011 - accuracy: 0.7147 Epoch 26/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.2777 - accuracy: 0.7199 Epoch 27/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.2581 - accuracy: 0.7261 Epoch 28/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.2411 - accuracy: 0.7265 Epoch 29/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.2259 - accuracy: 0.7306 Epoch 30/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.2140 - accuracy: 0.7329 Epoch 31/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.2003 - accuracy: 0.7355 Epoch 32/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.1890 - accuracy: 0.7378 Epoch 33/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.1783 - accuracy: 0.7410 Epoch 34/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.1700 - accuracy: 0.7425 Epoch 35/35 60000/60000 [==============================] - 0s 5us/step - loss: 1.1605 - accuracy: 0.7449 10000/10000 [==============================] - 0s 37us/step Time_CPU: 11.055424336002034 Acc:: 0.7436000108718872

A second run was a bit faster: 10.8 secs. Accuracy around: 0.7449.

The relatively low accuracy is mainly due to the regularization (and reasonable to avoid overfitting). Without regularization we would already have passed the 0.9 border.

My own **unoptimized** MLP-program was executed with the following parameter setting:

my_data_set="mnist_keras", n_hidden_layers = 2, ay_nodes_layers = [0, 70, 30, 0], n_nodes_layer_out = 10, num_test_records = 10000, # number of test data # Normalizing - you should play with scaler1 only for the time being scaler1 = 1, # 1: StandardScaler (full set), 1: Normalizer (per sample) scaler2 = 0, # 0: StandardScaler (full set), 1: MinMaxScaler (full set) b_normalize_X_before_preproc = False, b_normalize_X_after_preproc = True, my_loss_function = "LogLoss", n_size_mini_batch = 500, n_epochs = 35, lambda2_reg = 0.01, learn_rate = 0.001, decrease_const = 0.000001, init_weight_meth_L0 = "sqrt_nodes", # method to init weights in an interval defined by =>"sqrt_nodes" or a constant interval "const" init_weight_meth_Ln = "sqrt_nodes", # sqrt_nodes", "const" init_weight_intervals = [(-0.5, 0.5), (-0.5, 0.5), (-0.5, 0.5)], # in case of a constant interval init_weight_fact = 2.0, # extends the interval mom_rate = 0.00005, b_shuffle_batches = True, # shuffling the batches at the start of each epoch b_predictions_train = True, # test accuracy by predictions for ALL samples of the training set (MNIST: 60000) at the start of each epoch b_predictions_test = False, prediction_train_period = 1, # 1: each and every epoch is used for accuracy tests on the full training set prediction_test_period = 1, # 1: each and every epoch is used for accuracy tests on the full test dataset

People familiar with my other article series on the MLP program know the parameters. But I think their names and comments are clear enough.

With a measurement of accuracy based on a forward propagation of the complete training set after each and every epoch (with the adjusted weights) I got a run time of 60 secs.

With accuracy measurements based on error tracking for batches and averaging over all batches, I get 49.5 secs (on 4 CPU threads). So, this is the mentioned factor between 5 and 6.

(By the way: The test indicates some space for improvement on the "Forward Propagation" We shall take care of this in the next article of this series - promised).

So, these were the references or baselines for improvements.

Well, let us look at the results after two major code changes. With a test of accuracy performed on the full training set of 60000 samples at the start of each epoch I get the following result :

------------------ Starting epoch 35 Time_CPU for epoch 35 0.5518779030026053 relative CPU time portions: shuffle: 0.05 batch loop: 0.58 prediction: 0.37 Total CPU-time: 19.065050211000198 learning rate = 0.0009994051838157095 total costs of training set = 5843.522 rel. reg. contrib. to total costs = 0.0013737131 total costs of last mini_batch = 56.300297 rel. reg. contrib. to batch costs = 0.14256112 mean abs weight at L0 : 0.06393985 mean abs weight at L1 : 0.37341583 mean abs weight at L2 : 1.302389 avg total error of last mini_batch = 0.00709 presently reached train accuracy = 0.99072 ------------------- Total training Time_CPU: 19.04528829299714

With accuracy taken only from the error of a batch:

avg total error of last mini_batch = 0.00806 presently reached train accuracy = 0.99194 ------------------- Total training Time_CPU: 11.331006342999899

Isn't this good news? A time of **11.3** secs is pretty close to what Keras provides us with! (Well, at least for a batch size of 500). And with a better result regarding accuracy on my side - but this has to do with a probably different handling of learning rates and the precise translation of the L2-regularization parameter for batches.

How did I get to this point? As said: Two measures were sufficient.

So far I have never cared too much for defining the level of precision by which Numpy handles arrays with floating point numbers. In the context of Machine Learning this is a profound mistake. on a 64bit CPU many time consuming operations can gain almost a factor of 2 in performance when using float 32 precision - if the programmers tweaked everything. And I assume the Numpy guys did it.

So: Just use "dtype=np.float32" (np means "numpy" which I always import as "np") whenever you initialize numpy arrays!

For the readers following my other series: You should look at multiple methods performing some kind of initialization of my "MyANN"-class. Here is a list:

def _handle_input_data(self): ..... self._y = np.array([int(i) for i in self._y], dtype=np.float32) ..... self._X = self._X.astype(np.float32) self._y = self._y.astype(np.int32) ..... def _encode_all_y_labels(self, b_print=True): ..... self._ay_onehot = np.zeros((self._n_labels, self._y_train.shape[0]), dtype=np.float32) self._ay_oneval = np.zeros((self._n_labels, self._y_train.shape[0], 2), dtype=np.float32) ..... def _create_WM_Input(self): ..... w0 = w0.astype(dtype=np.float32) ..... def _create_WM_Hidden(self): ..... w_i_next = w_i_next.astype(dtype=np.float32) ..... def _create_momentum_matrices(self): ..... self._li_mom[i] = np.zeros(self._li_w[i].shape, dtype=np.float32) ..... def _prepare_epochs_and_batches(self, b_print = True): ..... self._ay_theta = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) self._ay_costs = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) self._ay_reg_cost_contrib = -1 * np.ones(self._shape_epochs_batches, dtype=np.float32) ..... self._ay_period_test_epoch = -1 * np.ones(shape_test_epochs, dtype=np.float32) self._ay_acc_test_epoch = -1 * np.ones(shape_test_epochs, dtype=np.float32) self._ay_err_test_epoch = -1 * np.ones(shape_test_epochs, dtype=np.float32) self._ay_period_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) self._ay_acc_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) self._ay_err_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) self._ay_tot_costs_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) self._ay_rel_reg_train_epoch = -1 * np.ones(shape_train_epochs, dtype=np.float32) ..... self._ay_mean_abs_weight = -10 * np.ones(shape_weights, dtype=np.float32) ..... def _add_bias_neuron_to_layer(self, A, how='column'): ..... A_new = np.ones((A.shape[0], A.shape[1]+1), dtype=np.float32) ..... A_new = np.ones((A.shape[0]+1, A.shape[1]), dtype=np.float32) .....

After I applied these changes the factor in comparison to Keras went down to **3.1** - for a batch size of 500. Good news after a first simple step!

The next step required a bit more thinking. When I went through further more detailed tests of CPU consumption for various steps during training I found that the error back propagation through the network required *significantly* more time than the forward propagation.

At first sight this seems to be logical. There are more operations to be done between layers - real matrix multiplications with np.dot() (or np.matmul()) and element-wise multiplications with the "*"-operation. See also my PDF on the basic math:

Back_Propagation_1.0_200216.

But this is wrong assumption: When I measured CPU times in detail I saw that such operations took most time when network layer L0 - i.e. the input layer of the MLP - got involved. This also seemed to be reasonable: the weight matrix is biggest there; the input layer of all layers has most neuron nodes.

But when I went through the code I saw that I just had been too lazy whilst coding back propagation:

''' -- Method to handle error BW propagation for a mini-batch --''' def _bw_propagation(self, ay_y_enc, li_Z_in, li_A_out, li_delta_out, li_delta, li_D, li_grad, b_print = True, b_internal_timing = False): # Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch # Initiate BW propagation - provide delta-matrices for outermost layer # *********************** # Input Z at outermost layer E (4 layers -> layer 3) ay_Z_E = li_Z_in[self._n_total_layers-1] # Output A at outermost layer E (was calculated by output function) ay_A_E = li_A_out[self._n_total_layers-1] # Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print ) # Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices) ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print) # add the matrices at the outermost layer to their lists ; li_delta_out gets only one element idxE = self._n_total_layers - 1 li_delta_out[idxE] = ay_delta_out_E # this happens only once li_delta[idxE] = ay_delta_E li_D[idxE] = ay_D_E li_grad[idxE] = None # On the outermost layer there is no gradient ! # Loop over all layers in reverse direction # ****************************************** # index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2)) range_N_bw_layer = reversed(range(0, self._n_total_layers-1)) # must be -1 as the last element is not taken # loop over layers for N in range_N_bw_layer: # Back Propagation operations between layers N+1 and N # ******************************************************* # this method handles the special treatment of bias nodes in Z_in, too ay_delta_N, ay_D_N, ay_grad_N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False ) # add matrices to their lists li_delta[N] = ay_delta_N li_D[N] = ay_D_N li_grad[N]= ay_grad_N return

with the following key function:

''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N ''' def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta): ''' BW-error-propagation between layer N+1 and N Inputs: li_Z_in: List of input Z-matrices on all layers - values were calculated during FW-propagation li_A_out: List of output A-matrices - values were calculated during FW-propagation li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist Returns: ay_delta_N - delta-matrix of layer N (required in subsequent steps) ay_D_N - derivative matrix for the activation function on layer N ay_grad_N - matrix with gradient elements of the cost fnction with respect to the weights on layer N ''' # Prepare required quantities - and add bias neuron to ay_Z_in # **************************** # Weight matrix meddling between layers N and N+1 ay_W_N = self._li_w[N] # delta-matrix of layer N+1 ay_delta_Np1 = li_delta[N+1] # !!! Add row (for bias) to Z_N intermediately !!! ay_Z_N = li_Z_in[N] ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row') # Derivative matrix for the activation function (with extra bias node row) ay_D_N = self._calculate_D_N(ay_Z_N) # fetch output value saved during FW propagation ay_A_N = li_A_out[N] # Propagate delta # ************** # intermediate delta ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1) # final delta ay_delta_N = ay_delta_w_N * ay_D_N # reduce dimension again (bias row) ay_delta_N = ay_delta_N[1:, :] # Calculate gradient # ******************** # required for all layers down to 0 ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T) # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) ay_grad_N[:, 1:] += (self._li_w[N][:, 1:] * self._lambda2_reg + np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg) return ay_delta_N, ay_D_N, ay_grad_N

Now, look at the eventual code:

''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N ''' def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False): ''' BW-error-propagation between layer N+1 and N .... ''' # Prepare required quantities - and add bias neuron to ay_Z_in # **************************** # Weight matrix meddling between layers N and N+1 ay_W_N = self._li_w[N] ay_delta_Np1 = li_delta[N+1] # fetch output value saved during FW propagation ay_A_N = li_A_out[N] # Optimization ! if N > 0: ay_Z_N = li_Z_in[N] # !!! Add intermediate row (for bias) to Z_N !!! ay_Z_N = self._add_bias_neuron_to_layer(ay_Z_N, 'row') # Derivative matrix for the activation function (with extra bias node row) ay_D_N = self._calculate_D_N(ay_Z_N) # Propagate delta # ************** # intermediate delta ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1) # final delta ay_delta_N = ay_delta_w_N * ay_D_N # reduce dimension again ay_delta_N = ay_delta_N[1:, :] else: ay_delta_N = None ay_D_N = None # Calculate gradient # ******************** # required for all layers down to 0 ay_grad_N = np.dot(ay_delta_Np1, ay_A_N.T) # regularize gradient (!!!! without adding bias nodes in the L1, L2 sums) if self._lambda2_reg > 0.0: ay_grad_N[:, 1:] += self._li_w[N][:, 1:] * self._lambda2_reg if self._lambda1_reg > 0.0: ay_grad_N[:, 1:] += np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg return ay_delta_N, ay_D_N, ay_grad_N

You have, of course, detected the most important change:

We do not need to propagate any delta-matrices (originally coming from the error deviation at the output layer) down to layer 1!

This is due to the somewhat staggered nature of error back propagation - see the PDF on the math again. Between the first hidden layer L1 and the input layer L0 we only need to fetch the output matrix A at L0 to be able to calculate the gradient components for the weights in the weight matrix connecting L0 and L1. This saves us from the biggest matrix multiplication - and thus reduces computational time significantly.

Another bit of CPU time can be saved by calculating only the regularization terms really asked for; for my simple densely populated network I almost never use Lasso regularization; so L1 = 0.

These changes got me down to the values mentioned above. And, note: The CPU time for backward propagation then drops to the level of forward propagation. So: Be somewhat skeptical about your coding if backward propagation takes much more CPU time than forward propagation!

I should remark that TF2 still brings some major and remarkable advantages with it. Its strength becomes clear when we go to much bigger batch sizes than 500:

When we e.g. take a size of 10000 samples in a batch, the required time of Keras and TF2 goes down to 6.4 secs. This is again a factor of roughly 1.75 faster.

I do not see any such acceleration with batch size in case of my own program!

More detailed tests showed that I do not gain speed with a batch size over 1000; the CPU time increases linearly from that point on. This actually seems to be a limitation of Numpy and OpenBlas on my system.

Because , I have some reasons to believe that TF2 also uses some basic OpenBlas routines, this is an indication that we need to put more brain into further optimization.

We saw in this article that ML programs based on Python and Numpy may gain a boost by using only dtype=float32 and the related accuracy for Numpy arrays. In addition we saw that avoiding unnecessary propagation steps between the first hidden and at the input layer helps a lot.

In the next article of this series we shall look a bit at the performance of forward propagation - especially during accuracy tests on the training and test data set.

MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation

MLP, Numpy, TF2 – performance issues – Step II – bias neurons, F- or C- contiguous arrays and performance

MLP, Numpy, TF2 – performance issues – Step III – a correction to BW propagation

Before you want to use TensorFlow [TF] on a Nvidia graphics card you must install Cuda. The present version is Cuda 10.2. I was a bit naive to assume that this should be the right version - as it has been available for some time already. Wrong! Afterwards I read somewhere that TensorFlow2 [TF2] is working with Cuda 10.1, only, and not yet fully compatible with Cuda 10.2. Well, at least for my purposes [MLP training] it seemed to work nevertheless - with some "dirty" additional library links.

There is a central Cuda repository available at this address: cuda10.2. Actually, the repo offers both cuda10.0, cuda10.1 and cuda10.2 (plus some nvidia drivers). I selected some central cuda10.2 packages for installation - just to find out where the related files were placed in the filesystem. I then ran into a major chain of packet dependencies, which I had to resolve during many tedious steps . Some packages may not have been necessary for a basic installation. In the end I was too lazy to restrict the libs to what is really required for Keras. The bill came afterwards: Cuda 10.2 is huge! If you do not know exactly what you need: Be prepared to invest up to 3 GB on your hard disk.

The Cuda 10.2 RPM packets install most of the the required "*.so"-shared library files and many other additional files in a directory "/usr/local/cuda-10.2/". To make changes between different versions of Cuda possible we also find a link "**/usr/local/cuda**" pointing to

"/usr/local/cuda-10.2/" after the installation. Ok, reasonable - we could change the link to point to "/usr/local/cuda-10.0/". This makes you assume that the Tensorflow 2 libraries and other modules in your virtual Python3 and Jupyter environment would look for required Cuda files in the central directory "**/usr/local/cuda**" - i.e. without special version attributes of the files. Unfortunately, this was another wrong assumption. See below.

In addition to the Cuda packages you must install the present "cudnn" libraries from Nvidia - more precisely: The runtime * and* the development package. You get the RPMs from here. Be prepared to give Nvidia your private data.

I should add that I ignored and ignore the Nvidia drivers from the Cuda repository, i.e. I never installed them. Instead, I took those from the standard Nvidia community repository. They worked and work well so far - and make my update life on Opensuse easier.

I use a virtual Python3 environment and update it regularly via "pip". Regarding TF2 an update via the command "pip install --upgrade tensorflow" should be sufficient - it will resolve dependencies. (By the way: If you want to bring all Python libs to their present version you can also use "pip-review --auto". Note that under certain circumstances you may need the "--force" option for special upgrades. I cannot go into details in this article.)

Unfortunately, the next time I started my virtual Python environment I got the warning that the dynamic library "**libnvinfer.so.6**" could not be found, but was required in case I planned to use *TensorRT*. What? Well, you may find some information here

https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/

https://developer.nvidia.com/tensorrt

I leave it up to you whether you really need TensorRT. You can ignore this message - TF will run for standard purposes without it. But, dear TF-developers: a clear message in the warning would in my opinion have been helpful. Then I checked whether some version of the Nvidia related library "libnvinfer.so" came with Cuda or Cudnn onto my system. Yeah, it did - unfortunately version 7 instead of 6. :-(.

So, we are confronted with a dependency on a specific file version which is older than the present one. I do not like this style of development. Normally, it should be the other way round: If a *newer* version is required due to new capabilities you warn the user. But under normal circumstances a backward compatibility of libs should be established. You would assume such a backward compatibility and that TF would search for the present version via looking for files "libnvinfer.so" and "libnvinfer_plugin.so" which do exist and point to the latest versions. But, no, in this case they want it explicitly to be version 6 ... Makes you wonder whether the old Cudnn version is still available. I did not check it. Ok, ok - backward compatibility is not always possible ....

Just to see how good the internal checking of the alleged dependency is, I did something you normally do not do: I created a link "libnvinfer.so.6" in "/usr/lib64" to "libnvinfer.7.so". Had to do the same for "libnvinfer_plugin.so.6". Effect: I got rid of the warning - so much about dependency checking. I left the linking. You see I trust in coming developments sometimes and run some risks ....

Then came the next surprise. I had read a bit about changed statements in TF2 (compared to TF1) - and thought I was prepared for this. But, when I tried to execute some initial commands to configure TF2 from a Jupyter cell as e.g.

import time import tensorflow as tf from tensorflow import keras as K from keras.datasets import mnist from keras import models from keras import layers from tensorflow.python.client import device_lib import os #os.environ["CUDA_VISIBLE_DEVICES"] = "-1" tf.config.optimizer.set_jit(True) tf.config.threading.set_intra_op_parallelism_threads(4) tf.config.threading.set_inter_op_parallelism_threads(4) tf.debugging.set_log_device_placement(True) device_lib.list_local_devices()

I at once got a complaint in the shell from which I had started the Jupyter notebook - saying that a lib called "**libcudart.so.10.1**" was missing. Again - an explicit version dependency . On purpose or just a glitch? Just one out of many files version dependent? Without a clear information? If this becomes the standard in the interaction between TF2 and Cuda - well, no fun any longer. In my opinion the TF2 developers should not use a search for files via version specific names - but do an analysis of headers and warn explicitly that the present version requires a specific Cuda version. Would be much more convenient for the user and compatible with the link mechanism described above.

Whilst a bunch of other dynamic libs was loaded by their name without a version in this case TF2 asks for a very specific version - although there is a corresponding lib available in the directory "/usr/lib/cuda-10.2".... Nevertheless with full trust again in a better future I offered TF2 a softlink "libcudart.so.10.1" in "/usr/lib64/" pointing to the "/usr/local/cuda-10.2/lib64/libcudart.so". It cleared my way to the next hurdle. And my Keras MLP worked in the end ...

When I tried to run specific Keras commands, which TF2 wanted to compile as XLA-supported statements, I again got complaints that files in a local directory "./bin" were missing. This was a first clear indication that Cuda paths were totally ignored in my Python/Jupyter environment. But what directory did the "./" refer to? Some experiments revealed:

I had to link an artificial subdirectory "./bin" in the directory where I kept my Jupyter notebooks to "/usr/local/Cuda-10.2/bin".

But the next problems with other directories waited directly around the corner. Actually many ... To make a long story short - the installation of TF2 in combination with Cuda 1.2 does not evaluate paths or ask for paths when used in a Python3/Jupyter environment. We have to provide and export them as shell environment variables. See below.

Another thing which drove me nuts was that TF2 required information about XLA-flags. It took me a while to find out that this also could be handled via environment variables.

All in all I now start the shell from which I launch my virtual Python environment and Jupyter notebooks with the following command sequence:

myself@mytux:/projekte/GIT/....../ml> export XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda myself@mytux:/projekte/GIT/....../ml> export TF_XLA_FLAGS=--tf_xla_cpu_global_jit myself@mytux:/projekte/GIT/....../ml_1> export OPENBLAS_NUM_THREADS=4 myself@mytux:/projekte/GIT/....../ml_1> source bin/activate (ml) myself@mytux:/projekte/GIT/....../ml_1> jupyter notebook

The first two commands did the magic regarding the path-problems! TF2 worked afterwards both for XLA-capable CPUs and Nvidia GPUs. So, a specific version may or may not have advantages - I do not know - but at least you can get TF2 running with Cuda 10.2.

Without the use of explicit compatibility commands TF2 does not support commands like

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores, inter_op_parallelism_threads=num_cores, allow_soft_placement=True, device_count = {'CPU' : num_CPU, 'GPU' : num_GPU}, log_device_placement=True ) config.gpu_options.per_process_gpu_memory_fraction=0.4 config.gpu_options.force_gpu_compatible = True

any longer. But as with TF1 you probably do not want to pick all the memory from your graphics card and you do not want to use all cores of a CPU in TF2. You can circumvent the lack of a "ConfigProto" property in TF2 by the following commands:

# configure use of just in time compiler tf.config.optimizer.set_jit(True) # limit use of parallel threads tf.config.threading.set_intra_op_parallelism_threads(4) tf.config.threading.set_inter_op_parallelism_threads(4) # Not required in TF2: tf.enable_eager_execution() # print out use of certain device (at first run) tf.debugging.set_log_device_placement(True) #limit use of graphics card memory gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]) except RuntimeError as e: print(e) # Not required in TF2: tf.enable_eager_execution() # print out a list of usable devices device_lib.list_local_devices()

**Addendum, 15.05.2020:**

Well, this actually proved to be correct for the limitation of the GPU memory, only. The limitations on the CPU cores do **NOT** work. At least not on my system. See also:

tensorflow issues 34415.

I give you a workaround below.

Afterwards the following simple Keras MLP ran without problems and with the expected performance on a GPU and a multicore CPU:

**Jupyter cell 1**

import time import tensorflow as tf #from tensorflow import keras as K import keras as K from keras.datasets import mnist from keras import models from keras import layers from keras.utils import to_categorical from tensorflow.python.client import device_lib import os # use to work with CPU (CPU XLA ) only # os.environ["CUDA_VISIBLE_DEVICES"] = "-1" gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]) except RuntimeError as e: print(e) # if not yet done elsewhere tf.config.optimizer.set_jit(True) tf.debugging.set_log_device_placement(True) use_cpu_or_gpu = 1 # 0: cpu, 1: gpu # The following can only be done once - all CPU cores are used otherwise #if use_cpu_or_gpu == 0: # tf.config.threading.set_intra_op_parallelism_threads(4) # tf.config.threading.set_inter_op_parallelism_threads(6) # function for training def train(train_images, train_labels, epochs, batch_size): network.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size) # setup of the MLP network = models.Sequential() network.add(layers.Dense(200, activation='sigmoid', input_shape=(28*28,))) network.add(layers.Dense(100, activation='sigmoid')) network.add(layers.Dense(50, activation='sigmoid')) network.add(layers.Dense(30, activation='sigmoid')) network.add(layers.Dense(10, activation='sigmoid')) network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # load MNIST mnist = K.datasets.mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() # simple normalization train_images = X_train.reshape((60000, 28*28)) train_images = train_images.astype('float32') / 255 test_images = X_test.reshape((10000, 28*28)) test_images = test_images.astype('float32') / 255 train_labels = to_categorical(y_train) test_labels = to_categorical(y_test)

**Jupyter cell 2**

# run it if use_cpu_or_gpu == 1: start_g = time.perf_counter() train(train_images, train_labels, epochs=45, batch_size=1500) end_g = time.perf_counter() test_loss, test_acc= network.evaluate(test_images, test_labels) print('Time_GPU: ', end_g - start_g) else: start_c = time.perf_counter() train(train_images, train_labels, epochs=45, batch_size=1500) end_c = time.perf_counter() test_loss, test_acc= network.evaluate(test_images, test_labels) print('Time_CPU: ', end_c - start_c) # test accuracy print('Acc: ', test_acc)

Another culprit is that - depending on the exact version of TF 2 - you may need to use the following statement to run (parts of) your code on the CPU **only**:

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

in the beginning. Otherwise Tensorflow 2.0 and version 2.1 will choose execution on the GPU even if you use a statement like

with tf.device("/CPU:0"):

(which worked in TF1).

It seems that this problem was solved with TF 2.2 (tested it on 15.05.2020)! But you may have to check it yourself.

You can watch the involvement of the GPU e.g. with "watch -n0.1 nvidia-smi" on a terminal. Another possibility is to set

tf.debugging.set_log_device_placement(True)

and get messages in the shell of your virtual Python environment or in the presently used Jupyter cell.

After several trials and tests I think that both TF2 and the Keras version delivered with handle the above given TF2 statements to limit the number of CPU cores to use inefficiently. I addition the behavior of TF2/Keras has changed with the TF2 versions 2.0, 2.1 and now 2.2.

Strange things also happen, when you combine statements of the TF1 compat layer with pure TF2 restriction statements. You should refrain from mixing them.

So, it is either

**Option 1: CPU only and limited number of cores **

from tensorflow import keras as K from tensorflow.python.keras import backend as B import os os.environ["CUDA_VISIBLE_DEVICES"] = "-1" ... config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=4, inter_op_parallelism_threads=1) B.set_session(tf.compat.v1.Session(config=config)) ...

OR

**Option 2: Mixture of GPU (with limited memory) and CPU (limited core number) with TF2 statements**

import tensorflow as tf from tensorflow import keras as K from tensorflow.python.keras import backend as B from keras import models from keras import layers from keras.utils import to_categorical from keras.datasets import mnist from tensorflow.python.client import device_lib import os tf.config.threading.set_intra_op_parallelism_threads(6) tf.config.threading.set_inter_op_parallelism_threads(1) gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]) except RuntimeError as e: print(e)

OR

**Option 3: Mixture of GPU (limited memory) and CPU (limited core numbers) with TF1 compat statements**

import tensorflow as tf from tensorflow import keras as K from tensorflow.python.keras import backend as B from keras import models from keras import layers from keras.utils import to_categorical from keras.datasets import mnist from tensorflow.python.client import device_lib import os #gpu = False gpu = True if gpu: GPU = True; CPU = False; num_GPU = 1; num_CPU = 1 else: GPU = False; CPU = True; num_CPU = 1; num_GPU = 0 config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1, allow_soft_placement=True, device_count = {'CPU' : num_CPU, 'GPU' : num_GPU}, log_device_placement=True ) config.gpu_options.per_process_gpu_memory_fraction=0.35 config.gpu_options.force_gpu_compatible = True B.set_session(tf.compat.v1.Session(config=config))

**Hint 1:**

If you want to test some code parts on the GPU and others on the CPU in the same session, I strongly recommend to use the compat statements in the form given by Option 3 above

The reason is that it - strangely enough - gives you a faster performance on a multicore CPU by more than 25% in comparison to the pure TF2 statements .

Afterwards you can use statements like:

batch_size=64 epochs=5 if use_cpu_or_gpu == 0: start_g = time.perf_counter() with tf.device("/GPU:0"): train(train_imgs, train_labels, epochs, batch_size) end_g = time.perf_counter() print('Time_GPU: ', end_g - start_g) else: start_c = time.perf_counter() with tf.device("/CPU:0"): train(train_imgs, train_labels, epochs, batch_size) end_c = time.perf_counter() print('Time_CPU: ', end_c - start_c)

**Hint 2: **

If you check the limitations on CPU cores (threads) via watching the CPU load on tools like "gkrellm" or "ksysguard", it may appear that all cores are used in parallel. You have to set the update period of these tools to 0.1 sec to see that each core is only used intermittently. In gkrellm you should also see a variation of the average CPU load value with a variation of the parameter "intra_op_parallelism_threads=n".

**Hint 3:**

In my case with a Quadcore CPU with hyperthreading the following settings seem to be optimal for a variety of Keras CNN models - whenever I want to train them on the CPU only:

... config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=1) B.set_session(tf.compat.v1.Session(config=config)) ...

**Hint 4:**

If you want to switch settings in a Jupyter session it is best to stop and *restart* the respective kernel. You can do this via the commands under "kernel" in the Jupyter interface.

Well, my friends the above steps where what I had to do to get Keras working in combination with TF2, Cuda 10.2 and the present version of Cudnn. I regard this not as a straightforward procedure - to say it mildly.

In addition after some tests I might also say that the performance seems to be worse than with Tensorflow 1. Especially when using Keras - whether the Keras included with Tensorflow 2 or Keras in form of separate Python lib. Especially the performance on a GPU is astonishingly bad with Keras for small networks.

This impression of sluggishness stands in a strange contrast to elementary tests were I saw a factor of 5 difference for a series of typical matrix multiplications executed directly with tf.matmul() on a GPU vs. a CPU. But this another story .....

tensorflow-running-version-with-cuda-on-cpu-only

]]>

A simple Python program for an ANN to cover the MNIST dataset – XIII – the impact of regularization

A simple Python program for an ANN to cover the MNIST dataset – XII – accuracy evolution, learning rate, normalization

A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix

A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests

A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests

A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation

A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles

A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“

A simple Python program for an ANN to cover the MNIST dataset – V – coding the loss function

A simple Python program for an ANN to cover the MNIST dataset – IV – the concept of a cost or loss function

A simple Python program for an ANN to cover the MNIST dataset – III – forward propagation

A simple Python program for an ANN to cover the MNIST dataset – II - initial random weight values

A simple Python program for an ANN to cover the MNIST dataset – I - a starting point

In this article we shall work a bit on the following topic: How can we reduce the computational time required for gradient descent runs of our MLP?

Readers who followed my last articles will have noticed that I sometimes used 1800 epochs in a gradient descent run. The computational time including

- costly intermediate print outs into Jupyter cells,
- a full determination of the reached accuracy both on the full training and the test dataset at every epoch

lay in a region of 40 to 45 minutes for our MLP with two hidden layers and roughly 58000 weights. Using an Intel I7 standard CPU with OpenBlas support. And I plan to work with bigger MLPs - not on MNIST but other data sets. Believe me: Everything beyond 10 minutes is a burden. So, I have a natural interest in accelerating things on a very basic level already before turning to GPUs or arrays of them.

This introductory question leads to another one: What basic factors beyond technical capabilities of our Linux system and badly written parts of my Python code influence the consumption of computational time? Four points come to my mind; you probably find even more:

- One factor is certainly the extra forward propagation run which we apply to all samples of both the test and training data seat the end of each epoch. We perform this propagation to make predictions and to get data on the evolution of the accuracy, the total loss and the ratio of the regularization term to the real costs. We could do this in the future at every 2nd or 5th epoch to save some time. But this will reduce CPU-time only by less than 22%. 76% of the CPU-time of an epoch is spent in batch-handling with a dominant part in error backward propagation and weight corrections.
- The
*learning rate*has a direct impact on the number of required epochs. We could enlarge the learning rate in combination with input data normalization; see the last article. This could reduce the number of required epochs significantly. Depending on the parameter choices before by up to 40% or 50%. But it requires a bit of experimenting .... - Two other, more important factors are the frequent number of matrix operations during error back-propagation and the size of the involved matrices. These operations depend directly on the number of nodes involved. We could therefore reduce the number of nodes of our MLP to a minimum compatible with the required accuracy and precision. This leads directly to the next point.
- The dominant weight matrix is of course the one which couples layer L0 and layer L1. In our case its shape is 784 x 70; it has almost 55000 elements. The matrix for the next pair of layers has only 70x30 = 2100 elements - it is much, much smaller. To reduce CPU time for forward propagation we should try to make this matrix smaller. During
*error back propagation*we must perform multiple matrix multiplications; the matrix dimensions depend on the number of samples in a mini-batch*AND*on the**number of nodes in the involved layers**. The dimensions of the the result matrix correspond to the those of the weight matrix. So once again: A reduction of the nodes in the first 2 layers would be extremely helpful for the expensive backward propagation. See: The math behind EBP.

We shall mainly concentrate on the last point in this article.

The following numbers show typical CPU times spend for matrix operations during error back propagation [EBP] between different layers of our MLP and for two different batches at the beginning of gradient descent:

Time_CPU for BW layer operations (to L2) 0.00029015699965384556 Time_CPU for BW layer operations (to L1) 0.0008645610000712622 Time_CPU for BW layer operations (to L0) 0.006551215999934357 Time_CPU for BW layer operations (to L2) 0.00029157400012991275 Time_CPU for BW layer operations (to L1) 0.0009575330000188842 Time_CPU for BW layer operations (to L0) 0.007488838999961445

The operations involving layer L0 cost a factor of 7 more CPU time than the other operations! Therefore, a key to the reduction of the number of mathematical operations is obviously the reduction of the number of nodes in the input layer! We cannot reduce the numbers in the hidden layers much, if we do not want to hamper the accuracy properties of our MLP too much. So the basic question is

Can we reduce the number of input nodes somehow?

Yes, maybe we can! Input nodes correspond to "*features*". In case of the MNIST dataset the relevant features are given by the gray-values for the 784 pixels of each image. A first idea is that there are many pixels within each MNIST image which are probably not used at all for classification - especially pixels at the outer image borders. So, it would be helpful to chop them off or to ignore them by some appropriate method. In addition, special significant pixel areas may exist to which the MLP, i.e. its weight optimization, reacts during training. For example: The digits 3, 5, 6, 8, 9 all have a bow within the lower 30% of an image, but in other regions, e.g. to the left and the right, they are rather different.

If we could identify suitable image areas in which dark pixels have a higher probability for certain digits then, maybe, we could use this information to discriminate the represented digits? But a "higher density of dark pixels in an image area" is nothing else than a description of a "**cluster**" of (dark) pixels in certain image areas. Can we use *pixel clusters* at numerous areas of an image to learn about the represented *digits*? Is the *combination* of (averaged) feature values in certain clusters of pixels representative for a handwritten digit in the MNIST dataset?

If the number of such pixel clusters could be reduced below lets say 100 then we could indeed reduce the number of input features significantly!

To be able to use relevant "clusters" of pixels - if they exist in a usable form in MNIST images at all - we must first identify them. Cluster identification and discrimination is a major discipline of Machine Learning. This discipline works in general with *unlabeled* data. In the MNIST case we would not use the labels in the "y"-data at all to identify clusters; we would only use the "X"-data. A nice introduction to the mechanisms of cluster identification is given in the book of Paul Wilcott (see Machine Learning – book recommendations for the reference). The most fundamental method - called "kmeans" - iterates over 3 major steps [I simplify a bit :-)]:

- We assume that K clusters exist and start with random initial positions of their centers (called "centroids") in the multidimensional feature space
- We measure the distance of all data points to he centroids and associate a point with that centroid to which the distance is smallest
- We determine the "center of mass" (according to some distance metric) of the identified data point groups and assume it as a new position of the centroids and move the old positions (a bit) in this direction.

We iterate over these steps until the centroids' positions hopefully get stable. Pretty simple. But there is a major drawback: You must make an assumption on the number "K" of clusters. To make such an assumption can become difficult in the complex case of a feature space with hundreds of dimensions.

You can compensate this by executing multiple cluster runs and comparing the results. By what? Regarding the closure or separation of clusters in terms of an appropriate norm. One such norm is called "**cluster inertia**"; it measures the mean squared distance to the center for all points of a cluster. The theory is that the sum of the inertias for all clusters drops significantly with the number of clusters *until* an optimal number is reached and the inertia curve flattens out. The point where this happens in a plot of inertia vs. number of clusters is called "**elbow**". Identifying this "elbow" is one of the means to find an optimal number of clusters. However, this recipe does not work under all circumstances. As the number of clusters get big we may be confronted with a smooth decline of the inertia sum.

How could we measure whether an image shows certain clusters? We could e.g. measure distances (with some appropriate metric) of all image points to the clusters. The "fit_transform()"-method of KMeans and MiniBatchKMeans provide us with with some distance measure of each image to the identified clusters. This means our images are transformed into a **new feature space** - namely into a "cluster-distance space". This is a quite complex space, too. But it has less dimensions than the original feature space!

Note: We would of course normalize the resulting distance data in the new feature space before applying gradient descent.

There are multiple variants of "KMeans". We shall use one which is provided by SciKit-Learn and which is optimized for large datasets: "MiniBatchKMeans". It operates batch-wise without loosing too much of accuracy and convergence properties in comparison to KMeans (or a comparison see here). "MiniBatchKMeans"has some parameters you can play with.

We could be tempted to use 10 clusters as there are 10 digits to discriminate between. But remember: A digit can be written in very many ways. So, it is much more probable that we need a significant larger number of clusters. But again: How to determine on which K-values we should invest a bit more time? "Kmeans" and methods alike offer another quantity called "silhouette" coefficient. It measures how well the data points are within, at or outside the borders of a cluster. See the book of Geron referenced at the link given above on more information.

Let us first have a look at the evolution of CPU time, total inertia and averaged silhouette with the number of clusters "K" for two different runs. The following code for a Jupyter cell gives us the data:

# ********************************************************* # Pre-Clustering => Searching for the elbow # ********************************************************* from sklearn.cluster import KMeans from sklearn.cluster import MiniBatchKMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score X = np.concatenate((ANN._X_train, ANN._X_test), axis=0) y = np.concatenate((ANN._y_train, ANN._y_test), axis=0) print("X-shape = ", X.shape, "y-shape = ", y.shape) num = X.shape[0] li_n = [] li_inertia = [] li_CPU = [] li_sil1 = [] # Loop over the number "n" of assumed clusters rg_n = range(10,171,10) for n in rg_n: print("\nNumber of clusters: ", n) start = time.perf_counter() kmeans = MiniBatchKMeans(n_clusters=n, n_init=500, max_iter=1000, batch_size=500 ) X_clustered = kmeans.fit_transform(X) sil1 = silhouette_score(X, kmeans.labels_) #sil2 = silhouette_score(X_clustered, kmeans.labels_) end = time.perf_counter() dtime = end - start print('Inertia = ', kmeans.inertia_) print('Time_CPU = ', dtime) print('sil1 score = ', sil1) li_n.append(n) li_inertia.append(kmeans.inertia_) li_CPU.append(dtime) li_sil1.append(sil1) # Plots # ****** fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 14 fig_size[1] = 5 fig1 = plt.figure(1) fig2 = plt.figure(2) ax1_1 = fig1.add_subplot(121) ax1_2 = fig1.add_subplot(122) ax1_1.plot(li_n, li_CPU) ax1_1.set_xlabel("num clusters K") ax1_1.set_ylabel("CPU time") ax1_2.plot(li_n, li_inertia) ax1_2.set_xlabel("num clusters K") ax1_2.set_ylabel("inertia") ax2_1 = fig2.add_subplot(121) ax2_2 = fig2.add_subplot(122) ax2_1.plot(li_n, li_sil1) ax2_1.set_xlabel("num clusters K") ax2_1.set_ylabel("silhoutte 1")

You see that I allowed for large numbers of initial centroid positions and iterations to be on the safe side. Before you try it yourself: Such runs for a broad variation of K-values are relatively costly. The CPU time rises from around 32 seconds for 30 clusters to a little less than 1 minute for 180 clusters. These times add up to a significant sum after a while ...

Here are some plots:

The second run was executed with a higher resolution of K_(n+1) - K_n 5 = 5.

We see that the CPU time to determine the centroids' positions varies fairly linear with "K". And even for 170 clusters it does not take more than a minute! So, CPU-time for cluster identification is not a major limitation.

Unfortunately, we do not see a clear elbow in the inertia curve! What you regard as a reasonable choice for the number K depends a lot on where you say the curve starts to flatten. You could say that this happens around K = 60 to 90. But the results for the silhouette-quantity indicate for our parameter setting that K=40, K=70, K=90 are interesting points. We shall look at these points a bit closer with higher resolution later on.

Now, I want to discuss an important point which I did not find in the literature:

In my last article we saw that regularization plays a significant but also delicate role in reaching top accuracy values for the test dataset. We saw that Lambda2 = 0.2 was a good choice for a normalized input of the MNIST data. It corresponded to a certain ratio of the regularization term to average batch costs.

But when we reduce the number of input nodes we also reduce the number of total weights. So the weight values themselves will automatically become bigger if we want to get to similar good values at the second layer. But as the regularization term depends in a quadratic way on the weights we may assume that we roughly need a linear reduction of Lambda2. So, for K=100 clusters we may shrink Lambda2 to (0.2/784*100) = 0.025 instead of 0.2. In general:

Lambda2_cluster = Lambda2_std * K / (number of input nodes)

I applied this rule of a thumb successfully throughout experiments with clustering befor gradient descent.

We saw at the end of article XII that we could reach an accuracy of around 0.975 after 500 epochs under optimal circumstances. But in the case I presented ten I was extremely lucky with the statistical initial weight distribution and the batch composition. In other runs with the same parameter setup I got smaller accuracy values. So, let us take an ad hoc run with the following parameters and results:

**Parameters:** learn_rate = **0.001**, decrease_rate = 0.00001, mom_rate = 0.00005, n_size_mini_batch = **500**, n_epochs = 600, Lambda2 = 0.2, weights at all layers in [-2*1.0/sqrt(num_nodes_layer), 2*1.0/sqrt(num_nodes_layer)]

**Results:** acc_train: 0.9949 , **acc_test: 0.9735**, convergence after ca. 550-600 epochs

The next plot shows (from left to right and the down) the evolution of the costs per batch, the averaged error of the last mini-batch during an epoch, the ratio of regularization to batch costs and the total costs of the training set, respectively .

The following plot summarizes the evolution of the total costs of the traaining set (including the regularization contribution) and the evolution of the accuracy on the training and the test data sets (in orange and blue, respectively).

The required computational time for the 600 epochs was roughly 18,2 minutes.

Before we go into a more detailed discussion of code adaption and test runs with things like clusters in unnormalized and normalized feature spaces, I want to show what we - without too much effort - can get out of using cluster detection ahead of gradient descent. The next plot shows the evolution of a run for **K=70** clusters in combination with a special normalization:

and the total cost and accuracy evolution

The dotted line marks an **accuracy of 97.8%**! This is 0.5% bigger then our reference value of 97.3%. The total gain of %gt; 0.5% means however 18.5% of the remaining difference of 2.7% to 100% and we past a value of 97.8% already at epoch 600 of the run.

What were the required computational times?

If we just wanted 97.4% as accuracy we need around 150 epochs. And a total CPU time of 1.3 minutes to get to the same accuracy as our reference run. This is a **factor of roughly 14** in required CPU time. For a stable 97.73% after epoch 350 we were still **a factor of 5.6 better**. For a stable accuracy beyond 97.8% we needed around 600 epochs - and still were by a **factor of 3.3 faster** than our reference run! So, clustering really brings some big advantages with it.

In this article I discussed the idea of introducing cluster identification in the (unnormalized or normalized) feature space ahead of gradient descent as a possible means to save computational time. A preliminary trial run showed that we indeed can become significantly **faster by at least a factor of 3 up to 5** and even more. This is just due to the point that we reduced the number of input nodes and thus the number of mathematical calculations during matrix operations.

In the next article we shall have a more detailed look at clustering techniques in combination with normalization.

]]>A simple Python program for an ANN to cover the MNIST dataset – XII – accuracy evolution, learning rate, normalization

A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix

A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests

A simple Python program for an ANN to cover the MNIST dataset – IX – First Tests

A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation

A simple Python program for an ANN to cover the MNIST dataset – VII – EBP related topics and obstacles

A simple Python program for an ANN to cover the MNIST dataset – VI – the math behind the „error back-propagation“

A simple Python program for an ANN to cover the MNIST dataset – V – coding the loss function

A simple Python program for an ANN to cover the MNIST dataset – IV – the concept of a cost or loss function

A simple Python program for an ANN to cover the MNIST dataset – III – forward propagation

A simple Python program for an ANN to cover the MNIST dataset – II - initial random weight values

A simple Python program for an ANN to cover the MNIST dataset – I - a starting point

In the last article of the series we made some interesting experiences with the variation of the "leaning rate". We also saw that a reasonable range for initial weight values should be chosen.

Even more fascinating was, however, the impact of a normalization of the input data on a smooth and fast gradient descent. We drew the conclusion that normalization is of major importance when we use the sigmoid function as the MLP's activation function - especially for nodes in the first hidden layer and for input data which are on average relatively big. The reason for our concern were saturation effects of the sigmoid functions and other functions with a similar variation with their argument. In the meantime I have tried to make the importance of normalization even more plausible with the help of a a very minimalistic perceptron for which we can analyze saturation effects a bit more in depth; you get to the related article series via the following link:

A single neuron perceptron with sigmoid activation function – III – two ways of applying Normalizer

There we also have a look at other *normalizers* or *feature scalers*.

But back to our series on a multi-layer perceptron. You may have have asked yourself in the meantime: Why did he not check the impact of the * regularization*? Indeed: We kept the parameter Lambda2 for the quadratic regularization term constant in all experiments so far: Lambda2 = 0.2. So, the question about the impact of regularization e.g. on accuracy is a good one.

I add even one more question: How big is the relative contribution of the regularization term to the total loss or cost function? In our Python program for a MLP model we included a so called quadratic Ridge term:

**Lambda2 * 0.5 * SUM[all weights**2]**, where bias nodes are excluded from the sum.

From various books on Machine Learning [ML] you just learn to choose the factor Lambda2 in the range between 0.01 and 0.1. But how big is the resulting term actually in comparison to the standard cost term, then, and how does the ratio between both terms evolve during gradient descent? What factors influence this ratio?

As we follow a training strategy based on mini-batches the regularization contribution was and is added up to the costs of each mini-batch. So its relative importance varies of course with the size of the mini-batches! Other factors which may also be of some importance - at least during the first epochs - could be the total number of weights in our network and the range of initial weight values.

Regarding the evolution during a converging gradient descent we know already that the total costs go down on the path to a cost minimum - whilst the weight values reach a stable level. So there is a (non-linear!) competition between the regularization term and the real costs of the "Log Loss" cost function! During convergence the relative importance of the regularization term may therefore become bigger until the ratio to the standard costs reaches an eventual constant level. But how dominant will the regularization term get in the end?

Let us do some experiments with the MNIST dataset again! We fix some **common parameters and conditions** for our test runs:

As we saw in the last article we should normalize the input data. So, all of our numerical experiments below (with the exception of the last one) are done with standardized input data (using Scikit-Learn's StandardScaler). In addition initial weights are all set according to the sqrt(nodes)-rule for all layers in the interval [-0.5*sqrt(1/num_nodes), 0.5*sqrt(1/num_nodes)], with num_nodes meaning the number of nodes in a layer. Other parameters, which we keep constant, are:

**Parameters:** learn_rate = 0.001, decrease_rate = 0.00001, mom_rate = 0.00005, n_size_mini_batch = 500, n_epochs = 800.

I added some statements to the method for cost calculation in order to save the relative part of the regularization terms with respect to the total costs of each mini-batch in a Numpy array and plot the evolution in the end. The changes are so simple that I omit showing the modified code.

How does the outcome of gradient descent look for standardized input data and a Lambda2-value of 0.1?

**Lambda2 = 0.1**

**Results:** acc_train: 0.999 , acc_test: **0.9714**, convergence after ca. 600 epochs

We see that the regularization term actually dominates the total loss of a mini-batch at convergence. At least with our present parameter setting. In comparisoin to the total loss of the full training set the contribution is of course much smaller and typically below 1%.

Let us reduce the regularization term via setting **Lambda = 0.01**. We expect its initial contribution to the costs of a batch to be smaller then, but this does NOT mean that the ratio to the standard costs of the batch automatically shrinks significantly, too:

**Lambda2 = 0.01**

**Results:** acc_train: **1.0** , acc_test: **0.9656**, convergence after ca. 350 epochs

Note the absolute scale of the costs in the plots! We ended up at a much lower level of the total loss of a batch! But the relative dominance of regularization at the point of convergence actually increased! However, this did not help us with the accuracy of our MLP-algorithm on the test data set - *although* we perfectly fit the training set by a 100% accuracy.

In the end this is what regularization is all about. We do not want a total overfitting, a perfect adaption of the grid to the training set. It will not help in the sense of getting a better *general* accuracy on other input data. A Lambda2 of 0.01 is much too small in our case!

So lets enlarge Lambda2 a bit:

**Lambda2 = 0.2**

**Results:** acc_train: **0.9946** , acc_test: **0.9728**, convergence after ca. 700 epochs

We get an improved accuracy!

**Lambda2 = 0.4**

**Results:** acc_train: **0.9858** , acc_test: **0.9693**, convergence after ca. 600 epochs

**Lambda2 = 0.8**

**Results:** acc_train: **0.9705** , acc_test: **0.9588**, convergence after ca. 400 epochs

OK, but in both cases we see a significant and systematic trend towards reduced accuracy values on the test data set with growing Lambda2-values > 0.2 for our chosen mini-batch size (500 samples).

We learned a bit about the impact of regularization today. Whatever the exact Lambda2-value - in the end the contribution of a regularization term becomes a significant part of the total loss of a mini-batch when we approached the total cost minimum. However, the factor Lambda2 must be chosen with a reasonable size to get an impact of regularization on the final minimum position in the weight-space! But then it will help to improve accuracy on general input data in comparison to overfitted solutions!

But we also saw that there is some balance to take care of: For an optimum of generalization AND accuracy you should neither make Lambda2 too small nor too big. In our case Lambda2 = 0.2 seems to be a reasonable and good choice. Might be different with other datasets.

All in all studying the impact of a variation of achieved accuracy with the factor for a Ridge regularization term seems to be a good investment of time in ML projects. We shall come back to this point already in the next articles of this series.

In the next article

we shall start to work on cluster detection in the feature space of the MNIST data before using gradient descent.

]]>