# A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix

Welcome back to my readers who followed me through the (painful?) process of writing a Python class to simulate a “Multilayer Perceptron” [MLP]. The pain in my case resulted from the fact that I am still a beginner in Machine Learning [ML] and Python. Nevertheless, I hope that we have meanwhile acquired some basic understanding of how a MLP works and “learns”. During the course of the last articles we had a close look at such nice things as “forward propagation”, “gradient descent”, “mini-batches” and “error backward propagation”. For the latter I gave you a mathematical description to grasp the background of the matrix operations involved.

Where do we stand after 10 articles and a PDF on the math?

We have a working code

• with some parameters to control layers and node numbers, learning and momentum rates and regularization,
• with many dummy parts for other output and activation functions than the sigmoid function we used so far,
• with prepared code fragments for applying MSE instead of “Log Loss” as a cost function,
• and with dummy parts for handling different input datasets than the MNIST example.

The code is not yet optimized; it includes e.g. many statements for tests which we should eliminate or comment out. A completely open conceptual aspect is the optimization of the adaption of the learning rate; it is very primitive so far. We also need an export/import functionality to be able to perform training with a series of limited epoch numbers per run.
We also should save the weights and accuracy data after a fixed epoch interval to be able to analyze a bit more after training. Another idea – though probably costly – is to even perform intermediate runs on the test data set an get some information on the development of the averaged error on the test data set.

Despite all these deficits, which we need to cover in some more articles, we are already able to perform an insightful task – namely to find out with which numbers and corresponding images of the MNIST data set our MLP has problems with. This leads us to the topics of a confusion matrix and other measures for the accuracy of our algorithm.

However, before we look at these topics, we first create some useful code, which we can save inside cells of the Jupyter notebook we maintain for testing our class “MyANN”.

# Some functions to evaluate the prediction capability of our ANN after training

For further analysis we shall apply the following functions later on:

```# ------ predict results for all test data
# *************************
def predict_all_test_data():
size_set = ANN._X_test.shape[0]

li_Z_in_layer_test  = [None] * ANN._n_total_layers
li_Z_in_layer_test[0] = ANN._X_test

# Transpose input data matrix
ay_Z_in_0T       = li_Z_in_layer_test[0].T
li_Z_in_layer_test[0] = ay_Z_in_0T
li_A_out_layer_test  = [None] * ANN._n_total_layers

# prediction by forward propagation of the whole test set
ANN._fw_propagation(li_Z_in = li_Z_in_layer_test, li_A_out = li_A_out_layer_test, b_print = False)
ay_predictions_test = np.argmax(li_A_out_layer_test[ANN._n_total_layers-1], axis=0)

# accuracy
ay_errors_test = ANN._y_test - ay_predictions_test
acc = (np.sum(ay_errors_test == 0)) / size_set
print ("total acc for test data = ", acc)

def predict_all_train_data():
size_set = ANN._X_train.shape[0]

li_Z_in_layer_test  = [None] * ANN._n_total_layers
li_Z_in_layer_test[0] = ANN._X_train
# Transpose
ay_Z_in_0T       = li_Z_in_layer_test[0].T
li_Z_in_layer_test[0] = ay_Z_in_0T
li_A_out_layer_test  = [None] * ANN._n_total_layers

ANN._fw_propagation(li_Z_in = li_Z_in_layer_test, li_A_out = li_A_out_layer_test, b_print = False)
Result = np.argmax(li_A_out_layer_test[ANN._n_total_layers-1], axis=0)
Error = ANN._y_train - Result
acc = (np.sum(Error == 0)) / size_set
print ("total acc for train data = ", acc)

# Plot confusion matrix
# orginally from Runqi Yang;
# see https://gist.github.com/hitvoice/36cf44689065ca9b927431546381a3f7
def cm_analysis(y_true, y_pred, filename, labels, ymap=None, figsize=(10,10)):
"""
Generate matrix plot of confusion matrix with pretty annotations.
The plot image is saved to disk.
args:
y_true:    true label of the data, with shape (nsamples,)
y_pred:    prediction of the data, with shape (nsamples,)
filename:  filename of figure file to save
labels:    string array, name the order of class labels in the confusion matrix.
use `clf.classes_` if using scikit-learn models.
with shape (nclass,).
ymap:      dict: any -> string, length == nclass.
if not None, map the labels & ys to more understandable strings.
Caution: original y_true, y_pred and labels must align.
figsize:   the size of the figure plotted.
"""
if ymap is not None:
y_pred = [ymap[yi] for yi in y_pred]
y_true = [ymap[yi] for yi in y_true]
labels = [ymap[yi] for yi in labels]
cm = confusion_matrix(y_true, y_pred, labels=labels)
cm_sum = np.sum(cm, axis=1, keepdims=True)
cm_perc = cm / cm_sum.astype(float)
* 100
annot = np.empty_like(cm).astype(str)
nrows, ncols = cm.shape
for i in range(nrows):
for j in range(ncols):
c = cm[i, j]
p = cm_perc[i, j]
if i == j:
s = cm_sum[i]
annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
elif c == 0:
annot[i, j] = ''
else:
annot[i, j] = '%.1f%%\n%d' % (p, c)
cm = pd.DataFrame(cm, index=labels, columns=labels)
cm.index.name = 'Actual'
cm.columns.name = 'Predicted'
fig, ax = plt.subplots(figsize=figsize)
ax=sns.heatmap(cm, annot=annot, fmt='')
#plt.savefig(filename)

#
# Plotting
# **********
def plot_ANN_results():
num_epochs  = ANN._n_epochs
num_batches = ANN._n_batches
num_tot = num_epochs * num_batches

cshape = ANN._ay_costs.shape
print("n_epochs = ", num_epochs, " n_batches = ", num_batches, "  cshape = ", cshape )
tshape = ANN._ay_theta.shape
print("n_epochs = ", num_epochs, " n_batches = ", num_batches, "  tshape = ", tshape )

#sizing
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 5

# Two figures
# -----------
fig1 = plt.figure(1)
fig2 = plt.figure(2)

# first figure with two plot-areas with axes
# --------------------------------------------

ax1_1.plot(range(len(ANN._ay_costs)), ANN._ay_costs)
ax1_1.set_xlim (0, num_tot+5)
ax1_1.set_ylim (0, 1500)
ax1_1.set_xlabel("epochs * batches (" + str(num_epochs) + " * " + str(num_batches) + " )")
ax1_1.set_ylabel("costs")

ax1_2.plot(range(len(ANN._ay_theta)), ANN._ay_theta)
ax1_2.set_xlim (0, num_tot+5)
ax1_2.set_ylim (0, 0.15)
ax1_2.set_xlabel("epochs * batches (" + str(num_epochs) + " * " + str(num_batches) + " )")
ax1_2.set_ylabel("averaged error")

```

The first function “predict_all_test_data()” allows us to create an array with the predicted values for all test data. This is based on a forward propagation of the full set of test data; so we handle some relatively big matrices here. The second function delivers prediction values for all training data; the operations of propagation algorithm involve even bigger matrices here. You will nevertheless experience that the calculations are performed very quickly. Prediction is much faster than training!

The third function “cm_analysis()” is not from me, but taken from Github Gist; see below. The fourth function “plot_ANN_results()” creates plots of the evolution of the cost function and the averaged error after training. We come back to these functions below.

To be able to use these functions we need to perform some more imports first. The full list of statements which we should place in the first Jupyter cell of our test notebook now reads:

```import numpy as np
import numpy.random as npr
import math
import sys
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.metrics import confusion_matrix
from scipy.special import expit
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpat
import time
import imp
from mycode import myann
```

Note the new lines for the import of the “pandas” and “seaborn” libraries. Please inform yourself about the purpose of each library on the Internet.

# Limited Accuracy

In the last article we performed some tests which showed a thorough robustness of our MLP regarding the MNIST datatset. There was some slight overfitting, but
playing around with hyper-parameters showed no extraordinary jump in “accuracy“, which we defined to be the percentage of correctly predicted records in the test dataset.

In general we can say that an accuracy level of 95% is what we could achieve within the range of parameters we played around with. Regression regularization (Lambda2 > 0) had some positive impact. A structural change to a MLP with just one layer did NOT give us a real breakthrough regarding CPU-time consumption, but when going down to 50 or 30 nodes in the intermediate layer we saw at least some reduction by up to 25%. But then our accuracy started to become worse.

Whilst we did our tests we measured the ANN’s “accuracy” by comparing the number of records for which our ANN did a correct prediction with the total number of records in the test data set. This is a global measure of accuracy; it averages over all 10 digits, i.e. all 10 classification categories. However, if we want to look a bit deeper into the prediction errors our MLP obviously produces it is, however, useful to introduce some more quantities and other measures of accuracy, which can be applied on the level of each output category.

# Measures of accuracy, related quantities and classification errors for a specific category

The following quantities and basic concepts are often used in the context of ML algorithms for classification tasks. Predictions of our ANN will not be error free and thus we get an accuracy less than 100%. There are different reasons for this – and they couple different output categories. In the case of MNIST the output categories correspond to the digits 0 to 9. Let us take a specific output category, namely the digit “5”. Then there are two basic types of errors:

• The network may have predicted a “3” for a MNIST image record, which actually represents a “5” (according to the “y_train”-value for this record). This error case is called a “False Negative“.
• The network may have predicted a “5” for a MNIST image record, which actually represents a “3” according to its “y_train”-value. This error case is called a “False Positive“.

Both cases mark some difference between an actual and predicted number value for a MNIST test record. Technically, “actual” refers to the number value given by the related record in our array “ANN._y_test”. “Predicted” refers to the related record in an array “ay_prediction_test”, which our function “predict_all_test_data()” returns (see the code above).

Regarding our example digit “5” we obviously can distinguish between the following quantities:

• AN : The total number of all records in the test data set which actually correspond to our digit “5”.
• TP: The number of “True Positives”, i.e. the number of those cases correctly detected as “5”s.
• FP: The number of “False Positives”, i.e. the number of those cases where our ANN falsely predicts a “5”.
• FN: The number of “False Negatives”, i.e. the number of those cases where our ANN falsely predicts another digit than “5”, but where it actually should predict a “5”.

Then we can calculate the following ratios which all somehow measure “accuracy” for a specific output category:

• Precision:
TP / (TP + FP)
• Recall:
TP / ( TP + FN))
• Accuracy:
TP / AN
• F1:
TP / ( TP + 0.5*(FN + TP) )

A careful reader will (rightly) guess that the quantity “recall” corresponds to what we would naively define as “accuracy” – namely the ratio TP/AN.
From its definition it is clear that the quantity “F1” gives us a weighted average between the measures “precision” and “recall”.

How can we get these numbers for all 10 categories from our MLP after training ?

# Confusion matrix

When we want to analyze our basic error types per category we need to look at the discrepancy between predicted and actual data. This suggests a presentation in form of a matrix with all for all possible category values both in x- and y-direction. The cells of such a matrix – e.g. a cell for an actual “5” and a predicted “3” – could e.g. be filled with the corresponding FN-number.

We will later on develop our own code to solve the task of creating and displaying such a matrix. But there is a nice guy called Runqi Yang who shared some code for precisely this purpose on GitHub Gist; see https://gist.github.com/hitvoice/36c…
We can use his suggested code as it is in our context. We have already presented it above in form of the function “cm_analysis()“, which uses the pandas and seaborn libraries.

After a training run with the following parameters

```try:
ANN = myann.MyANN(my_data_set="mnist_keras", n_hidden_layers = 2,
ay_nodes_layers = [0, 70, 30, 0],
n_nodes_layer_out = 10,
my_loss_function = "LogLoss",
n_size_mini_batch = 500,
n_epochs = 1800,
n_max_batches = 2000,  # small values only for test runs
lambda2_reg = 0.2,
lambda1_reg = 0.0,
vect_mode = 'cols',
learn_rate = 0.0001,
decrease_const = 0.000001,
mom_rate   = 0.00005,
shuffle_batches = True,
print_period = 50,
figs_x1=12.0, figs_x2=8.0,
legend_loc='upper right',
b_print_test_data = True
)
except SystemExit:
print("stopped")
```

we get

and

and eventually

When I studied the last plot for a while I found it really instructive. Each of its cell outside the diagonal obviously contains the number of “False Negative” records for these two specific category values – but with respect to actual value.

What more do we learn from the matrix? Well, the numbers in the cells on the diagonal, in a row and in a
column are related to our quantities TP, FN and FP:

• Cells on the diagonal: For the diagonal we should find many correct “True Positive” values compared to the actual correct MNIST digits. (At least if all numbers are reasonably distributed across the MNIST dataset). We see that this indeed is the case. The ration of “True Positives” and the “Actual Positives” is given as a percentage and with the related numbers inside the respective cells on the diagonal.
• Cells of a row: The values in the cells of a row (without the cell on the diagonal) of the displayed matrix give us the numbers/ratios for “False Negatives” – with respect to the actual value. If you sum up the individual FN-numbers you get the total number of “False negatives”, which of course is the difference between the total number AN and the number TP for the actual category.
• Cells of a column: The column cells contain the numbers/ratios for “False Positives” – with respect to the predicted value. If you sum up the individual FN-numbers you get the total number of “False Positives” with respect to the predicted column value.

So, be a bit careful: A FN value with respect to an actual row value is a FP value with respect to the predicted column value – if the cell is one outside the diagonal!

All ratios are calculated with respect to the total actual numbers of data records for a specific category, i.e. a digit.

Looking closely we detect that our code obviously has some problems with distinguishing pictures of “5”s with pictures of “3”s, “6”s and “8”s. The same is true for “8”s and “3”s or “2s”. Also the distinction between “9”s, “3”s and “4”s seems to be difficult sometimes.

# Does the confusion matrix change due to random initial weight values and mini-batch-shuffling?

We have seen already that statistical variations have no big impact on the eventual accuracy when training converges to points in the parameter-space close to the point for the minimum of the overall cost-function. Statistical effects between to training runs stem in our case from statistically chosen initial values of the weights and the changes to our mini-batch composition between epochs. But as long as our training converges (and ends up in a global minimum) we should not see any big impact on the confusion matrix. And indeed a second run leads to:

The values are pretty close to those of the first run.

# Precision, Recall values per digit category and our own confusion matrix

Ok, we now can look at the nice confusion matrix plot and sum up all the values in a row of the confusion matrix to get the total FN-number for the related actual digit value. Or sum up the entries in a column to get the total FP-number. But we want to calculate these values from the ANN’s prediction results without looking at a plot and summation handwork. In addition we want to get the data of the confusion matrix in our own Numpy matrix array independently of foreign code. The following box displays the code for two functions, which are well suited for this task:

```# A class to print in color and bold
class color:
PURPLE = '\033[95m'
CYAN = '\033[96m'
DARKCYAN = '\033[36m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
END = '\033[
0m'

def acc_values(ay_pred_test, ay_y_test):
ay_x = ay_pred_test
ay_y = ay_y_test
# -----
#- dictionary for all false positives for all 10 digits
fp = {}
fpnum = {}
irg = range(10)
for i in irg:
key = str(i)
xfpi = np.where(ay_x==i)[0]
fpi = np.zeros((10000, 3), np.int64)

n = 0
for j in xfpi:
if ay_y[j] != i:
row = np.array([j, ay_x[j], ay_y[j]])
fpi[n] = row
n+=1

fpi_real   = fpi[0:n]
fp[key]    = fpi_real
fpnum[key] = fp[key].shape[0]

#- dictionary for all false negatives for all 10 digits
fn = {}
fnnum = {}
irg = range(10)
for i in irg:
key = str(i)
yfni = np.where(ay_y==i)[0]
fni = np.zeros((10000, 3), np.int64)

n = 0
for j in yfni:
if ay_x[j] != i:
row = np.array([j, ay_x[j], ay_y[j]])
fni[n] = row
n+=1

fni_real = fni[0:n]
fn[key] = fni_real
fnnum[key] = fn[key].shape[0]

#- dictionary for all true positives for all 10 digits
tp = {}
tpnum = {}
actnum = {}
irg = range(10)
for i in irg:
key = str(i)
ytpi = np.where(ay_y==i)[0]
actnum[key] = ytpi.shape[0]
tpi = np.zeros((10000, 3), np.int64)

n = 0
for j in ytpi:
if ay_x[j] == i:
row = np.array([j, ay_x[j], ay_y[j]])
tpi[n] = row
n+=1

tpi_real = tpi[0:n]
tp[key] = tpi_real
tpnum[key] = tp[key].shape[0]

#- We create an array for the precision values of all 10 digits
ay_prec_rec_f1 = np.zeros((10, 9), np.int64)
print(color.BOLD + "Precision, Recall, F1, Accuracy, TP, FP, FN, AN" + color.END +"\n")
print(color.BOLD + "i  ", "prec  ", "recall  ", "acc    ", "F1       ", "TP    ",
"FP    ", "FN    ", "AN" + color.END)
for i in irg:
key = str(i)
tpn = tpnum[key]
fpn = fpnum[key]
fnn = fnnum[key]
an  = actnum[key]
precision = tpn / (tpn + fpn)
prec = format(precision, '7.3f')
recall = tpn / (tpn + fnn)
rec = format(recall, '7.3f')
accuracy = tpn / an
acc = format(accuracy, '7.3f')
f1 = tpn / ( tpn + 0.5 * (fnn+fpn) )
F1 = format(f1, '7.3f')
TP = format(tpn, '6.0f')
FP = format(fpn, '6.0f')
FN = format(fnn, '6.0f')
AN = format(an,  '6.0f')

row = np.array([i, precision, recall, accuracy, f1, tpn, fpn, fnn, an])
ay_prec_rec_f1[i] = row
print (i, prec, rec, acc, F1, TP, FP, FN, AN)

return tp, tpnum, fp, fpnum, fn, fnnum, ay_prec_rec_f1

def create_cf(ay_fn, ay_tpnum):
''' fn: array with false negatives row = np.array([j, x[j], y[j]])
'''
cf = np.zeros((10, 10), np.int64)
rgi = range(10)
rgj = range(10)
for i in rgi:
key = str(i)
fn_i = ay_fn[key][ay_fn[key][:,2] == i]
for j in rgj:
if j!= i:
fn_ij = fn_i[fn_i[:,1] == j]
#print(i, j, fn_ij)
num_fn_ij = fn_ij.shape[0]
cf[i,j] = num_fn_ij
if j==i:
cf[i,j] = ay_tpnum[key]

cols=["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
df = pd.DataFrame(cf, columns=cols, index=cols)
# print( "\n", df, "\n")
# df.style

return cf, df

```

The first function takes a array with prediction values (later on provided externally
by our “ay_predictions_test”) and compares its values with those of an y_test array which contains the actual values (later provided externally by our “ANN._y_test”). Then it uses array-slicing to create new arrays with information on all error records, related indices and the confused category values. Eventually, the function determines the numbers for AN, TP, FP and FN (per digit category) and prints the gathered information. It also returns arrays with information on records which are “True Positives”, “False Positives”, “False Negatives” and the various numbers.

The second function uses array-slicing of the array which contains all information on the “False Negatives” to reproduce the confusion matrix. It involves Pandas to produce a styled output for the matrix.

Now you can run the above code and the following one in Jupyter cells – of course, only after you have completed a training and a prediction run:

For my last run I got the following data:

We again see that especially “5”s and “9”s have a problem with FNs. When you compare the values of the last printed matrix with those in the plot of the confusion matrix above, you will see that our code produces the right FN/FP/TP-values. We have succeeded in producing our own confusion matrix – and we have all values directly available in our own Numpy arrays.

# Some images of “4”-digits with errors

We can use the arrays which we created with functions above to get a look at the images. We use the function “plot_digits()” of Aurelien Geron at handson-ml2 chapter 03 on classification to plot several images in a series of rows and columns. The code is pretty easy to understand; at its center we find the matplotlib-function “imshow()”, which we have already used in other ML articles.

We again perform some array-slicing of the arrays our function “acc_values()” (see above) produces to identify the indices of images in the “X_test”-dataset we want to look at. We collect the first 50 examples of “true positive” images of the “4”-digit, then we take the “false positives” of the 4-digit and eventually the “fales negative” cases. We then plot the images in this order:

```def plot_digits(instances, images_per_row=10, **options):
size = 28
images_per_row = min(len(instances), images_per_row)
images = [instance.reshape(size,size) for instance in instances]
n_rows = (len(instances) - 1) // images_per_row + 1
row_images = []
n_empty = n_rows * images_per_row - len(instances)
images.append(np.zeros((size, size * n_empty)))
for row in range(n_rows):
rimages = images[row * images_per_row : (row + 1) * images_per_row]
row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
plt.imshow(image, cmap = mpl.cm.binary, **options)
plt.axis("off")

ay_tp, ay_tpnum, ay_fp, ay_fpnum, ay_fn, ay_
fnnum, ay_prec_rec_f1 = \
acc_values(ay_pred_test = ay_predictions_test, ay_y_test = ANN._y_test)

idx_act = str(4)

# fetching the true positives
num_tp = ay_tpnum[idx_act]
idx_tp = ay_tp[idx_act][:,[0]]
idx_tp = idx_tp[:,0]
X_test_tp = ANN._X_test[idx_tp]

# fetching the false positives
num_fp = ay_fpnum[idx_act]
idx_fp = ay_fp[idx_act][:,[0]]
idx_fp = idx_fp[:,0]
X_test_fp = ANN._X_test[idx_fp]

# fetching the false negatives
num_fn = ay_fnnum[idx_act]
idx_fn = ay_fn[idx_act][:,[0]]
idx_fn = idx_fn[:,0]
X_test_fn = ANN._X_test[idx_fn]

# plotting
# +++++++++++
plt.figure(figsize=(12,12))

# plotting the true positives
# --------------------------
plt.subplot(321)
plot_digits(X_test_tp[0:25], images_per_row=5 )
plt.subplot(322)
plot_digits(X_test_tp[25:50], images_per_row=5 )

# plotting the false positives
# --------------------------
plt.subplot(323)
plot_digits(X_test_fp[0:25], images_per_row=5 )
plt.subplot(324)
plot_digits(X_test_fp[25:], images_per_row=5 )

# plotting the false negatives
# ------------------------------
plt.subplot(325)
plot_digits(X_test_fn[0:25], images_per_row=5 )
plt.subplot(326)
plot_digits(X_test_fn[25:], images_per_row=5 )

```

The first row of the plot shows the (first) 50 “True Positives” for the “4”-digit images in the MNIST test data set. The second row shows the “False Positives”, the third row the “False Negatives”.

Very often you can guess why our MLP makes a mistake. However, in some cases we just have to acknowledge that the human brain is a much better pattern recognition machine than a stupid MLP 🙂 .

# Conclusion

With the help of a “confusion matrix” it is easy to find out for which MNIST digit-images our algorithm has major problems. A confusion matrix gives us the necessary numbers of those digits (and their images) for which the MLP wrongly predicts “False Positives” or “False Negatives”.

We have also seen that there are three quantities – precision, recall, F1 – which are useful to describe the accuracy of a classification algorithm per classification category.

We have written some code to collect all necessary information about “confused” images into our own Numpy arrays after training. Slicing of Numpy arrays proved to be useful, and matplotlib helped us to visualize examples of the wrongly classified MNIST digit-images.

In the next article
A simple program for an ANN to cover the Mnist dataset – XII – accuracy evolution, learning rate, normalization
we shall extract some more information on the evolution of accuracy during training. We shall also make use of a “clustering” technique to reduce the number of input nodes.

# A simple Python program for an ANN to cover the MNIST dataset – X – mini-batch-shuffling and some more tests

I continue my series on a Python code for a simple multi-layer perceptron [MLP]. During the course of the previous articles we have built a Python class “ANN” with methods to import the MNIST data set and handle forward propagation as well as error backward propagation [EBP]. We also had a deeper look at the mathematics of gradient descent and EBP for MLP training.

The code modifications in the last article enabled us to perform a first test on the MNIST dataset. This test gave us some confidence in our training algorithm: It seemed to converge and produce weights which did a relatively good job on analyzing the MNIST images.

We saw a slight tendency of overfitting. But an accuracy level of 96.5% on the test dataset showed that the MLP had indeed “learned” something during training. We needed around 1000 epochs to come to this point.

However, there are a lot of parameters controlling our grid structure and the learning behavior. Such parameters are often called “hyper-parameters“. To get a better understanding of our MLP we must start playing around with such parameters. In this article we shall concentrate on the parameter for (regression) regularization (called Lambda2 in the parameter interface of our class ANN) and then start varying the node numbers on the layers.

But before we start new test runs we add a statistical element to the training – namely the variation of the composition of our mini-batches (see the last article).

General hint: In all of the test runs below we used 4 CPU cores with libOpenBlas on a Linux system with an I7 6700K CPU.

# Shuffling the contents of the mini-batches

Let us add some more parameters to the interface of class “ANN”:

shuffle_batches = True
print_period = 20

The first parameter
shall control whether we vary the composition of the mini-batches with each epoch. The second parameter controls for which period of the epochs we print out some intermediate data (costs, averaged error of last mini-batch).

```    def __init__(self,
my_data_set = "mnist",
n_hidden_layers = 1,
ay_nodes_layers = [0, 100, 0], # array which should have as much elements as n_hidden + 2
n_nodes_layer_out = 10,  # expected number of nodes in output layer

my_activation_function = "sigmoid",
my_out_function        = "sigmoid",
my_loss_function       = "LogLoss",

n_size_mini_batch = 50,  # number of data elements in a mini-batch

n_epochs      = 1,
n_max_batches = -1,  # number of mini-batches to use during epochs - > 0 only for testing
# a negative value uses all mini-batches

lambda2_reg = 0.1,     # factor for quadratic regularization term
lambda1_reg = 0.0,     # factor for linear regularization term

vect_mode = 'cols',

learn_rate = 0.001,        # the learning rate (often called epsilon in textbooks)
decrease_const = 0.00001,  # a factor for decreasing the learning rate with epochs
mom_rate   = 0.0005,       # a factor for momentum learning

shuffle_batches = True,    # True: we mix the data for mini-batches in the X-train set at the start of each epoch

print_period = 20,         # number of epochs for which to print the costs and the averaged error

figs_x1=12.0, figs_x2=8.0,
legend_loc='upper right',

b_print_test_data = True

):
'''
Initialization of MyANN
Input:
data_set: type of dataset; so far only the "mnist", "mnist_784" datsets are known
We use this information to prepare the input data and learn about the feature dimension.
This info is used in preparing the size of the input layer.
n_hidden_layers = number of hidden layers => between input layer 0 and output layer n

ay_nodes_layers = [0, 100, 0 ] : We set the number of nodes in input layer_0 and the output_layer to zero
Will be set to real number afterwards by infos from the input dataset.
All other numbers are used for the node numbers of the hidden layers.
n_nodes_out_layer = expected number of nodes in the output layer (is checked);
this number corresponds to the number of categories NC = number of labels to be distinguished

my_activation_function : name of the activation function to use
my_out_function : name of the "activation" function of the last layer which produces the output values
my_loss_function : name of the "cost" or "loss" function used for optimization

n_size_mini_batch : Number of elements/samples in a mini-batch of training data
The number of mini-batches will be calculated from this

n_epochs : number of epochs to calculate during training
n_max_batches : > 0: maximum of mini-batches to use during training

< 0: use all mini-batches

lambda_reg2:    The factor for the quadartic regularization term
lambda_reg1:    The factor for the linear regularization term

vect_mode: Are 1-dim data arrays (vctors) ordered by columns or rows ?

learn rate :     Learning rate - definies by how much we correct weights in the indicated direction of the gradient on the cost hyperplane.
decrease_const:  Controls a systematic decrease of the learning rate with epoch number
mom_const:       Momentum rate. Controls a mixture of the last with the present weight corrections (momentum learning)

shuffle_batches: True => vary composition of mini-batches with each epoch

print_period:    number of periods between printing out some intermediate data
on costs and the averaged error of the last mini-batch

figs_x1=12.0, figs_x2=8.0 : Standard sizing of plots ,
legend_loc='upper right': Position of legends in the plots

b_print_test_data: Boolean variable to control the print out of some tests data

'''

# Array (Python list) of known input data sets
self._input_data_sets = ["mnist", "mnist_784", "mnist_keras"]
self._my_data_set = my_data_set

# X, y, X_train, y_train, X_test, y_test
# will be set by analyze_input_data
# X: Input array (2D) - at present status of MNIST image data, only.
# y: result (=classification data) [digits represent categories in the case of Mnist]
self._X       = None
self._X_train = None
self._X_test  = None
self._y       = None
self._y_train = None
self._y_test  = None

# relevant dimensions
# from input data information;  will be set in handle_input_data()
self._dim_sets     = 0
self._dim_features = 0
self._n_labels     = 0   # number of unique labels - will be extracted from y-data

# Img sizes
self._dim_img      = 0 # should be sqrt(dim_features) - we assume square like images
self._img_h        = 0
self._img_w        = 0

# Layers
# ------
# number of hidden layers
self._n_hidden_layers = n_hidden_layers
# Number of total layers
self._n_total_layers = 2 + self._n_hidden_layers
# Nodes for hidden layers
self._ay_nodes_layers = np.array(ay_nodes_layers)
# Number of nodes in output layer - will be checked against information from target arrays
self._n_nodes_layer_out = n_nodes_layer_out

# Weights
# --------
# empty List for all weight-matrices for all layer-connections
# Numbering :
# w[0] contains the weight matrix which connects layer 0 (input layer ) to hidden layer 1
# w[1] contains the weight matrix which connects layer 1 (input layer ) to (hidden?) layer 2
self._li_w = []

# Arrays for encoded output labels - will be set in _encode_all_mnist_labels()
# -------------------------------
self._ay_onehot = None
self._ay_oneval = None

# Known Randomizer methods ( 0: np.random.randint, 1: np.random.uniform )
# ------------------
self.__ay_known_randomizers = [0, 1]

# Types of activation functions and output functions
# ------------------
self.__ay_activation_functions = ["sigmoid"] # later also relu
self.__ay_output_functions     = ["sigmoid"] # later also
softmax

# Types of cost functions
# ------------------
self.__ay_loss_functions = ["LogLoss", "MSE" ] # later also other types of cost/loss functions

# the following dictionaries will be used for indirect function calls
self.__d_activation_funcs = {
'sigmoid': self._sigmoid,
'relu':    self._relu
}
self.__d_output_funcs = {
'sigmoid': self._sigmoid,
'softmax': self._softmax
}
self.__d_loss_funcs = {
'LogLoss': self._loss_LogLoss,
'MSE': self._loss_MSE
}
# Derivative functions
self.__d_D_activation_funcs = {
'sigmoid': self._D_sigmoid,
'relu':    self._D_relu
}
self.__d_D_output_funcs = {
'sigmoid': self._D_sigmoid,
'softmax': self._D_softmax
}
self.__d_D_loss_funcs = {
'LogLoss': self._D_loss_LogLoss,
'MSE': self._D_loss_MSE
}

# The following variables will later be set by _check_and set_activation_and_out_functions()

self._my_act_func  = my_activation_function
self._my_out_func  = my_out_function
self._my_loss_func = my_loss_function
self._act_func = None
self._out_func = None
self._loss_func = None

# number of data samples in a mini-batch
self._n_size_mini_batch = n_size_mini_batch
self._n_mini_batches = None  # will be determined by _get_number_of_mini_batches()

# maximum number of epochs - we set this number to an assumed maximum
# - as we shall build a backup and reload functionality for training, this should not be a major problem
self._n_epochs = n_epochs

# maximum number of batches to handle ( if < 0 => all!)
self._n_max_batches = n_max_batches
# actual number of batches
self._n_batches = None

# regularization parameters
self._lambda2_reg = lambda2_reg
self._lambda1_reg = lambda1_reg

# parameter for momentum learning
self._learn_rate = learn_rate
self._decrease_const = decrease_const
self._mom_rate   = mom_rate
self._li_mom = [None] *  self._n_total_layers

# shuffle data in X_train?
self._shuffle_batches = shuffle_batches

# epoch period for printing
self._print_period = print_period

# book-keeping for epochs and mini-batches
# -------------------------------
# range for epochs - will be set by _prepare-epochs_and_batches()
self._rg_idx_epochs = None
# range for mini-batches
self._rg_idx_batches = None
# dimension of the numpy arrays for book-keeping - will be set in _prepare_epochs_and_batches()
self._shape_epochs_batches = None    # (n_epochs, n_batches, 1)

# list for error values at outermost layer for minibatches and epochs during training
# we use a numpy array here because we can redimension it
self._ay_theta = None
# list for cost values of mini-batches during training
# The list will later be split into sections for epochs
self._ay_costs = None

# Data elements for back propagation
# ----------------------------------

# 2-dim array of partial derivatives of the elements of an additive cost function
# The derivative is taken with respect to the output results a_j = ay_ANN_out[j]
# The array dimensions account for nodes and sampls of a
mini_batch. The array will be set in function
# self._initiate_bw_propagation()
self._ay_delta_out_batch = None

# parameter to allow printing of some test data
self._b_print_test_data = b_print_test_data

# Plot handling
# --------------
# Alternatives to resize plots
# 1: just resize figure  2: resize plus create subplots() [figure + axes]
self._plot_resize_alternative = 1
# Plot-sizing
self._figs_x1 = figs_x1
self._figs_x2 = figs_x2
self._fig = None
self._ax  = None
# alternative 2 does resizing and (!) subplots()
self.initiate_and_resize_plot(self._plot_resize_alternative)

# ***********
# operations
# ***********

# check and handle input data
self._handle_input_data()
# set the ANN structure
self._set_ANN_structure()

# Prepare epoch and batch-handling - sets ranges, limits num of mini-batches and initializes book-keeping arrays
self._rg_idx_epochs, self._rg_idx_batches = self._prepare_epochs_and_batches()

# perform training
start_c = time.perf_counter()
self._fit(b_print=True, b_measure_batch_time=False)
end_c = time.perf_counter()
print('\n\n ------')
print('Total training Time_CPU: ', end_c - start_c)
print("\nStopping program regularily")

```

Both parameters affect our method “_fit()” in the following way :

```    ''' -- Method to set the number of batches based on given batch size -- '''
def _fit(self, b_print = False, b_measure_batch_time = False):
'''
Parameters:
b_print:                 Do we print intermediate results of the training at all?
b_print_period:          For which period of epochs do we print?
b_measure_batch_time:    Measure CPU-Time for a batch
'''
rg_idx_epochs  = self._rg_idx_epochs
rg_idx_batches = self._rg_idx_batches
if (b_print):
print("\nnumber of epochs = " + str(len(rg_idx_epochs)))
print("max number of batches = " + str(len(rg_idx_batches)))

# loop over epochs
for idxe in rg_idx_epochs:
if (b_print and (idxe % self._print_period == 0) ):
print("\n ---------")
print("\nStarting epoch " + str(idxe+1))

# sinmple adaption of the learning rate
self._learn_rate /= (1.0 + self._decrease_const * idxe)

# shuffle indices for a variation of the mini-batches with each epoch
if self._shuffle_batches:
shuffled_index = np.random.permutation(self._dim_sets)
self._X_train, self._y_train, self._ay_onehot = self._X_train[shuffled_index], self._y_train[shuffled_index], self._ay_onehot[:, shuffled_index]

# loop over mini-batches
for idxb in rg_idx_batches:
if b_measure_batch_time:
start_0 = time.perf_counter()
# deal with a mini-batch
self._handle_mini_batch(num_batch = idxb, num_epoch=idxe, b_print_y_vals = False, b_print = False)
if b_measure_batch_time:
end_0 = time.perf_counter()
print('Time_CPU for batch ' + str(idxb+1), end_0 - start_0)

if (b_print and (idxe % self._print_period == 0) ):
print("\ntotal costs of mini_batch = ", self._ay_costs[idxe, idxb])

print("avg total error of mini_batch = ", self._ay_theta[idxe, idxb])

return None

```

# Results for shuffling the contents of the mini-batches

With shuffling we expect a slightly broader variation of the costs and the averaged error. But the accuracy should no change too much in the end. We start a new test run with the following parameters:

```     ay_nodes_layers = [0, 70, 30, 0],
n_nodes_layer_out = 10,
my_loss_function = "LogLoss",
n_size_mini_batch = 500,
n_epochs = 1500,
n_max_batches = 2000,  # small values only for test runs
lambda2_reg = 0.1,
lambda1_reg = 0.0,
vect_mode = 'cols',
learn_rate = 0.0001,
decrease_const = 0.000001,
mom_rate   = 0.00005,
shuffle_batches = True,
print_period = 20,
...
```

If we look at the intermediate printout for the last mini-batch of some epochs and compare it to the results given in the last article, we see a stronger variation in the costs and averaged error. The reason is that the composition of last mini-batch of an epoch changes with every epoch.

```number of epochs = 1500
max number of batches = 120
---------
Starting epoch 1
total costs of mini_batch =  1757.7650929607967
avg total error of mini_batch =  0.17086198431410954
---------
Starting epoch 61
total costs of mini_batch =  511.7001121819204
avg total error of mini_batch =  0.030287362041332373
---------
Starting epoch 121
total costs of mini_batch =  435.2513093033654
avg total error of mini_batch =  0.023445601362614754
----------
Starting epoch 181
total costs of mini_batch =  361.8665831722295
avg total error of mini_batch =  0.018540003201911136
---------
Starting epoch 241
total costs of mini_batch =  293.31230634431023
avg total error of mini_batch =  0.0138237366634751
---------
Starting epoch 301
total costs of mini_batch =  332.70394217467936
avg total error of mini_batch =  0.017697548541363246
---------
Starting epoch 361
total costs of mini_batch =  249.26400606039937
avg total error of mini_batch =  0.011765164578232358
---------
Starting epoch 421
total costs of mini_batch =  240.0503762160913
avg total error of mini_batch =  0.011650843329895542
---------
Starting epoch 481
total costs of mini_batch =  222.89422430417295
avg total error of mini_batch =  0.011503859412784031
---------
Starting epoch 541
total costs of mini_batch =  200.1195962051405
avg total error of mini_batch =  0.009962020519104173
---------
tarting epoch 601
total costs of mini_batch =  206.74753168607685
avg total error of mini_batch =  0.01067995191155135
---------
Starting epoch 661
total costs of mini_batch =  171.14090717705736
avg total error of mini_batch =  0.0077091934178393105
---------
Starting epoch 721
total costs of mini_batch =  158.44967190977957
avg total error of mini_batch =  0.0070760922760890735
---------
Starting epoch 781
total costs of mini_batch =  165.4047453537401
avg total error of mini_batch =  0.008622788115637027
---------
Starting epoch 841
total costs of mini_batch =  140.52762105883642
avg total error of mini_batch =  0.0067360505574077766
---------
Starting epoch 901
total costs of mini_batch =  163.9117184790982
avg total error of mini_batch =  0.007431666926365192
---------
Starting epoch 961
total costs of mini_batch =  126.05539161877512
avg total error of mini_batch =  0.005982378079899406
---------
Starting epoch 1021
total costs of mini_batch =  114.89943308334199
avg total error of mini_batch =  0.005122976288751798
---------
Starting epoch 1081
total costs of mini_batch =  117.22051220670932
avg total error of mini_batch
=  0.005185936692097749
---------
Starting epoch 1141
total costs of mini_batch =  140.88969853048422
avg total error of mini_batch =  0.007665464508660714
---------
Starting epoch 1201
total costs of mini_batch =  113.27223303239667
avg total error of mini_batch =  0.0059791015452599705
---------
Starting epoch 1261
total costs of mini_batch =  105.55343407063131
avg total error of mini_batch =  0.005000503315756879
---------
Starting epoch 1321
total costs of mini_batch =  130.48116668827038
avg total error of mini_batch =  0.006287118265324945
---------
Starting epoch 1381
total costs of mini_batch =  109.04042315247389
avg total error of mini_batch =  0.005874339148860562
---------
Starting epoch 1441
total costs of mini_batch =  121.01379412127089
avg total error of mini_batch =  0.0065105907117289944
---------
Starting epoch 1461
total costs of mini_batch =  103.08774822996196
avg total error of mini_batch =  0.005299079778792264
---------
Starting epoch 1481
total costs of mini_batch =  106.21334182056928
avg total error of mini_batch =  0.005343967730134955
-------
Total training Time_CPU:  1963.8792177759988
```

Note that the averaged error values result from averaging of the absolute values of the errors of all records in a batch! The small numbers are not due to a cancelling of positive by negative deviations. A contribution to the error at an output node is given by the absloute value of the difference between the predicted real output value and the encoded target output value. We then first calculate an average over all output nodes (=10) per record and then average these values over all records of a batch. Such an “averaged error” gives us a first indication of the accuracy level reached.

Note that this averaged error is not becoming a constant. The last values in the above list indicate that we do not get much better with the error than 0.0055 on the training data. Our approached minimum points on the various cost hyperplanes of the mini-batches obviously hop around the global minimum on the hyperplane of the total costs. One of the reasons is the varying composition of the mini-batches; another reason is that the cost hyperplanes of the various mini-batches themselves are different from the hyperplane of the total costs of all records of the test data set. We see the effects of a mixture of “batch gradient descent” and “stochastic gradient descent” here; i.e., we do not get rid of stochastic elements even if we are close to a global minimum.

Still we observe an overall convergent behavior at around 1050 epochs. There our curves get relatively flat.

Accuracy values are:

total accuracy for training data = 0.9914
total accuracy for test data        = 0.9611

So, this is pretty much the same as in our original run in the last article without data shuffling.

# Dropping regularization

In the above run we had used a quadratic from of the regularization (often called Ridge regularization). In the next test run we shall drop regularization completely (Lambda2 = 0, Lambda1 = 0) and find out whether this hampers the generalization of our MLP and the resulting accuracy with respect to the test data set.

Resulting data for the last epochs of the test run are

```Starting epoch 1001
total costs of mini_batch =  67.98542512352101
avg total error of mini_batch =  0.007449654093429594
---------
nStarting epoch 1051
total costs of mini_batch =  56.69195783294443
avg total error of mini_batch =  0.0063384571747725415
---------
Starting epoch 1101
total costs of mini_batch =  51.81035466850738
avg total error of mini_batch =  0.005939699354987233
---------
Starting epoch 1151
total costs of mini_batch =  52.23157716632318
avg total error of mini_batch =  0.006373981433882217
---------
Starting epoch 1201
total costs of mini_batch =  48.40298652277855
avg total error of mini_batch =  0.005653856253701317
---------
Starting epoch 1251
total costs of mini_batch =  45.00623540189525
avg total error of mini_batch =  0.005245339176038497
---------
Starting epoch 1301
total costs of mini_batch =  36.88409532579881
avg total error of mini_batch =  0.004600719544961844
---------
Starting epoch 1351
total costs of mini_batch =  36.53543045554845
avg total error of mini_batch =  0.003993852242709943
---------
Starting epoch 1401
total costs of mini_batch =  38.80422469954769
avg total error of mini_batch =  0.00464620714991714
---------
Starting epoch 1451
total costs of mini_batch =  42.39371261881638
avg total error of mini_batch =  0.005294796697150631
------
Total training Time_CPU:  2118.4527089519997
```

Note, by the way, that the absolute values of the costs depend on the regularization parameter; therefore we see somewhat lower values in the end than before. But the absolute cost values are not so important regarding the general convergence and the accuracy of the network reached.

We omit the plots and just give the accuracy values:

total accuracy for training data = 0.9874
total accuracy for test data        = 0.9514

We get a slight drop in accuracy for the test data set – small (1%), but notable. It is interesting the even the accuracy on the training data became influenced.

# Why might it be interesting to check the impact of the regularization?

We learn from literature that regularization helps with overfitting. Another aspect discussed e.g. by Jörg Frochte in his book “Maschinelles Lernen” is, whether we have enough training data to determine the vast amount of weights in complicated networks. He suggests on page 190 of his book to consider the number of weights in an MLP and compare it with the number of available data points.

He suggests that one may run into trouble if the difference between the number of weights (number of degrees of freedom) and the number of data records (number of independent information data) becomes too big. However, his test example was a rather limited one and for regression not classification. He also notes that if the data are well distributed may not be as big as assumed. If one thinks about it, one may also come to the question whether the real amount of data provided by the records is not by a factor of 10 larger – as we use 10 output values per record ….

Anyway, I think it is worthwhile to have a look at regularization.

# Enlarging the regularization factor

We double the value of Lambda2: Lambda2 = 0.2.

```Starting epoch 1251
total costs of mini_batch =  128.00827405482318
avg total error of mini_batch =  0.007276206815511017
---------
Starting epoch 1301
total costs of mini_batch =  107.62983581797556
avg total error of mini_batch =  0.005535653858885446
---------
Starting epoch 1351
total costs of mini_batch =  107.83630092292944
avg total error of mini_batch =  0.
005446805325519184
---------
Starting epoch 1401
total costs of mini_batch =  119.7648277329852
avg total error of mini_batch =  0.00729466852297802
---------
Starting epoch 1451
total costs of mini_batch =  106.74254206278933
avg total error of mini_batch =  0.005343124456075227
```

We get a slight improvement of the accuracy compared to our first run with data shuffling:

total accuracy for training data = 0.9950
total accuracy for test data        = 0.964

So, regularization does have its advantages. I recommend to investigate the impact of this parameter closely, if you need to get the “last percentages” in generalization and accuracy for a given MLP-model.

# Enlarging node numbers

We have 60000 data records in the training set. In our example case we needed to fix around 784*70 + 70*30 + 30*10 = 57280 weight values. This is pretty close to the total amount of training data (60000). What happens if we extend the number of weights beyond the number of training records?
E.g. 784*100 + 100*50 + 50*10 = 83900. Do we get some trouble?

The results are:

```
Starting epoch 1151
total costs of mini_batch =  109.77341617599176
avg total error of mini_batch =  0.005494982077591186
---------
Starting epoch 1201
total costs of mini_batch =  113.5293680548904
avg total error of mini_batch =  0.005352117137100675
---------
Starting epoch 1251
total costs of mini_batch =  116.26371170820423
avg total error of mini_batch =  0.0072335516486698
---------
Starting epoch 1301
total costs of mini_batch =  99.7268420386945
avg total error of mini_batch =  0.004850817052601995
---------
Starting epoch 1351
total costs of mini_batch =  101.16579732551999
avg total error of mini_batch =  0.004831600835072556
---------
Starting epoch 1401
total costs of mini_batch =  98.45208584213253
avg total error of mini_batch =  0.004796133492821962
---------
Starting epoch 1451
total costs of mini_batch =  99.279344780807
avg total error of mini_batch =  0.005289728162205425
------
Total training Time_CPU:  2159.5880855739997
```

Ooops, there appears a glitch in the data around epoch 1250. Such things happen! So, we should have a look at the graphs before we decide to take the weights of a special epoch for our MLP model!

But in the end, i.e. with the weights at epoch 1500 the accuracy values are:

total accuracy for training data = 0.9962
total accuracy for test data        = 0.9632

So, we were NOT punished by extending our network, but we gained nothing worth the effort.

Now, let us go up with node numbers much more: 300 and 100 => 784*300 + 300*100 + 100*10 = 266200; ie. substantially more individual weights than training samples! First with Lambda2 = 0.2:

```Starting epoch 1201
total costs of mini_batch =  104.4420759423322
avg total error of mini_batch =  0.0037985801450468246
---------
Starting epoch 1251
total costs of mini_batch =  102.80878647657674
avg total error of mini_batch =  0.003926855904089417
---------
Starting epoch 1301
total costs of mini_batch =  100.01189950545002
avg total error of mini_batch =  0.0037743225368465773
---------
Starting epoch 1351
total
costs of mini_batch =  97.34294880936079
avg total error of mini_batch =  0.0035513092392408865
---------
Starting epoch 1401
total costs of mini_batch =  93.15432903284587
avg total error of mini_batch =  0.0032916082743134206
---------
Starting epoch 1451
total costs of mini_batch =  89.79127326241868
avg total error of mini_batch =  0.0033628384147655283
------
Total training Time_CPU:  4254.479082876998
```

total accuracy for training data = 0.9987
total accuracy for test data        = 0.9630

So , much CPU-time for no gain!

Now, what happens if we set Lambda2 = 0? We get:

total accuracy for training data = 0.9955
total accuracy for test data        = 0.9491

This is a small change around 1.4%! I would say:

In the special case of MNIST, a MLP-network with 2 hidden layers and a small learning rate we see neither a major problem regarding the amount of available data, nor a dramatic impact of the regularization. Regularization brings around a 1% gain in accuracy.

# Reducing the node number on the hidden layers

Another experiment could be to reduce the number of nodes on the hidden layers. Let us go down to 40 nodes on L1 and 20 on L2, then to 30 and 20. Lambda2 is set to 0.2 in both runs. acc1 in the following listing means the accuracy for the training data, acc2 for the test data.

The results are:

40 nodes on L1, 20 nodes on L2, 1500 epochs => 1600 sec, acc1 = 0.9898, acc2 = 0.9578
30 nodes on L1, 20 nodes on L2, 1800 epochs => 1864 sec, acc1 = 0.9861, acc2 = 0.9513

We loose around 1% in accuracy!

The plot for 30, 20 nodes on layers L1, L2 shows that we got a convergence only beyond 1500 epochs.

# Working with just one hidden layer

To get some more indications regarding efficiency let us now turn to networks with just one layer.
We investigate three situations with the following node numbers on the hidden layer: 100, 50, 30

The plot for 100 nodes on the hidden layer; we get convergence at around 1050 epochs.

Interestingly the CPU time for 100 nodes is with 1850 secs not smaller than for the network with 70 and 30 nodes on the hidden layers. As the dominant matrices are the ones connecting layer L0 and layer L1 this is quite understandable. (Note the CPU time also depends on the consumption of other jobs on the system.

The plots for 50 and 30 nodes on the hidden layer; we get convergence at around 1450 epochs. The
CPU time for 1500 epochs goes down to 1500 sec and XXX sec, respectively.

Plot for 50 nodes on the hidden layer:

Plot for 30 nodes:

We get the following accuracy values:

100 nodes, 1950 sec (1500 epochs), acc1 = 0.9947, acc2 = 0.9619,
50 nodes, 1600 sec (1500 epochs), acc1 = 0.9880, acc2 = 0.9566,
30 nodes, 1450 sec (1500 epochs), acc1 = 0.9780, acc2 = 0.9436

Well, now we see a drop in the accuracy by around 2% compared to our best cases. You have to decide yourself whether the gain in CPU time is worth it.

Note, by the way, that the accuracy value for 50 nodes is pretty close to the value S. Rashka got in his book “Python Machine Learning”. If you compare such values with your own runs be aware of the rather small learning rate (0.0001) and momentum rates (0.00005) I used. You can probably become faster with smaller learning rates. But then you may need another type of adaption for the learning rate compared to our simple linear one.

# Conclusion

We saw that our original guess of a network with 2 hidden layers with 70 and 30 nodes was not a bad one. A network with just one layer with just 50 nodes or below does not give us the same accuracy. However, we neither saw a major improvement if we went to grids with 300 nodes on layer L1 and 100 nodes on layer L2. Despite some discrepancy between the number of weights in comparison to the number of test records we saw no significant loss in accuracy either – with or without regularization.
We also learned that we should use regularization (here of the quadratic type) to get the last 1% to 2% bit of accuracy on the test data in our special MNIST case.

In the next article

A simple program for an ANN to cover the Mnist dataset – XI – confusion matrix

we shall have a closer look at those MNIST number images where our MLP got problems.

# A simple Python program for an ANN to cover the MNIST dataset – VIII – coding Error Backward Propagation

I continue with my series on a Python program for coding small “Multilayer Perceptrons” [MLPs].

After all the theoretical considerations of the last two articles we now start coding again. Our objective is to extend our methods for training the MLP on the MNIST dataset by methods which perform the “error back propagation” and the correction of the weights. The mathematical prescriptions were derived in the following PDF:
My PDF on “The math behind EBP”

When you study the new code fragments below remember a few things:
We are prepared to use mini-batches. Therefore, the cost functions will be calculated over the data records of each batch and all the matrix operations for back propagation will cover all batch-records in parallel. Training means to loop over epochs and mini-batches – in pseudo-code:

• Loop over epochs
2. check for convergence criteria,
3. Shuffle all data records in the test data set and build new mini-batches
• Loop over mini-batches
1. Perform forward propagation for all records of the mini-batch
2. calculate and save the total cost value for each mini-batch
3. calculate and save an averaged error on the output layer for each mini-batch
4. perform error backward propagation on all records of the mini-batch to get the gradient of the cost function with respect to all weights
5. adjust all weights on all layers

As discussed in the last article: The cost hyperplane changes a bit with each mini-batch. If there is a good mixture of records in a batch then the form of its specific cost hyperplane will (hopefully) resemble the form of an overall cost hyperplane, but it will not be the same. By the second step in the outer loop we want to avoid that the same data records always get an influence on the gradients at the same position in the correction procedure. Both statistical elements help a bit to overcome dominant records and a non-equal distribution of test records. If we had only pictures for number 3 at the end of our MNIST data set we
may start learning “3” representations very well, but not other numbers. Statistical variation also helps to avoid side minima on the overall cost hyperplane for all data records of the test set.

We shall implement the second step and third step in the epoch loop in the next article – when we are sure that the training algorithm works as expected. So, at the moment we will stop our training only after a given number of epochs.

# More input parameters

In the first articles we had build an __init__()-method to parameterize a training run. We have to include three more parameters to control the backward propagation.

learn_rate = 0.001, # the learning rate (often called epsilon in textbooks)
decrease_const = 0.00001, # a factor for decreasing the learning rate with epochs
mom_rate = 0.0005, # a factor for momentum learning

The first parameter controls by how much we change weights with the help of gradient values. See formula (93) in the PDF of article VI (you find the Link to the latest version in the last section of this article). The second parameter will give us an option to decrease the learning rate with the number of training epochs. Note that a constant decrease rate only makes sense, if we can be relatively sure that we do not end up in a side minimum of the cost function.

The third parameter is interesting: It will allow us to mix the presently calculated weight correction with the correction term from the last step. So to say: We extend the “momentum” of the last correction into the next correction. This helps us not to follow indicated direction changes on the cost hyperplanes too fast.

# Some hygienic measures regarding variables

In the already written parts of the code we have used a prefix “ay_” for all variables which represent some vector or array like structure – including Python lists and Numpy arrays. For back propagation coding it will be more important to distinguish between lists and arrays. So, I changed the variable prefix for some important Python lists from “ay_” to “li_”. (I shall do it for all lists used in a further version). In addition I have changed the prefix for Python ranges to “rg_”. These changes will affect the contents and the interface of some methods. You will notice when we come to these methods.

# The changed __input__()-method

Our modified __init__() function now looks like this:

```    def __init__(self,
my_data_set = "mnist",
n_hidden_layers = 1,
ay_nodes_layers = [0, 100, 0], # array which should have as much elements as n_hidden + 2
n_nodes_layer_out = 10,  # expected number of nodes in output layer

my_activation_function = "sigmoid",
my_out_function        = "sigmoid",
my_loss_function       = "LogLoss",

n_size_mini_batch = 50,  # number of data elements in a mini-batch

n_epochs      = 1,
n_max_batches = -1,  # number of mini-batches to use during epochs - > 0 only for testing
# a negative value uses all mini-batches

lambda2_reg = 0.1,     # factor for quadratic regularization term
lambda1_reg = 0.0,     # factor for linear regularization term

vect_mode = 'cols',

learn_rate = 0.001,        # the learning rate (often called epsilon in textbooks)
decrease_const = 0.00001,  # a factor for decreasing the learning rate with epochs
mom_rate
= 0.0005,       # a factor for momentum learning

figs_x1=12.0, figs_x2=8.0,
legend_loc='upper right',

b_print_test_data = True

):
'''
Initialization of MyANN
Input:
data_set: type of dataset; so far only the "mnist", "mnist_784" datsets are known
We use this information to prepare the input data and learn about the feature dimension.
This info is used in preparing the size of the input layer.
n_hidden_layers = number of hidden layers => between input layer 0 and output layer n

ay_nodes_layers = [0, 100, 0 ] : We set the number of nodes in input layer_0 and the output_layer to zero
Will be set to real number afterwards by infos from the input dataset.
All other numbers are used for the node numbers of the hidden layers.
n_nodes_out_layer = expected number of nodes in the output layer (is checked);
this number corresponds to the number of categories NC = number of labels to be distinguished

my_activation_function : name of the activation function to use
my_out_function : name of the "activation" function of the last layer which produces the output values
my_loss_function : name of the "cost" or "loss" function used for optimization

n_size_mini_batch : Number of elements/samples in a mini-batch of training data
The number of mini-batches will be calculated from this

n_epochs : number of epochs to calculate during training
n_max_batches : > 0: maximum of mini-batches to use during training
< 0: use all mini-batches

lambda_reg2:    The factor for the quadartic regularization term
lambda_reg1:    The factor for the linear regularization term

vect_mode: Are 1-dim data arrays (vctors) ordered by columns or rows ?

learn rate :     Learning rate - definies by how much we correct weights in the indicated direction of the gradient on the cost hyperplane.
decrease_const:  Controls a systematic decrease of the learning rate with epoch number
mom_const:       Momentum rate. Controls a mixture of the last with the present weight corrections (momentum learning)

figs_x1=12.0, figs_x2=8.0 : Standard sizing of plots ,
legend_loc='upper right': Position of legends in the plots

b_print_test_data: Boolean variable to control the print out of some tests data

'''

# Array (Python list) of known input data sets
self._input_data_sets = ["mnist", "mnist_784", "mnist_keras"]
self._my_data_set = my_data_set

# X, y, X_train, y_train, X_test, y_test
# will be set by analyze_input_data
# X: Input array (2D) - at present status of MNIST image data, only.
# y: result (=classification data) [digits represent categories in the case of Mnist]
self._X       = None
self._X_train = None
self._X_test  = None
self._y       = None
self._y_train = None
self._y_test  = None

# relevant dimensions
# from input data information;  will be set in handle_input_data()
self._dim_sets     = 0
self._dim_features = 0
self._n_labels     = 0   # number of unique labels - will be extracted from y-data

# Img sizes
self._dim_img      = 0 # should be sqrt(dim_features) - we assume square like images
self._img_h        = 0
self._img_w        = 0

# Layers
# ------
# number of hidden layers
self._n_hidden_layers = n_hidden_layers
# Number of total layers
self._n_total_layers = 2 + self._n_hidden_layers
# Nodes for hidden layers
self._ay_nodes_layers = np.array(ay_nodes_layers)
# Number of nodes in output layer - will be checked against information from target arrays
self._n_nodes_layer_out = n_nodes_layer_out

# Weights
# --------
# empty List for all weight-matrices for all layer-connections
# Numbering :
# w[0] contains the weight matrix which connects layer 0 (input layer ) to hidden layer 1
# w[1] contains the weight matrix which connects layer 1 (input layer ) to (hidden?) layer 2
self._li_w = []

# Arrays for encoded output labels - will be set in _encode_all_mnist_labels()
# -------------------------------
self._ay_onehot = None
self._ay_oneval = None

# Known Randomizer methods ( 0: np.random.randint, 1: np.random.uniform )
# ------------------
self.__ay_known_randomizers = [0, 1]

# Types of activation functions and output functions
# ------------------
self.__ay_activation_functions = ["sigmoid"] # later also relu
self.__ay_output_functions     = ["sigmoid"] # later also softmax

# Types of cost functions
# ------------------
self.__ay_loss_functions = ["LogLoss", "MSE" ] # later also othr types of cost/loss functions

# the following dictionaries will be used for indirect function calls
self.__d_activation_funcs = {
'sigmoid': self._sigmoid,
'relu':    self._relu
}
self.__d_output_funcs = {
'sigmoid': self._sigmoid,
'softmax': self._softmax
}
self.__d_loss_funcs = {
'LogLoss': self._loss_LogLoss,
'MSE': self._loss_MSE
}
# Derivative functions
self.__d_D_activation_funcs = {
'sigmoid': self._D_sigmoid,
'relu':    self._D_relu
}
self.__d_D_output_funcs = {
'sigmoid': self._D_sigmoid,
'softmax': self._D_softmax
}
self.__d_D_loss_funcs = {
'LogLoss': self._D_loss_LogLoss,
'MSE': self._D_loss_MSE
}

# The following variables will later be set by _check_and set_activation_and_out_functions()

self._my_act_func  = my_activation_function
self._my_out_func  = my_out_function
self._my_loss_func = my_loss_function
self._act_func = None
self._out_func = None
self._loss_func = None

# number of data samples in a mini-batch
self._n_size_mini_batch = n_size_mini_batch
self._n_mini_batches = None  # will be determined by _get_number_of_mini_batches()

# maximum number of epochs - we set this number to an assumed maximum
# - as we shall build a backup and reload functionality for training, this should not be a major problem
self._n_epochs = n_epochs

# maximum number of batches to handle ( if < 0 => all!)
self._n_max_batches = n_max_batches
# actual number of batches
self._n_batches = None

# regularization parameters
self._lambda2_reg = lambda2_reg
self._
lambda1_reg = lambda1_reg

# parameter for momentum learning
self._learn_rate = learn_rate
self._decrease_const = decrease_const
self._mom_rate   = mom_rate
self._li_mom = [None] *  self._n_total_layers

# book-keeping for epochs and mini-batches
# -------------------------------
# range for epochs - will be set by _prepare-epochs_and_batches()
self._rg_idx_epochs = None
# range for mini-batches
self._rg_idx_batches = None
# dimension of the numpy arrays for book-keeping - will be set in _prepare_epochs_and_batches()
self._shape_epochs_batches = None    # (n_epochs, n_batches, 1)

# list for error values at outermost layer for minibatches and epochs during training
# we use a numpy array here because we can redimension it
self._ay_theta = None
# list for cost values of mini-batches during training
# The list will later be split into sections for epochs
self._ay_costs = None

# Data elements for back propagation
# ----------------------------------

# 2-dim array of partial derivatives of the elements of an additive cost function
# The derivative is taken with respect to the output results a_j = ay_ANN_out[j]
# The array dimensions account for nodes and sampls of a mini_batch. The array will be set in function
# self._initiate_bw_propagation()
self._ay_delta_out_batch = None

# parameter to allow printing of some test data
self._b_print_test_data = b_print_test_data

# Plot handling
# --------------
# Alternatives to resize plots
# 1: just resize figure  2: resize plus create subplots() [figure + axes]
self._plot_resize_alternative = 1
# Plot-sizing
self._figs_x1 = figs_x1
self._figs_x2 = figs_x2
self._fig = None
self._ax  = None
# alternative 2 does resizing and (!) subplots()
self.initiate_and_resize_plot(self._plot_resize_alternative)

# ***********
# operations
# ***********

# check and handle input data
self._handle_input_data()
# set the ANN structure
self._set_ANN_structure()

# Prepare epoch and batch-handling - sets ranges, limits num of mini-batches and initializes book-keeping arrays
self._rg_idx_epochs, self._rg_idx_batches = self._prepare_epochs_and_batches()

# perform training
start_c = time.perf_counter()
self._fit(b_print=True, b_measure_batch_time=False)
end_c = time.perf_counter()
print('\n\n ------')
print('Total training Time_CPU: ', end_c - start_c)
print("\nStopping program regularily")
sys.exit()

```

# The extended method _set_ANN_structure()

I do not change method “_handle_input_data()”. However, I extend method “def _set_ANN_structure()” by a statement to initialize a list with momentum matrices for all layers.

```    '''-- Method to set ANN structure --'''
def _set_ANN_structure(self):
# check consistency of the node-number list with the number of hidden layers (n_hidden)
self._check_layer_and_node_numbers()
# set node numbers for the input layer and the output layer
self._set_nodes_for_input_output_layers()
self._show_node_numbers()

# create the weight matrix between input and first hidden layer
self._create_WM_Input()
# create weight matrices between the
hidden layers and between tha last hidden and the output layer
self._create_WM_Hidden()

# initialize momentum differences
self._create_momentum_matrices()
#print("\nLength of li_mom = ", str(len(self._li_mom)))

# check and set activation functions
self._check_and_set_activation_and_out_functions()
self._check_and_set_loss_function()

return None

```

The following box shows the changed functions _create_WM_Input(), _create_WM_Hidden() and the new function _create_momentum_matrices():

```    '''-- Method to create the weight matrix between L0/L1 --'''
def _create_WM_Input(self):
'''
Method to create the input layer
The dimension will be taken from the structure of the input data
We need to fill self._w[0] with a matrix for conections of all nodes in L0 with all nodes in L1
We fill the matrix with random numbers between [-1, 1]
'''
# the num_nodes of layer 0 should already include the bias node
num_nodes_layer_0 = self._ay_nodes_layers[0]
num_nodes_with_bias_layer_0 = num_nodes_layer_0 + 1
num_nodes_layer_1 = self._ay_nodes_layers[1]

# fill the matrix with random values
#rand_low  = -1.0
#rand_high = 1.0
rand_low  = -0.5
rand_high = 0.5
rand_size = num_nodes_layer_1 * (num_nodes_with_bias_layer_0)

randomizer = 1 # method np.random.uniform

w0 = self._create_vector_with_random_values(rand_low, rand_high, rand_size, randomizer)
w0 = w0.reshape(num_nodes_layer_1, num_nodes_with_bias_layer_0)

# put the weight matrix into array of matrices
self._li_w.append(w0)
print("\nShape of weight matrix between layers 0 and 1 " + str(self._li_w[0].shape))

#
'''-- Method to create the weight-matrices for hidden layers--'''
def _create_WM_Hidden(self):
'''
Method to create the weights of the hidden layers, i.e. between [L1, L2] and so on ... [L_n, L_out]
We fill the matrix with random numbers between [-1, 1]
'''

# The "+1" is required due to range properties !
rg_hidden_layers = range(1, self._n_hidden_layers + 1, 1)

# for random operation
rand_low  = -1.0
rand_high = 1.0

for i in rg_hidden_layers:
print ("Creating weight matrix for layer " + str(i) + " to layer " + str(i+1) )

num_nodes_layer = self._ay_nodes_layers[i]
num_nodes_with_bias_layer = num_nodes_layer + 1

# the number of the next layer is taken without the bias node!
num_nodes_layer_next = self._ay_nodes_layers[i+1]

# assign random values
rand_size = num_nodes_layer_next * num_nodes_with_bias_layer

randomizer = 1 # np.random.uniform

w_i_next = self._create_vector_with_random_values(rand_low, rand_high, rand_size, randomizer)
w_i_next = w_i_next.reshape(num_nodes_layer_next, num_nodes_with_bias_layer)

# put the weight matrix into our array of matrices
self._li_w.append(w_i_next)
print("Shape of weight matrix between layers " + str(i) + " and " + str(i+1) + " = " + str(self._li_w[i].shape))
#
'''-- Method to create and initialize matrices for momentum learning (differences) '''
def _create_momentum_matrices(self):
rg_layers = range(0, self._n_total_layers - 1)
for i in rg_layers:
r
self._li_mom[i] = np.zeros(self._li_w[i].shape)
#print("shape of li_mom[" + str(i) + "] = ", self._li_mom[i].shape)
```

# The modified functions _fit() and _handle_mini_batch()

The _fit()-function is modified to include a systematic decrease of the learning rate.

```    ''' -- Method to set the number of batches based on given batch size -- '''
def _fit(self, b_print = False, b_measure_batch_time = False):

rg_idx_epochs  = self._rg_idx_epochs
rg_idx_batches = self._rg_idx_batches
if (b_print):
print("\nnumber of epochs = " + str(len(rg_idx_epochs)))
print("max number of batches = " + str(len(rg_idx_batches)))

# loop over epochs
for idxe in rg_idx_epochs:
if (b_print):
print("\n ---------")
print("\nStarting epoch " + str(idxe+1))

self._learn_rate /= (1.0 + self._decrease_const * idxe)

# loop over mini-batches
for idxb in rg_idx_batches:
if (b_print):
print("\n ---------")
print("\nDealing with mini-batch " + str(idxb+1))
if b_measure_batch_time:
start_0 = time.perf_counter()
# deal with a mini-batch
self._handle_mini_batch(num_batch = idxb, num_epoch=idxe, b_print_y_vals = False, b_print = False)
if b_measure_batch_time:
end_0 = time.perf_counter()
print('Time_CPU for batch ' + str(idxb+1), end_0 - start_0)

#if idxb == 100:
#    sys.exit()

return None
```

Note that the number of epochs is determined by an external parameter as an upper limit of the range “rg_idx_epochs”.

Method “_handle_mini_batch()” requires several changes: First we define lists which are required to save matrix data of the backward propagation. And, of course, we call a method to perform the BW propagation (see step 6 in the code). Some statements print shapes, if required. At step 7 of the code we correct the weights by using the learning rate and the calculated gradient of the loss function.

Note, that we mix the correction evaluated at the last batch-record with the correction evaluated for the present record! This corresponds to a simple form of momentum learning. We then have to save the present correction values, of course. Note that the list for momentum correction “li_mom” is, therefore, not deleted at the end of a mini-batch treatment !

In addition to saving the total costs per mini-batch we now also save a mean error at the output level. The average is done by the help of Numpy’s function numpy.average() for matrices. Remember, we build the average over errors at all output nodes and all records of the mini-batch.

```
''' -- Method to deal with a batch -- '''
def _handle_mini_batch(self, num_batch = 0, num_epoch = 0, b_print_y_vals = False, b_print = False, b_keep_bw_matrices = True):
'''
For each batch we keep the input data array Z and the output data A (output of activation function!)
for all layers in Python lists
We can use this as input variables in function calls - mutable variables are handled by reference values !
We receive the A and Z data from propagation functions and proceed them to cost and gradient calculation functions

As an initial step we define the Python lists li_Z_
in_layer and li_A_out_layer
and fill in the first input elements for layer L0

Forward propagation:
--------------------
li_Z_in_layer : List of layer-related 2-dim matrices for input values z at each node (rows) and all batch-samples (cols).
li_A_out_layer: List of layer-related 2-dim matrices for output alues z at each node (rows) and all batch-samples (cols).
The output is created by Phi(z), where Phi represents an activation or output function

Note that the matrices in ay_A_out will be permanently extended by a row (over all samples)
to account for a bias node of each inner layer. This happens during FW propagation.

Note that the matrices ay_Z_in will be temporarily extended by a row (over all samples)
to account for a bias node of each inner layer. This happens during BW propagation.

Backward propagation:
--------------------
li_delta_out:  Startup matrix for _out_delta-values at the outermost layer
li_grad_layer: List of layer-related matrices with gradient values for the correction of the weights

Depending on parameter "b_keep_bw_matrices" we keep
- a list of layer-related matrices D with values for the derivatives of the act./output functions
- a list of layer-related matrices for the back propagated delta-values
in lists during back propagation. This can support error analysis.

All matrices in the lists are 2 dimensional with dimensions for nodes (rows) and training samples (cols)
All these lists be deleted at the end of the function to accelerate garbadge handling

Input parameters:
----------------
num_epoch:     Number of present epoch
num_batch:    Number of present mini-batch
'''
# Layer-related lists to be filled with 2-dim Numpy matrices during FW propagation
# ********************************************************************************
li_Z_in_layer  = [None] * self._n_total_layers # List of matrices with z-input values for each layer; filled during FW-propagation
li_A_out_layer = li_Z_in_layer.copy()          # List of matrices with results of activation/output-functions for each layer; filled during FW-propagation
li_delta_out   = li_Z_in_layer.copy()          # Matrix with out_delta-values at the outermost layer
li_delta_layer = li_Z_in_layer.copy()          # List of the matrices for the BW propagated delta values
li_D_layer     = li_Z_in_layer.copy()          # List of the derivative matrices D containing partial derivatives of the activation/ouput functions
li_grad_layer  = li_Z_in_layer.copy()          # List of the matrices with gradient values for weight corrections

if b_print:
len_lists = len(li_A_out_layer)
print("\nnum_epoch = ", num_epoch, "  num_batch = ", num_batch )
print("\nhandle_mini_batch(): length of lists = ", len_lists)
self._info_point_print("handle_mini_batch: point 1")

# Print some infos
# ****************
if b_print:
self._print_batch_infos()
self._info_point_print("handle_mini_batch: point 2")

# Major steps for the mini-batch during one epoch iteration
# **********************************************************

# Step 0: List of indices for data records in the present mini-batch
# ******
ay_idx_batch = self._ay_mini_batches[num_batch]

# Step 1: Special preparation of the Z-input to the MLP's input Layer L0
# ******
# Layer L0: Fill in the input vector for the ANN's input layer L0
li_
Z_in_layer[0] = self._X_train[ay_idx_batch] # numpy arrays can be indexed by an array of integers
if b_print:
print("\nPropagation : Shape of X_in = li_Z_in_layer = ", li_Z_in_layer[0].shape)
#print("\nidx, expected y_value of Layer L0-input :")
#for idx in self._ay_mini_batches[num_batch]:
#    print(str(idx) + ', ' + str(self._y_train[idx]) )
self._info_point_print("handle_mini_batch: point 3")

# Step 2: Layer L0: We need to transpose the data of the input layer
# *******
ay_Z_in_0T       = li_Z_in_layer[0].T
li_Z_in_layer[0] = ay_Z_in_0T
if b_print:
print("\nPropagation : Shape of transposed X_in = li_Z_in_layer = ", li_Z_in_layer[0].shape)
self._info_point_print("handle_mini_batch: point 4")

# Step 3: Call forward propagation method for the present mini-batch of training records
# *******
# this function will fill the ay_Z_in- and ay_A_out-lists with matrices per layer
self._fw_propagation(li_Z_in = li_Z_in_layer, li_A_out = li_A_out_layer, b_print = b_print)

if b_print:
ilayer = range(0, self._n_total_layers)
print("\n ---- ")
print("\nAfter propagation through all " + str(self._n_total_layers) + " layers: ")
for il in ilayer:
print("Shape of Z_in of layer L" + str(il) + " = " + str(li_Z_in_layer[il].shape))
print("Shape of A_out of layer L" + str(il) + " = " + str(li_A_out_layer[il].shape))
if il < self._n_total_layers-1:
print("Shape of W of layer L" + str(il) + " = " + str(self._li_w[il].shape))
print("Shape of Mom of layer L" + str(il) + " = " + str(self._li_mom[il].shape))
self._info_point_print("handle_mini_batch: point 5")

# Step 4: Cost calculation for the mini-batch
# ********
ay_y_enc = self._ay_onehot[:, ay_idx_batch]
ay_ANN_out = li_A_out_layer[self._n_total_layers-1]
# print("Shape of ay_ANN_out = " + str(ay_ANN_out.shape))

total_costs_batch = self._calculate_loss_for_batch(ay_y_enc, ay_ANN_out, b_print = False)
# we add the present cost value to the numpy array
self._ay_costs[num_epoch, num_batch] = total_costs_batch
if b_print:
print("\n total costs of mini_batch = ", self._ay_costs[num_epoch, num_batch])
self._info_point_print("handle_mini_batch: point 6")
print("\n total costs of mini_batch = ", self._ay_costs[num_epoch, num_batch])

# Step 5: Avg-error for later plotting
# ********
# mean "error" values - averaged over all nodes at outermost layer and all data sets of a mini-batch
ay_theta_out = ay_y_enc - ay_ANN_out
if (b_print):
print("Shape of ay_theta_out = " + str(ay_theta_out.shape))
ay_theta_avg = np.average(np.abs(ay_theta_out))
self._ay_theta[num_epoch, num_batch] = ay_theta_avg

if b_print:
print("\navg total error of mini_batch = ", self._ay_theta[num_epoch, num_batch])
self._info_point_print("handle_mini_batch: point 7")
print("avg total error of mini_batch = ", self._ay_theta[num_epoch, num_batch])

# Step 6: Perform gradient calculation via back propagation of errors
# *******
self._bw_propagation( ay_y_enc = ay_y_enc,
li_Z_in = li_Z_in_layer,
li_A_out = li_A_out_layer,
li_delta_out = li_delta_out,
li_delta = li_delta_
layer,
li_D = li_D_layer,
b_print = b_print,
b_internal_timing = False
)

# Step 7: Adjustment of weights
# *******
rg_layer=range(0, self._n_total_layers -1)
for N in rg_layer:
self._li_w[N] -= ( delta_w_N + (self._mom_rate * self._li_mom[N]) )
# save momentum
self._li_mom[N] = delta_w_N

# try to accelerate garbage handling
# **************
if len(li_Z_in_layer) > 0:
del li_Z_in_layer
if len(li_A_out_layer) > 0:
del li_A_out_layer
if len(li_delta_out) > 0:
del li_delta_out
if len(li_delta_layer) > 0:
del li_delta_layer
if len(li_D_layer) > 0:
del li_D_layer

return None
```

# Forward Propagation

The method for forward propagation remains unchanged in its structure. We only changed the prefix for the Python lists.

```    ''' -- Method to handle FW propagation for a mini-batch --'''
def _fw_propagation(self, li_Z_in, li_A_out, b_print= False):

b_internal_timing = False

# index range of layers
#    Note that we count from 0 (0=>L0) to E L(=>E) /
#    Careful: during BW-propgation we may need a correct indexing of lists filled during FW-propagation
ilayer = range(0, self._n_total_layers-1)

# propagation loop
# ***************
for il in ilayer:
if b_internal_timing: start_0 = time.perf_counter()

if b_print:
print("\nStarting propagation between L" + str(il) + " and L" + str(il+1))
print("Shape of Z_in of layer L" + str(il) + " (without bias) = " + str(li_Z_in[il].shape))

# Step 1: Take input of last layer and apply activation function
# ******
if il == 0:
A_out_il = li_Z_in[il] # L0: activation function is identity
else:
A_out_il = self._act_func( li_Z_in[il] ) # use real activation function

# Step 2: Add bias node
# ******
# save in array
li_A_out[il] = A_out_il
if b_print:
print("Shape of A_out of layer L" + str(il) + " (with bias) = " + str(li_A_out[il].shape))

# Step 3: Propagate by matrix operation
# ******
Z_in_ilp1 = np.dot(self._li_w[il], A_out_il)
li_Z_in[il+1] = Z_in_ilp1

if b_internal_timing:
end_0 = time.perf_counter()
print('Time_CPU for layer propagation L' + str(il) + ' to L' + str(il+1), end_0 - start_0)

# treatment of the last layer
# ***************************
il = il + 1
if b_print:
print("\nShape of Z_in of layer L" + str(il) + " = " + str(li_Z_in[il].shape))
A_out_il = self._out_func( li_Z_in[il] ) # use the output function
li_A_out[il] = A_out_il
if b_print:
print("Shape of A_out of last layer L" + str(il) + " = " + str(li_A_out[il].shape))

return None

```

We shall later learn that the treatment of bias neurons can be done more efficiently. The present way of coding it reduces performance – especially at the input layer. See the article series starting with
MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation
for more information. At the present stage of our discussion we are, however, more interested in getting a working code first – and not so much in performance optimization.

# Methods for Error Backward Propagation

In contrast to the recipe given in my PDF on the EBP-math we cannot calculate the matrices with the derivatives of the activation functions “ay_D” in advance for all layers. The reason was discussed in the last article VII: Some matrices have to be intermediately adjusted for a bias-neuron, which is ignored in the analysis of the PDF.

The resulting code of our method for EBP looks like given below:

```
''' -- Method to handle error BW propagation for a mini-batch --'''
def _bw_propagation(self,
ay_y_enc, li_Z_in, li_A_out,
b_print = True, b_internal_timing = False):

# List initialization: All parameter lists or arrays are filled or to be filled by layers
# Note: the lists li_Z_in, li_A_out were already filled by _fw_propagation() for the present batch

# Initiate BW propagation - provide delta-matrices for outermost layer
# ***********************
# Input Z at outermost layer E  (4 layers -> layer 3)
ay_Z_E = li_Z_in[self._n_total_layers-1]
# Output A at outermost layer E (was calculated by output function)
ay_A_E = li_A_out[self._n_total_layers-1]

# Calculate D-matrix (derivative of output function) at outmost the layer - presently only D_sigmoid
ay_D_E = self._calculate_D_E(ay_Z_E=ay_Z_E, b_print=b_print )

# Get the 2 delta matrices for the outermost layer (only layer E has 2 delta-matrices)
ay_delta_E, ay_delta_out_E = self._calculate_delta_E(ay_y_enc=ay_y_enc, ay_A_E=ay_A_E, ay_D_E=ay_D_E, b_print=b_print)

# We check the shapes
shape_theory = (self._n_nodes_layer_out, self._n_size_mini_batch)
if (b_print and ay_delta_E.shape != shape_theory):
print("\nError: Shape of ay_delta_E is wrong:")
print("Shape = ", ay_delta_E.shape, "  ::  should be = ", shape_theory )
if (b_print and ay_D_E.shape != shape_theory):
print("\nError: Shape of ay_D_E is wrong:")
print("Shape = ", ay_D_E.shape, "  ::  should be = ", shape_theory )

# add the matrices to their lists ; li_delta_out gets only one element
idxE = self._n_total_layers - 1
li_delta_out[idxE] = ay_delta_out_E # this happens only once
li_delta[idxE]     = ay_delta_E
li_D[idxE]         = ay_D_E
li_grad[idxE]      = None    # On the outermost layer there is no gradient !

if b_print:
print("bw: Shape delta_E = ", li_delta[idxE].shape)
print("bw: Shape D_E = ", ay_D_E.shape)
self._info_point_print("bw_propagation: point bw_1")

# Loop over all layers in reverse direction
# ******************************************
# index range of target layers N in BW direction (starting with E-1 => 4 layers -> layer 2))
if b_print:
range_N_bw_layer_test = reversed(range(0,
self._n_total_layers-1))   # must be -1 as the last element is not taken
rg_list = list(range_N_bw_layer_test) # Note this exhausts the range-object
print("range_N_bw_layer = ", rg_list)

range_N_bw_layer = reversed(range(0, self._n_total_layers-1))   # must be -1 as the last element is not taken

# loop over layers
for N in range_N_bw_layer:
if b_print:
print("\n N (layer) = " + str(N) +"\n")
# start timer
if b_internal_timing: start_0 = time.perf_counter()

# Back Propagation operations between layers N+1 and N
# *******************************************************
# this method handles the special treatment of bias nodes in Z_in, too
ay_delta_N, ay_D_N, ay_grad_N = self._bw_prop_Np1_to_N( N=N, li_Z_in=li_Z_in, li_A_out=li_A_out, li_delta=li_delta, b_print=False )

if b_internal_timing:
end_0 = time.perf_counter()
print('Time_CPU for BW layer operations ', end_0 - start_0)

# add matrices to their lists
li_delta[N] = ay_delta_N
li_D[N]     = ay_D_N
#sys.exit()

return

```

We first handle the necessary matrix evaluations for the outermost layer. We use two helper functions there to calculate the derivative of the output function with respect to the a-term [ _calculate_D_E() ] and to calculate the values for the “delta“-terms at all nodes and for all records [ _calculate_delta_E() ] according to the prescription in the PDF:

```
''' -- Method to calculate the matrix with the derivative values of the output function at outermost layer '''
def _calculate_D_E(self, ay_Z_E, b_print= True):
'''
This method calculates and returns the D-matrix for the outermost layer
The D matrix contains derivatives of the output function with respect to local input "z_j" at outermost nodes.

Returns
------
ay_D_E:    Matrix with derivative values of the output function
with respect to local z_j valus at the nodes of the outermost layer E
Note: This is a 2-dim matrix over layer nodes and training samples of the mini-batch
'''
if self._my_out_func == 'sigmoid':
ay_D_E = self._D_sigmoid(ay_Z=ay_Z_E)

else:
print("The derivative for output function " + self._my_out_func + " is not known yet!" )
sys.exit()

return ay_D_E

''' -- Method to calculate the delta_E matrix as a starting point of the backward propagation '''
def _calculate_delta_E(self, ay_y_enc, ay_A_E, ay_D_E, b_print= False):
'''
This method calculates and returns the 2 delta-matrices for the outermost layer

Returns
------
delta_E:     delta_matrix of the outermost layer (indicated by E)
delta_out:   delta_out matrix => elements are local derivative values of the cost function
with respect to the output "a_j" at an outermost node
!!! delta_out will only be returned if calculable !!!

Note: these are 2-dim matrices over layer nodes and training samples of the mini-batch
'''

if self._my_loss_func == 'LogLoss':
# Calculate delta_S_E directly to avoid problems with zero denominators
ay_delta_E = ay_A_E - ay_y_enc
# delta_out is fetched but may be None
ay_delta_out, ay_D_
numerator, ay_D_denominator = self._D_loss_LogLoss(ay_y_enc, ay_A_E, b_print = False)

# To be done: Analyze critical values in D_denominator

# Release variables explicitly
del ay_D_numerator
del ay_D_denominator

if self._my_loss_func == 'MSE':
# First calculate delta_out and then the delta_E
delta_out = self._D_loss_MSE(ay_y_enc, ay_A_E, b_print=False)
# calculate delta_E via matrix multiplication
ay_delta_E = delta_out * ay_D_E

return ay_delta_E, ay_delta_out

```

Further required helper methods to calculate the cost functions and related derivatives are :

```    ''' method to calculate the logistic regression loss function '''
def _loss_LogLoss(self, ay_y_enc, ay_ANN_out, b_print = False):
'''
Method which calculates LogReg loss function in a vectorized form on multidimensional Numpy arrays
'''
b_test = False

if b_print:
print("From LogLoss: shape of ay_y_enc =  " + str(ay_y_enc.shape))
print("From LogLoss: shape of ay_ANN_out =  " + str(ay_ANN_out.shape))
print("LogLoss: ay_y_enc = ", ay_y_enc)
print("LogLoss: ANN_out = \n", ay_ANN_out)
print("LogLoss: log(ay_ANN_out) =  \n", np.log(ay_ANN_out) )

# The following means an element-wise (!) operation between matrices of the same shape!
Log1 = -ay_y_enc * (np.log(ay_ANN_out))
# The following means an element-wise (!) operation between matrices of the same shape!
Log2 = (1 - ay_y_enc) * np.log(1 - ay_ANN_out)

# the next operation calculates the sum over all matrix elements
# - thus getting the total costs for all mini-batch elements
cost = np.sum(Log1 - Log2)

#if b_print and b_test:
# Log1_x = -ay_y_enc.dot((np.log(ay_ANN_out)).T)
# print("From LogLoss: L1 =   " + str(L1))
# print("From LogLoss: L1X =  " + str(L1X))

if b_print:
print("From LogLoss: cost =  " + str(cost))

# The total costs is just a number (scalar)
return cost
#
''' method to calculate the derivative of the logistic regression loss function
with respect to the output values '''
def _D_loss_LogLoss(self, ay_y_enc, ay_ANN_out, b_print = False):
'''
This function returns the out_delta_S-matrix which is required to initialize the
BW propagation (EBP)
Note ANN_out is the A_out-list element ( a 2-dim matrix) for the outermost layer
In this case we have to take care of denominators = 0
'''
D_numerator = ay_ANN_out - ay_y_enc
D_denominator = -(ay_ANN_out - 1.0) * ay_ANN_out
n_critical = np.count_nonzero(D_denominator < 1.0e-8)
if n_critical > 0:
delta_s_out = None
else:
delta_s_out = np.divide(D_numerator, D_denominator)
return delta_s_out, D_numerator, D_denominator
#
''' method to calculate the MSE loss function '''
def _loss_MSE(self, ay_y_enc, ay_ANN_out, b_print = False):
'''
Method which calculates LogReg loss function in a vectorized form on multidimensional Numpy arrays
'''
if b_print:
print("From loss_MSE: shape of ay_y_enc =  " + str(ay_y_enc.shape))
print("From loss_MSE: shape of ay_ANN_out =  " + str(ay_ANN_out.shape))
#print("LogReg: ay_y_enc = ", ay_y_enc)
#print("LogReg: ANN_out = \n", ay_
ANN_out)
#print("LogReg: log(ay_ANN_out) =  \n", np.log(ay_ANN_out) )

cost = 0.5 * np.sum( np.square( ay_y_enc - ay_ANN_out ) )

if b_print:
print("From loss_MSE: cost =  " + str(cost))

return cost
#
''' method to calculate the derivative of the MSE loss function
with respect to the output values '''
def _D_loss_MSE(self, ay_y_enc, ay_ANN_out, b_print = False):
'''
This function returns the out_delta_S - matrix which is required to initialize the
BW propagation (EBP)
Note ANN_out is the A_out-list element ( a 2-dim matrix) for the outermost layer
In this case the output is harmless (no critical denominator)
'''
delta_s_out = ay_ANN_out - ay_y_enc
return delta_s_out
```

You see that we are a bit careful to avoid zero denominators for the Logarithmic loss function in all of our helper functions.

The check statements for shapes can be eliminated in a future version when we are sure that everything works correctly. Keeping the layer specific matrices during the handling of a mini-batch will be also good for potentially required error analysis in the beginning. In the end we only may keep the gradient-matrices and the layer specific matrices required to process the local calculations during back propagation.

Then we turn to loop over all other layers down to layer L0. The matrix operation to be done for all these layers are handled in a further method:

```
''' -- Method to calculate the BW-propagated delta-matrix and the gradient matrix to/for layer N '''
def _bw_prop_Np1_to_N(self, N, li_Z_in, li_A_out, li_delta, b_print=False):
'''
BW-error-propagation bewtween layer N+1 and N
Inputs:
li_Z_in:  List of input Z-matrices on all layers - values were calculated during FW-propagation
li_A_out: List of output A-matrices - values were calculated during FW-propagation
li_delta: List of delta-matrices - values for outermost ölayer E to layer N+1 should exist

Returns:
ay_delta_N - delta-matrix of layer N (required in subsequent steps)
ay_D_N     - derivative matrix for the activation function on layer N
ay_grad_N  - matrix with gradient elements of the cost fnction with respect to the weights on layer N
'''

if b_print:
print("Inside _bw_prop_Np1_to_N: N = " + str(N) )

# Prepare required quantities - and add bias neuron to ay_Z_in
# ****************************

# Weight matrix meddling betwen layer N and N+1
ay_W_N = self._li_w[N]
shape_W_N   = ay_W_N.shape # due to bias node first dim is 1 bigger than Z-matrix
if b_print:
print("shape of W_N = ", shape_W_N )

# delta-matrix of layer N+1
ay_delta_Np1 = li_delta[N+1]
shape_delta_Np1 = ay_delta_Np1.shape

# !!! Add intermediate row (for bias) to Z_N !!!
ay_Z_N = li_Z_in[N]
shape_Z_N_orig = ay_Z_N.shape
shape_Z_N = ay_Z_N.shape # dimensions should fit now with W- and A-matrix

# Derivative matrix for the activation function (with extra bias node row)
#    can only be calculated now as we need the z-values
ay_D_N = self._calculate_D_N(ay_Z_N)
shape_D_N = ay_D_N.shape

ay_A_N = li_A_out[N]
shape_A_N = ay_A_N.shape

# print shapes
if b_print:
print("shape of W_N = ", shape_W_N)
print("
shape of delta_(N+1) = ", shape_delta_Np1)
print("shape of Z_N_orig = ", shape_Z_N_orig)
print("shape of Z_N = ", shape_Z_N)
print("shape of D_N = ", shape_D_N)
print("shape of A_N = ", shape_A_N)

# Propagate delta
# **************
if li_delta[N+1] is None:
print("BW-Prop-error:\n No delta-matrix found for layer " + str(N+1) )
sys.exit()

# Check shapes for np.dot()-operation - here for element [0] of both shapes - as we operate with W.T !
if ( shape_W_N[0] != shape_delta_Np1[0]):
print("BW-Prop-error:\n shape of W_N [", shape_W_N, "]) does not fit shape of delta_N+1 [", shape_delta_Np1, "]" )
sys.exit()

# intermediate delta
# ~~~~~~~~~~~~~~~~~~
ay_delta_w_N = ay_W_N.T.dot(ay_delta_Np1)
shape_delta_w_N = ay_delta_w_N.shape

# Check shapes for element wise *-operation !
if ( shape_delta_w_N != shape_D_N ):
print("BW-Prop-error:\n shape of delta_w_N [", shape_delta_w_N, "]) does not fit shape of D_N [", shape_D_N, "]" )
sys.exit()

# final delta
# ~~~~~~~~~~~
ay_delta_N = ay_delta_w_N * ay_D_N
# reduce dimension again
ay_delta_N = ay_delta_N[1:, :]
shape_delta_N = ay_delta_N.shape

# Check dimensions again - ay_delta_N.shape should fit shape_Z_in_orig
if shape_delta_N != shape_Z_N_orig:
print("BW-Prop-error:\n shape of delta_N [", shape_delta_N, "]) does not fit original shape Z_in_N [", shape_Z_N_orig, "]" )
sys.exit()

if N > 0:
shape_W_Nm1 = self._li_w[N-1].shape
if shape_delta_N[0] != shape_W_Nm1[0] :
print("BW-Prop-error:\n shape of delta_N [", shape_delta_N, "]) does not fit shape of W_Nm1 [", shape_W_Nm1, "]" )
sysexit()

# ********************
#     required for all layers down to 0
# check shapes
if shape_delta_Np1[1] != shape_A_N[1]:
print("BW-Prop-error:\n shape of delta_Np1 [", shape_delta_Np1, "]) does not fit shape of A_N [", shape_A_N, "] for matrix multiplication" )
sys.exit()

# regularize gradient (!!!! without adding bias nodes in the L1, L2 sums)
ay_grad_N[:, 1:] += (self._li_w[N][:, 1:] * self._lambda2_reg + np.sign(self._li_w[N][:, 1:]) * self._lambda1_reg)

#
# Check shape
print("BW-Prop-error:\n shape of grad_N [", shape_grad_N, "]) does not fit shape of W_N [", shape_W_N, "]" )
sys.exit()

# print shapes
if b_print:
print("shape of delta_N = ", shape_delta_N)

```

This function does more or less exactly what we have requested by our theoretical analysis in the last two articles. Note the intermediate handling of bias nodes! Note also that bias nodes are NOT regarded in regularization terms L1 and L2! The function to calculate the derivative of the activation function is:

```
#
''' -- Method to calculate the matrix with the derivative values of the output function at outermost layer '''

def _calculate_D_N(self, ay_Z_N, b_print= False):
'''
This method calculates and returns the D-matrix for the outermost layer
The D matrix contains derivatives of the output function with respect to local input "z_j" at outermost nodes.

Returns
------
ay_D_E:    Matrix with derivative values of the output function
with respect to local z_j valus at the nodes of the outermost layer E
Note: This is a 2-dim matrix over layer nodes and training samples of the mini-batch
'''
if self._my_out_func == 'sigmoid':
ay_D_E = self._D_sigmoid(ay_Z = ay_Z_N)

else:
print("The derivative for output function " + self._my_out_func + " is not known yet!" )
sys.exit()

return ay_D_E

```

The methods to calculate regularization terms for the loss function are:

```
#
''' method do calculate the quadratic regularization term for the loss function '''
def _regularize_by_L2(self, b_print=False):
'''
The L2 regularization term sums up all quadratic weights (without the weight for the bias)
over the input and all hidden layers (but not the output layer)
The weight for the bias is in the first column (index 0) of the weight matrix -
as the bias node's output is in the first row of the output vector of the layer
'''
ilayer = range(0, self._n_total_layers-1) # this excludes the last layer
L2 = 0.0
for idx in ilayer:
L2 += (np.sum( np.square(self._li_w[idx][:, 1:])) )
L2 *= 0.5 * self._lambda2_reg
if b_print:
print("\nL2: total L2 = " + str(L2) )
return L2
#
''' method do calculate the linear regularization term for the loss function '''
def _regularize_by_L1(self, b_print=False):
'''
The L1 regularization term sums up all weights (without the weight for the bias)
over the input and all hidden layers (but not the output layer
The weight for the bias is in the first column (index 0) of the weight matrix -
as the bias node's output is in the first row of the output vector of the layer
'''
ilayer = range(0, self._n_total_layers-1) # this excludes the last layer
L1 = 0.0
for idx in ilayer:
L1 += np.sum(np.abs( self._li_w[idx][:, 1:]))
L1 *= 0.5 * self._lambda1_reg
if b_print:
print("\nL1: total L1 = " + str(L1))
return L1
```

Also the BW-propagation code presented here will later be the target of optimization steps. We shall see that it – despite working correctly – can be criticized regarding efficiency at several points. See again the article series starting with
MLP, Numpy, TF2 – performance issues – Step I – float32, reduction of back propagation.

# Conclusion

We have extended our set of methods quite a bit. At the core of the operations we perform matrix operations which are supported by the Openblas library on a Linux system with multiple CPU cores. In the next article

A simple program for an ANN to cover the Mnist dataset – IX – First Tests

we shall test the convergence of our training for the MNIST data set. We shall see that a MLP with two hidden layers with 70 and 30 nodes can give us a convergence
of the averaged relative error down to 0.006 after 1000 epochs on the test data. However, we have to analyze such results for overfitting. Stay tuned …