A simple CNN for the MNIST dataset – VI – classification by abstract features and the role of the CNN’s MLP part

In my last article of this series on a CNN

A simple CNN for the MNIST dataset – V – activations and abstraction of features
A simple CNN for the MNIST dataset – IV – Visualizing the output of convolutional layers and maps
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST datasets – I – CNN basics

I discussed the following points: The series of convolutional transformations, which a CNN applies to its input, leads to a manifold of abstract representations in an eventually very low dimensional parameter space at the last convolutional layer. We saw that the transformations would NOT produce results there which are interpretable in the sense of figurative elements of depicted numbers, such as straight lines, circles or bows. Instead, due to pooling layers, lines and curved line elements obviously experience a fast dissolution during propagation through the various Conv layers. Whilst the first Conv layer still gives fair representations of e.g. a "4", line-like structures get already unclear at the second Conv layer. We saw that the common elements of the maps of multiple images of a handwritten "4" are more point-like in the low dimensional parameter space for maps on the output side of the last convolutional layer.

We got, however, the impression that the eventual abstractions contained enough unique patterns which allowed for a classification of a "4". But it is really this simple? Can we by just looking for visible patterns in the activation output of the last convolutional layer already discriminate between different digits?

In this article I want to show that this is NOT the case. To demonstrate this we shall look at the image of a "4" which could also be almost classified to represent a "9". We shall see

  • that the detection of clear unique patterns becomes really difficult when we look at the representations of "4"s which almost resemble a "9" - at least from a human point of view;
  • that directly visible patterns at the last convolutional layer may not contain sufficiently clear information for a classification;
  • that the MLP part of our CNN nevertheless detects patterns after a linear transformation - i.e. after a linear combination of the outputs of the last Conv layer - which are not directly evident for human eyes. These "hidden" patterns do, however, allow for a rather solid classification.

What have "4"s in common after three convolutional transformations?

As in the last article I took three clear "4" images

and compared the activation output after three convolutional transformations - i.e. at the output side of the last Conv layer which we named "Conv2D_3":

The red circles indicate common points in the resulting 128 maps which we do not find in representations of three clear "9"s (see below). The yellow circles indicate common patterns which, however, appear in some representations of a "9".

What have "9"s in common after three convolutional transformations?

Now let us look at the same for three clear "9"s:

 

A comparison gives the following common features of "9"s on the third Conv2D layer:

We again get the impression that enough unique features seem to exist on the maps for "4"s and "9"s, respectively, to distinguish between images of these numbers. But is it really so simple?

Intermezzo: Some useful steps to reproduce results

You certainly do not want to perform a training all the time when you want to analyze predictions at certain layers for some selected MNIST images. And you may also need the same "X_train", "X_test" sets to identify one and the same image by a defined number. Remember: In the Python code which I presented in a previous article for the setup for the data samples no unique number would be given due to initial shuffling.

Thus, you may need to perform a training run and then save the model as well as your X_train, y_train and X_test, y_test datasets. Note that we have transformed the data already in a reasonable tensor form which Keras expects. We also had already used one-hot-labels. The transformed sets were called "train_imgs", "test_imgs", "train_labels", "test_labels", "y_train", "y_test"

The following code saves the model (here "cnn") at the end of a training and loads it again:

# save a full model 
cnn.save('cnn.h5')

#load a full model  
cnnx = models.load_model('cnn.h5')        

On a Linux system the default path is typically that one where you keep your Jupyter notebooks.

The following statements save the sets of tensor-like image data in Numpy compatible data (binary) structures:

# Save the data

from numpy import save
save('train_imgs.npy', train_imgs) 
save('test_imgs.npy', test_imgs) 
save('train_labels.npy', train_labels) 
save('test_labels.npy', test_labels) 
save('y_train.npy', y_train) 
save('y_test.npy', y_test) 

We reload the data by

# Load train, test image data (in tensor form) 

from numpy import load
train_imgs   = load('train_imgs.npy')
test_imgs    = load('test_imgs.npy')
train_labels = load('train_labels.npy')
test_labels  = load('test_labels.npy')
y_train      = load('y_train.npy')
y_test       = load('y_test.npy')

Be careful to save only once - and not to set up and save your training and test data again in a pure analysis session! I recommend to use different notebooks for training and later analysis. If you put all your code in just one notebook you may accidentally run Jupyter cells again, which you do not want to run during analysis sessions.

What happens for unclear representations/images of a "4"?

When we trained a pure MLP on the MNIST dataset we had a look at the confusion matrix:
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix.
We saw that the MLP e.g. confused "5"s with "9s", "9"s with "4"s, "2"s with "7"s, "8"s with "5"s - and vice versa. We got the highest confusion numbers for the misjudgement of badly written "4"s and "9"s.

Let us look at a regular 4 and two "4"s which with some good will could also be interpreted as representations of a "9"; the first one has a closed upper area - and there are indeed some representations of "9"s in the MNIST dataset which look similar. The second "4" in my view is even closer to a "9":

 

Now, if we wanted to look out for the previously discussed "unique" features of "4"s and "9s" we would get a bit lost:

The first image is for a clear "4". The last two are the abstractions for our two newly chosen unclear "4"s in the order given above.

You see: Many of our seemingly "unique features" for a "4" on the third Conv-level are no longer or not fully present for our second "4"; so we would be rather insecure if we had to judge the abstraction as a viable pattern for a "4". We would expect that this "human" uncertainty also shows up in a probability distribution at the output layer of our CNN.

But, our CNN (including its MLP-part) has no doubt about the classification of the last sample as a "4". We just look at the prediction output of our model:

# Predict for a single image 
# ****************************
num_img = 1302
ay_sgl_img = test_imgs[num_img:num_img+1]
print(ay_sgl_img.shape)
# load last cell for the next statement to work 
#prob = cnn_pred.predict_proba(ay_sgl_img, batch_size=1)
#print(prob) 
prob1 = cnn_pred.predict(ay_sgl_img, batch_size=1)
print(prob1) 

[[3.61540742e-07 1.04205284e-07 1.69877489e-06 1.15337198e-08
  9.35641170e-01 3.53500056e-08 1.29525617e-07 2.28584581e-03
  2.59062881e-06 6.20680153e-02]]

93.5% probability for a "4"! A very clear discrimination! How can that be, given the - at first sight - seemingly unclear pattern situation at the third activation layer for our strange 4?

The MLP-part of the CNN "sees" things we humans do not see directly

We shall not forget that the MLP-part of the CNN plays an important role in our game. It reduces the information of the last 128 maps (3x3x128 = 1152) values down to 100 node values with the help of 115200 distinguished weights for related connections. This means there is a lot of fine-tuned information extraction and information compactification going on at the border of the CNN's MLP part - a transformation step which is too complex to grasp directly.

It is the transformation of all the 128x3x3-map-data into all 100 nodes via a linear combination which makes things difficult to understand. 115200 optimized weights leave enough degrees of freedom to detect combined patterns in the activation data which are more complex and less obvious than the point-like structures we encircled in the images of the maps

So, it is interesting to visualize and see how the MLP part of our CNN reacts to the activations of the last convolutional layers. Maybe we find some more intriguing patterns there, which discriminate "4"s from "9"s and explain the rather clear probability evaluation.

Visualization of the output of the dense layers of the CNN's MLP-part

We need to modify some parts of our code for creating images of the activation outputs of convolutional layers to be able to produce equally reasonable images for the output of the dense MLP layers, too. These modifications are simple. We distinguish between the types of layers by their names: When the name contains "dense" we execute a slightly different code. The changes affect just the function "img_grid_of_layer_activation()" previously discussed as the contents of a Jupyter "cell 9":

  
# Function to plot the activations of a layer 
# -------------------------------------------
# Adaption of a method originally designed by F.Chollet 

def img_grid_of_layer_activation(d_img_sets, model_fname='cnn.h5', layer_name='', img_set="test_imgs", num_img=8, 
                                 scale_img_vals=False):
    '''
    Input parameter: 
    -----------------
    d_img_sets: dictionary with available img_sets, which contain img tensors (presently: train_imgs, test_imgs)  
    model_fname: Name of the file containing the models data 
    layer_name: name of the layer for which we plot the activation; the name must be known to the Keras model (string) 
    image_set: The set of images we pick a specific image from (string)
    num_img: The sample number of the image in the chosen set (integer) 
    scale_img_vals: False: Do NOT scale (standardize) and clip (!) the pixel values. True: Standardize the values. (Boolean)
        
    Hints: 
    -----------------
    We assume quadratic images - in case of dense layers we assume a size of 1 
    '''
    
    # Load a model 
    cnnx = models.load_model(model_fname)
    
    # get the output of a certain named layer - this includes all maps
    # https://keras.io/getting_started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer-feature-extraction
    cnnx_layer_output = cnnx.get_layer(layer_name).output

    # build a new model for input "cnnx.input" and output "output_of_layer"
    # ~~~~~~~~~~~~~~~~~
    # Keras knows the required connections and intermediat layers from its tensorflow graphs - otherwise we get an error 
    # The new model can make predictions for a suitable input in the required tensor form   
    mod_lay = models.Model(inputs=cnnx.input, outputs=cnnx_layer_output)
    
    # Pick the input image from a set of respective tensors 
    if img_set not in d_img_sets:
        print("img set " + img_set + " is not known!")
        sys.exit()
    # slicing to get te right tensor 
    ay_img = d_img_sets[img_set][num_img:(num_img+1)]
    
    # Use the tensor data as input for a prediction of model "mod_lay" 
    lay_activation = mod_lay.predict(ay_img) 
    print("shape of layer " + layer_name + " : ", lay_activation.shape )
    
    # number of maps of the selected layer 
    n_maps   = lay_activation.shape[-1]
    print("n_maps = ", n_maps)

    # size of an image - we assume quadratic images 
    # in the case  of "dense" layers we assume that the img size is just 1 (1 node)    
    if "dense" in layer_name:
        img_size = 1
    else: 
        img_size = lay_activation.shape[1]
    print("img_size = ", img_size)

    # Only for testing: plot an image for a selected  
    # map_nr = 1 
    #plt.matshow(lay_activation[0,:,:,map_nr], cmap='viridis')

    # We work with a grid of images for all maps  
    # ~~~~~~~~~~~~~~~----------------------------
    # the grid is build top-down (!) with num_cols and num_rows
    # dimensions for the grid 
    num_imgs_per_row = 8 
    num_cols = num_imgs_per_row
    num_rows = n_maps // num_imgs_per_row
    #print("img_size = ", img_size, " num_cols = ", num_cols, " num_rows = ", num_rows)

    # grid 
    dim_hor = num_imgs_per_row * img_size
    dim_ver = num_rows * img_size
    img_grid = np.zeros( (dim_ver, dim_hor) )   # horizontal, vertical matrix  
    print("shape of img grid = ", img_grid.shape)

    # double loop to fill the grid 
    n = 0
    for row in range(num_rows):
        for col in range(num_cols):
            n += 1
            #print("n = ", n, "row = ", row, " col = ", col)
            # in case of a dense layer the shape of the tensor like output 
            # is different in comparison to Conv2D layers  
            if "dense" in layer_name:
                present_img = lay_activation[ :, row*num_imgs_per_row + col]
            else: 
                present_img = lay_activation[0, :, :, row*num_imgs_per_row + col]
            
            # standardization and clipping of the img data  
            if scale_img_vals:
                present_img -= present_img.mean()
                if present_img.std() != 0.0: # standard deviation
                    present_img /= present_img.std()
                    #present_img /= (present_img.std() +1.e-8)
                    present_img *= 64
                    present_img += 128
                present_img = np.clip(present_img, 0, 255).astype('uint8') # limit values to 255

            # place the img-data at the right space and position in the grid 
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # the following is only used if we had reversed vertical direction by accident  
            #img_grid[row*img_size:(row+1)*(img_size), col*img_size:(col+1)*(img_size)] = np.flip(present_img, 0)
            img_grid[row*img_size:(row+1)*(img_size), col*img_size:(col+1)*(img_size)] = present_img
 
    return img_grid, img_size, dim_hor, dim_ver 

 

You certainly detect the two small changes in comparison to the code for Jupyter cell 9 of the article
A simple CNN for the MNIST dataset – IV – Visualizing the output of convolutional layers and maps.

However, there remains one open question: We were too lazy in the coding discussed in previous articles to create our own names names for the dense layers. This is, however, no major problem: Keras creates its own names - if we do not define our own layer names when constructing a CNN model. Where do we get these default names from? Well, from the model's summary:

cnn_pred.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
Conv2D_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
Max_Pool_1 (MaxPooling2D)    (None, 13, 13, 32)        0         
_________________________________________________________________
Conv2D_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
Max_Pool_2 (MaxPooling2D)    (None, 5, 5, 64)          0         
_________________________________________________________________
Conv2D_3 (Conv2D)            (None, 3, 3, 128)         73856     
_________________________________________________________________
flatten_7 (Flatten)          (None, 1152)              0         
_________________________________________________________________
dense_14 (Dense)             (None, 100)               115300    
_________________________________________________________________
dense_15 (Dense)             (None, 10)                1010      
=================================================================
Total params: 208,982
Trainable params: 208,982
Non-trainable params: 0
_________________________________________________________________

Our first MLP layer with 100 nodes obviously got the name "dense_14".

With our modification and the given name we can now call Jupyter "cell 10" as before:

  
# Plot the img grid of a layers activation 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# global dict for the image sets 
d_img_sets= {'train_imgs':train_imgs, 'test_imgs':test_imgs}

# layer - pick one of the names which you defined for your model 
layer_name = "dense_14"

# choose a image_set and an img number 
img_set = "test_imgs"

# clear 4 
num_img = 1816

#unclear 4
#num_img = 1270
#num_img = 1302

#clear 9 
#num_img = 1249
#num_img = 1410
#num_img = 1858


# Two figures 
# -----------
fig1 = plt.figure(1, figsize=(5,5))  # figure for the input img
fig2 = plt.figure(2)  # figure for the activation outputs of th emaps 

fig1 = plt.figure( figsize=(5,5) )
ay_img = test_imgs[num_img:num_img+1]
#plt.imshow(ay_img[0,:,:,0], cmap=plt.cm.binary)
plt.imshow(ay_img[0,:,:,0], cmap=plt.cm.jet)


# getting the img grid 
img_grid, img_size, dim_hor, dim_ver = img_grid_of_layer_activation(
                                        d_img_sets, model_fname='cnn.h5', layer_name=layer_name, 
                                        img_set=img_set, num_img=num_img, 
                                        scale_img_vals=False)
# Define reasonable figure dimensions by scaling the grid-size  
scale = 1.6 / (img_size)
fig2 = plt.figure( figsize=(scale * dim_hor, scale * dim_ver) )
#axes 
ax = fig2.gca()
ax.set_xlim(-0.5,dim_hor-1.0)
ax.set_ylim(dim_ver-1.0, -0.5)  # the grid is oriented top-down 
#ax.set_ylim(-0,dim_ver-1.0) # normally wrong

# setting labels - tick positions and grid lines  
ax.set_xticks(np.arange(img_size-0.5, dim_hor, img_size))
ax.set_yticks(np.arange(img_size-0.5, dim_ver, img_size))
ax.set_xticklabels([]) # no labels should be printed 
ax.set_yticklabels([])

# preparing the grid 
plt.grid(b=True, linestyle='-', linewidth='.5', color='#ddd', alpha=0.7)

# color-map 
#cmap = 'viridis'
#cmap = 'inferno'
cmap = 'jet'
#cmap = 'magma'

plt.imshow(img_grid, aspect='auto', cmap=cmap)

 

In the output picture each node will be represented by a colored rectangle.

Visualization of the output for clear "4"s at the first dense MLP-layer

The following picture displays the activation values for three clear "4"s at the first dense MLP layer:

I encircled again some of the nodes which carry some seemingly "unique" information for representations of the digit "4".

For clear "9"s we instead get:

Hey, there are some clear differences: Especially, the diagonal pattern (vertically a bit below the middle and horizontally a bit to the left) and the activation at the first node (upper left) seem to be typical for representations of a "9".

Our unclear "4" representations at the first MLP layer

Now, what do we get for our two unclear "4"s?

I think that we would guess with confidence that our first image clearly corresponds to a "4". With the second one we would be a bit more careful - but the lack of the mentioned diagonal structure with sufficiently high values (orange to yellow on the "jet"-colormap) would guide us to a "4". Plus the presence of a relatively high value at a node present at the lower right which is nowhere in the "9" representations. Plus too small values at the upper left corner. Plus some other aspects - some nodes have a value where all the clear "9"s do not have anything.

We should not forget that there are 1000 weights again to emphasize some combinations and suppress others on the way to the output layer of the CNN's MLP part.

Conclusion

Information which is still confusing at the last convolutional layer - at least from a human visual perspective - can be "clarified" by a combination of the information across all (128) maps. This is done by the MLP transformations (linear matrix plus non-linear activation function) which produce the output of the 1st dense layer.

Thus and of course, the dense layers of the MLP-part of a CNN play an important role in the classification process: The MLP may detect patterns in the the combined information of all available maps at the last convolutional layer which the human eye may have difficulties with.

In the sense of a critical review of the results of our last article we can probably say: NOT the individual points, which we marked in the images of the maps at the last convolutional layer, did the classification trick; it was the MLP analysis of the interplay of the information across all maps which in the end lead the CNN to an obviously correct classification.

Common features in calculated maps for MNIST images are nice, but without an analysis of a MLP across all maps they are not sufficient to solve the classification problem. So: Do not underestimate the MLP part of a CNN!

In the next article we shall try to visualize the patterns or structures to which a certain CNN map specifically reacts to after training on the MNIST dataset.

A simple CNN for the MNIST dataset – V – activations and abstraction of features

In my last article of my introductory series on "Convolutional Neural Networks" [CNNs]

A simple CNN for the MNIST dataset – IV – Visualizing the output of convolutional layers and maps
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST datasets – I – CNN basics

I described how we can visualize the output of different maps at convolutional (or pooling) layers of a CNN. We are now well equipped to look a bit closer at the way "feature detection" is handled by a CNN after training to support classification tasks.

Actually, I want to point out that the terms "abstraction" and "features" should be used in a rather restricted and considerate way. In many textbooks these terms imply a kind of "intelligent" visual and figurative pattern detection and comparison. I think this is misleading. In my opinion the misconception becomes rather obvious in the case of the MNIST dataset - i.e. for the analysis and classification of images which display a figurative representation of numbers.

When you read some high level books on AI a lot of authors interpret more into the term "features" than should be done. When you for example think about a "4": What in an image of a "4" makes it a representation of the number "4"? You would say that it is a certain arrangement of more or less vertical and horizontal lines. Obviously, you have a generative and constructive concept in mind - and when you try to interpret an image of a "4" you look out for footprints of the creation rules for a figurative "4"-representation within the image. (Of course, with some degrees of freedom in mind - e.g. regarding the precise angles of the lines.)

Textbooks about CNNs often imply that pattern detection of a CNN during training reflects something of this "human" thought process. You hear of the detection of "line crossings" and "bows" and their combinations in a figure during training. I think this a dangerous over-interpretation of what a CNN actually does; in my opinion a CNN does NOT detect any kind of such "humanly interpretable" pattern or line elements of a figurative digit representation in MNIST images. A pure classification CNN works on a more abstract and at the same time more stupid level - far off any generative concept of the contents of an image.

I want to make this plausible by some basic considerations and then by looking more closely at the activation output. My intention is to warn against any mis- and over-interpretation of the "intelligence" in AI algorithms. Purely passively applied, non-generative CNN algorithms are very stupid in the end. This does not diminish the effectiveness of such classification algorithms 🙂 .

Theoretical considerations

My basic argument against an interpretation of "pattern detection" by CNNs in the sense of the detection of an "abstract concept of rules to construct a figurative pattern representing some object" is 4-fold:

  • The analytcal process during training is totally passive. In contrast to Autoencoders or GANs (Generative Adversial Networks) there is no creational, generative or (re-)constructive element enclosed.
  • Concepts of crossing (straight or curved) lines or line elements require by definition a certain level of resolution. But due to pooling layers the image analysis during CNN training drops more and more high resolution information - whilst gaining other relational information in rather coarse parameter (=representation) spaces.
  • Filters are established during training by an analysis over data points of many vertically arranged maps at the same time (see the first article of this series). A filter on a higher convolutional layer subsumes information of many different views on pixel distributions. The higher the layer number the more diffuse do localized geometrical aspects of the original image become.
  • As soon as the training algorithm has found a stable solution it represents a fully deterministic set of transformation rules where an MLP analyzes combinations of individual values in a very limited input vector.

What a CNN objectively does in its convolutional part after training is the following:

  • It performs a sequence of transformations of the input defined in a high-dimensional parameter or representation space into parameter spaces of lower and lower dimensions.
  • It feeds the data of the maps of the last convolutional layer into a MLP for classification.

At the last Conv2D-layer the dimension may even be too small to apply any figurative descriptions of the results at all. The transformation parameters were established due to mathematical rules for optimizing a cost function for classification errors at the output side of the MLP - no matter whether the detected features in the eventual coarse parameter space are congruent with a human generative or constructive rule for or a human idea about abstract digit representations.

In my opinion a CNN learns something about relations of pixel clusters on coarser and coarser scales and growing areas of an image. It learns something about the distribution of active pixels by transforming them into coarser and more and more abstract parameter spaces. Abstraction is done in the sense of dropping detailed information and analyzing broader and broader areas of an image in a relatively large number of ways - the number depending on the number of maps of the last convolutional layer. But this is NOT an abstraction in the sense of a constructive concept.

Even if some filters on lower convolutional layers indicate the "detection" of line based patterns - these "patterns" are not really used on the eventual convolutional level when due to previous convolution and pooling vast extended areas of an image are overlayed and "analyzed" in the sense of minimizing the cost function.

The "feature abstraction" during the learning process is more a guided abstraction of relations of different areas of an image after some useful transformations. The whole process resembles more to something which we saw already when we applied a unsupervised "cluster analysis" to the pixel distributions in MNIST images and then fed the detected lower dimensional cluster information into a MLP (see
A simple Python program for an ANN to cover the MNIST dataset – XIV – cluster detection in feature space).
We saw already there that guided transformations to lower dimensional representation spaces can support classification.

In the end the so called "abstraction" only leads to the use of highly individual elements of an input vector to an MLP in the following sense: "If lamp 10 and lamp 138 and lamp 765 ... of all lamps 1 to 1152" blink then we have a high probability of having a "4". This is it what the MLP on top of the convolutional layers "learns" in the end. Convolution raises the probability of finding unique indicators in different representations of "4"s, but the algorithm in the end is stupid and knows nothing about patterns the sense of an abstract concept regarding the arrangement of lines. A CNN has no "idea" about such an abstract concept of how number representations must be constructed from line elements.

Levels of "abstractions"

Let us take a MNIST image which represents something which a European would consider to be a clear representation of a "4".

In the second image I used the "jet"-color map; i.e. dark blue indicates a low intensity value while colors from light blue to green to yellow and red indicate growing intensity values.

The first conv2D-layer ("Conv2d_1") produces the following 32 maps of my chosen "4"-image after training:

We see that the filters, which were established during training emphasize general contours but also focus on certain image regions. However, the "4" is still clearly visible on very many maps as the convolution does not yet reduce resolution too much.

The second Conv2D-layer already combines information of larger areas of the image - as a max (!) pooling layer was applied before. As we use 64 convolutional maps this allows for a multitude of different new filters to mark "features".

As the max-condition of the pooling layer was applied first and because larger areas are now analyzed we are not too astonished to see that the filters dissolve the original "4"-shape and indicate more general geometrical patterns.

I find it interesting that our "4" triggers more horizontally sensitive filters than vertical ones. (We shall see later that a new training process may lead to filters which have another sensitivity). But this has also a bit to do with standardization and the level of pixel intensity in case of my special image. See below.

The third convolutional layer applies filters which now cover almost the full original image and combine and mix at the same time information from the already rather abstract results of layer 2 - and of all the 64 maps there in parallel.

We again see a dominance of horizontal patterns. We see clearly that on this level any reference to something like an arrangement of parallel vertical lines crossed by a vertical line is almost totally lost. Instead the CNN has transformed the original distribution of black (dark grey) pixels into an abstract configuration space only coarsely reflecting the original image area - by 9x9 grids; i.e. with a very pure resolution. What we see here are "relations" of filtered and transformed original pixel clusters over relatively large areas. But no concept of certain line arrangements.

Now, if this were the level of "features" which the MLP-part of the CNN uses to determine that we have a "4" then we would bet that such abstracted "features" or patterns (active points on 9x9 grids) appear in a similar way on the maps of the 3rd Conv layer for other MNIST images of a "4", too.

Well, how similar do different map representations of "4"s look like on the 3rd Conv2D-layer?

What makes a four a four in the eyes of the CNN?

The last question corresponds to the question of what activation outputs of "4"s really have in common. Let us take 3 different images of a "4":

The same with the "jet"-color-map:

 

Already with our eyes we see that there are similarities but also quite a lot of differences.

"4"-representation on the 2nd Conv-layer

Below we see comparison of the 64 maps on the 2nd Conv-layer for our three "4"-images.

Now, you may say: Well, I still recognize some common line patterns - e.g. parallel lines in a certain 75 degree angle on the 11x11 grids. Yes, but these lines are almost dissolved by the next pooling step:

Now, consider in addition that the next (3rd) convolution combines 3x3-data of all of the displayed 5x5-maps. Then, probably, we can hardly speak of a concept of abstract line configurations any more ...

"4"-representations on the third Conv-layer

Below you find the abstract activation outputs on the 3rd Conv2D-layer for our three different "4"-images:

When we look at details we see that prominent "features" in one map of a specific 4-image do NOT appear in a fully comparable way in the eventual convolutional maps for another image of a "4". Some of the maps (i.e. filters after 4 transformations) produce really different results for our three images.

But there are common elements: I have marked only some of the points which show a significant intensity in all of the maps. But does this mean these individual common points are decisive for a classification of a "4"? We cannot be sure about it - probably it is their combination which is relevant.

In the eights row and third column we see an abstract combination of a three point combination - in a shape like ┌ - but whether this indicates a part of the lines constituting a "4" on this coarse representation level of the original image is more than questionable.

So, what we ended up with is that we find some common points or some common point-relations in a few of the 128 "3x3"-maps of our three images of handwritten "4"s.

Conclusion

The maps of a CNN are created by an effective and guided optimization process. The results reflect indeed a process of the detection of very abstract features. But such "features" should not be associated with figurative elements and not in the sense of a concept of how to draw lines to construct a representation of a number. At least not in pure CNNs ... Things may be a bit different in convolutional "autoencoders" (combinations of convolutional encoders and decoders), but this is another story we will come back to in this blog.

The process of convolutional filtering (= transformations) and pooling (= averaging) "maps" the original pixel distribution onto abstract relation patterns on very coarse grids covering information stemming from extended regions of the original image. In the end the MLP decides on the appearance and relation of rather few common "point"-like elements in a multitude of maps - i.e. on point-like elements in low dimensional representation spaces.

This is very different from a conscious consideration process and weighing of alternatives which a human brain performs when it looks at sketches of numbers. Our brain tries to find signs consistent with a construction process defined for writing down a "4", i.e. signs of a certain arrangement of straight and curved lines. A human brain, thus, would refer to arrangements of line elements - but not to relations of individual points in an extremely coarse and abstract representation space after some mathematical transformations.

A CNN training algorithm tests many, many ways of how to filter information and to combine filtered information to detect unique patterns or point-like combinations on very coarse abstract feature grids, where original lines are completely dissolved. The last convolutional layer plays an important role in a CNN structure as it feeds the "classifying" MLP. As soon as the training algorithm has found a stable solution it represents a fully deterministic set of rules weighing common relations of point like activations in a rather abstract input vector for a MLP.

In the next article

A simple CNN for the MNIST dataset – VI – classification by abstract features and the role of the CNN’s MLP part

we shall look at the whole procedure again, but then we compare common elements of a "4" with those of a "9" on the 3rd convolutional layer. Then the key question will be: " What do "4"s have in common on the last convolutional maps which corresponding activations of "9"s do not show - and vice versa.

This will become especially interesting in cases for which a distinction was difficult for pure MLPs. You remember the confusion matrix for the MNIST dataset? See:
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix
We saw at that point in time that pure MLPs had some difficulties to distinct badly written "4"s from "9s".
We will see that the better distinction abilities of CNNs in the end depend on very few point like elements of the eventual activation on the last layer before the MLP.