In my last article of my introductory series on "Convolutional Neural Networks" [CNNs]
A simple CNN for the MNIST dataset – IV – Visualizing the output of convolutional layers and maps
A simple CNN for the MNIST dataset – III – inclusion of a learning-rate scheduler, momentum and a L2-regularizer
A simple CNN for the MNIST datasets – II – building the CNN with Keras and a first test
A simple CNN for the MNIST datasets – I – CNN basics
I described how we can visualize the output of different maps at convolutional (or pooling) layers of a CNN. We are now well equipped to look a bit closer at the way "feature detection" is handled by a CNN after training to support classification tasks.
Actually, I want to point out that the terms "abstraction" and "features" should be used in a rather restricted and considerate way. In many textbooks these terms imply a kind of "intelligent" visual and figurative pattern detection and comparison. I think this is misleading. In my opinion the misconception becomes rather obvious in the case of the MNIST dataset - i.e. for the analysis and classification of images which display a figurative representation of numbers.
When you read some high level books on AI a lot of authors interpret more into the term "features" than should be done. When you for example think about a "4": What in an image of a "4" makes it a representation of the number "4"? You would say that it is a certain arrangement of more or less vertical and horizontal lines. Obviously, you have a generative and constructive concept in mind - and when you try to interpret an image of a "4" you look out for footprints of the creation rules for a figurative "4"-representation within the image. (Of course, with some degrees of freedom in mind - e.g. regarding the precise angles of the lines.)
Textbooks about CNNs often imply that pattern detection of a CNN during training reflects something of this "human" thought process. You hear of the detection of "line crossings" and "bows" and their combinations in a figure during training. I think this a dangerous over-interpretation of what a CNN actually does; in my opinion a CNN does NOT detect any kind of such "humanly interpretable" pattern or line elements of a figurative digit representation in MNIST images. A pure classification CNN works on a more abstract and at the same time more stupid level - far off any generative concept of the contents of an image.
I want to make this plausible by some basic considerations and then by looking more closely at the activation output. My intention is to warn against any mis- and over-interpretation of the "intelligence" in AI algorithms. Purely passively applied, non-generative CNN algorithms are very stupid in the end. This does not diminish the effectiveness of such classification algorithms 🙂 .
My basic argument against an interpretation of "pattern detection" by CNNs in the sense of the detection of an "abstract concept of rules to construct a figurative pattern representing some object" is 4-fold:
- The analytcal process during training is totally passive. In contrast to Autoencoders or GANs (Generative Adversial Networks) there is no creational, generative or (re-)constructive element enclosed.
- Concepts of crossing (straight or curved) lines or line elements require by definition a certain level of resolution. But due to pooling layers the image analysis during CNN training drops more and more high resolution information - whilst gaining other relational information in rather coarse parameter (=representation) spaces.
- Filters are established during training by an analysis over data points of many vertically arranged maps at the same time (see the first article of this series). A filter on a higher convolutional layer subsumes information of many different views on pixel distributions. The higher the layer number the more diffuse do localized geometrical aspects of the original image become.
- As soon as the training algorithm has found a stable solution it represents a fully deterministic set of transformation rules where an MLP analyzes combinations of individual values in a very limited input vector.
What a CNN objectively does in its convolutional part after training is the following:
- It performs a sequence of transformations of the input defined in a high-dimensional parameter or representation space into parameter spaces of lower and lower dimensions.
- It feeds the data of the maps of the last convolutional layer into a MLP for classification.
At the last Conv2D-layer the dimension may even be too small to apply any figurative descriptions of the results at all. The transformation parameters were established due to mathematical rules for optimizing a cost function for classification errors at the output side of the MLP - no matter whether the detected features in the eventual coarse parameter space are congruent with a human generative or constructive rule for or a human idea about abstract digit representations.
In my opinion a CNN learns something about relations of pixel clusters on coarser and coarser scales and growing areas of an image. It learns something about the distribution of active pixels by transforming them into coarser and more and more abstract parameter spaces. Abstraction is done in the sense of dropping detailed information and analyzing broader and broader areas of an image in a relatively large number of ways - the number depending on the number of maps of the last convolutional layer. But this is NOT an abstraction in the sense of a constructive concept.
Even if some filters on lower convolutional layers indicate the "detection" of line based patterns - these "patterns" are not really used on the eventual convolutional level when due to previous convolution and pooling vast extended areas of an image are overlayed and "analyzed" in the sense of minimizing the cost function.
The "feature abstraction" during the learning process is more a guided abstraction of relations of different areas of an image after some useful transformations. The whole process resembles more to something which we saw already when we applied a unsupervised "cluster analysis" to the pixel distributions in MNIST images and then fed the detected lower dimensional cluster information into a MLP (see
A simple Python program for an ANN to cover the MNIST dataset – XIV – cluster detection in feature space).
We saw already there that guided transformations to lower dimensional representation spaces can support classification.
In the end the so called "abstraction" only leads to the use of highly individual elements of an input vector to an MLP in the following sense: "If lamp 10 and lamp 138 and lamp 765 ... of all lamps 1 to 1152" blink then we have a high probability of having a "4". This is it what the MLP on top of the convolutional layers "learns" in the end. Convolution raises the probability of finding unique indicators in different representations of "4"s, but the algorithm in the end is stupid and knows nothing about patterns the sense of an abstract concept regarding the arrangement of lines. A CNN has no "idea" about such an abstract concept of how number representations must be constructed from line elements.
Levels of "abstractions"
Let us take a MNIST image which represents something which a European would consider to be a clear representation of a "4".
In the second image I used the "jet"-color map; i.e. dark blue indicates a low intensity value while colors from light blue to green to yellow and red indicate growing intensity values.
The first conv2D-layer ("Conv2d_1") produces the following 32 maps of my chosen "4"-image after training:
We see that the filters, which were established during training emphasize general contours but also focus on certain image regions. However, the "4" is still clearly visible on very many maps as the convolution does not yet reduce resolution too much.
The second Conv2D-layer already combines information of larger areas of the image - as a max (!) pooling layer was applied before. As we use 64 convolutional maps this allows for a multitude of different new filters to mark "features".
As the max-condition of the pooling layer was applied first and because larger areas are now analyzed we are not too astonished to see that the filters dissolve the original "4"-shape and indicate more general geometrical patterns.
I find it interesting that our "4" triggers more horizontally sensitive filters than vertical ones. (We shall see later that a new training process may lead to filters which have another sensitivity). But this has also a bit to do with standardization and the level of pixel intensity in case of my special image. See below.
The third convolutional layer applies filters which now cover almost the full original image and combine and mix at the same time information from the already rather abstract results of layer 2 - and of all the 64 maps there in parallel.
We again see a dominance of horizontal patterns. We see clearly that on this level any reference to something like an arrangement of parallel vertical lines crossed by a vertical line is almost totally lost. Instead the CNN has transformed the original distribution of black (dark grey) pixels into an abstract configuration space only coarsely reflecting the original image area - by 9x9 grids; i.e. with a very pure resolution. What we see here are "relations" of filtered and transformed original pixel clusters over relatively large areas. But no concept of certain line arrangements.
Now, if this were the level of "features" which the MLP-part of the CNN uses to determine that we have a "4" then we would bet that such abstracted "features" or patterns (active points on 9x9 grids) appear in a similar way on the maps of the 3rd Conv layer for other MNIST images of a "4", too.
Well, how similar do different map representations of "4"s look like on the 3rd Conv2D-layer?
What makes a four a four in the eyes of the CNN?
The last question corresponds to the question of what activation outputs of "4"s really have in common. Let us take 3 different images of a "4":
The same with the "jet"-color-map:
Already with our eyes we see that there are similarities but also quite a lot of differences.
"4"-representation on the 2nd Conv-layer
Below we see comparison of the 64 maps on the 2nd Conv-layer for our three "4"-images.
Now, you may say: Well, I still recognize some common line patterns - e.g. parallel lines in a certain 75 degree angle on the 11x11 grids. Yes, but these lines are almost dissolved by the next pooling step:
Now, consider in addition that the next (3rd) convolution combines 3x3-data of all of the displayed 5x5-maps. Then, probably, we can hardly speak of a concept of abstract line configurations any more ...
"4"-representations on the third Conv-layer
Below you find the abstract activation outputs on the 3rd Conv2D-layer for our three different "4"-images:
When we look at details we see that prominent "features" in one map of a specific 4-image do NOT appear in a fully comparable way in the eventual convolutional maps for another image of a "4". Some of the maps (i.e. filters after 4 transformations) produce really different results for our three images.
But there are common elements: I have marked only some of the points which show a significant intensity in all of the maps. But does this mean these individual common points are decisive for a classification of a "4"? We cannot be sure about it - probably it is their combination which is relevant.
In the eights row and third column we see an abstract combination of a three point combination - in a shape like ┌ - but whether this indicates a part of the lines constituting a "4" on this coarse representation level of the original image is more than questionable.
So, what we ended up with is that we find some common points or some common point-relations in a few of the 128 "3x3"-maps of our three images of handwritten "4"s.
The maps of a CNN are created by an effective and guided optimization process. The results reflect indeed a process of the detection of very abstract features. But such "features" should not be associated with figurative elements and not in the sense of a concept of how to draw lines to construct a representation of a number. At least not in pure CNNs ... Things may be a bit different in convolutional "autoencoders" (combinations of convolutional encoders and decoders), but this is another story we will come back to in this blog.
The process of convolutional filtering (= transformations) and pooling (= averaging) "maps" the original pixel distribution onto abstract relation patterns on very coarse grids covering information stemming from extended regions of the original image. In the end the MLP decides on the appearance and relation of rather few common "point"-like elements in a multitude of maps - i.e. on point-like elements in low dimensional representation spaces.
This is very different from a conscious consideration process and weighing of alternatives which a human brain performs when it looks at sketches of numbers. Our brain tries to find signs consistent with a construction process defined for writing down a "4", i.e. signs of a certain arrangement of straight and curved lines. A human brain, thus, would refer to arrangements of line elements - but not to relations of individual points in an extremely coarse and abstract representation space after some mathematical transformations.
A CNN training algorithm tests many, many ways of how to filter information and to combine filtered information to detect unique patterns or point-like combinations on very coarse abstract feature grids, where original lines are completely dissolved. The last convolutional layer plays an important role in a CNN structure as it feeds the "classifying" MLP. As soon as the training algorithm has found a stable solution it represents a fully deterministic set of rules weighing common relations of point like activations in a rather abstract input vector for a MLP.
In the next article
we shall look at the whole procedure again, but then we compare common elements of a "4" with those of a "9" on the 3rd convolutional layer. Then the key question will be: " What do "4"s have in common on the last convolutional maps which corresponding activations of "9"s do not show - and vice versa.
This will become especially interesting in cases for which a distinction was difficult for pure MLPs. You remember the confusion matrix for the MNIST dataset? See:
A simple Python program for an ANN to cover the MNIST dataset – XI – confusion matrix
We saw at that point in time that pure MLPs had some difficulties to distinct badly written "4"s from "9s".
We will see that the better distinction abilities of CNNs in the end depend on very few point like elements of the eventual activation on the last layer before the MLP.