TinyCA2 as a replacement for YaST’s CA-tools on Opensuse Leap servers with TLS/SSL – I

Today server services should offer network connectivity for clients with encryption. On Linux StartTLS based services are common - for LDAP, email/groupware servers as well as web servers. To set up SSL/TLS/StartTLS based services we need certificates and encryption keys issued by a central CA - which we trust. Administering your own local CA and server certificates can be a bit challenging without graphical tools - even in smaller networks with a dozen server instances.

In our networks with mainly Opensuse and Debian servers I had used YaST's CA-module to create a CA and server certificates signed by this CA. The stupid thing is that the required "yast2-ca"-module and its RPM are missing since Opensuse Leap 15.0. This was not a major problem so far; the update processes respected existing certificates, of course. However, some days ago two of my central server certificates - namely the one for my LDAP-server and an Apache2-server - expired. This in turn lead to a breakdown of several other services on other (virtual) machines: SSSD, IMAP, Postfix (SMTP, because these services use the LDAP server among other things as a backend for user authentication. (SSSD itself provides a TLS connection to LDAP.)

The Opensuse documentation cha.security.yast_ca.html is really misleading because it claims to be valid for Opensuse Leap 15.1 - which it is not, as there still is no yast-ca-module available. For me this kind of policy of Opensuse is unbelievable; doesn't Leap provide the basic platform for SLES? How shall SLES admins in smaller companies tackle the resulting problems? Buy a PKI tool? Everybody talks about secuirty .... but SuSE (???)

I wanted some cost free alternative for my own network - and as a first trial I went for "TinyCA2".

This became more of an adventure than expected. Part of the hurdles were due to Opensuse specific settings - but also due to the very many different configuration files which had to be adapted for the certificate of my new CA - which came in addition to my old one. (I did not yet want to give up the old CA as some (virtual) servers still have valid server certificates from it.) Another obstacle appeared when Opensuse deleted any new files in "/etc/ssl/certs" after a system restart. Also the GUI of TinyCA2 has some strange "features" regarding default values, which I did not become aware of during my first trials. In addition it seemed to be necessary to replace SHA1 by SHA256. And in the end I got e.g. Apache running with its new server certificate, but not e.g. the slapd.service - due to a access rights problem which was difficult to see.

In this article I shall describe most of the required steps for switching to a TinyCA2 CA and adjusting server settings. I shall concentrate on some simple services as examples. But I hope the general pattern of how to proceed will become clear and help others, who work with Opensuse, to save a bit of time.

Installing and patching TinyCA2

Both Opensuse Leap 15.0 and Opensuse Leap 15.1 provide RPMs for TinyCA2 version 0.7.5. Which is from 2015. If you have a look at GitHub (see the link in the last section of this article) you may also find some (important) patches. On an Opensuse system You install the RPM easily with the help of YaST (yast2). After the installation you find the Perl files of TinyCA2 in the directory "/usr/share/TinyCA2/lib".

When you start TinyCA2 via the command "tinyca2 &" the first thing you may stumble across is the fact that (among other digests) MD5 and SHA1 are offered as hashing algorithms. Look at the bottom part of the following screenshot:

(By the way: The layout - especially the icons - may look different on your system. It depends on your graphical desktop and your settings for GTK applications)

You see that we get a variety of hashing algorithms offered under the category "Digest". Most of them are regarded insecure today. So, even in a semiprofessional environment you would like to see something better - e.g. SHA256. Fortunately, another guy (Bill Thorsteinson) had the same problem and he has created a patch for TinyCA2 which enables SHA256. You find the patch at
https://www.systemajik.com/tinyca-sha2/.

Let us try this out; on my ssh session to my central server "myserv" (this is the one with LDAP):

myserv:~ # mkdir /extras/Updates/tinyca
myserv:~ # wget https://www.systemajik.com/wp-uploads/2014/10/tinyca_sha256.patch_.txt -O /central/Updates/tinyca/tinyca_sha256.patch_.txt
--2019-07-20 12:44:57--  https://www.systemajik.com/wp-uploads/2014/10/tinyca_sha256.patch_.txt
Resolving www.systemajik.com (www.systemajik.com)... 206.47.13.3, 2001:470:1f11:b22::8
Connecting to www.systemajik.com (www.systemajik.com)|206.47.13.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4863 (4.7K) [text/plain]
Saving to: ‘/central/Updates/tinyca/tinyca_sha256.patch_.txt’

/central/Updates/tinyca/tinyca_sha256.pat 100%[=================>]   4.75K  --.-KB/s    in 0s      

2019-07-20 12:44:58 (138 MB/s) - ‘/central/Updates/tinyca/tinyca_sha256.patch_.txt’ saved [4863/4863]

myserv:~ # 
myserv:~ # cd /usr/share/TinyCA2/lib
myserv:/usr/share/TinyCA2/lib # cp /extras/Updates/tinyca/tinyca_sha256.patch_.txt .
myserv:/usr/share/TinyCA2/lib # patch --verbose -p1 < tinyca_sha256.patch_.txt
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|From e5e25e55f8da2b4d2bad584f2145ca0ff6b3a92a Mon Sep 17 00:00:00 2001
|From: Bill Thorsteinson <bill.git@systemajik.com>
|Date: Thu, 30 Oct 2014 22:26:47 -0400
|Subject: [PATCH] Apply changes
|
|---
...
...
|--- a/REQ.pm
|+++ b/REQ.pm
--------------------------
patching file REQ.pm
Using Plan A...
Hunk #1 succeeded at 59.
Hunk #2 succeeded at 426.
Hmm...  Ignoring the trailing garbage.
done

Note: The "-p1" in "patch --verbose -p1 < tinyca_sha256.patch_.txt" reads "-pONE" and not "-pL" with a small L-letter.

A "tinyca2" command now produces:

Much better !

Importing the old CA from Opensuse?

If you play around with the menus of TinyCA2 you find an option to import other CAs. Could this work with my old YaST-CA? To make a long story short - I did not succeed with this. The reasons are still unclear to me .... TinyCA could not read the relevant information.

So, I really was forced to set up a new CA - with all consequences as issuing and deploying new server certificates and the (trusted) CA certificate on my servers (and the CA cert also on my client machines). Which even in my little network (12 servers- thanks to virtualization) is painstaking ...

Creating a new TinyCA2 based CA

Let us create a new CA with TinyCA2. We have some freedom regarding the "common name". I choose a reference to my main internal domain "anraconc.de" - so my common name is: "anraconc-CA". (It only looks like an official Internet domain; but actually it is an internal domain, only, and my DNS server is configured accordingly.)

Important hint:
Change the settings for Keylength and Digest by explicitly clicking first on other values and then the real choice again! If you do not change anything explicitly you may get a surprise regarding default values. They may not be what is indicated. Seems to be a bug. Do not disregard this hint if you want to save time ....

Now, a click on the "OK"-button gives us:

We set the keyUsage to "critical" (this certificate extension is used by some applications). And we eventually get all the information about our CA certificate:

The data - and especially the private key - can be found in the directory "/root/.TinyCA/anraconc-CA/". TinyCA2 creates such a directory for every main CA. (If you use sub-CAs you will find respective directories below it).

myserv:~/.TinyCA # cd anraconc-CA/
myserv:~/.TinyCA/anraconc-CA # la
total 44
drwx------ 7 root root 4096 Jul 20 13:13 .
drwx------ 5 root root 4096 Jul 20 13:12 ..
-rw------- 1 root root 3311 Jul 20 13:13 cacert.key
-rw------- 1 root root 2504 Jul 20 13:13 cacert.pem
drwx------ 2 root root 4096 Jul 20 13:12 certs
drwx------ 2 root root 4096 Jul 20 13:13 crl
-rw------- 1 root root    0 Jul 20 13:12 index.txt
drwx------ 2 root root 4096 Jul 20 13:12 keys
drwx------ 2 root root 4096 Jul 20 13:12 newcerts
-rw------- 1 root root 3872 Jul 20 13:13 openssl.cnf
drwx------ 2 root root 4096 Jul 20 13:12 req
-rw------- 1 root root    2 Jul 20 13:12 serial

Hint: You should make a backup of the CA directories on a periodic basis.

Now, you can export the CA certificate in form of a standard pem-file to some intermediate place where you gather your own certificates and keys - in my case this is a directory "/etc/certs" - which so far survived any Opensuse upgrades. Depending on what else you intend to save there (private keys?), you should make this place accessible to root only! We click on the second to last icon in the icon row of TinyCA2:

Note: In general you have any freedom here to give the exported file any kind of name - whatever you like. However, it is a good policy to use the "common name" which you gave to the CA certificate. See below for the reason.

Place the CA certificate at a central location for trusted CAs

We can now export this certificate file with public information to servers into directories where we gather the public certificates (keys) of all trusted CAs. Of course we need to do this on the server "myserv", too, as some services may refer to it. In my age my first guess is "/etc/ssl/certs"; old habit form a decade ago where this directory was used more frequently.

myserv:~/.TinyCA/anraconc-CA # cp /etc/certs/anraconc-CA.pem /etc/ssl/certs
myserv:~/.TinyCA/anraconc-CA # chmod 640 /etc/ssl/certs/anraconc-CA.pem 
myserv:~/.TinyCA/anraconc-CA # 

A wrong decision in the end - see below. But for our present session this will work.

Note: If we would create Sub-CAs we would have to export all respective pem files to such a central location - the whole CA-chain must be reflected there for the verification of a service whose "server certificate" has been issued by a sub-CA. I do not use Sub-CAs in this article - but it my be necessary in your organization!

Create a server certificate

Now, we need to create "server certificates" or even service specific certificates. It depends on your policy of how far you want to discriminate services.

In this article I follow the path of a server wide central "server certificate" for all the services implemented there. As examples we shall later have a look at a local OpenLDAP service and a local Apache web server. My central server "myserv" with OpenLDAP has a FQDN of "myserv.anraconc.de".

Important note: You must use the FQDN as a "common name" in server certificates - consistent with DNS settings. Otherwise you may risk warnings of security aware applications that the server certificate does not fit the server!

in our TinyCA2 window we click on the tab "Certificates" and then on the empty sheet icon:

We fill in the required data. As keys protected by a password may cause trouble for services during automated system startups we try to leave the password-fields empty:

But this approach is not accepted!

So we type in some lengthy password - and set the options for Digest and Algorithm again explicitly - by clicking a bit around first (see above). Then we click on "OK" and get:

Think a bit about the validity! Actually, the length of the validity period should be somewhat shorter than the period for your CA! E.g. 5 years. Otherwise you will get a warning. Eventually:

If you click on the tab "Keys" you will see a related (private) key, too.

Important Note:
We have taken a shortcut here. You could have started in a different way - namely via a certificate "request". In a first step you would then have issued such a "request" under the tab "Requests" and filled out an initial form there. Afterwards, you explicitly need to sign the requested certificate with the CA's signature. You get the option for signing by right-clicking on the request entry after its creation. This approach also leads to a valid certificate.

Exporting the server certificate and the key to a central location on the Opensuse system

We need to export the certificate and the (private) key to some save location on our server "myserv" - only accessible to root (and maybe read accessible to some special system user). On an Opensuse Leap system the location for server certificates should be "/etc/ssl/servercerts". For exporting the certificate we right-click on the entry:

Then we switch to the tab "Keys" and do the same there:


Important note:

At this point we get an option to export the key without a password. I have chosen this option. This implicates security risks - your exported private key is protected by nothing afterwards. So be very careful where you save it and with which access rights. On the other side such a key will allow for automated service starts - otherwise someone would have to provide the password during startup. I do not want to deepen the discussion here. But be careful with unprotected private keys!

You saw that I exported into my intermediate directory "/etc/certs". There we change rights for security reasons to:

myserv:/etc/ssl # chmod 600 /etc/certs/myserv-anraconc-key.pem 

Note:
If you instead export your key with the password there is a way to get rid of it afterwards:

myserv:/etc/certs # openssl rsa -in /etc/certs/myserv-anraconc-key.pem -out /etc/certs/myserv-anraconc-key_new.pem
Enter pass phrase for /etc/certs/myserv-anraconc-key.pem:
writing RSA key
myserv:/etc/certs # cp /etc/certs/myserv-anraconc-key_new.pem /etc/certs/myserv-anraconc-key.pem

This creates a key without password! Actually I recommend to use this approach - we do not know details of TinyCA2's export procedure, but "openssl" will always create the key in the format required for further processing.

Now, before any further steps, I make a backup of everything existing in the folder "/etc/ssl/servercerts". This is important! If you loose your previous certificates and keys they are gone and you have no chance to get services running with them.

Then we overwrite the existing entries:

 
myserv:/etc/ssl # cp /etc/certs/myserv-anraconc-cert.pem /etc/ssl/servercerts/servercert.pem 
myserv:/etc/ssl # cp /etc/certs/myserv-anraconc-key.pem /etc/ssl/servercerts/serverkey.pem 
myserv:/etc/ssl # chmod 600 /etc/ssl/servercerts/serverkey.pem 

Note: The last step is of fundamental importance due to security reasons! See the discussion below if this leads to trouble for some services and the related system user.

You may try "644" in the beginning to avoid any problems with system users running special services. But if you do this then DO NOT FORGET to restrict the read rights again in the end and after your tests.

Note that replacing the contents of /etc/ssl/severcerts" will probably lead to a breakdown of all services which base their TLS/SSL functionality on these files. In most cases the reason for this will be that the configuration will refer to a wrong CA-certificate. Therefore, you must reconfigure your local services step by step.

Conclusion

Enough for today. We have seen how TinyCA2 can be patched for SHA256, how it can be used to create a CA and server certificates. In the next article

TinyCA2 as a replacement for YaST’s CA-tools on Opensuse Leap servers with TLS/SSL – II

we shall reconfigure an Apache and an LDAP service to work with the new server certificate. And I shall show how we can make the CA-certificate permanently available in "/etc/ssl/certs" - across any server reboot. Stay tuned !

Links

CAs and Certificates - general information
https://wiki.ubuntuusers.de/CA/
stackexchange.com: what-role-do-hashes-play-in-tls-ssl-certificate-validation
stackexchange.com how-does-ssl-tls-work
robpol86.com root certificate authority
https://roll.urown.net/ca/ca_cert.html

TinyCA2
https://github.com/glennie/tinyca2
linux-magazin 2008: eigene zertifikatsstelle mit tinyca

TinyCA2 patches
https://www.systemajik.com/tinyca-sha2/.

Critical
stackexchange.com: which-properties-of-a-x-509-certificate-should-be-critical-and-which-not

 

The moons dataset and decision surface graphics in a Jupyter environment – IV – plotting the decision surface

In this article series

The moons dataset and decision surface graphics in a Jupyter environment – I
The moons dataset and decision surface graphics in a Jupyter environment – II – contourplots
The moons dataset and decision surface graphics in a Jupyter environment – III – Scatter-plots and LinearSVC

we used the moons data set to build up some basic knowledge about using a Jupyter notebook for experiments, Pipelines and SVM-algorithms of SciKit-Sklearn and plotting functionality of matplotlib.

Our ultimate goal is to write some code for plotting a decision surface between the moon shaped data clusters. The ability to visualize data sets and related decision surfaces is a key to quickly testing the quality of different SVM-approaches. Otherwise, you would have to run some kind of analysis code to get an impression of what is going on and possible deficits of the determined separation surface.

In most cases, a visual impression of the separation surface for complexly shaped data sets will give you much clearer information. With just one look you get answers to the following questions:

  • How well does the decision surface really separate the data points of the clusters? Are there data points which are placed on the wrong side of the decision surface?
  • How reasonable does the decision surface look like? How does it extend into regions of the representation space not covered by the data points of the training set?
  • Which parameters of our SVM-approach influences what regarding the shape of the surface?

In the second article of this series we saw how we would create contour-plots. The motivation behind this was that a decision surface is something as the border between different areas of data points in an (x1,x2)-plane for which we get different distinct Z(x1,x2)-values. I.e., a contour line separating contour areas is an example of a decision surface in a 2-dimensional plane.

During the third article we learned in addition how we could visualize the various distinct data points of a training set via a scatter-plot.

We then applied some analysis tools to analyze the moons data - namely the "LinearSVC" algorithm together with "PolynomialFeatures" to cover non-linearity by polynomial extensions of the input data.

We did this in form of a Sklearn Pipeline for a step-wise transformation of our data set plus the definition of a predictor algorithm. Our LinearSVC-algorithm was trained with 3000 iterations (for a polynomial degree of 3) - and we could predict values for new data points.

In this article we shall combine all previous insights to produce a visual impression of the decision interface determined by LinearSVC. We shall put part of our code into a wrapper function. This will help us to efficiently visualize the results of further classification experiments.

Predicting Z-values for a contour plot in the (x1,x2) representation space of the moons dataset

To allow for the necessary interpolations done during contour plotting we need to cover the visible (x1,x2)-area relatively densely and systematically by data points. We then evaluate Z-values for all these points - in our case distinct values, namely 0 and 1. To achieve this we build a mesh of data points both in x1- and x2-direction. We saw already in the second article how numpy's meshgrid() function can help us with this objective:

resolution = 0.02
x1_min, x1_max = X[:, 0].min()  - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min()  - 1, X[:, 1].max() + 1
xm1, xm2 = np.meshgrid( np.arange(x1_min, x1_max, resolution), 
                        np.arange(x2_min, x2_max, resolution))

We extend our area quite a bit beyond the defined limits of (x1,x2) coordinates in our data set. Note that xm1 and xm2 are 2-dim arrays (!) of the same shape covering the surface with repeated values in either coordinate! We shall need this shape later on in our predicted Z-array.

To get a better understanding of the structure of the meshgrid data we start our Jupyter notebook (see the last article), and, of course, first run the cell with the import statements

import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from matplotlib import ticker, cm
from mpl_toolkits import mplot3d

from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

Then we run the cell that creates the moons data set to get the X-array of [x1,x2] coordinates plus the target values y:

X, y = make_moons(noise=0.1, random_state=5)
#X, y = make_moons(noise=0.18, random_state=5)
print('X.shape: ' + str(X.shape))
print('y.shape: ' + str(y.shape))
print("\nX-array: ")
print(X)
print("\ny-array: ")
print(y)

Now we can apply the "meshgrid()"-function in a new cell:

You see the 2-dim structure of the xm1- and xm2-arrays.

Rearranging the mesh data for predictions
How do we predict data values? In the last article we did this in the following form

z_test = polynomial_svm_clf.predict([[x1_test_1, x2_test_1], 
                                     [x1_test_2, x2_test_2], 
                                     [x1_test_3, x2_test_3],
                                     [x1_test_3, x2_test_3]
                                    ])      

"polynomial_svm_clf" was the trained predictor we got by our pipeline with the LinearSVC algorithm and a subsequent training.

The "predict()"-function requires its input values as a 1-dim array, where each element provides a (x1, x2)-pair of coordinate values. But how do we get such pairs from our strange 2-dimensional xm1- and xm2-arrays?

We need a bit of array- or matrix-wizardry here:

Numpy gives us the function "ravel()" which transforms a 2d-array into a 1-d array AND numpy also gives us the possibility to transpose a matrix (invert the axes) via "array().T". (Have a look at the numpy-documentation e.g. at https://docs.scipy.org/doc/).

We can use these options in the following way - see the test example:

The involved logic should be clear by now. So, the next step should be something like

Z = polynomial_svm_clf.predict( np.array([xm1.ravel(), xm2.ravel()] ).T)

However, in the second article we already learned that we need Z in the same shape as the 2-dim mesh coordinate-arrays to create a contour-plot with contourf(). We, therefore, need to reshape the Z-array; this is easy - numpy contains a method reshape() for numpy-array-objects : Z = Z.reshape(xm1.shape). It is sufficient to use xm1 - it has the same shape as xm2.

Applying contourf()

To distinguish contour areas we need a color map for our y-target-values. Later on we will also need different optical markers for the data points. So, for the contour-plot we add some statements like

markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
# fetch unique values of y into array and associate with colors  
cmap = ListedColormap(colors[:len(np.unique(y))])

Z = Z.reshape(xm1.shape)

# see article 2 for the use of contourf()
plt.contourf(xm1, xm2, Z, alpha=0.4, cmap=cmap)  

Let us put all this together; as the statements to create a plot obviously are many, we first define a function "plot_decision_surface()" in a notebook cell and run the cell contents:

Now, let us test - with a new cell that repeats some of our code of the last article for training:

Yeah - we eventually got our decision surface!

But this result still is not really satisfactory - we need the data set points in addition to see how good the 2 clusters are separated. But with the insights of the last article this is now a piece of cake; we extend our function and run the definition cell

def plot_decision_surface(X, y, predictor, ax_delta=1.0, mesh_res = 0.01, alpha=0.4, bscatter=1,  
                          figs_x1=12.0, figs_x2=8.0, x1_lbl='x1', x2_lbl='x2', 
                          legend_loc='upper right'):

    # some arrays and colormap
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot size  
    fig_size = plt.rcParams["figure.figsize"]
    fig_size[0] = figs_x1 
    fig_size[1] = figs_x2
    plt.rcParams["figure.figsize"] = fig_size

    # mesh points 
    resolution = mesh_res
    x1_min, x1_max = X[:, 0].min()  - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min()  - 1, X[:, 1].max() + 1
    xm1, xm2 = np.meshgrid( np.arange(x1_min, x1_max, resolution), 
                            np.arange(x2_min, x2_max, resolution))
    mesh_points = np.array([xm1.ravel(), xm2.ravel()]).T

    # predicted vals 
    Z = predictor.predict(mesh_points)
    Z = Z.reshape(xm1.shape)

    # plot contur areas 
    plt.contourf(xm1, xm2, Z, alpha=alpha, cmap=cmap)

    # add a scatter plot of the data points 
    if (bscatter == 1): 
        alpha2 = alpha + 0.4 
        if (alpha2 > 1.0 ):
            alpha2 = 1.0
        for idx, yv in enumerate(np.unique(y)): 
            plt.scatter(x=X[y==yv, 0], y=X[y==yv, 1], 
                        alpha=alpha2, c=[cmap(idx)], marker=markers[idx], label=yv)
            
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.xlabel(x1_label)
    plt.ylabel(x2_label)
    if (bscatter == 1):
        plt.legend(loc=legend_loc)

Now we get:

So far, so good ! We see that our specific model of the moons data separates the (x1,x2)-plane into two areas - which has two wiggles near our data points, but otherwise asymptotically approaches almost a diagonal.

Hmmm, one could bet that this is model specific. Therefore, let us do a quick test for a polynomial_degree=4 and max_iterations=6000. We get

Surprise, surprise ... We have already 2 different models fitting our data.

Which one do you believe to be "better" for extrapolations into the big (x1,x2)-plane? Even in the vicinity of the leftmost and rightmost points in x1-direction we would get different predictions of our models for some points. We see that our knowledge is insufficient - i.e. the test data do not provide enough information to really distinguish between different models.

Conclusion

After some organization of our data we had success with our approach of using a contour plot to visualize a decision surface in the 2-dimensional space (x1,x2) of input data X for our moon clusters. A simple wrapper function for surface plotting equips us now for further fast experiments with other algorithms.

To become better organized, we should save this plot-function for decision surfaces as well as a simpler function for pure scatter plots in a Python class and import the functionality later on.

We shall create such a class within Eclipse PyDev as a first step in the next article:

The moons dataset and decision surface graphics in a Jupyter environment - V - a class for plots and some experiments

Afterward we shall look at other SVM algorithms - as the "polynomial kernel" and the "Gaussian kernel". We shall also have a look at the impact of some of the parameters of the algorithms. Stay tuned ...

The moons dataset and decision surface graphics in a Jupyter environment – III – Scatter-plots and LinearSVC

During this article series we use the moons dataset to acquire basic knowledge on Python based tools for machine learning [ML] - in this case for a classification task. The first article

The moons dataset and decision surface graphics in a Jupyter environment – I

provided us with some general information about the moons dataset. The second article

The moons dataset and decision surface graphics in a Jupyter environment – II – contourplots

explained how to use a Jupyter notebook for performing ML-experiments. We also had a look at some functions of "matplotlib" which enabled us to create contour plots. We will need the latter to eventually visualize a decision surface between the two moon-like shaped clusters in the 2-dimensional representation space of the moons data points.

In this article we extend our plotting knowledge to the creation of a scatter-plot for visualizing data points of the moons data set. Then we will have a look at the "pipeline" feature of SciKit for a sequence of tasks, namely

  • to prepare the moons data set,
  • to analyze it
  • and to train a selected SVM-algorithm.

In this article we shall use a specific algorithm - namely LinearSVC - to predict the cluster association for some new data points.

Starting our Jupyter notebook, extending imports and loading the moons data set

At the end of the last session you certainly have found out, how to close the Jupyter notebook on a Linux system. Three steps were involved:

  1. Logout via the button at the top-right corner of the web-page
  2. Ctrl-C in your terminal window
  3. Closing the tags in the browser.

For today's session we start the notebook again from our dedicated Python "virtualenv" by

myself@mytux:/projekte/GIT/ai/ml1> source bin/activate
(ml1) myself@mytux:/projekte/GIT/ai/ml1> cd mynotebooks/
(ml1) myself@mytux:/projekte/GIT/ai/ml1/mynotebooks> jupyter notebook

We open "moons1.ipynb" from the list of available notebooks. (Note the move to the directory mynotebooks above; the Jupyter start page lists the notebooks in its present directory, which is used as a kind of "/"-directory for navigation. If you want the whole directory structure of the virtualenv accessible, you should choose a directory level higher as a starting point.)

For the work of today's session we need some more modules/classes from "sklearn" and "matplotlib". If you have not yet imported some of the most important ML-packages you should do so now. Probably, you need a second terminal - as the prompt of the first one is blocked by Jupyter:

myself@mytux:/projekte/GIT/ai/ml1> source bin/activate 
(ml1) myself@mytux:/projekte/GIT/ai/ml1> pip3 install --upgrade matplotlib numpy pandas scipy scikit-learn
Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/57/4f/dd381ecf6c6ab9bcdaa8ea912e866dedc6e696756156d8ecc087e20817e2/matplotlib-3.1.1-cp36-cp36m-manylinux1_x86_64.whl (13.1MB)
.....

The nice people from SciKit/SkLearn have already prepared data and functionality for the setup of the moons data set; we find the relevant function in sklearn.datasets. Later on we will also need some colormap functionality for scatter-plotting. And for doing the real work (training, SVM-analysis, ...) we need some special classes of sklearn.

So, as a first step, we extend the import statements inside the first cell of our Jupyter notebook and run it:

Then we move to the end of our notebook to prepare new cells. (We can rerun already defined cell code at any time.)

We enter the following code that creates the moons data-points with some "noise", i.e. with a spread in the coordinates around a perfect moon-like line. You see the relevant function below; for a beginning it is wise to keep the spread limited - to avoid to many overlap points of the data clusters. I added some print-statements to get an impression of the data structure.

It is common use to assign an uppercase letter "X" to the input data points and a lowercase letter to the array with the classification information (per data point) - i.e. the target vector "y".

The function "make_moons()" creates such an input array "X" of 2-dim data points and an associated target array "y" with classification information for the data points. In our case the classification is binary, only; so we get an array with "0"s or "1"s for each point.

This basic (X,y)-structure of data is very common in classification tasks of ML - at its core it represents the information reduction: "multiple features" => "member of a class".

Scatter-plots: Plotting the raw data in 2D and 3D

We want to create a visual representation of the data points in their 2-dim feature space. We name the two elements of a data point array "x1" and "x2".

For a 2D-plot we need some symbols or "markers" to distinguish the different data points of our 2 classes. And we need at least 2 related colors to assign to the data points.

To work efficiently with colors, we create a list-like ColorMap-object from given color names (or RGB-values); see ListedColormap. We can access the RGBA-values from a ListedColormap by just creating it as a "list" with an integer index, i.e.:

colors= ('red', 'green', 'yellow')
cmap=ListedColormap(colors)
print(cmap(1)) // gives: (0.0, 0.5019607843137255, 0.0, 1.0)  
print(cmap(1)) // gives: (0.0, 0.5019607843137255, 0.0, 1.0)  

All RGBA-values are normalized between 0.0 and 1.0. The last value defines an alpha-opacity. Note that "green" in matplotlib is defined a bit strange in comparison to HTML.

Let us try it for a list ('red', 'blue', 'green', gray', 'yellow', '#00ff00'):

The lower and upper limits of the the two axes must be given. Note that this sets the size of the region in our representation space which we want to analyze or get predictions for later on. We shall make the region big enough to willingly cover points outside the defined clusters. It will be interesting to see how an algorithm extrapolates its knowledge learned by training on the input data to regions beyond the training area.

For the purpose of defining the length of the axes we can use the plot functions pyplot.xlim() and pyplot.ylim().

The central function, which we shall use for plotting data points in the defined area of the (x1,x2)-plane, is "matplotlib.pyplot.scatter()"; see the documentation scatter() for parameters.

Regarding the following code, please note that we plot all points of each of the two moon like cluster in one step. Therefore, we call scatter() exactly two times with the for-loop defined below:

In the code you may stumble across the defined lists there with expressions included in the brackets. These are examples of so called Python "list comprehensions". You find an elementary introduction here.

As we are have come so far, lets try a 3D-scatter-plot, too. This is not required to achieve our objectives, but it is fun and it extends our knowledge base:

Of course all points of a class are placed on the same level (0 or 1) in z-direction. When we change the last statement to "ax.view_init(90, 0)". We get

As expected 🙂 .

Analyzing the data with the help of a "pipeline" and "LinearSVC" as an SVM classificator

Sklearn provides us with a very nice tool (actually a class) named "Pipeline":

Pipeline([]) allows us

  • to define a series of transformation operations which are successively applied to a data set
  • and to define exactly one predictor algorithm (e.g. a regression or classifier algorithm), which creates a model of the data and which is optimized later on.

Transformers and predictors are also called "estimators".

"Transformers" and "predictors" are defined by Python classes in Sklearn. All transformer classes must provide a method " fit_transform()" which operates on the (X,y)-data; the predictor class of a class provides a method "fit()".

The "Pipeline([])" is defined via rows of an array, each comprising a tuple with a chosen name for each step and the real class-names of the transformers/predictor. A pipeline of transformers and a predictor creates an object with a name, which also offers the method "fit()" (related to the predictor algorithm).

Thus a pipeline prepares a data set(X,y) via a chain of operational steps for training.

This sounds complicated, but is actually pretty easy to use. How does such a pipeline look like for our moons dataset? One possible answer is:

polynomial_svm_clf = Pipeline([
  ("poly_features", PolynomialFeatures(degree=3)),
  ("scaler", StandardScaler()),
  ("svm_clf", LinearSVC(C=18, loss="hinge", max_iter=3000))
])
polynomial_svm_clf.fit(X, y)

The transformers obviously are "PolynomialFeatures" and "StandardScaler", the predictor is "LinearSVC" which is a special linear SVM method, trying to find a linear separation channel between the data in their representation space.

The last statement

polynomial_svm_clf.fit(X, y)

starts the training based on our pipeline - with its algorithm.

PolynomialFatures

What is "PolynomialFeatures" in the first step of our Pipeline good for? Well, looking at the moons data plotted above, it becomes quite clear that in the conventional 2-dim space for the data points in the (x1, x2)-plane there is no linear decision surface. Still, we obviously want to use a linear classification algorithm .... Isn't this a contradiction? What can be done about the problem of non-linearity?

In the first article of this series I briefly discussed an approach where data, which are apparently not linearly separable in their original representation space, can be placed into an extended feature space. For each data point we add new "features" by defining additional variables consisting of polynomial combinations of the points basic X-coordinates. We do this up to a maximum degree, namely the order of a polynomial function - e.g. T(x1,x2) = x1**3 + a* x1**2*x2 + b*x1*x2**2 + c*x1*x2 + x2**3.

Thereby, the dimensionality of the original X(x1,x2) set is extended by multiple further dimensions. Each data point is positioned in the extended feature space by a defined transformation T.

Our hope is that we can find a linear separation ("decision") surface in the new extended multi-dimensional feature space.

The first step of our Pipeline enhances our X by additional and artificial polynomial "features" (up to a degree of 3 in our example). We do not need to care for details - they are handled by the class "PolynomialFeatures". The choice of a polynomial of order 3 is a bit arbitrary at the moment; we shall play around with the polynomial degree in a future article.

StandardScaler

The second step in the Pipeline is a simple one: StandardScaler.fit_transform() scales all data such that they fit into standard ranges. This helps both for e.g. linear regression- and SVM-analysis.

The predictor LinearSVC

The third step assigns a predictor - in our example a simple linear SVM-like algorithm. It is provided by the class LinearSVC (a linear soft margin classificator). See e.g
support-vector-machine-algorithm/,
LinearSVC vs SVC,
www.quora.com : What-is-the-difference-between-Linear-SVMs-and-Logistic-Regression.

The basic parameters of LinearSVC, as the number of iterations (3000) to find an optimal solution and the width "C" for the separation channel, will also be a subject of further experiments.

Analyzing the moons data and fitting the LinearSVC algorithm

Let us apply our pipeline and predict for some data points outside the X-region whether they belong to the "red" or the "blue" cluster. But, how do we predict?

We are not surprised that we find a method predict() in the documentation for our classifier algorithm; see LinearSVC.

So:

We get for the different test points

[x1=1.50, x2=1.0] => 0  
[x1=1.92, x2=0.8] => 0
[x1=1.94, x2=0.8] => 1
[x1=2.20, x2=1.0] => 1               

Looking at our scatter plot above we can assume that the decision line predicted by LinearSVC moves through the right upper corner of the (x1,x2)-space.

However and of course, looking at some test data points is not enough to check the quality of our approach to find a decision surface. We absolutely need to plot the decision surface throughout the selected region of our (x1,x2)-plane.

Conclusion

But enough for today's session. We have seen, how we can produce a scatter plot for our moons data. We have also learned a bit about Sklearn's "pipelines". And we have used the classes "PolynomialFeatures" and "LinearSVC" to try to separate our two data clusters.

By now, we have gathered so much knowledge that we should be able to use our predictor to create a contour plot - with just 2 contour areas in our representation space. We just have to apply the function contourf() discussed in the second article of this series to our data:

If we cover the (x1,x2)-plane densely and associate the predicted values of 0 or 1 with colors we should clearly see the contour line, i.e. the decision surface, separating the two areas in our contour plot. And hopefully all data points of our original (X,y) set fall into the right region. This is the topic of the next article

The moons dataset and decision surface graphics in a Jupyter environment – IV – plotting the decision surface

Stay tuned.

Links

Understanding Support Vector Machine algorithm from examples (along with code) by Sunil Ray
Stackoverflow - What is exactly sklearn-pipeline?
LinearSVC