Performance of data retrieval from a simple wordlist in a Pandas dataframe with a string based index – II

In my last post I set up a Pandas dataframe with a column containing a (German) wordlist of around 2.2 million words. We created a unique string based index for the dataframe from a column wither “lower case” writing of the words. My eventual objectives are

  1. to find out whether a string like token out of some millions of tokens is a member of the wordlist or not,
  2. to compare n-grams of characters, i.e. (sub-strings) of millions of given strange string tokens with the n-grams of each word in the wordlist.

In the first case a kind of “existence-query” on the wordlist is of major importance. We could work with a condition on a row-value or somehow use the string based index itself. For the second objective we need a requests on column values with “OR” conditions or again a kind of existence-queries on individual columns, which we turn into index structures before.

It found it interesting and a bit frustrating that a lot of introductory articles on the Internet and even books do not comment on performance. In this article we, therefore, compare the performance of different forms of simple data requests on a Pandas dataframe. To learn a bit more about Pandas’ response times, we extend the data retrieval requests a bit beyond the objectives listed above: We are going to look for rows where conditions for multiple words are fulfilled.

For the time being we restrict our experiments to a dataframe with just one UNIQUE index. I.e. we do not, yet, work with a multi-index. However, at the end of this article, I am going to look a bit at a dataframe with a NON-UNIQUE index, too.

Characteristics of the dataframe and the “query”

We work on a Pandas dataframe “dfw_smallx” with the following characteristics:

pdx_shape = dfw_smallx.shape
print("shape of dfw_smallx = ", pdx_shape)
pdx_rows = pdx_shape[0]
pdx_cols = pdx_shape[1]
print("rows of dfw_smallx = ", pdx_rows)
print("cols of dfw_smallx = ", pdx_cols)
print("column names", dfw_smallx.columns)
print("index", dfw_smallx.index)
print("index is unique: ", dfw_smallx.index.is_unique)
print('')
print(dfw_smallx.loc[['aachener']])
shape of dfw_smallx =  (2188246, 3)
rows of dfw_smallx =  2188246
cols of dfw_smallx =  3
column names Index(['lower', 'word', 'len'], dtype='object')
index Index(['aachener', 'aachenerin', 'aachenerinnen', 'aachenern', 'aacheners',
       'aachens', 'aal', 'aale', 'aalen', 'aales',
       ...
       'zynisches', 'zynischste', 'zynischsten', 'zynismus', 'zypern',
       'zyperns', 'zypresse', 'zypressen', 'zyste', 'zysten'],
      dtype='object', name='indw', length=2188246)
index is unique:  True

             lower      word  len
indw                             
aachener  aachener  AACHENER    8

The only difference to the dataframe created in the last article is the additional column “lower”, repeating the index. As said, the string based index of this dataframe is unique (checked by dfw_smallx.index.is_unique). At the end of this post we shall also have a look at a similar dataframe with a non-unique index.

Query: For a comparison we look at different methods to answer the following question: Are there entries for the words “null”, “mann” and “frau” in the list?

We apply each methods a hundred times to get some statistics. I did the experiments on a CPU (i7-6700K). The response time depends a bit on the background load – I took the best result out of three runs.

Unique index: CPU Time for checking the existence of an entry within the index itself

There is a very simple answer to the question, of how one can check the existence of a value in a (string based) index of a Pandas dataframe. We just use
r

(‘STRING-VALUE‘ in df.index)” !

Let us apply this for three values

b1 = 0; b2=0; b3=0;  
v_start_time = time.perf_counter()
for i in range(0, 100):
    if 'null' in dfw_smallx.index:
        b1 = 1
    if 'mann' in dfw_smallx.index:
        b2=1 
    if 'frau' in dfw_smallx.index:
        b3=1
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
b4 = 'gamling' in dfw_smallx.index 
print(b1, b2, b3, b4)
Total CPU time  0.00020675300038419664
NULL
1 1 1 False

Giving me a total time on my old PC of about 2.1e-4 secs. Which – as we are going to see is a pretty good value – for Pandas!

Total time 2.1e-4 secs.    Per query: 6.9e-7 secs.

Unique index: CPU Time for checking the existence of an entry with a Python dictionary

It is interesting to compare the query time required for a simple dictionary.

We first create a usable dictionary with the lower case word strings as the index:

ay_voc = dfw_smallx.to_numpy()
print(ay_voc.shape)
print(ay_voc[0:2, :])

ay_lower = ay_voc[:,0].copy()
d_lower = dict(enumerate(ay_lower))
d_low   = {y:x for x,y in d_lower.items()}
print(d_lower[0])
print(d_low['aachener'])
(2188246, 3)
[['aachener' 'AACHENER' 8]
 ['aachenerin' 'AACHENERIN' 10]]
aachener
0

And then:

b1 = 0; b2 = 0; b3 = 0
v_start_time = time.perf_counter()
for i in range(0, 100):
    if 'null' in d_low:
        b1 = 1
    if 'mann' in d_low:
        b2=1 
    if 'frau' in d_low:
        b3=1    
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

print(b1, b2, b3)
print(d_low['mann'], d_low['frau'])
Total CPU time  8.626400085631758e-05
1 1 1
1179968 612385

Total time 8.6e-5 secs.    Per query: 2.9e-7 secs.

A dictionary is by almost a factor of 2.4 faster regarding a verification of the existence of a given string value in a string based index than related queries on an indexed Pandas series or dataframes!

Unique index: CPU Time for direct “at”-queries by providing individual index values

Now, let us start with repeated query calls on a our dataframe with the “at”-operator:

v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.at['null', 'word']
    wordy = dfw_smallx.at['mann', 'word']
    wordz = dfw_smallx.at['frau', 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  0.0013257559985504486

Total time: 1.3e-3 secs.    Per query: 4.4e-6 secs
This approach is by factors 6.5 and 15 slower than the fastest solutions for Pandas and the dictionary, respectively

Unique index: CPU Time for direct “loc”-queries by providing individual index values

We now compare this with the same queries – but with the “loc”-operator:

v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc['null', 'word']
    wordy = dfw_smallx.loc['mann', 'word']
    wordz = dfw_smallx.loc['frau', 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  0.0021894429992244113
NULL

Total time: 2.2e-3 secs. &
nbsp;  Per query: 7.3e-6 secs
More than a factor of 10.6 and 25.4 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: CPU Time for a query based on a list of index values

Now, let us use a list of index values – something which would be typical for a programmer who wants to save some typing time:

# !!!! Much longer CPU time for a list of index values  
inf = ['null', 'mann', 'frau']
v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc[inf, 'word'] # a Pandas series 
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print(wordx)
Total CPU time  0.037733839999418706
indw
null    NULL
mann    MANN
frau    FRAU
Name: word, dtype: object

Total time: 3.8e-2 secs.    Per query: 1.3e-4 secs
More than a factor of 182 and 437 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: CPU Time for a query based on a list of index values with index.isin()

We now try a small variation by

ix = dfw_smallx.index.isin(['null', 'mann', 'frau'])
v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc[ix, 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  0.058356915000331355

Total time: 5.8e-2 secs.    Per query: 1.9e-4 secs
We are loosing ground, again.
More than a factor of 282 and 667 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: CPU Time for a query based on a list of index values with index.isin() within loc()

Yet, another seemingly small variation

inf = ['null', 'mann', 'frau']
v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc[dfw_smallx.index.isin(inf), 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  6.466159620998951

OOOOPs!
Total time: 6.48 secs.    Per query: 2.2e-2 secs
More than a factor of 31000 and 75000 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: Query by values for a column and ignoring the index

What happens if we ignore the index and query by values of a column?

v_start_time = time.perf_counter()
for i in range(0, 100):
    pdq = dfw_smallx.query('indw == "null" | indw == "mann" | indw == "frau" ')
    #print(pdq)
    w1 = pdq.iloc[0,0]
    w2 = pdq.iloc[1,0]
    w3 = pdq.iloc[2,0]
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  16.747838952000166

Well, well – the worst result so far!
Total time: 16.75 secs.    Per query: 5.6e-2 secs
More than a factor of 81000 and 194000 slower than the fastest Pandas solution and the dictionary, respectively.

The following takes even longer:

v_start_time = time.perf_counter()
for i in range(0, 100):
    word3 = dfw_smallx.loc[dfw_smallx['word'] == 'NULL', 'word']
    word4 = dfw_smallx.loc[dfw_smallx['word'] == 'MANN', 'word']
    word5 = dfw_smallx.loc[dfw_smallx['word'] == 'FRAU', 'word']
    w4 = word3.iloc[0]
    w5 = word4.iloc[0]
    w6 = word4.iloc[0]
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
r
Total CPU time  22.6538158809999

Total time: 22.76 secs.    Per query: 7.6e-2 secs
More than a factor of 109000 and 262000 slower than the fastest Pandas solution and the dictionary, respectively.

However:

v_start_time = time.perf_counter()
for i in range(0, 100):
    pdl = dfw_smallx.loc[dfw_smallx['word'].isin(['MANN', 'FRAU', 'NULL' ]), 'word']
    w6 = pdl.iloc[0]
    w7 = pdl.iloc[1]
    w8 = pdl.iloc[2]
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

This gives us 6.576 secs again.

Unique index: Query by values of a dictionary and ignoring the index

Here a comparison to a dictionary is interesting again:

v_start_time = time.perf_counter()
b1 = 0; b2 = 0; b3 = 0
for i in range(0, 100):
    if 'null' in d_lower.values():
        b1 = 1
    if 'mann' in d_lower.values():
        b2=1 
    if 'frau' in d_lower.values():
        b3=1    
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print(b1, b2, b3)
Total CPU time  4.572028649003187
1 1 1

So a dictionary is faster than Pandas – even if we ignore the index and query for certain values!

Intermediate conclusions

What do the above results tell us about the handling of Pandas dataframes or series with strings as elements and an unique index containing strings, too?

The first thing is:

If possible do not use a Pandas dataframe at all! Turn (two) required (string) columns of the dataframe into a string indexed dictionary!

In our case a simple solution was to turn a column with lower case writings of the words into a dictionary index and using an enumerated column as values.

A dictionary with string based keys will give you by far the fastest solution if you are only interested in the existence of a certain key-value.

Now, if you want to use Pandas to check the existence of a certain string in a unique index or column the following rules should be followed:

  • Whenever possible use a (unique) index based on the strings whose existence you are interested in.
  • For pure existence checks of a string in the index use a query of the type
    “if ‘STRING‘ in df.index”.
    This will give you the fastest solution with Pandas.
  • Whenever possible use a series of simple queries – each for exactly one index value and one or multiple column labels instead of providing multiple index values in a list.
  • If you want to use the “at” or “loc”-operators, prefer the “at”-operator for a unique index! The form should be
    result = df.at[‘IndexValue’, ‘ColLabel’].
    The loc-operator
    result = df.loc[‘IndexValue’, ‘colLabel1’, ‘colLabel2’, …] is somewhat slower, but the right choice if you want to retrieve multiple columns for a single row index value.
  • Avoid results which themselves become Pandas dataframes or series – i.e. results which contain a multitude of rows and column values
  • Avoid queries on column-values! The CPU times to produce results depending on conditions for a column may vary; factors between 6 and 200,000 in comparison to the fastest solution for single values are possible.

Using a string based index in a Pandas dataframe is pretty fast because Pandas then uses a hash-function and a hashtable to index datarows. Very much like Python handles dictionaries.

Dataframes with a non-unique string index

Let us now quickly check what happens if we turn to a vocabulary with a non-unique
string based index. We can easily get this from the standard checked German wordlist provided by T.Brischalle.

dfw_fullx = pd.read_csv('/py/projects/CA22/catch22/Wortlisten/word_list_german_spell_checked.txt', dtype='str', na_filter=False)
dfw_fullx.columns = ['word']
dfw_fullx['indw'] = dfw_fullx['word']

pdfx_shape = dfw_fullx.shape
print('')
print("shape of dfw_fullx = ", pdfx_shape)
pdfx_rows = pdfx_shape[0]
pdfx_cols = pdfx_shape[1]
print("rows of dfw_fullx = ", pdfx_rows)
print("cols of dfw_fullx = ", pdfx_cols)

giving:

shape of dfw_fullx =  (2243546, 2)
rows of dfw_fullx =  2243546
cols of dfw_fullx =  2

We set an index as before by lowercase word values:

dfw_fullx['indw'] = dfw_fullx['word'].str.lower()
dfw_fullx = dfw_fullx.set_index('indw')

word_null = dfw_fullx.loc['null', 'word']
word_mann = dfw_fullx.loc['mann', 'word']
word_frau = dfw_fullx.loc['frau', 'word']

print('')
print(word_null)
print('')
print(word_mann)
print('')
print(word_frau)

Giving:

indw
null    null
null    Null
null    NULL
Name: word, dtype: object

indw
mann    Mann
mann    MANN
Name: word, dtype: object

indw
frau    Frau
frau    FRAU
Name: word, dtype: object

You see directly that the index is not unique.

Non-unique index: CPU Time for direct queries by with single index values

We repeat our first experiment from above :

v_start_time = time.perf_counter()
for i in range(0, 100):
    pds1 = dfw_fullx.loc['null', 'word']
    pds2 = dfw_fullx.loc['mann', 'word']
    pds3 = dfw_fullx.loc['frau', 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print(pds1) 

This results in:

indw
null    null
null    Null
null    NULL
Name: word, dtype: object
TotalCPU time  7.232291821999752

7.2 secs! Not funny!

The reason is that the whole index must be checked for entries of the given value.

Why the whole dataset? Well, Pandas cannot be sure that the string based index is sorted in a way!

Non-unique index: CPU Time for direct queries by single index values and a sorted index

We remedy the above problem by sorting the index:

dfw_fullx = dfw_fullx.sort_index(axis=0)

And afterwards we try our test from above again – this time giving us :

indw
null    Null
null    null
null    NULL
Name: word, dtype: object
Total CPU time  0.04120599599991692

0.041 secs – much faster. On average the response time now depends on log(N) because Pandas now can use a binary search – with N being the number of elements in the index.

Conclusions

For me as a beginner with Pandas the multitude of options to retrieve values from a Pandas series or dataframe was confusing and their relation to index involvement and the construction of the “result set” was not always obvious. Even more surprising, however, was the impact on performance:

As soon as you hit multiple rows by a “query” a Pandas series or dataframe is constructed which contains the results. This may have advantages regarding the presentation of the result data and advantages regarding a convenient post-query handling of the results. It is however a costly procedure in terms of CPU time. In database language:

For Pandas building the result set can become more costly than the query itself – even for short result sets. This was in a way a bit disappointing.

If you are interested in just the existence of a certain string values in a list of unique strings just create a
dictionary – indexed by your string values. Then check the existence of a string value by “(‘STRING-VALUE‘ in dict)”.
You do not need Pandas or this task!

When you want to use an (indexed) Pandas for existence checks then query like “if ‘STRING-VALUE‘ in df.index” to check the existence of a string value in the properly created string based index.

When you need a bunch of values in various columns for given indices go for the “at”-operator and individual queries for each index value – and not a list.

When retrieving values you should, in general, use a direct form of “selecting” by just one index value and (if possible) one column value with the “at”-operator (df.at[‘indexValue’, ‘columnLabel’]). The loc-operator is up to a factor 1,7 slower – but relevant if you query for more than one columns.

Non-unique indices cost time, too. If you have no chance to associate your dataframe with unique index that suits your objectives then sort the non-unique index at least.

Performance of data retrieval from a simple wordlist in a Pandas dataframe with a string based index – I

When preparing a bunch of texts for Machine Learning [ML] there may come a point where you need to eliminate probable junk words or simply wrongly written words form the texts. This is especially true for scanned texts. Let us assume that you have already applied a tokenizer to your texts and that you have created a “bag of words” [BoW] for each individual text or even a global one for all of your texts.

Now, you may want to compare each word in your bag with a checked list of words – a “reference vocabulary” – which you assume to comprise the most relevant words of a language. If you do not find a specific word of your bag in your reference “vocabulary” you may want to put this word into a second bag for a later, more detailed analysis. Such an analysis may be based on a table where the vocabulary words are split in n-grams of characters. These n-grams will stored in additional columns added to your wordlist, thus turning it into a 2-dimensional array of data.

Such tasks require a tool which – among other things –

  • is able to load 1-, 2-dimensional and sometimes 3-dimensional data structures in a fast way from CSV-files into series-, table- or cube-like data structures in RAM,
  • provides tools to select, filter, retrieve and manipulate data from rows,columns and cells,
  • provides tools to operate on a multitude of rows or columns,
  • provides tools to create some statistics on the data.

All of it pretty fast – which means that the tool must support the creation of an index or indices and/or support vectorized operations (mostly on columns).

I had read in ML books the Pandas is such a tool for a Python environment. A way to accomplish our task would be to load the reference vocabulary into a Pandas structure and check the words of the BoW(s) against it. This means that you try to find the word in the reference list and evaluate the result (positive or negative). And when you are done with this challenge you may want to retrieve additional information from a 2-dimensional Pandas data structure.

This article is about the performance of some data retrieval experiments I recently did on a wordlist of around 2 million words. The objective was to check the existence of tokenized words of some 200.000 texts, each with around 2000 tokens, against this wordlist being embedded in a Pandas dataframe and also to retrieve additional information from other columns of the dataframe.

As we talk about scanned texts and OCR treatment it is very probable that the number of tokens you have to compare with your vocabulary is well above 10 millions. It is clear that there is an requirement for performance if you want to work with a standard Linux PC.

Multiple ways to retrieve or query information from a Pandas series or dataframe

When I started to really use Pandas some days ago I became a bit overwhelmed by the documentation – and the differences in comparison to databases. After having used a database like MySQL for years I had a certain vision about the handling of “table”-like data and related performance. Well, I had to swallow some camels!

And when I started to really care about performance I also realized that there where very many ways to “query” a Pandas dataframe – and not all will give you the same speed in data retrieval.

This article, therefore, dives a bit below the glittering surface of Pandas and looks at different methods to retrieve rows and certain cell values out of a simple “Pandas dataframe”. To work with some practical data I used a reference vocabulary for the German language based on Wikipedia articles.

The first objective was very simple: Verify that a certain word is an element in the reference vocabulary.
The second objective was a natural extension: Retrieve
rows (with multiple columns) for fitting entries – sometimes multiple entries with different word writings.

I was somewhat astonished top see factors between at least 16 and 10.000 for real data retrieval, in comparison with the fastest solution. Just checking the existence of a word in the wordlist proved to be extremely faster after having created a suitable index – and not using any data columns at all.

The response times of Pandas depended strongly on the “query” method and the usage of an index.

I hope the information given below and in the next article is useful for other beginners with Pandas. I shall speak of a “query” when I want to select data from a Pandas dataframe and a “resultset” when addressing one or a collection of data rows as the result of a query. Can’t forget my time with databases …

I assume that you already have a valid Pandas installation in a Python 3 environment on your Linux PC. I did my simple experiments with a Jupyter notebook, but, of course, other tools can be used, too.

Loading an example wordlist into a Pandas dataframe

For my small “query” experiments I first loaded a simple list with around 2.1 million words from a text file into a Pandas data structure. This operation created a so called “Pandas series” and also produced an unique index – appearing as integers, which marked each row of the data with a specific integer.

Then I created two additional columns: The first one with all words written in lower case letters. The second one containing the number of characters of the word’s string. By these operations I created a real 2-dim object – a so called Pandas “dataframe”.

Let us follow this line of operations as a first step. So, where do we get a wordlist from?

A friendly engineer (Torsten Brischalle) has provided a German word-list based on Wikipedia which we can use as an example.
See: http://www.aaabbb.de/WordList/WordList.php

We first import the “uppercase”-wordlist. You can download from this link. On your Linux PC you expand the 7zip archive by standard Linux tools.

This “uppercase” list has the advantage that an index which we will later base on the lowercase writing of the words will (hopefully) be unique. The more extensive wordlist also provided by Brischalle instead comprises multiple writings for some words. The related index would, therefore, not be unique. We shall see that this has a major impact on the response time of the resulting Pandas dataframe.

The wordlists, after 7zip-expansion, all are very simple text-files: Each line contains just one word.

We shall nevertheless work with a 2-dim general Pandas “dataframe” instead of a “series”. A reason is that in a real data analysis environment we may want to add multiple columns with more information later on. E.g. columns for n-grams of character sequences constituting the word or for other information as frequencies, consonant to vocal ratio, etc. And then we would work on 2-dim data structures.

Loading the data into a Pandas dataframe and creating an index based on lowercase word representation

Let us import the wordlist data by the help of some Python code in a Jupyter cell (in my case from a directory “/py/projects/CA22/catch22/Wortlisten/”):

import os
import time
import pandas as pd
import numpy as np

dfw_smallx = pd.read_csv('/py/projects/CA22/catch22/Wortlisten/word_list_german_uppercase_spell_checked.txt', dtype='str', na_filter=False)
dfw_smallx.columns = ['word']
dfw_smallx['indw'] = dfw_smallx['word']

pdx_shape = dfw_smallx.shape
print("shape of dfw_smallx = ", pdx_shape)
pdx_rows = pdx_shape[0]
pdx_cols = pdx_shape[1]
print("rows of dfw_smallx = ", pdx_rows)
print("cols 
of dfw_smallx = ", pdx_cols)

dfw_smallx.head(8)

You see that we need to import the Pandas module besides other standard modules. Then you find that Pandas obviously provides a function “read_csv()” to import CSV like text files. You find more about it in the Pandas documentation here.
The CSV import should in our case be a matter of a few seconds, only.

A column name or column names can be added to a Pandas series or Pandas dataframe, respectively, afterward.

Why did I use the parameter “na_filter“? Well, this was done to handle a special value in the wordlist, namely “NULL”. You may remember that this is a key-word in Python! We would get an empty entry in the dataframe for this input value without the named parameter. You find more information on this topic in the Pandas documentation on the “read_csv()”-function.

The reader also notices that I just named the single data column (resulting from the import) ‘word’ and then copied this column to another new column called ‘indw’. I shall use the latter column as an index in a minute. I then print out some information on the dataframe:

shape of dfw_smallx =  (2188246, 2)
rows of dfw_smallx =  2188246
cols of dfw_smallx =  2

	word 			indw
0 	AACHENER 		AACHENER
1 	AACHENERIN 		AACHENERIN
2 	AACHENERINNEN 	AACHENERINNEN
3 	AACHENERN 		AACHENERN
4 	AACHENERS 		AACHENERS
5 	AACHENS 		AACHENS
6 	AAL 			AAL
7 	AALE			AALE

Almost 2.2 million words. OK, I do not like uppercase. I want a lowercase representation to be used as an index later on. This gives me the opportunity to apply an operation to a whole column with 2.2 mio words.

The creation of our string based index can be achieved by the “set_index()” function:

dfw_smallx['indw'] = dfw_smallx['word'].str.lower()
dfw_smallx = dfw_smallx.set_index('indw')
dfw_smallx.head(5)

Leading after less than 0.5 secs (!) to:

 				word
indw 	
aachener 		AACHENER
aachenerin 		AACHENERIN
aachenerinnen 	AACHENERINNEN
aachenern 		AACHENERN
aacheners 		AACHENERS

Now, let us add one more column containing the length information on the word(s).

This can be done by two methods

  • dfw_smallx[‘len’] = dfw_smallx[‘word’].str.len()
  • dfw_smallx[‘len’] = dfw_smallx[‘word’].apply(len)

The second method is a bit faster (by a factor of 0.7), but does not work on NaN cells of a column. In our case no problem, we get:

# A a column for len information 
v_start_time = time.perf_counter()
dfw_smallx['len'] = dfw_smallx['word'].apply(len)
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
dfw_smallx.head(3)

Total CPU time  0.3626117290004913

			word 			len
indw 		
aachener 		AACHENER 		8
aachenerin 		AACHENERIN 		10
aachenerinnen 	AACHENERINNEN 	13

Basics of addressing data in a Pandas dataframe

Ok, we have loaded our reference list of words into a dataframe. A Pandas “dataframe” basically is a 2-dimensional data structure based on Numpy array technology for the columns. Now, we want to address data in specific rows or cells. Below I repeat some basics for the retrieval of single values from a dataframe:

Each “cell” has a two dimensional integer-“index” – a tuple [i,j], with “i” identifying a row and “j” a column. You can use respective integer values by the “iloc[]“-operator. E.g. dfw_smallx.iloc[2,1] will give you the value “13”.

The “loc[]“-operator instead works with “labels” given to the rows and columns; in the most primitive form as :

dataframe.loc[row label, column label], e.g. dfw_smallx.loc.[ ‘aachenerinnen’, ‘len’ ] .

Labels have to be defined. For columns you may define names (often already during construction of the dataframe). For rows you may define an index – as we actually did above. If you want to compare this with databases: You define a primary key (sometimes based on column-combinations).

Other almost equivalent methods

  • iat[] – operator ,
  • at[] – operator,
  • array like usage of the column label + row-index
  • and the so called dot-notation

for the retrieval of single values are presented in the following code snippet:

print(dfw_smallx.iloc[2,1])
print(dfw_smallx.iat[2,1])
print(dfw_smallx['len'][2]) 
print(dfw_smallx.loc['aachenerinnen', 'len'])
print(dfw_smallx.at['aachenerinnen', 'len'])
print(dfw_smallx.len.aachenerinnen)

13
13
13
13
13
13

Note that the “iat[]” and “at[]” operators can only be used for cells, so both row and column values have to be provided; the other methods can be used for more general slicing of columns.

Slicing

Slicing in general supported by the “:” notation – just as in NumPy. So, with the notation “labelvalue1 : labelvalue2” one can define slices. This works even for string label values:

words = dfw_smallx.loc['alt':'altersschwach', 'word':'len']
print(words)
                          word  len
indw                               
alt                        ALT    3
altaachener        ALTAACHENER   11
altablage            ALTABLAGE    9
altablagen          ALTABLAGEN   10
altablagerung    ALTABLAGERUNG   13
...                        ...  ...
altersschnitt    ALTERSSCHNITT   13
altersschnitts  ALTERSSCHNITTS   14
altersschrift    ALTERSSCHRIFT   13
altersschutz      ALTERSSCHUTZ   12
altersschwach    ALTERSSCHWACH   13

[3231 rows x 2 columns]

Queries with conditions on column values – and Pandas objects containing multiple results

Now let us look at some queries with conditions on columns and the form of the “result sets” when more than just a single value is returned in a Pandas response. Multiple return values may mean multiple rows (with one or more column values) or just one row with multiple column values. Two points are noteworthy:

  1. Pandas produces a new dataframe or series with multiple rows if multiple values are returned. Whenever we get a Pandas “object” with an internal structure as a Pandas response, we need to narrow down the result to the particular value we want to see.
  2. To grasp a certain value you need to include some special methods already in the “query” or to apply a method to the result series or dataframe.

An interesting type of “query” for a Pandas dataframe is provided by the “query()“-function: it allows us to retrieve rows or single values by conditions on column entries. But conditions can also be supplied when using the “loc[]” operator:

w1 = dfw_smallx.loc['null', 'word']
pd_w2 = dfw_smallx.loc['null'] # resulting in a series 
w2 = pd_w2[0]
pd_w3 = dfw_smallx.loc[dfw_smallx['word'] == 'NULL', 'word']
w3 = pd_w3[0]
pd_w4 = dfw_smallx.query('word == "NULL"')
w4 = pd_w4.iloc[0,0]
w5 = dfw_smallx.query('word == "NULL"').iloc[0,0]
w6 = dfw_smallx.query('word == "NULL"').word.item()
print("w1 = ", w1)
print("pd_w2 = ", pd_w2)
print("w2 = ", w2)
print("pd_wd3 = ", pd_w3)
print("w3 = ", w3)
print("w4 = ", w4)
print("w5 = ", w5)
print("w6 = ", w6)
r

I have added a prefix “pd_” to some variables where I expected a Pandas dataframe to be the answer. And really:

w1 =  NULL
pd_w2 =  word    NULL
len        4
Name: null, dtype: object
w2 =  NULL
pd_wd3 =  indw
null    NULL
Name: word, dtype: object
w3 =  NULL
w4 =  NULL
w5 =  NULL
w6 =  NULL

Noteworthy: For loc[] (in contrast to iloc[]) the last value of the slice definition is included in the result set.

Retrieving data by a list of index values

As soon as you dig a bit deeper into the Pandas documentation you will certainly find the following way to retrieve multiple rows by providing a list of of index values:

# Retrieving col values by a list of index values 
inf = ['null', 'mann', 'frau']
wordx = dfw_smallx.loc[inf, 'word']
wx = wordx.iloc[0:3] # resulting in a Pandas series 
print(wx.iloc[0])
print(wx.iloc[1])
print(wx.iloc[2])
NULL
MANN
FRAUp

Intermediate conclusion

The variety of options even in our very simple scenario to retrieve values from a wordlist (with an additional column) is almost overwhelming. They all serve their purpose – depending on the structure of the dataframe and your knowledge on the data positions.

But actually in our scenario for analyzing a BoWs, we have a very simple task ahead of us: We just want to check whether a word or a list of words exists in the list, i.e. if there is an entry for a word (written in small letters) in the list. What about the performance of the different methods for this task?

Actually, there is a very simple answer for the existence check – giving you maximum performance.

But to learn a bit more about the performance of different forms of Pandas queries we also shall look at methods performing some real data retrieval from the columns of a row addressed by some (string) index value.

These will be the topics of the next article. Stay tuned …

Links

Various ways of “querying” Pandas dataframes
https://www.sharpsightlabs.com/blog/pandas-loc/
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
https://cmsdk.com/python/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.html
https://pythonexamples.org/pandas-dataframe-query/

The book “Mastering Pandas” of Ashish Kumar, 2nd, edition, 2019, Packt Publishing Ltd. may be of help – though it does not really comment on performance issues on this level.

NULL values
https://stackoverflow.com/questions/50683765/how-to-treat-null-as-a-normal-string-with-pandas

Walking a directory tree of text files and detect languages with Python, langdetect and fastText

Whilst preparing text data for ML we sometimes are confronted with a large amount of separate text-files organized in a directory hierarchy. In the last article in this category

Walking a directory tree with Python to eliminate unwanted files

I have shown how we “walk” through such a directory tree of files with the help of Python and os.walk(). We can for example use os.walk() procedure to eliminate unwanted files in the tree.

As an example I discussed a collection of around 200.000 scanned paper pages with typewriter text; the collection contained jpg-files and related OCR-based txt-files. With os.walk() I could easily get rid of the jpg-files.

A subsequent task which awaited me regarding my special text collection was the elimination of those text-files which, with a relatively high probability, were not written in German language. To solve such a problem with Python one needs a module that guesses the language of a given text correctly – in my case despite scanning and OCR errors. In this article I discuss the application of two methods – one based on the standard Python “langdetect” module and the other one based on a special variant of Facebook’s “fastText”-algorithm. The latter is by far the faster approach – which is of relevance given the relatively big number of files to analyze.

“langdetect” and its “detect()”-function

To get the “langdetect”-module import it into your virtual Python environment via “pip install langdetect“. To test its functionality just present a written text in French and another one in German from respective txt-files to the “detect()”-function of this module – like e.g. in the following Jupyter cell:

from langdetect import detect
import os
import time

ftxt1 = open('/py/projects/CA22/catch22/0034_010.txt', 'r').read()
print(detect(ftxt1))
ftxt2 = open('/py/projects/CA22/catch22/0388_001.txt', 'r').read()
print(detect(ftxt2))

In my test case this gave the result:

fr
de

So, detect() is pretty easy to use.

What about the performance of langdetect.detect()?
Let us look at the performance. For this I just ran os.walk() over the directory tree of my collection and printed out some information for all files not being of German language.

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

n = 0
m = 0
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    subdirs.sort()
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        ftext = open(filepath, 'r').read()
        lang = detect(ftext) 
        if lang != 'de':
            print(filepath, ' ', lang)
            m += 1 
        n += 1
    if n > 9999: 
        break

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

print("num of files checked: ", n)
print("num of non German files: ", m)

Resulting numbers were:

Total CPU time  72.65049991299747
num of files checked:  10033
num of non German files:  1465

Well, printing in a browser to a Jupyter cell takes some time itself. However, we shall see that this aspect is not really relevant in this example. (The situation would be different if we had to switch contexts between GPU operations and CPU operations whilst printing in real ML applications run on a GPU, primarily.)
In addition we are only interested in relative (CPU) performance when comparing the number later to the result of
fastText.

I measured 74 secs for a standard hard disk (no SSD) with printing – observing that everything just ran on one core of the CPU. Without printing the run required 72 secs. All measured with disc caching active and taking an average of three runs on cached data.

So, for my whole collection of 200.000 txt-files I would have to wait more than 25 minutes – even if the files had been opened, read and cached previously. Deleting the files form the disk would cost a bit of extra time in addition.

Without caching – i.e. for an original run I would probably have to wait for around 3 hrs. langdetect.detect() is somewhat slow.

Fasttext

FastText was developed by Facebook. It is used to create word-vector models – which can be embedded in neural networks via special layers. See
https://en.wikipedia.org/wiki/FastText and https://fasttext.cc/.

An original publication on the fastText algorithm can be found at https://arxiv.org/abs/1607.04606. An somewhat less scientific and understandable introduction is presented at:
https://amitness.com/2020/06/fasttext-embeddings/.

Besides supporting a variety of real ML-tasks, fastText also offers a language detection functionality. And a Python is supported, too:

In your virtual Python environment just use “pip install fasttext”. In addition on a Linux CLI download a pretrained model for language detection, e.g. as with wget:

wget -O /py/projects/fasttext/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

FastText for language detection can be used together with os.walk() as follows:

import fasttext
PRETRAINED_MODEL_PATH = '/py/projects/fasttext/lid.176.bin'
model = fasttext.load_model(PRETRAINED_MODEL_PATH)

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

n = 0
m = 0
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    subdirs.sort()
    #print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        ftext = open(filepath, 'r').read()
        ftext = ftext.replace("\n", " ")
        tupel_lang = model.predict(ftext) 
        #print(tupel_lang)
        lang = tupel_lang[0][0][-2:]
        #print(lang)
        if lang != 'de':
            #print(filepath, ' ', lang) 
            m += 1 
        n += 1
    if n > 9999: 
        break

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print("num of files checked: ", n)
print("num of non German files: ", m)

Result numbers:

Total CPU time  1.895524049999949
num of files checked:  10033
num of non German files:  1355

Regarding performance fastText is almost by a factor of 38 faster than langdetect.detect() !

Eliminating non-German-files from the collection of 208.000 files took only 42 secs – once the files had been cached. An original delete run without caching took around 320 secs – which is still much faster than a detect()-run on cached data. As said on an (older) HD. This is very acceptable. SSDs would give us an even better performance.

Accuracy of the language detection

The reader certainly has noticed that the number of non-German files detected by fastText is somewhat smaller than what langdetect.detect() gave me. That there is a discrepancy is no wonder: Some of the text files are filled with so many scan and OCR errors that even a human
would have problems to detect the original language. I just checked 10 of the files where the different methods showed a discrepancy for the guessed language. In all of these 10 cases the guess of fastText was better.

Conclusion

Although I am not a friend of Facebook they have produced something really useful with their fastText algorithm. Regarding language detection fastText provides a really convincing super-fast and reliable method to guess the main language used in text pieces. It supports Python environments and allows the analysis of a huge amount of text files even on a PC.

Walking a directory tree with Python to eliminate unwanted files

Recently I got an a enormous amount of text and related image files of scanned texts (around 210.000). The whole dataset had a size around 25 GB. The text data should be analyzed with some machine learning [ML] algorithms. One of the first things to do in such a situation is to get rid of the jpg-files. Such files consume most of the disk space.

In may case I also got the data from a Mac machine. Hidden “._”-files were created on the Mac when the original data were downloaded from the Internet. These files control Mac security operations. I had to eliminate these files, too.

Due to the doubling of files and additional “.”-files the total number of files was around 830.000. The number of files really required was much smaller. To eliminate text files, which one does not need is an exercise which is often required in a Machine Learning context for text files.

In such a situation the function “os.walk()” in a Python environment.

os.walk

os.walk() allows us to walk recursively through a directory tree. We get a tupel back containing

  1. the path to the present directory,
  2. a list of all sub-directories,
  3. a list of all files (by their names) in the directory.

For typical applications this is enough information to perform analysis and file operations within the directories.

Application

In my case the usage was very simple. In a Jupyter cell the following code helped:

import os
import time

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        if filename.endswith('.jpg') or filename.endswith('.db') or filename.endswith('-GS.txt'):
            os.remove(filepath) 
            
# extra loop as there are hidden "."-files also for ".jpg"-files      
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        if filename.startswith('._'):
            os.remove(filepath) 

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

If you do not print out file-paths this should be a matter of seconds only, on a SSD below a second.

Counting the remaining files

We can also use os.walk() to count the number of remaining files:

n = 0 
v_start_time = time.perf_counter()
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    n += len(filesdir)
v_end_time = time.perf_counter()
print('Number of files = ', n )
print("Total CPU time ", v_end_time - v_start_time)

On a Linux system you could also use

mytux:~ # find /py/projects/CA22/catch22/ -type f | wc -l
208123

Thus I could bring down the total size to 780 MB and the number of txt-files to be processed down to around 208.000.

Deep Dreams of a CNN trained on MNIST data – III – catching dream patterns at smaller length scales

In the first two articles of this series on Convolutional Neural Networks [CNNs] and “Deep Dream” pictures

Deep Dreams of a CNN trained on MNIST data – II – some code for pattern carving
Deep Dreams of a CNN trained on MNIST data – I – a first approach based on one selected map of a convolutional layer

I introduced the concept of CNN-based “pattern carving” and a related algorithm; it consists of a small number of iterations over of a sequence of image manipulation steps (downscaling to a resolution the CNN can handle, detection and amplification of a map triggering pattern by a CNN, upscaling to the original size and normalization). The included pattern detection and amplification algorithm is nothing else than the OIP-algorithm, which I discussed in another article series on CNNs in this blog. The OIP-algorithm is able to create a pattern, which triggers a chosen CNN map optimally, out of a chaotic fluctuation background. The difference is that we apply such an OIP-algorithms now to a structured input image – the basis for a “dream”. The “carving algorithm” is just a simplifying variation of more advanced Deep Dream algorithms; it can be combined with simple CNNs trained on gray and low-resolution images.

In the last article I provided some code which creates dream like pictures with the help of carving by a CNN trained on MNIST data. Nice, but … In the original theory of “Deep Dreaming” people from Google and others applied their algorithms to a cascade of so called “octaves“: Octaves represent the structures within an image at different levels of resolution. Unfortunately, at first sight, such an approach seems to be beyond the capabilities of our MNIST CNN, because the latter works on a fixed and very coarse resolution scale of 28×28 pixels, only.

As we cannot work on different scales of resolution directly: Can we handle pattern detection and amplification on different length scales in a different way? Can we somehow extend the carving process to smaller length scales and a possible detection of smaller map-triggering patterns within the input image?
The answer in short is: Yes, we can.
We “just” have to replace working with “octaves” by looking at sub-segments of the images – and apply our “carving” algorithm with up and down-scaling to these sub-segments as well as to the full image during an iteration process. We shall test the effects of such an approach in this article for a single selected sub-area of the input image, before we apply it more thoroughly to different input images in further articles.

A first step towards Deep Dreams “dreaming” detail structures of a chosen image …

We again use the image of a bunch of roses, which we worked with in the last articles, as a starting point for the creation of a Deep Dream picture. The easiest way to define sub-areas of our (quadratic) input image is to divide it into 4 (quadratic) adjacent sub-segments or sub-images, each with half of the side length of the original image. We thus get two rows, each with two neighboring sub-areas of the size 280×280 px. Then we could apply the carving algorithm to one selected or to all of the sub-segments. But how do we combine such an approach with an overall treatment of the full image? The answer is that we have to work in parallel on both length-scales involved. We can do this within each cycle of down- and up-scaling according to the following scheme:

  • Step 1: Pick the input image IMG at the full working size (here 560 x 560 px) – and reshape it into a tensor suitable for
    Tensorflow 2 [TF2] functions.
  • Step 2: Down- and upscale the input image “IMG” (560×560 => 28×28 => 560×560) – with information loss
  • Step 3: Calculate the difference between the re-up-scaled image to the original image as a tensor for later detail corrections.
  • Step 4: Subdivide IMG into 4 quadrants (each with a size of of 280×280 px). Save the quadrants as sub-images S_IMG_1, …. S_IMG_4 – and create respective tensors.
  • Step 5: For all sub-images: Down- and up-scale the sub-images S_IMG_m (280×280 => 28×28 => 280×280) – with information loss.
  • Step 6: For all sub-images: Determine the differences between the re-upscaled sub-images and the original sub-images as tensors for later detail corrections.
  • Step 7: Loop A [around 4 to 6 iterations]
    • Step LA-1: Pick the down-scaled image IMG_d (of size 28×28 px) as the present input image for the CNN and the OIP-analysis
    • Step LA-2: Apply the OIP-algorithm for N epochs (20 < N < 40) to the downscaled image IMG_d (28x28)
    • Step LA-3: Upscale the resulting changed version of image IMG_d to the original IMG- size by bicubic interpolation => IMG_u
    • Step LA-4: Add the (constant) correction tensor for details to IMG_u.
    • (Step LA-5: Loop B over Sub-Images)
      • Step LB-1: Cut out the area corresponding to sub-image S_IMG_m from the changed full image IMG_u. Use it as Sub_IMG_m during the following steps.
      • Step LB-2: Downscale the present sub-image Sub_IMG_m to the size suitable for the CNN (here: 28×28) by bicubic interpolation => Sub_IMG_M_d.
      • Step LB-3: Apply the OIP-algorithm for N epochs (20 < N < 40) to the downscaled sub-image (28x28) Sub_IMG_m_d
      • Step LB-4: Upscale the changed Sub_IMG_m_d to the original size of Sub_Img_m by bicubic interpolation => Sub_IMG_m_u
      • Step LB-5: Add the (constant) correction tensor to the tensor for the upscaled sub-image Sub_IMG_m_u
      • Step LB-6: Replace the sub-image region of the upscaled full image IMG_u with the (corrected) sub-image Sub_IMG_m_u
      • Step LB-7: Downscale the new full image IMG_u again (to 28×28 => IMG_d) and use the resulting IMG_d for the next iteration of Loop A

The Loop B has been placed in brackets because we are going to apply the suggested technique only to a single one of the 4 sub-image quadrants in this article.

Code fragments for the sub-image preparation

The following code for a Jupyter cell prepares the quadrants of the original image IMG – here of a bunch of roses. The code is straight forward and easy to understand.

# ****************************
# Work with sub-Images  
# ************************

%matplotlib inline

import PIL
from PIL import Image, ImageOps

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 16
fig_size[1] = 20
fig_3 = plt.figure(
3)
ax3_1_1 = fig_3.add_subplot(541)
ax3_1_2 = fig_3.add_subplot(542)
ax3_1_3 = fig_3.add_subplot(543)
ax3_1_4 = fig_3.add_subplot(544)
ax3_2_1 = fig_3.add_subplot(545)
ax3_2_2 = fig_3.add_subplot(546)
ax3_2_3 = fig_3.add_subplot(547)
ax3_2_4 = fig_3.add_subplot(548)
ax3_3_1 = fig_3.add_subplot(5,4,9)
ax3_3_2 = fig_3.add_subplot(5,4,10)
ax3_3_3 = fig_3.add_subplot(5,4,11)
ax3_3_4 = fig_3.add_subplot(5,4,12)
ax3_4_1 = fig_3.add_subplot(5,4,13)
ax3_4_2 = fig_3.add_subplot(5,4,14)
ax3_4_3 = fig_3.add_subplot(5,4,15)
ax3_4_4 = fig_3.add_subplot(5,4,16)
ax3_5_1 = fig_3.add_subplot(5,4,17)
ax3_5_2 = fig_3.add_subplot(5,4,18)
ax3_5_3 = fig_3.add_subplot(5,4,19)
ax3_5_4 = fig_3.add_subplot(5,4,20)

# size to work with 
# ******************
img_wk_size = 560 

# bring the orig img down to (560, 560) 
# ***************************************
imgvc = Image.open("rosen_orig_farbe.jpg")
imgvc_wk_size = imgvc.resize((img_wk_size,img_wk_size), resample=PIL.Image.BICUBIC)
# Change to np arrays 
ay_picc = np.array(imgvc_wk_size)
print("ay_picc.shape = ", ay_picc.shape)
print("r = ", ay_picc[0,0,0],    " g = ", ay_picc[0,0,1],    " b = " , ay_picc[0,0,2] )
print("r = ", ay_picc[200,0,0],  " g = ", ay_picc[200,0,1],  " b = " , ay_picc[200,0,2] )

# Turn color to gray 
#Red * 0.3 + Green * 0.59 + Blue * 0.11
#Red * 0.2126 + Green * 0.7152 + Blue * 0.0722
#Red * 0.299 + Green * 0.587 + Blue * 0.114

ay_picc_g = ( 0.299  * ay_picc[:,:,0] + 0.587  * ay_picc[:,:,1] + 0.114  * ay_picc[:,:,2] )  
ay_picc_g = ay_picc_g.astype('float32') 
t_picc_g  = ay_picc_g.reshape((1, img_wk_size, img_wk_size, 1))
t_picc_g  = tf.image.per_image_standardization(t_picc_g)

# downsize to (28,28)  
t_picc_g_28 = tf.image.resize(t_picc_g, [28,28], method="bicubic", antialias=True)
t_picc_g_28 = tf.image.per_image_standardization(t_picc_g_28)

# get the correction for the full image 
t_picc_g_wk_size = tf.image.resize(t_picc_g_28, [img_wk_size,img_wk_size], method="bicubic", antialias=True)
t_picc_g_wk_size = tf.image.per_image_standardization(t_picc_g_wk_size)

t_picc_g_wk_size_corr = t_picc_g - t_picc_g_wk_size
t_picc_g_wk_size_re   = t_picc_g_wk_size + t_picc_g_wk_size_corr


# Display wk_size orig images  
ax3_1_1.imshow(imgvc_wk_size)
ax3_1_2.imshow(t_picc_g[0,:,:,0], cmap=plt.cm.gray)
ax3_1_3.imshow(t_picc_g_28[0,:,:,0], cmap=plt.cm.gray)
ax3_1_4.imshow(t_picc_g_wk_size_re[0,:,:,0], cmap=plt.cm.gray)


# Split in 4 sub-images 
# ***********************

half_wk_size  = int(img_wk_size / 2)  

t_picc_g_1  = t_picc_g[:, 0:half_wk_size, 0:half_wk_size, :]
t_picc_g_2  = t_picc_g[:, 0:half_wk_size, half_wk_size:img_wk_size, :]
t_picc_g_3  = t_picc_g[:, half_wk_size:img_wk_size, 0:half_wk_size, :]
t_picc_g_4  = t_picc_g[:, half_wk_size:img_wk_size, half_wk_size:img_wk_size, :]

# Display wk_size orig images  
ax3_2_1.imshow(t_picc_g_1[0,:,:,0], cmap=plt.cm.gray)
ax3_2_2.imshow(t_picc_g_2[0,:,:,0], cmap=plt.cm.gray)
ax3_2_3.imshow(t_picc_g_3[0,:,:,0], cmap=plt.cm.gray)
ax3_2_4.imshow(t_picc_g_4[0,:,:,0], cmap=plt.cm.gray)


# Downscale sub-images 
t_picc_g_1_28 = tf.image.resize(t_picc_g_1, [28,28], method="bicubic", antialias=True)
t_picc_g_2_28 = tf.image.resize(t_picc_g_2, [28,28], method="bicubic", antialias=True)
t_picc_g_3_28 = tf.image.resize(t_picc_g_3, [28,28], method="bicubic", antialias=True)
t_picc_g_4_28 = tf.image.resize(t_picc_g_4, [28,28], method="bicubic", antialias=True)


# Display downscales sub-images  
ax3_3_1.imshow(t_picc_g_1_28[0,:,:,0], cmap=plt.cm.gray)
ax3_3_2.imshow(t_picc_g_2_28[0,:,:,0], cmap=plt.cm.gray)
ax3_3_3.imshow(t_picc_g_3_28[0,:,:,0], cmap=plt.cm.gray)
ax3_3_4.imshow(t_picc_g_4_28[0,:,:,0], cmap=plt.cm.gray)


# get correction values for upsizing 
t_picc_g_1_wk_half = tf.image.resize(t_picc_g_1_28, [half_wk_size,
half_wk_size], method="bicubic", antialias=True)
t_picc_g_1_wk_half = tf.image.per_image_standardization(t_picc_g_1_wk_half)
t_picc_g_2_wk_half = tf.image.resize(t_picc_g_2_28, [half_wk_size,half_wk_size], method="bicubic", antialias=True)
t_picc_g_2_wk_half = tf.image.per_image_standardization(t_picc_g_2_wk_half)
t_picc_g_3_wk_half = tf.image.resize(t_picc_g_3_28, [half_wk_size,half_wk_size], method="bicubic", antialias=True)
t_picc_g_3_wk_half = tf.image.per_image_standardization(t_picc_g_3_wk_half)
t_picc_g_4_wk_half = tf.image.resize(t_picc_g_4_28, [half_wk_size,half_wk_size], method="bicubic", antialias=True)
t_picc_g_4_wk_half = tf.image.per_image_standardization(t_picc_g_4_wk_half)
 
    
t_picc_g_1_corr =  t_picc_g_1 - t_picc_g_1_wk_half   
t_picc_g_2_corr =  t_picc_g_2 - t_picc_g_2_wk_half   
t_picc_g_3_corr =  t_picc_g_3 - t_picc_g_3_wk_half   
t_picc_g_4_corr =  t_picc_g_4 - t_picc_g_4_wk_half   

t_picc_g_1_re   = t_picc_g_1_wk_half + t_picc_g_1_corr 
t_picc_g_2_re   = t_picc_g_2_wk_half + t_picc_g_2_corr 
t_picc_g_3_re   = t_picc_g_3_wk_half + t_picc_g_3_corr 
t_picc_g_4_re   = t_picc_g_4_wk_half + t_picc_g_4_corr 

print(t_picc_g_1_re.shape)

# Display downscales sub-images  
ax3_4_1.imshow(t_picc_g_1_re[0,:,:,0], cmap=plt.cm.gray)
ax3_4_2.imshow(t_picc_g_2_re[0,:,:,0], cmap=plt.cm.gray)
ax3_4_3.imshow(t_picc_g_3_re[0,:,:,0], cmap=plt.cm.gray)
ax3_4_4.imshow(t_picc_g_4_re[0,:,:,0], cmap=plt.cm.gray)

ay_img_comp = np.zeros((img_wk_size,img_wk_size))
ay_t_comp   = ay_img_comp.reshape((1,img_wk_size, img_wk_size,1))
t_pic_comp  = ay_t_comp
t_pic_comp[0, 0:half_wk_size, 0:half_wk_size, 0]                       = t_picc_g_1_re[0, :, :, 0]
t_pic_comp[0, 0:half_wk_size, half_wk_size:img_wk_size, 0]             = t_picc_g_2_re[0, :, :, 0]
t_pic_comp[0, half_wk_size:img_wk_size, 0:half_wk_size, 0]             = t_picc_g_3_re[0, :, :, 0]
t_pic_comp[0, half_wk_size:img_wk_size, half_wk_size:img_wk_size, 0]   = t_picc_g_4_re[0, :, :, 0]

ax3_5_1.imshow(t_picc_g[0,:,:,0], cmap=plt.cm.gray)
ax3_5_2.imshow(t_pic_comp[0,:,:,0], cmap=plt.cm.gray)

 
Note the computation of the correction tensors; we prove their effectiveness below by displaying respective results for the full image, cut out and re-upscaled, corrected quadrants and replaced areas of the original image.

The results in the prepared image frames look like:

Good!

Some code for the carving algorithm – applied to the full image AND a single selected sub-image

Now, for a first test, we concentrate on the sub-image in the lower-right corner – and apply the above algorithm. A corresponding Jupyter cell code is given below:

# *************************************************************************
# OIP analysis on sub-image tiles (to be used after the previous cell)
# **************************************************************************
# Note: To be applied after previous cell !!!
# ******

#interactive plotting - will be used in a future version when we deal with "octaves" 
#%matplotlib notebook 
#plt.ion()

%matplotlib inline

# preparation of figure frames 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 12
fig_4 = plt.figure(4)
ax4_1_1 = fig_4.add_subplot(441)
ax4_1_2 = fig_4.add_subplot(442)
ax4_1_3 = fig_4.add_subplot(443)
ax4_1_4 = fig_4.add_subplot(444)

ax4_2_1 = fig_4.add_subplot(445)
ax4_2_2 = fig_4.add_subplot(446)
ax4_2_3 = fig_4.add_subplot(447)
ax4_2_
4 = fig_4.add_subplot(448)

ax4_3_1 = fig_4.add_subplot(4,4,9)
ax4_3_2 = fig_4.add_subplot(4,4,10)
ax4_3_3 = fig_4.add_subplot(4,4,11)
ax4_3_4 = fig_4.add_subplot(4,4,12)

ax4_4_1 = fig_4.add_subplot(4,4,13)
ax4_4_2 = fig_4.add_subplot(4,4,14)
ax4_4_3 = fig_4.add_subplot(4,4,15)
ax4_4_4 = fig_4.add_subplot(4,4,16)

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 6
fig_5 = plt.figure(5)
axa5_1 = fig_5.add_subplot(241)
axa5_2 = fig_5.add_subplot(242)
axa5_3 = fig_5.add_subplot(243)
axa5_4 = fig_5.add_subplot(244)
axa5_5 = fig_5.add_subplot(245)
axa5_6 = fig_5.add_subplot(246)
axa5_7 = fig_5.add_subplot(247)
axa5_8 = fig_5.add_subplot(248)
li_axa5 = [axa5_1, axa5_2, axa5_3, axa5_4, axa5_5, axa5_6, axa5_7, axa5_8]

# Some general OIP run parameters 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
n_epochs  = 30          # should be divisible by 5  
n_steps   = 6           # number of intermediate reports 
epsilon   = 0.01        # step size for gradient correction  
conv_criterion = 2.e-4  # criterion for a potential stop of optimization 

# image width parameters
# ~~~~~~~~~~~~~~~~~~~~~
img_wk_size   = 560    # must be identical to the last cell   
half_wk_size  = int(img_wk_size / 2) 

# min / max values of the input image 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Note : to be used for contrast enhancement 
min1 = tf.reduce_min(t_picc_g)
max1 = tf.reduce_max(t_picc_g)
# parameters to deal with the spectral distribution of gray values
spect_dist = min(abs(min1), abs(max1))
spect_fact   = 0.85
height_fact  = 1.1

# Set the gray downscaled input image as a startng point for the iteration loop 
# ~~~~~~~~~~~~~----------~~~~--------------~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MyOIP._initial_inp_img_data = t_picc_g_28

# ************
# Main Loop 
# ***********

num_main_iterate = 4
# map_index_main = -1       # take a cost function for a full layer 
map_index_main = 56         # map-index to be used for the OIP analysis of the full image  
map_index_sub  = 56         # map-index to be used for the OIP analysis of the sub-images   


for j in range(num_main_iterate):
   
    # ******************************************************
    # deal with the full (downscaled) input image 
    # ******************************************************
    
    # Apply OIP-algorithm to the whole downscaled image  
    # ~~~~~~~~~~~~~~~~~~~-----------~~~~~~~~~~~~~~~~~~~~~~~
    map_index_main_oip = map_index_main   # map-index we are interested in 

    # Perform the OIP analysis 
    MyOIP._derive_OIP(map_index = map_index_oip, 
                      n_epochs = n_epochs, n_steps = n_steps, 
                      epsilon = epsilon , conv_criterion = conv_criterion, 
                      b_stop_with_convergence=False,
                      b_print=False,
                      li_axa = li_axa5,
                      ax1_1 = ax4_1_1, ax1_2 = ax4_1_2)

    # display the modified downscaled image  (without and with contrast)
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_oip_g_28  = MyOIP._inp_img_data
    ay_oip_g_28 = t_oip_g_28[0,:,:,0].numpy()
    ay_oip_g_28_cont = MyOIP._transform_tensor_to_img(T_img=t_oip_g_28[0,:,:,0], centre_move=0.33, fact=1.0)
    ax4_1_3.imshow(ay_oip_g_28_cont, cmap=plt.cm.gray)
    ax4_1_4.imshow(t_picc_g_28[0, :, :, 0], cmap=plt.cm.gray)
    
    # rescale to 560 and re-add details via the correction tensor  
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_oip_g_wk_size        = tf.image.resize(t_oip_g_28, [img_wk_size,img_wk_size], 
                                             method="bicubic", antialias=True)
    t_oip_g_wk_size_re     = t_oip_g_wk_size + t_picc_g_wk_size_corr 
    
    # standardize to get an intermediate image 
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
n    # t_oip_g_loop           = t_oip_g_wk_size_re.numpy()
    # t_oip_g_wk_size_re_std = tf.image.per_image_standardization(t_oip_g_wk_size_re)
    t_oip_g_wk_size_re_std = tf.image.per_image_standardization(t_oip_g_wk_size_re)
    t_oip_g_loop           = t_oip_g_wk_size_re_std.numpy()
    
    # contrast required as the reduction of irrelvant pixels has smoothed out the gray scale beneath 
    # ~~~~~~~~~~~~~~~~~
    # the added details at pattern areas = high level of whitened socket + relative small detail variation    
    t_oip_g_wk_size_re_std_plt = tf.clip_by_value(height_fact*t_oip_g_wk_size_re_std, -spect_dist*spect_fact, spect_dist*spect_fact)
    ax4_2_1.imshow(t_oip_g_wk_size[0,:,:,0], cmap=plt.cm.gray)
    ax4_2_2.imshow(t_oip_g_wk_size_re_std[0,:,:,0], cmap=plt.cm.gray)
    ax4_2_3.imshow(t_oip_g_wk_size_re_std_plt[0,:,:,0], cmap=plt.cm.gray)
    ax4_2_4.imshow(t_picc_g[0,:,:,0], cmap=plt.cm.gray)
   

    # *************************************************************
    # deal with a chosen sub-image - in this version: "t_oip_4_g"
    # **************************************************************
   
    # Cut out and downscale a sub-image region 
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_oip_4_g     = t_oip_g_loop[:, half_wk_size:img_wk_size, half_wk_size:img_wk_size, :]
    t_oip_4_g_28  = tf.image.resize(t_oip_4_g, [28,28], method="bicubic", antialias=True)
    
    # use the downscaled sub-image as an input image to the OIP-algorithm 
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    MyOIP._initial_inp_img_data = t_oip_4_g_28
    
    
    # Perform the OIP analysis on the sub-image
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    map_index_sub_oip  = map_index_sub   # we may vary this in later versions 
    
    MyOIP._derive_OIP(map_index = map_index_sub_oip, 
                      n_epochs = n_epochs, n_steps = n_steps, 
                      epsilon = epsilon , conv_criterion = conv_criterion, 
                      b_stop_with_convergence=False,
                      b_print=False,
                      li_axa = li_axa5,
                      ax1_1 = ax4_3_1, ax1_2 = ax4_3_2)
    
    
    # display the modified sub-image without and with contrast  
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_oip_4_g_28  = MyOIP._inp_img_data
    ay_oip_4_g_28 = t_oip_4_g_28[0,:,:,0].numpy()
    ay_oip_4_g_28_cont = MyOIP._transform_tensor_to_img(T_img=t_oip_4_g_28[0,:,:,0], centre_move=0.33, fact=1.0)
    ax4_3_3.imshow(ay_oip_4_g_28_cont, cmap=plt.cm.gray)
    ax4_3_4.imshow(t_oip_4_g[0, :, :, 0], cmap=plt.cm.gray)
    
    # upscaling of the present sub-image 
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_pic_4_g_wk_half = tf.image.resize(t_oip_4_g_28, [half_wk_size,half_wk_size], method="bicubic", antialias=True)
    
    # add the detail correction to the manipulated upscaled sub-image
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_pic_4_g_wk_half = t_pic_4_g_wk_half + t_picc_g_4_corr
    #t_pic_4_g_wk_half = tf.image.per_image_standardization(t_pic_4_g_wk_half)
    ax4_4_1.imshow(t_pic_4_g_wk_half[0, :, :, 0], cmap=plt.cm.gray)
    ax4_4_2.imshow(t_oip_4_g[0, :, :, 0], cmap=plt.cm.gray)
    
    # Overwrite the related region in the full image with the new sub-image  
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    t_oip_g_loop[0, half_wk_size:img_wk_size, half_wk_size:img_wk_size, 0]  = t_pic_4_g_wk_half[0, :, :, 0]
    t_oip_g_loop_std = tf.image.per_image_standardization(t_oip_g_loop)
    
    #ax4_4_3.imshow(t_oip_g_loop[0, :, :, 0], cmap=plt.cm.gray)   
    ax4_4_3.imshow(t_oip_g_loop_std[0, :, :, 0], cmap=plt.cm.gray)   
    ay_oip_g_loop_std_plt = tf.clip_by_value(height_fact*t_oip_g_loop_std, -spect_dist*spect_fact, spect_dist*spect_fact)
    ax4_4_4.
imshow(ay_oip_g_loop_std_plt[0, :, :, 0], cmap=plt.cm.gray)
    
    # Downscale the resulting new full image and feed it into into the next iteration of the main loop
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    MyOIP._initial_inp_img_data = tf.image.resize(t_oip_g_loop_std, [28,28], method="bicubic", antialias=True)
   

 

The result of these operations after 4 iterations is displayed in the following image:

We recognize again the worm-like shape resulting for map 56, which we have seen in the last article, already. (By the way: Our CNN’s map 56 originally is strongly triggered for the shapes of handwritten 9-digits and plays a significant role in classifying respective images correctly). But there is a new feature which appeared in the lower right part of the image – a kind of wheel or two closely neighbored 9-like shapes.

The following code compares the final result with the original image:

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 12
fig_7X = plt.figure(7)
ax1_7X_1 = fig_7X.add_subplot(221)
ax1_7X_2 = fig_7X.add_subplot(222)
ax1_7X_3 = fig_7X.add_subplot(223)
ax1_7X_4 = fig_7X.add_subplot(224)

ay_pic_full_cont = MyOIP._transform_tensor_to_img(T_img=t_oip_g_loop_std[0, :, :, 0], centre_move=0.46, fact=0.8)

ax1_7X_1.imshow(imgvc_wk_size)
ax1_7X_2.imshow(t_picc_g[0,:,:,0], cmap=plt.cm.gray)
ax1_7X_3.imshow(ay_oip_g_loop_std_plt[0, :, :, 0], cmap=plt.cm.gray)
ax1_7X_4.imshow(ay_pic_full_cont, cmap=plt.cm.gray)

Giving:

Those who do not like the relative strong contrast may do their own experiments with other parameters for contrast enhancement.

You see: Simplifying prejudices about the world, which get or got manifested in inflexible neural networks (or brains ?), may lead to strange, if not bad dreams. May also some politicians learn from this … hmmm, just joking, as this requires both an openness and capacity for learning … can’t await Jan, 21st …

Conclusion

Our “carving algorithm”, which corresponds to an amplification of traces of map-related patterns that a CNN detects in an input image, can also be used for sub-image-areas and their pixel information. We can therefore use even very limited CNNs trained on low resolution MNIST images to create a “Deep MNIST Dream” out of a chosen arbitrary input image. At least in principle.

Addendum 06.01.2021: There is, however, a pitfall – our CNN (for MNIST or other image samples) may have been trained and work on standardized image/pixel data, only. But even if the overall input image may have been standardized before operating on it in the sense of the algorithm discussed in this article, we should take into account that a sub-image area may contain pixel data which are far from a standardized distribution. So, our corrections for details may require more than just the calculation of a difference term. We may have to reverse intermediate standardization steps, too. We shall take care of this in forthcoming code changes.

In the next article we are going to extend our algorithm to multiple cascaded sub-areas of our image – and we vary the maps used for different sub-areas at the same time.