Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – IV

In the last posts of this mini-series we have studied if and how we can use three 3-char-grams at defined positions of a string token to identify matching words in a reference vocabulary. We have seen that we should choose some distance between the char-grams and that we should use the words length information to keep the list of possible hits small.

Such a search may be interesting if there is only fragmented information available about some words of a text or if one cannot trust the whole token to be written correctly. There may be other applications. Note: This has so far nothing to do with text analysis based on machine learning procedures. I would put the whole topic more in the field of text preparation or text rebuilding. But, I think that one can combine our simple identification of fitting words by 3-char-grams with ML-methods which evaluate the similarity or distance of a (possibly misspelled) token with vocabulary words: When we get a long hit-list we could invoke ML-methods to to determine the best fitting word.

We saw that we can do a 100,000 search runs with 3-char-grams on a decent vocabulary of around 2 million words in a Pandas dataframe below a 1.3 minutes on one CPU core of an older PC. In this concluding article I want to look a bit at the idea of multiprocessing the search with up to 4 CPU cores.

Points to take into account when using multiprocessing – do not expect too much

Pandas normally just involves one CPU core to do its job. And not all operations on a Pandas dataframe may be well suited for multiprocessing. Readers who have followed the code fragments in this series so far will probably and rightly assume that there is indeed a chance for reasonably separating our search process for words or at least major parts of it.

But even then – there is always some overhead to expect from splitting a Pandas dataframe into segments (or “partitions”) for a separate operations on different CPU cores. Overhead is also expected from the task to correctly to combine the particular results from the different processor cores to a data unity (here: dataframe) again at the end of a multiprocessed run.

A bottleneck for multiprocessing may also arise if multiple processes have to access certain distinct objects in memory at the same time. In our case we this point is to be expected for the access of and search within distinct sub-dataframes of the vocabulary containing words of a specific length.

Due to overhead and bottlenecks we do not expect that a certain problem scales directly and linearly with the number of CPU cores. Another point is that although the Linux OS may recognize a hyperthreading physical core of an Intel processor as two cores – but it may not be able to use such virtual cores in a given context as if they were real separate physical cores.

Code to invoke multiple processor cores

In this article I just use the standard Python “multiprocessing” module. (I did not test Ray yet – as a first trial gave me trouble in some preparing code-segments of my Jupyter notebooks. I did not have time to solve the problems there.)

Following some advice on the Internet I handled parallelization in the following way:

import multiprocessing
from multiprocessing import cpu_count, Pool

#cores = cpu_count() # Number of physical CPU cores on your system
cores = 4
partitions = cores # But actually you can define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split), copy=False)
    pool.close()
    pool.join()
    return data

The basic function, corresponding to the parameter “func” of function “parallelize”, which shall be executed in our case is structurally well known from the last posts of this article series:

We perform a search via
putting conditions on columns (of the vocabulary-dataframe) containing 3-char-grams at different positions. The search is done on sub-dataframes of the vocabulary containing only words with a given length. The respective addresses are controlled by a Python dictionary “d_df”; see the last post for its creation. We then build a list of indices of fitting words. The dataframe containing the test tokens – in our case a random selection of real vocabulary words – will be called “dfw” inside the function “func() => getlen()” (see below). To understand the code you should be aware of the fact that the original dataframe is split into (4) partitions.

We only return the length of the list of hits and not the list of indices for each token itself.

# Function for parallelized operation 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def getlen(dfw):
    # Note 1: The dfw passed is a segment ("partition") of the original dataframe  
    # Note 2: We use a dict d_lilen which was defined outside  
    #         and is under the control of the parallelization manager
    
    num_rows = len(dfw)
    for i in range(0, num_rows):
        len_w = dfw.iat[i,0]
        idx = dfw.iat[i,33]
        
        df_name = "df_" + str(len_w)
        df_ = d_df[df_name]

        j_m = math.floor(len_w/2)+1
        j_l = 2
        j_r = len_w -1
        col_l = 'gram_' + str(j_l)
        col_m = 'gram_' + str(j_m)
        col_r = 'gram_' + str(j_r)
        val_l = dfw.iat[i, j_l+2]
        val_m = dfw.iat[i, j_m+2]
        val_r = dfw.iat[i, j_r+2]
        li_ind = df_.index[   (df_[col_r]==val_r) 
                            & (df_[col_m]==val_m)
                            & (df_[col_l]==val_l)
                            ]
        d_lilen[idx] = len(li_ind)

    # The dataframe must be returned - otherwise it will not be concatenated after parallelization 
    return dfw

While the processes work on different segments of our input dataframe we write results to a Python dictionaryd_lilen” which is under the control of the “parallelization manager” (see below). A dictionary is appropriate as we might otherwise loose control over the dataframe-indices during the following processes.

A reduced dataframe containing randomly selected “tokens”

To make things a bit easier we first create a “token”-dataframe “dfw_shorter3” based on a random selection of 100,000 indices from a dataframe containing long vocabulary words (length ≥ 10). We can derive it from our reference vocabulary. I have called the latter dataframe “dfw_short3” in the last post (because we use three 3-char-grams for longer tokens). “dfw_short3” contains all words of our vocabulary with a length of “10 ≤ length ≤ 30”.

# Prepare a sub-dataframe for of the random 100,000 words 
# ******************************
num_w = 100000
len_dfw = len(dfw_short3)

# select a 100,000 random rows 
random.seed()
# Note: random.sample does not repeat values 
li_ind_p_w = random.sample(range(0, len_dfw), num_w)
len_li_p_w = len(li_ind_p_w)

dfw_shorter3 = dfw_short3.iloc[li_ind_p_w, :].copy() 
dfw_shorter3['lx'] = 0
dfw_shorter3['idx'] = dfw_shorter3.index
dfw_shorter3.head(5)

The resulting dataframe “dfw_shorter3” looks like :


nYou see that the index varies randomly and is not in ascending order! This is the reason why we must pick up the index-information during our parallelized operations!

Code for executing parallelized run

The following code enforces a parallelized execution:

manager = multiprocessing.Manager()
d_lilen = manager.dict()
print(len(d_lilen))

v_start_time = time.perf_counter()
dfw_res = parallelize(dfw_shorter3, getlen)
v_end_time = time.perf_counter()
cpu_time   = v_end_time - v_start_time
print("cpu : ", cpu_time)

print(len(d_lilen))
mean_length  = sum(d_lilen.values()) / len(d_lilen)
print(mean_length)

The parallelized run takes about 29.5 seconds.

cpu :  29.46206265499968
100000
1.25008

How does cpu-time vary with the number of cores of my (hyperthreading) CPU?

The cpu-time does not improve much when the number of cores gets bigger than the number of real physical cores:

1 core : 90.5 secs       
2 cores: 47.6 secs  
3 cores: 35.1 secs 
4 cores: 29.4 secs 
5 cores: 28.2 secs 
6 cores: 26.9 secs 
7 cores: 26.0 secs 
8 cores: 25.5 secs

My readers know about this effect already from ML experiments with CUDA and libcublas:

As long a s we use physical processor cores we see substantial improvement, beyond that no real gain in performance is observed on hyperthreading CPUs.

Compared to a run with just one CPU core we seem to gain a factor of almost 3 by parallelization. But, actually, this is no fair comparison: My readers have certainly seen that the CPU-time for the run with one CPU-Core is significantly slower than comparable runs which I described in my last post. At that time we found a cpu-time of around 75 secs, only. So, we have a basic deficit of about 15 secs – without real parallelization!

Overhead and RAM consumption of multiprocessing

Why does run with just one CPU core take so long time? Is it functional overhead for organizing and controlling multiprocessing – which may occur despite using just one core and just one “partition” of the dataframe (i.e. the full dataframe)? Well, we can test this easily by reconstructing the runs of my last post a bit:

# Reformulate Run just for cpu-time comparisons 
# **********************************************
b_test = True 

# Function  
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def getleng(dfw, d_lileng):
    # Note 1: The dfw passed is a segment of the original dataframe  
    # Note 2: We use a list l_lilen which was outside defined 
    #         and is under the control of the prallelization manager
    
    num_rows = len(dfw)
    #print(num_rows)
    for i in range(0, num_rows):
        len_w = dfw.iat[i,0]
        idx = dfw.iat[i,33]
        
        df_name = "df_" + str(len_w)
        df_ = d_df[df_name]

        j_m = math.floor(len_w/2)+1
        j_l = 2
        j_r = len_w -1
        col_l = 'gram_' + str(j_l)
        col_m = 'gram_' + str(j_m)
        col_r = 'gram_' + str(j_r)
        val_l = dfw.iat[i, j_l+2]
        val_m = dfw.iat[i, j_m+2]
        val_r = dfw.iat[i, j_r+2]
        li_ind = df_.index[   (df_[col_r]==val_r) 
                            & (df_[col_m]==val_m)
                            & (df_[col_l]==val_l)
                            ]
        leng = len(li_ind)
        d_lileng[idx] = leng

    return d_lileng


if b_test: 
    num_w = 100000
    len_dfw = len(dfw_short3)

    # select a 100,000 random rows 
    random.seed()
    # Note: random.sample does not repeat values 
    li_ind_p_w = random.sample(range(0, len_dfw), num_w)
    len_li_p_w = len(li_ind_p_w)

    dfw_shortx = dfw_short3.iloc[li_ind_p_
w, :].copy() 
    dfw_shortx['lx']  = 0
    dfw_shortx['idx'] = dfw_shortx.index

    d_lileng = {} #

    v_start_time = time.perf_counter()
    d_lileng = getleng(dfw_shortx, d_lileng)
    v_end_time = time.perf_counter()
    cpu_time   = v_end_time - v_start_time
    print("cpu : ", cpu_time)
    print(len(d_lileng))
    mean_length = sum(d_lileng.values()) / len(d_lileng)
    print(mean_length)
    
    dfw_shortx.head(3)

 
How long does such a run take?

cpu :  77.96989408900026
100000
1.25666

Just 78 secs! This is pretty close to the number of 75 secs we got in our last post’s efforts! So, we see that turning to multiprocessing leads to significant functional overhead! The gain in performance, therefore, is less than the factor 3 observed above:

We (only) get a gain in performance by a factor of roughly 2.5 – when using 4 physical CPU cores.

I admit that I have no broad or detailed experience with Python multiprocessing. So, if somebody sees a problem in my code, please, send me a mail.

RAM is not released completely
Another negative side effect was the use of RAM in my case. Whereas we just get 2.2 GB RAM consumption with all required steps and copying parts of the loaded dataframe with all 3-char-grams in the above test run without multiprocessing, I saw a monstrous rise in memory during the parallelized runs:

Starting from a level of 2.4 GB, memory rose to 12.5 GB during the run and then fell back to 4.5 GB. So, there are copying processes and memory is not completely released again in the end – despite having all and everything encapsulated in functions. Repeating the multiprocessed runs even lead to a systematic increase in memory by about 150 MB per run.

So, when working with the “multiprocessing module” and big Pandas dataframes you should be a bit careful about the actual RAM consumption during the runs.

Conclusion

This series about finding words in a vocabulary by using two or three 3-char-grams may have appeared a bit “academical” – as one of my readers told me. Why the hell should someone use only a few 3-char-grams to identify words?

Well, I have tried to give some answers to this question: Under certain conditions you may only have fragments of words available; think of text transcribed from a recorded, but distorted communication with Skype or think of physically damaged written text documents. A similar situation may occur when you cannot trust a written string token to be a correctly written word – due to misspelling or other reasons (bad OCR SW or bad document conditions for scans combined with OCR).

In addition: character-grams are actually used as a basis for multiple ML methods for text-analysis tasks, e.g. in Facebook’s Fasttext. They give a solid base for an embedded word vector space which can help to find and measure similarities between correctly written words, but also between correctly written words and fantasy words or misspelled words. Looking a bit at the question of how much a few 3-char-grams help to identify a word is helpful to understand their power in other contexts, too.

We have seen that only three 3-char-grams can identify matching words quite well – even if the words are long words (up to 30 characters). The list of matching words can be kept surprisingly small if and when

  • we use available or reasonable length information about the words we want to find,
  • we define positions for the 3-char-grams inside the words,
  • we put some positional distance between the location of the chosen 3-char-grams inside the words.

For a 100,000 random cases with correctly written 3-char-grams the average length of the hit list was below 2 – if the distance between the 3-char-grams was
reasonably large compared to the token-length. Similar results were found for using only two 3-char-grams for short words.

We have also covered some very practical aspects regarding search operation on relatively big Pandas dataframes :

The CPU-time for identifying words in a Pandas dataframe by using 3-char-grams is reasonably small to allow for experiments with around 100,000 tokens even on PCs within minutes or quarters of an hour – but it does not take hours. As using 3-char-grams corresponds to putting conditions on two or three columns of a dataframe this result can be generalized to other similar problems with string comparisons on dataframe columns.

The basic RAM consumption of dataframes containing up to fifty-five 3-char-grams per word can be efficiently controlled by using the dtype “category” for the respective columns.

Regarding cpu-time we saw that working with many searches may get a performance boost by a factor well above 2 by using simple multiprocessing techniques based on Python’s “multiprocessing” module. However, this comes with an unpleasant side effect of enormous RAM consumption – at least temporarily.

I hope you had some fun with this series of posts. In a forthcoming series I will apply these results to the task of error correction. Stay tuned.

Links

https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a
https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-

 

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – II

In my last post

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – I

I have discussed some properties of 3-char-grams of words in a German word list. (See the named post for the related structure of a Pandas dataframe (“dfw_uml”) which hosts both the word list and all corresponding 3-char-grams.) In particular I presented the distribution of the maximum and mean number of words per unique 3-char-gram against the position of the 3-char-grams inside the the words of my vocabulary.

In the present post I want to use the very same Pandas dataframe to find German words which match two or three 3-char-grams defined at different positions inside some given strings or “tokens” of a text to be analyzed by a computer. One question in such a context is: How do we choose the 3-char-gram-positions to make the selection process effective in the sense of a short list of possible hits?

The dataframe has in my case 2.7 million rows for individual words and up to 55 columns for the values 3-char-grams at 55 positions. In the case of short words the columns are filled by artificial 3-char-grams “###”.

My objective and a naive approach

Let us assume that we have a string (or “token”) of e.g. 15 characters for a (German) word. The token contains some error in the sense of a wrongly written or omitted letter. Unfortunately, our text-analysis program does not know which letter of the string is wrongly written. So it wants to find words which may fit to the general character structure. We therefore pick a few 3-grams at given positions of our token. We then want to find words which match two or three 3-char-grams at different positions of the string – hoping that we chose 3-char-grams which do not contain any error. If we get no match we try different a different combination of 3-gram-positions.

In such a brute-force comparison process you would like to quickly pin down the number of matching words with a very limited bunch of 3-grams of the test token. The grams’ positions should be chosen such that the hit list contains a minimum of fitting words. We, therefore, can pose this problem in a different way:

Which chosen positions or positional distances of two or three 3-char-grams inside a string token reduces the list of matching words from a vocabulary to a minimum?

Maybe there is a theoretically well founded solution for this problem. Personally, I am too old and too lazy to analyze such problems with solid mathematical statistics. I take a shortcut and trust my guts. It seems reasonable to me that the selected 3-char-grams should be distributed across the test string with a maximum distance between them. Let us see how far we get with this naive approach.

For the experiments discussed below I use

  • three 3-char-grams for tokens longer than 9 characters.
  • two 3-char-grams for tokens shorter than 9 letters.

For our first tests we pick correctly written 3-char-grams of test words. This means that we take correctly written words as our test tokens. The handling of tokens with wrongly written characters will be the topic of future articles.

Position combinations of two 3-char-grams for relatively short words

To get some idea about the problem’s structure I first pick a test-word like “eisenbahn”. As it is a relatively short word we start working with only two 3-char-grams. My test-word is an interesting one as it is a compound of two individual words “eisen” and “bahn”. There are many other words in the German language which either contain the first or the second word. And in German we can
add even more words to get even longer compounds. So, we would guess with some confidence that there are many hits if we chose two 3-char-grams overlapping each other or being located too close to each other. In addition we would also expect that we should use the length information about the token (or the sought words) during the selection process.

With a stride of 1 we have exactly seven 3-char-grams which reside completely inside our test-word. This gives us 21 options to use two 3-char-grams to find matching words.

To raise the chance for a bunch of alternative results we first look at words with up to 12 characters in our vocabulary and create a respective shortened slice of our dataframe “dfw_uml”:

# Reduce the vocab to strings < max_len => Build dfw_short
#*********************************
#b_exact_length = False
b_exact_length = True

min_len = 4
max_len = 12
length  = 9

mil = min_len - 1 
mal = max_len + 1

if b_exact_length: 
    dfw_short = dfw_uml.loc[(dfw_uml.lower.str.len() == length)]
else:     
    dfw_short = dfw_uml.loc[(dfw_uml.lower.str.len() > mil) & (dfw_uml.lower.str.len() < mal)]
dfw_short = dfw_short.iloc[:, 2:26]
print(len(dfw_short))
dfw_short.head(5)

The above code allows us to choose whether we shorten the vocabulary to words with a length inside an interval or to words with a defined exact length. A quick and dirty code fragment to evaluate some statistics for all possible 21 position combinations for two 3-char-grams is the following:

# Hits for two 3-grams distributed over 9-letter and shorter words
# *****************************************************************
b_full_vocab  = False # operate on the full vocabulary 
#b_full_vocab  = True # operate on the full vocabulary 

word  = "eisenbahn"
word  = "löwenzahn"
word  = "kellertür"
word  = "nashorn"
word  = "vogelart"

d_col = { "col_0": "gram_2", "col_1": "gram_3", "col_2": "gram_4", "col_3": "gram_5",
          "col_4": "gram_6", "col_5": "gram_7", "col_6": "gram_8" 
        }
d_val = {}
for i in range(0,7):
    key_val  = "val_" + str(i)
    sl_start = i
    sl_stop  = sl_start + 3
    val = word[sl_start:sl_stop] 
    d_val[key_val] = val
print(d_val)

li_cols = [0] # list of cols to display in a final dataframe 

d_num = {}
 words 
# find matching words for all position combinations
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
upper_num = len(word) - 2 
for i in range(0,upper_num): 
    col_name1 = "col_" + str(i)
    val_name1 = "val_"  + str(i)
    col1 = d_col[col_name1]
    val1 = d_val[val_name1]
    col_name2 = ''
    val_name2 = ''
    for j in range(0,upper_num):
        if j <= i : 
            continue 
        else:
            col_name2 = "col_" + str(j)
            val_name2 = "val_"  + str(j)
            col2 = d_col[col_name2]
            val2 = d_val[val_name2]
            
            # matches ?
            if b_full_vocab:
                li_ind = dfw_uml.index[  (dfw_uml[col1]==val1) 
                                    &    (dfw_uml[col2]==val2)
                                      ].tolist()
            else: 
                li_ind = dfw_short.index[(dfw_short[col1]==val1) 
                                    &    (dfw_short[col2]==val2)
                                        ].tolist()
                
            num = len(li_ind)
            key = str(i)+':'+str(j)
            d_num[key] = num
#print("length of d_num = ", len(d_num))
print(d_num)

# bar diagram 
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 6
names  = list(d_num.keys())
values = list(d_num.values())
plt.bar(range(len(d_
num)), values, tick_label=names)
plt.xlabel("positions of the chosen two 3-grams", fontsize=14, labelpad=18)
plt.ylabel("number of matching words", fontsize=14, labelpad=18)
font_weight = 'bold' 
font_weight = 'normal' 
if b_full_vocab: 
    add_title = "\n(full vocabulary)"
elif  (not b_full_vocab and not b_exact_length):
    add_title = "\n(reduced vocabulary)"
else:
    add_title = "\n(only words with length = 9)"
    
plt.title("Number of words for different position combinations of two 3-char-grams" + add_title, 
          fontsize=16, fontweight=font_weight, pad=18) 
plt.show()

 
You see that I prepared three different 9-letter words. And we can choose whether we want to find matching words of the full or of the shortened dataframe.

The code, of course, imposes conditions on two columns of the dataframe. As we are only interested in the number of resulting words we use these conditions together with the “index()”-function of Pandas.

Number of matching relatively short words against position combinations for two 3-char-grams

For the full vocabulary we get the following statistics for the test-word “eisenbahn”:

{'val_0': 'eis', 'val_1': 'ise', 'val_2': 'sen', 'val_3': 'enb', 'val_4': 'nba', 'val_5': 'bah', 'val_6': 'ahn'}
{'0:1': 5938, '0:2': 5899, '0:3': 2910, '0:4': 2570, '0:5': 2494, '0:6': 2500, '1:2': 5901, '1:3': 2910, '1:4': 2570, '1:5': 2494, '1:6': 2500, '2:3': 3465, '2:4': 2683, '2:5': 2498, '2:6': 2509, '3:4': 4326, '3:5': 2681, '3:6': 2678, '4:5': 2836, '4:6': 2832, '5:6': 3857}

Note: The first and leftmost 3-char-gram is located at position “0”, i.e. we count positions from zero. Then the last position is at position “word-length – 3”.

The absolute numbers are much too big. But this plot already gives a clear indication that larger distances between the two 3-char-grams are better to limit the size of the result set. When we use the reduced vocabulary slice (with words shorter than 13 letters) we get

{'0:1': 1305, '0:2': 1277, '0:3': 143, '0:4': 48, '0:5': 20, '0:6': 24, '1:2': 1279, '1:3': 143, '1:4': 48, '1:5': 20, '1:6': 24, '2:3': 450, '2:4': 125, '2:5': 23, '2:6': 31, '3:4': 634, '3:5': 58, '3:6': 55, '4:5': 76, '4:6': 72, '5:6': 263}

For some combinations the resulting hit list is much shorter (< 50)! And the effect of some distance between the chosen char-grams gets much more pronounced.

Corresponding data for the words “löwenzahn” and “kellertür” confirm the tendency:

Test-word “löwenzahn”

Watch the lower numbers along the y-scale!

Test-token “kellertür”

Using the information about the word length for optimization

On average the above numbers are still too big for a later detailed comparative analysis with our test token – even on the reduced vocabulary. We expect an improvement by including the length information. What numbers do we get when we use a list with words having exactly the same length as the test-word?

You find the results below:

Test-token “eisenbahn”

{'0:1': 158, '0:2': 155, '0:3': 16, '0:4': 6, '0:5': 1, '0:6': 3, '1:2': 155, '1:3': 16, '1:4': 6, '1:5': 1, '1:6': 3, '2:3': 83, '2:4': 37, '2:5': 3, '2:6': 9, '3:4': 182, '3:5': 17, '3:6': 17, '4:5': 22, '4:6': 22, '5:6': 109}

Test-token “löwenzahn”

{'0:1': 94, '0:2': 94, '0:3': 3, '0:4': 2, '0:5': 2, '0:6': 1, '1:2': 94, '1:3': 3, '1:4': 2, '1:5': 2, '1:6': 1, '2:3': 3, '2:4': 2, '2:5': 2, '2:6': 1, '3:4': 54, '3:5': 43, '3:6': 13, '4:5': 59, '4:6': 14, '5:6': 46}

Test-token “kellertür”

{'0:1': 14, '0:2': 13, '0:3': 13, '0:4': 5, '0:5': 1, '0:6': 1, '1:2': 61, '1:3': 24, '1:4': 5, '1:5': 1, '1:6': 2, '2:3': 36, '2:4': 8, '2:5': 1, '2:6': 3, '3:4': 12, '3:5': 1, '3:6': 1, '4:5': 17, '4:6': 17, '5:6': 17}

For an even shorter word like “vogelart” and “nashorn” two 3-char-grams cover almost all of the word. But even here the number of hits is largest for neighboring 3-char-grams:

Test-word “vogelart” (8 letters)

{'val_0': 'vog', 'val_1': 'oge', 'val_2': 'gel', 'val_3': 'ela', 'val_4': 'lar', 'val_5': 'art', 'val_6': 'rt'}
{'0:1': 22, '0:2': 22, '0:3': 1, '0:4': 1, '0:5': 1, '1:2': 23, '1:3': 1, '1:4': 1, '1:5': 2, '2:3': 10, '2:4': 6, '2:5': 5, '3:4': 19, '3:5': 15, '4:5': 24}

Test-word “nashorn” (7 letters)

{'val_0': 'nas', 'val_1': 'ash', 'val_2': 'sho', 'val_3': 'hor', 'val_4': 'orn', 'val_5': 'rn', 'val_6': 'n'}
{'0:1': 1, '0:2': 1, '0:3': 1, '0:4': 1, '1:2': 1, '1:3': 1, '1:4': 1, '2:3': 3, '2:4': 2, '3:4': 26}

So, as an intermediate result I would say:

  • Our naive idea about using 3-char-grams with some distance between them is pretty well confirmed for relatively small words with a length below 9 letters and two 3-char-grams.
  • We should use the length information about a test-word or token in addition to diminish the list of reasonably matching words!

Code to investigate 3-char-gram combinations for words with more than 9 letters

Let us now turn to longer words. Here we face a problem: The number of possibilities to choose three 3-char-grams at different positions explodes with word-length (simple combinatorics leading to the binomial coefficient). It is even difficult to present results graphically. Therefore, I had to restrict myself to gram-combinations with some reasonable distance from the beginning.

The following code does not exclude anything and leads to problematic plots:

# Hits for two 3-grams distributed over a 13-letter word
# ******************************************************
b_full_vocab  = False # operate on the full vocabulary 
#b_full_vocab  = True # operate on the full vocabulary 

#word  = "nachtwache"             # 10
#word  = "morgennebel"            # 11
#word  = "generalmajor"           # 12
#word  = "gebirgskette"           # 12
#word  = "fussballfans"           # 12
#word  = "naturforscher"          # 13
#word  = "frühjahrsputz"          # 13 
#word  = "marinetaucher"          # 13
#word  = "autobahnkreuz"          # 13 
word  = "generaldebatte"         # 14
#word  = "eiskunstläufer"         # 14
#word  = "gastwirtschaft"         # 14
#word  = "vergnügungspark"        # 15 
#word  = "zauberkuenstler"        # 15
#word  = "abfallentsorgung"       # 16 
#word  = "musikveranstaltung"     # 18  
#word  = "sicherheitsexperte"     # 18
#word  = "literaturwissenschaft"  # 21 
#word  = "veranstaltungskalender" # 23

len_w = len(word)
print(len_w, math.floor(len_w/2))

d_col = { "col_0": "gram_2",   "col_1": "gram_3",   "col_2": "gram_4",   "col_3": "gram_5",
          "col_4": "gram_6",   "col_5": "gram_7",   "col_6": "gram_8",   "col_7": "gram_9", 
          "col_8": "gram_10",  "col_9": "gram_11",  "col_10": "gram_12", "col_11": "gram_13", 
          "col_12": "gram_14", "col_13": "gram_15", "col_14": "gram_16", "col_15": "gram_17", 
          "col_16": "gram_18", "col_17": "gram_19", "col_18": "gram_20", "col_19": "gram_21" 
        }
d_val = {}

ind_max = len_w - 2

for i in range(0,ind_max):
    key_val  = "val_" + str(i)
    sl_start = i
    sl_stop  = sl_start + 3
    val = word[sl_start:sl_stop] 
    d_val[key_val] = val
print(d_val)

li_cols = [0] # list of cols to display in a final dataframe 

d_num = {}
li_permut = []

# prepare 
short
length  = len_w
mil = min_len - 1 
mal = max_len + 1
b_exact_length = True
if b_exact_length: 
    dfw_short = dfw_uml.loc[(dfw_uml.lower.str.len() == length)]
else:     
    dfw_short = dfw_uml.loc[(dfw_uml.lower.str.len() > mil) & (dfw_uml.lower.str.len() < mal)]
dfw_short = dfw_short.iloc[:, 2:26]
print(len(dfw_short))


# find matching words for all position combinations
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for i in range(0,ind_max): 
    for j in range(0,ind_max):
        for k in range(0,ind_max):
            if (i,j,k) in li_permut or (i==j or j==k or i==k):
                continue
            else: 
                col_name1 = "col_" + str(i)
                val_name1 = "val_" + str(i)
                col1 = d_col[col_name1]
                val1 = d_val[val_name1]
                col_name2 = "col_" + str(j)
                val_name2 = "val_" + str(j)
                col2 = d_col[col_name2]
                val2 = d_val[val_name2]
                col_name3 = "col_" + str(k)
                val_name3 = "val_" + str(k)
                col3 = d_col[col_name3]
                val3 = d_val[val_name3]
                li_akt_permut = list(itertools.permutations([i, j, k]))
                li_permut = li_permut + li_akt_permut
                #print("i,j,k = ", i, ":", j, ":", k)
                #print(len(li_permut))
                
                # matches ?
                if b_full_vocab:
                    li_ind = dfw_uml.index[  (dfw_uml[col1]==val1) 
                                        &    (dfw_uml[col2]==val2)
                                        &    (dfw_uml[col3]==val3)
                                          ].tolist()
                else: 
                    li_ind = dfw_short.index[(dfw_short[col1]==val1) 
                                        &    (dfw_short[col2]==val2)
                                        &    (dfw_short[col3]==val3)
                                            ].tolist()

                num = len(li_ind)
                key = str(i)+':'+str(j)+':'+str(k)
                d_num[key] = num
print("length of d_num = ", len(d_num))
print(d_num)

# bar diagram 
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 6
names  = list(d_num.keys())
values = list(d_num.values())
plt.bar(range(len(d_num)), values, tick_label=names)
plt.xlabel("positions of the chosen two 3-grams", fontsize=14, labelpad=18)
plt.ylabel("number of matching words", fontsize=14, labelpad=18)
font_weight = 'bold' 
font_weight = 'normal' 
if b_full_vocab: 
    add_title = "\n(full vocabulary)"
elif  (not b_full_vocab and not b_exact_length):
    add_title = "\n(reduced vocabulary)"
else:
    add_title = "\n(only words with length = " + str(len_w) + ")"
    
plt.title("Number of words for different position combinations of two 3-char-grams" + add_title, 
          fontsize=16, fontweight=font_weight, pad=18) 
plt.show()

 

An example for the word “generaldebatte” (14 letters) gives:

A supplemental code that reduces the set of gram position combinations significantly to larger distances could look like this:

# Analysis for 3-char-gram combinations with larger positional distance
# ********************************************************************

hf = math.floor(len_w/2)

d_l={}
for i in range (2,26):
    d_l[i] = {}

r
for key, value in d_num.items():
    li_key = key.split(':')
    # print(len(li_key))
    i = int(li_key[0])
    j = int(li_key[1])
    k = int(li_key[2])
    l1 = int(li_key[1]) - int(li_key[0])
    l2 = int(li_key[2]) - int(li_key[1])
    le = l1 + l2 
    # print(le)
    if (len_w < 12): 
        bed1 = (l1<=1 or l2<=1)
        bed2 = (l1 <=2 or l2 <=2)
        bed3 = (((i < hf and j< hf and k< hf) or (i > hf and j> hf and k > hf)))
    if (len_w < 15): 
        bed1 = (l1<=2 or l2<=2)
        bed2 = (l1 <=3 or l2 <=3)
        bed3 = (((i < hf and j< hf and k< hf) or (i > hf and j> hf and k > hf)))
    elif (len_w <18): 
        bed1 = (l1<=3 or l2<=3)
        bed2 = (l1 <=4 or l2 <=4)
        bed3 = (((i < hf and j< hf and k< hf) or (i > hf and j> hf and k > hf)))
    else: 
        bed1 = (l1<=3 or l2<=3)
        bed2 = (l1 <=4 or l2 <=4)
        bed3 = (((i < hf and j< hf and k< hf) or (i > hf and j> hf and k > hf)))
        
    for j in range(2,26): 
        if le == j:
            if value == 0 or bed1 or ( bed2 and bed3) : 
                continue
            else:
                d_l[j][key] = value

sum_len = 0 
n_p = len_w -2
for j in range(2,n_p):
    num = len(d_l[j])
    print("len = ", j, " : ", "num = ", num) 
    
print()
print("len_w = ", len_w, " half = ", hf)    

if (len_w <= 12):
    p_start = hf 
elif (len_w < 15):
    p_start = hf + 1
elif len_w < 18: 
    p_start = hf + 2 
else: 
    p_start = hf + 2 

    
# Plotting 
# ***********
li_axa = []
m = 0
for i in range(p_start,n_p):
    if len(d_l[i]) == 0:
        continue
    else:
        m+=1
print(m)
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = m * 5
fig_b  = plt.figure(2)

for j in range(0, m):
    li_axa.append(fig_b.add_subplot(m,1,j+1))

m = 0
for i in range(p_start,n_p):
    if len(d_l[i]) == 0:
        continue
    # bar diagram 
    names  = list(d_l[i].keys())
    values = list(d_l[i].values())
    li_axa[m].bar(range(len(d_l[i])), values, tick_label=names)
    li_axa[m].set_xlabel("positions of the 3-grams", fontsize=14, labelpad=12) 
    li_axa[m].set_ylabel("num matching words", fontsize=14, labelpad=12) 
    li_axa[m].set_xticklabels(names, fontsize=12, rotation='vertical')
    #font_weight = 'bold' 
    font_weight = 'normal' 
    if b_full_vocab: 
        add_title = " (full vocabulary)"
    elif  (not b_full_vocab and not b_exact_length):
        add_title = " (reduced vocabulary)"against position-combinations for <em>three</em> 3-char-grams</h1>
    else:
        add_title = " (word length = " + str(len_w) + ")" 

    li_axa[m].set_title("total distance = " + str(i) + add_title, 
              fontsize=16, fontweight=font_weight, pad=16) 
    m += 1
    
plt.subplots_adjust( hspace=0.7 )
fig_b.suptitle("word :  " + word +" (" + str(len_w) +")", fontsize=24, 
              fontweight='bold', y=0.91) 
plt.show()

 

What are the restrictions? Basically

  • we eliminate combinations with 2 neighboring 3-char-grams,
  • we eliminate 3-char-grams combinations where all 3-grams are place only on one side of the word – the left or right one,
  • we pick only 3-char-grams where the sum of the positional distances between the 3-char-grams is somewat longer than half of the token’s length.

We vary these criteria a bit with the word length. In my opinion these criteria should produce plots, only, which show that the number of hits is reasonably small – if our basic approach is of some value.

Number
of matching words with more than 9 letters against position-combinations for three 3-char-grams

The following plots cover words of different growing lengths for dataframes reduced to words with exactly the same length as the chosen token. Not too surprising, all of the words are compound words.

**************************

Test-token “nachtwache”

Test-token “morgennebel”

Test-token “generalmajor”

Test-token “gebirgskette”

Test-token “fussballfans”

Test-token “naturforscher”

Test-token “frühjahrsputz”

Test-token “marinetaucher”

Test-token “autobahnkreuz”

Test-token “generaldebatte”

Test-token “eiskunstläufer”

Test-
token “gastwirtschaft”

Test-token “vergnügungspark”

Test-token “zauberkuenstler”

Test-token “abfallentsorgung”

Test-token “musikveranstaltung”

Test-token “sicherheitsexperte”

Test-token “literaturwissenschaft”

Test-token “veranstaltungskalender”

**************************

What we see is that whenever we choose 3-char-gram combinations with a relative big positional distance between them and a sum of the two distances ≥ word-length / 2 + 2 the number of matching words ogf the vocabulary is smaller than 10, very often even smaller than 5. The examples “prove” at least that choosing three (correctly written) 3-char-grams with relative big distance within a token lead to small numbers of matching vocabulary words,

Conclusion

One can use a few 3-char-grams within string tokens to find matching vocabulary words via a comparison of the char-grams at their respective
position. In this article we have studied how we should choose two or three 3-char-grams within string tokens of length ≤ 9 letters or > 9 letters, respectively, if and when we want to find matching vocabulary words effectively. We found strong indications that the 3-char-grams should be chosen with a relatively big positional distance. To use neighboring 3-char-grams will lead to hit numbers which are too big for a detailed analysis.

In the next post I will have a closer look at the required CPU-time for a word searches in a vocabulary based on 3-char-gram comparisons for a 100,000 string tokens.

 
 

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – I

Words or strings can be segmented into so called “n-character-grams” or “n-char-grams“. A n-char-gram is a defined sequence of “n” letters, i.e. a special string of length “n”. Such a defined letter sequence – if short enough – can be found at various positions within many words of a vocabulary. Words or technically speaking “strings” can e.g. be thought of being composed of a sequence of defined “2-char-grams” or “3-char-grams”. “n-char-grams” are useful for text-analysis and/or machine-learning methods applied to texts.

Let us assume you have a string representing a test word – but unfortunately with one or two wrong characters or two transposed characters at certain positions inside it. You may nevertheless want to find words in a German vocabulary which match most of the correct letters. One naive approach could be to compare the characters of the string position-wise with corresponding characters of all words in the vocabulary and then pick the word with most matches. As you neither can trust the first character nor the last character you quickly understand that a quick and efficient way of raising the probability to find reasonable fitting words requires to compare not only single letters but also bunches of them, i.e. sub-strings of sequential letters or “n-char-grams”.

This defines the problem of comparing n-char-grams at certain positions inside string “tokens” extracted from unknown texts with n-char-grams of words in a vocabulary. I call a “token” an unchecked distinct letter sequence, i.e. a string, identified by some “Tokenizer”-algorithm, which was applied to a text. A Tokenizer typically identifies word-separator characters to do his job. A “token” might or might not be regular word of a language.

This mini-series looks a bit at using “3-character-grams” of words in a German vocabulary residing in a Pandas dataframe. Providing and using 3-grams of a huge vocabulary in a suitable form as input for Python functions working on a Pandas dataframe can, however, be a costly business:

  • RAM: First of all Pandas dataframes containing strings in most of the columns require memory. Using the dtype “category” helps a lot to limit the memory consumption for a dataframe comprising all 3-char-grams of a reasonable vocabulary with some million words. See my last post on this topic.
  • CPU-time: Another critical aspect is the CPU-time required to determine all dataframe rows, i.e. vocabulary words, which contain some given 3-char-grams at defined positions.
  • It is not at all clear how many 3-char-grams are required to narrow down the spectrum of fitting words (of the vocabulary) for a given string to a small amount which can be handled by further detailed analysis modules.

In this article I, therefore, look at “queries” on a Pandas dataframe containing vocabulary words plus their 3-char-grams at defined positions inside the words. Each column contains 3-char-grams at a defined position in the word strings. Our queries apply conditions to multiple selected columns. I first discuss how 3-char-grams split the vocabulary into groups. I present some graphs of how the number of words for such 3-char-gram based groups vary with 3-gram-position. Then the question how many 3-char-grams at different positions allow for an identification of a reasonably small bunch of fitting words in the vocabulary will be answered by some elementary experiments. We also look at CPU-times required for related queries and I discuss some elementary optimization steps. An eventual short turn to multiprocessing reveals that we, indeed, can gain a bit of performance.

As a basis for my investigations I use a “vocabulary” based on the work of Torsten Brischalle. See
from http://www.aaabbb.de/WordList/WordList.php. I have supplemented his word-list by words with different writings for Umlauts. The word list contains around 2.8 million German words. Regarding the positional shift of the 3-char-grams of a word against each other I use the term “stride” as explained in my last post
Pandas and 3-char-grams of a vocabulary – reduce memory consumption by datatype „category“.
In addition I use some “padding” and fill up 3-char-grams at and beyond word boundaries with special characters (see the named post for it). In some plots I abbreviated “3-char-grams” to “3-grams”.

Why do I care about CPU-time on Pandas dataframes with 3-char-grams?

CPU-time is important if you want to correct misspelled words in huge bunches of texts with the help of 3-char-gram segmentation. Misspelled words are not only the result of wrong writing, but also of bad scans of old and unclear texts. I have a collection of over 200,000 such scans of German texts. The application of the Keras Tokenizer produced around 1.9 million string tokens.

Around 50% of the most frequent 100.000 tokens in my scanned texts appear to have “errors” as they are no members of the (limited) vocabulary. The following plot shows the percentage of hits in the vocabulary against the absolute number of the most frequent words within the text collection:

The “errors” contain a variety of (partially legitimate) compound words outside the vocabulary, but there are also wrong letters at different positions and omitted letters due to a bad OCR-quality of the scans. Correcting at least some of the simple errors (as one or two wrong characters) could improve the quality of the scan results significantly. To perform an analysis based on 3-char-grams we have to compare tenths up to hundreds of thousands tokens with some million vocabulary words. CPU-time matters – especially when using Pandas as a kind of database.

As the capabilities of my Linux workstation are limited I was interested in whether an analysis of 100,000 misspelled words based on comparisons of 3-char-grams is within reach for lets say a 100,000 tokens on a reasonably equipped PC.

Major Objective: Reduce the amount of vocabulary words matching a few 3-char-grams at different string positions to a minimum

The analysis of possible errors of a scanned word is more difficult than one may think. The errors may be of different nature and may have different consequences for the length and structure of the resulting error-containing word in comparison with the originally intended word. Different error types may appear in combination and the consequences may interfere within a word (or identified token).

What you want to do is to find words in the vocabulary which are comparable to your token – at least in some major parts. The list of such words would be those which with some probability might contain the originally intended word. Then you might apply a detailed and error specific analysis to this bunch of interesting words. Such an analysis may be complemented by an additional analysis on (embedded) word-vector spaces created by ML-trained neural networks to predict words at the end of a sequence of other words. A detailed analysis on a list of words and their character composition in comparison to a token may be CU-time intensive in itself as it typically comprises string operations.

In addition it is required to do the job
a bit differently for certain error types and you also have to make some assumptions regarding the error’s impact on the word-length. But even under simplifying assumptions regarding the number of wrong letters and the correct total amount of letters in a token, you are confronted with a basic problem of error-correction:

You do not know where exactly a mistake may have occurred during scanning or wrong writing.

As a direct consequence you may have to compare 3-char-grams at various positions within the token with corresponding 3-char-grams of vocabulary words. But more queries mean more CPU-time ….

In any case one major objective must be to quickly reduce the amount of words of the vocabulary which you want to use in the detailed error analysis down to a minimum below 10 words with only a few Pandas queries. Therefore, two points are of interest here:

  • How does the number of 3-char-grams for vocabulary words vary with the position?
  • How many correct 3-char-grams define a word in the vocabulary on average?

The two aspects may, of course, be intertwined.

Structure of the Pandas dataframe containing the vocabulary and its 3-char-grams

The image below displays the basic structure of the vocabulary I use in a Pandas dataframe (called “dfw_uml”):

The column “len” contains the length of a word. The column “indw” is identical to “lower”. “indw” allows for a quick change of the index from integers to the word itself. Each column with “3-char-gram” in the title corresponds to a defined position of 3-char-grams.

The stride between adjacent 3-char-grams is obviously 1. I used a “left-padding” of 2. This means that the first 3-char-grams were supplemented by the artificial letter “%” to the left. The first 3-char-gram with all letters residing within the word is called “gram_2” in my case – with its leftmost letter being at position 0 of the word-string and the rightmost letter at position 2. On the right-most side of the word we use the letter “#” to create 3-char-grams reaching outside the word boundary. You see that we get many “###” 3-char-grams for short words at the right side of the dataframe.

Below I actually use two dataframes: one with 21 3-char-grams up to position 21 and another one with (55) 3-char-grams up to position 55.

Variation of the number of vocabulary words against their length

With growing word-length there are more 3-char-grams to look at. Therefore we should have an idea about the distribution of the number of words with respect to word-length. The following plot shows how many different words we find with growing word-length in our vocabulary:

The Python code for the plot above is :

x1 = []
y1 = []
col_name = 'len'
df_col_grp_len = dfw_uml.groupby(col_name)['indw'].count()
d_len_voc = df_col_grp_len.to_dict()
#print (df_col_grp_len)
#print(d_len_voc) 

len_d = len(d_len_voc)
for key,value in d_len_voc.items():
    x1.append(key)
    y1.append(value)

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 6    
plt.plot(x1,y1, color='darkgreen', linewidth=5)
#plt.xticks(x)
plt.
xlabel("length of word)", fontsize=14, labelpad=18)
plt.ylabel("number of words ", fontsize=14, labelpad=18)
plt.title("Number of different words against length ") 
plt.show()

 

So, the word-length interval between 2 and 30 covers most of the words. This is consistent with the Pandas information provided by Pandas’ “describe()”-function applied to column “len”:

How does the number of different 3-char-grams vary with the 3-char-gram position?

Technically a 3-char-gram can be called “unique” if it has a specific letter-sequence at a specific defined position. So would call the 3-char-grams “ena” at position 5 and “ena” at position 12 unique despite their matching sequence of letters.

There is only a limited amount of different 3-char-gram at a given position within the words of a given vocabulary.
Each 3-char-gram column of our dataframe can thus be divided into multiple “categories” or groups of words containing the same specific 3-char-gram at the position associated with the column. A priori t was not at all clear to me how many vocabulary words we would typically find for a given 3-char-gram at a defined position. I wanted an overview. So let us first look at the number of different 3-char-grams against position.

So how does the distribution of the number of unique 3-char-grams against position look like?

To answer this question we use the Pandas function nunique() in the following way:

# Determine number of unique values in columns )(i.e. against 3-char-gram position)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
unique_vals = dfw_uml.nunique()
luv = len(unique_vals)
print(unique_vals)

and get

.....
.....
gram_0          29
gram_1         459
gram_2        3068
gram_3        4797
gram_4        8076
gram_5        8687
gram_6        8743
gram_7        8839
gram_8        8732
gram_9        8625
gram_10       8544
gram_11       8249
gram_12       7829
gram_13       7465
gram_14       7047
gram_15       6700
gram_16       6292
gram_17       5821
gram_18       5413
gram_19       4944
gram_20       4452
gram_21       3989

Already in my last post we saw that the given different 3-char-grams at a defined position divide the vocabulary into a relatively small amount of groups. For my vocabulary with 2.8 million words the maximum number of different 3-char-grams is around 8,800 at position 7 (for a stride of 1). 8,800 is relatively small compared to the total number of 2.7 million words.

Above I looked at the 3-char-grams at the first 21 positions (including left-padding 3-char-grams). We can get a plot by applying the the following code

# Plot for the distribution of categories (i.e. different 3-char-grams) against position
# **************************************
li_x = []
li_y = []
sum = 0 

for i in range(0, luv-4):
    li_x.append(i)
    name = 'gram_' + str(i)
    n_diff_grams = unique_vals[name] 
    li_y.append(n_diff_grams)
    sum += n_diff_grams
print(sum)

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 6
plt.plot(li_x,li_y, color='darkblue', linewidth=5)
plt.xlim(1, 22)
plt.xticks(li_x)
plt.xlabel("3-gram position (3rd character)", fontsize=14, labelpad=18)
plt.ylabel("number of different 3-grams", fontsize=14, labelpad=18)
plt.show()

The plot is:

We see a sharp rise of the number of different 3-char-grams with position 2 (i.e. with the 1st real character of the word) and a systematic decline after position 11. The total sum of all unique 3-char-grams over all positions 136,800 for positions up to 21. (The number includes padding-left and padding-right 3-char-grams).

When we extend the number of positions of 3-char-grams from 0 to 55 we get:

The total sum of unique 3-char-grams then becomes 161,259.

Maximum number of words per unique 3-char-gram with position

In a very similar way we can get the maximum number of rows, i.e. of different vocabulary words, appearing for a specific 3-char-gram at a certain position. This specific 3-char-gram defines the largest category or word group at the defined position. The following code creates a plot for the variation of this maximum against the 3-char-gram-position:

# Determine max number of different rows per category
# ***********************************************
x = []
y = []
i_min = 0; i_max = 56
for j in range(i_min, i_max):
    col_name = 'gram_' + str(j)
    maxel = dfw_uml.groupby(col_name)['indw'].count().max()
    x.append(j)
    y.append(maxel)

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 6    
plt.plot(x,y, color='darkred', linewidth=5)
plt.xticks(x)
plt.xlabel("3-gram position (3rd character)", fontsize=14, labelpad=18)
plt.ylabel("max number of words per 3-gram", fontsize=14, labelpad=18)
plt.show()

The result is:

The fact that there are less and less words with growing length in the vocabulary explains the growing maximum number of words for 3-char-grams at a defined late position. The maximum there corresponds to words for the artificial 3-char-gram “###”. Also the left-padding 3-char-grams have many fitting words.

Consistent to the number of different categories we get relatively small numbers between positions 3 and 9:

Note that above we looked at the maximum, only. The various 3-char-grams defined at a certain position may have very different numbers of words being consistent with the 3-char-gram.

Mean number of words with 3-char-gram position and variation at a certain position

Another view at the number of words per unique 3-char-gram is given by the average number of words for the 3-char-grams with position. The following graphs were produced by replacing the max()-function in the code above by the mean()-function:

Mean number of words per 3-char-gram category against positions 0 to 55:

Mean number of words per 3-char-gram category against positions 0 to 45:

We see that there is a significant slope after position 40. Going down to lower positions we see a more modest variation.

There is some variation, but the total numbers are much smaller than the maximum numbers. This means that there is only a relatively small number of 3-char-grams which produce real big numbers.

This can also be seen from the following plots where I have ordered the 3-char-grams according to the rising number of matching words for the 3-char-grams at position 5 and at position 10:

Watch the different y-scales! When we limit the number of ordered grams to 8000 the variation is much more comparable:

Conclusion

A quick overview over a vocabulary with the help of Pandas functions shows that the maximum and the mean number of matching words for 3-char-grams at defined positions inside the vocabulary words vary strongly with position and thereby also with word-length.

In the position range from 4 to 11 the mean number of words per unique 3-char-gram is pretty small – around 320. In the position range between 4 and 30 (covering most of the words) the mean number of different words per 3-char-gram is still below 1000.

This gives us some hope for reducing the number of words matching a few 3-char-grams at different positions down to numbers we can handle when applying a detailed analysis. The reason is that we then are interested in the intersection of multiple matching word-groups at the different positions. Respective queries, hit rates and CPU-Times are the topic of the next article:

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – II

Stay tuned …

 

Pandas and 3-char-grams of a vocabulary – reduce memory consumption by datatype “category”

I sit in front of my old laptop and want to pre-process data of a pool of scanned texts for an analysis with ML and conventional algorithms. One of the tasks will be to correct at least some wrongly scanned words by “brute force” methods. A straight forward approach is to compare “3-character-gram” segments of the texts’ distinguished words (around 1.9 million) with the 3-char-gram patterns of the words of a reference vocabulary. The vocabulary I use contains around 2.7 million German words.

I started today with a 3-char-gram segmentation of the vocabulary – just to learn that tackling this problem with Pandas and Python pretty soon leads to a significant amount of RAM consumption. RAM is limited on my laptop (16 GB), so I have to keep memory requirements low. In this post I discuss some elementary Pandas tricks which helped me reduce memory consumption.

The task

I load my vocabulary from a CSV file into a Pandas dataframe named “dfw_uml“. The structure of the data is as follows:

The “indw”-column is identical to the “lower”-column. “indw” allows me to quickly use the “lower” version of the words as an (unique) index instead of an integer index. (This is a very useful option regarding some analysis. Note: As long as the string-based index is unique a hash function is used to make operations using a string-based index very fast.)

For all the words in the vocabulary I want to get all their individual 3-char-gram segments. “All” needs to be qualified: When you split a word in 3-char-grams you can do this with an overlap of the segments or without. Similar to filter kernels of CNNs I call the character-shift of consecutive 3-char-grams against each other “stride“.

Let us look at a short word like “angular” (with a length “len” = 7 characters). How many 3-char-grams do we get with a stride of 1? This depends on a padding around the word’s edges with special characters. Let us say we allow for a left-padding of 2 characters “%” on the left side of the word and 2 characters “#” on the right side. (Assuming that these characters are no parts of the words themselves. Then, with a stride of “1”, the 3-char-grams are :

‘%%a’, ‘%an’, ‘ang’, ‘ngu’, ‘gul’, ‘ula’, ‘lar’, ‘ar#’, ‘r##’

I.e., we get len+2 (=9) different 3-char-grams.

However, with a stride of 3 and a left-padding of 0 we get :

‘ang’, ‘ula’, ‘r##’

I.e., len/3 + 1 (=3) different 3-char-grams. (Whether we need an additional 3-char-ram depends on the division rest len%3). On the right-hand side of the word we have to allow for filling the rightmost 3-char-gram with our extra character “#”.

The difference in the total number of 3-char-grams is substantial. And it becomes linearly bigger with the word-length.
In a German vocabulary many quite long words (composita) may appear. In my vocabulary the longest word has 58 characters:

“telekommunikationsnetzgeschaeftsfuehrungsgesellschaft”

(with umlauts ä and ü written as ‘ae’ and ‘ue’, respectively). So, we talk about 60 or 20 additional columns required for “all” 3-char-grams.

So, choosing a suitable stride is an important factor to control memory consumption. But for some kind of analysis you may just want to limit the number (num_gram) of 3-char-grams for your analysis. E.g. you may set num_grams = 20.

When working with a Pandas table-like data structure it seems logical to arrange all of the 3-char-grams in form of different columns. Let us take a number of 20 columns
for different 3-char-grams as an objective for this article. We can create such 3-char-grams for all vocabulary words either with a “stride=3” or “stride = 1” and “num_grams = 20”. I pick the latter option.

Which padding and stride values are reasonable?

Padding on the right side of a word is in my opinion always reasonable when creating the 3-char-grams. You will see from the code in the next section how one creates the right-most 3-char-grams of the vocabulary words efficiently. On the left side of a word padding may depend on what you want to analyze. The following stride and left-padding combinations seem reasonable to me for 3-char-grams:

  • stride = 3, left-padding = 0
  • stride = 2, left-padding = 0
  • stride = 2, left-padding = 2
  • stride = 1, left-padding = 2
  • stride = 1, left-padding = 1
  • stride = 1, left-padding = 0

Code to create 3-char-grams

The following function builds the 3-char-grams for the different combinations.

def create_3grams_of_voc(dfw_words, num_grams=20, 
                         padding=2, stride=1, 
                         char_start='%', char_end='#', b_cpu_time=True):
    
    cpu_time = 0.0
    if b_cpu_time:
        v_start_time = time.perf_counter()
    
    # Some checks 
    if stride > 3:
        print('stride > 3 cannot be handled of this function for 3-char-grams')
        return dfw_words, cpu_time
    if stride == 3 and padding > 0:
        print('stride == 3 should be used with padding=0 ')
        return dfw_words, cpu_time 
    if stride == 2 and padding == 1: 
        print('stride == 2 should be used with padding=0, 2 - only')
        return dfw_words, cpu_time 

    st1 = char_start
    st2 = 2*char_start
    
    # begin: starting index for loop below   
    begin = 0 
    if stride == 3:
        begin = 0
    if stride == 2 and padding == 2:
        dfw_words['gram_0'] = st2 + dfw_words['lower'].str.slice(start=0, stop=1)
        begin = 1
    if stride == 2 and padding == 0:
        begin = 0
    if stride == 1 and padding == 2:
        dfw_words['gram_0'] = st2 + dfw_words['lower'].str.slice(start=0, stop=1)
        dfw_words['gram_1'] = st1 + dfw_words['lower'].str.slice(start=0, stop=2)
        begin = 2
    if stride == 1 and padding == 1:    
        dfw_words['gram_0'] = st1 + dfw_words['lower'].str.slice(start=0, stop=2)
        begin = 1
    if stride == 1 and padding == 0:    
        begin = 0
        
    # for num_grams == 20 we have to create elements up to and including gram_21 (range -> 22)
        
    # Note that the operations in the loop occur column-wise, i.e vectorized
    # => You cannot make them row dependend 
    # We are lucky that slice returns '' 
    for i in range(begin, num_grams+2):
        col_name = 'gram_' + str(i)
        
        sl_start = i*stride - padding
        sl_stop  = sl_start + 3
        
        dfw_words[col_name] = dfw_words['lower'].str.slice(start=sl_start, stop=sl_stop) 
        dfw_words[col_name] = dfw_words[col_name].str.ljust(3, '#')
    
    # We are lucky that nothing happens if not required to fill up  
    #for i in range(begin, num_grams+2):
    #    col_name = 'gram_' + str(i)
    #    dfw_words[col_name] = dfw_words[col_name].str.ljust(3, '#')

    if b_cpu_time:
        v_end_time = time.perf_counter()
        cpu_time   = v_end_time - v_start_time
        
    return dfw_words, cpu_time

The only noticeable thing about this code is the vectorized handling of the columns. (The whole setup of the 3-char-gram columns still requires around 51 secs on my laptop).

We call the function above for stride=1, padding=2, num_grams=20 by the following code in a
Jupyter cell:

num_grams = 20; stride = 2; padding = 2
dfw_uml, cpu_time = create_3grams_of_voc(dfw_uml, num_grams=num_grams, padding=padding, stride=stride)
print("cpu_time = ", cpu_time)
print()
dfw_uml.head(3)

RAM consumption

Let us see how the memory consumption looks like. After having loaded all required libraries and some functions my Jupyter plugin “jupyter-resource-usage” for memory consumption shows: “Memory: 208.3 MB“.

When I fill the Pandas dataframe “dfw_uml” with the vocabulary data this number changes to: “Memory: 915 MB“.

Then I create the 3-char-gram-columns for “num_grams = 20; stride = 1; padding = 2” and get:

The memory jumped to “Memory: 4.5 GB“. The OS wth some started servers on the laptop takes around 2.6 GB. So, we have already consumed around 45% of the available RAM.

Looking at details by

dfw_uml.info(memory_usage='deep')

shows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   indw     object
 1   word     object
 2   len      int64 
 3   lower    object
 4   gram_0   object
 5   gram_1   object
 6   gram_2   object
 7   gram_3   object
 8   gram_4   object
 9   gram_5   object
 10  gram_6   object
 11  gram_7   object
 12  gram_8   object
 13  gram_9   object
 14  gram_10  object
 15  gram_11  object
 16  gram_12  object
 17  gram_13  object
 18  gram_14  object
 19  gram_15  object
 20  gram_16  object
 21  gram_17  object
 22  gram_18  object
 23  gram_19  object
 24  gram_20  object
 25  gram_21  object
dtypes: int64(1), object(25)
memory usage: 4.0 GB

The memory consumption due to our expanded dataframe is huge. No wonder with around 59.4 million string like entries in the dataframe! With Pandas we have no direct option of telling the columns to use specific 3 character columns. For strings Pandas instead uses a flexible datatype “object“.

Reducing memory consumption by using datatype “category”

Looking at the data we get the impression that one should be able to reduce the amount of required memory because the entries in all of the 3-char-gram-columns are non-unique. Actually, the 3-char-grams mark major groups of words (probably in a typical way for a given western language).

We can get the number of unique 3-char-grams in a column with the following code snippet:

li_unique = []
for i in range(2,22):
    col_name     = 'gram_' + str(i)
    count_unique = dfw_uml[col_name].nunique() 
    li_unique.append(count_unique)
print(li_unique)         

Giving for our 21 columns:

[3068, 4797, 8076, 8687, 8743, 8839, 8732, 8625, 8544, 8249, 7829, 7465, 7047, 6700, 6292, 5821, 5413, 4944, 4452, 3989]

Compared to 2.7 million rows these numbers are relatively small. This is where the datatype (dtype) “category” comes handy. We can transform the dtype of the dataframe columns by

for i in range(0,22):
    col_name     = 'gram_' + str(i)
    dfw_uml[col_name] = dfw_uml[col_name].astype('category')

“dfw_uml.info(memory_usage=’deep’)” afterwards gives us:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype   
---  ------   -----   
 0   indw     object  
 1   word     object  
 2   len      
int64   
 3   lower    object  
 4   gram_0   category
 5   gram_1   category
 6   gram_2   category
 7   gram_3   category
 8   gram_4   category
 9   gram_5   category
 10  gram_6   category
 11  gram_7   category
 12  gram_8   category
 13  gram_9   category
 14  gram_10  category
 15  gram_11  category
 16  gram_12  category
 17  gram_13  category
 18  gram_14  category
 19  gram_15  category
 20  gram_16  category
 21  gram_17  category
 22  gram_18  category
 23  gram_19  category
 24  gram_20  category
 25  gram_21  category
dtypes: category(22), int64(1), object(3)
memory usage: 739.9 MB

Just 740 MB!
Hey, we have reduced the required memory for the dataframe by more than a factor of 4!/

Read in data from CSV with already reduced memory

We can now save the result of our efforts in a CSV-file by

# the following statement just helps to avoid an unnamed column during export
dfw_uml = dfw_uml.set_index('indw') 
# export to csv-file
export_path_voc_grams = '/YOUR_PATH/voc_uml_grams.csv'
dfw_uml.to_csv(export_path_voc_grams)

For the reverse process of importing the data from a CSV-file the following question comes up:
How can we enforce that the data are read in into dataframe columns with dtype “category”? Such that no unnecessary memory is used during the read-in process. The answer is simple:

Pandas allows the definition of the columns’ dtype in form of a dictionary which can be provided as a parameter to the function “read_csv()“.

We define two functions to prepare data import accordingly:


# Function to create a dictionary with dtype information for columns
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def create_type_dict_for_gram_cols(num_grams=20):
    # Expected structure:
    # {indw: str, word: str, len: np.int16, lower: str, gram_0 ....gram_21: 'category'  
    
    gram_col_dict = {}
    gram_col_dict['indw']  = 'str'
    gram_col_dict['word']  = 'str'
    gram_col_dict['len']   = np.int16
    gram_col_dict['lower'] = 'str'
    
    for i in range(0,num_grams+2):
        col_name     = 'gram_' + str(i)
        gram_col_dict[col_name] = 'category'
    
    return gram_col_dict

# Function to read in vocabulary with prepared grams 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def readin_voc_with_grams(import_path='', num_grams = 20, b_cpu_time = True):
    if import_path == '':
        import_path = '/YOUR_PATH/voc_uml_grams.csv'
    
    cpu_time = 0.0 
    
    if b_cpu_time:
        v_start_time = time.perf_counter()

    # ceate dictionary with dtype-settings for the columns
    d_gram_cols = create_type_dict_for_gram_cols(num_grams = num_grams )
    df = pd.read_csv(import_path, dtype=d_gram_cols, na_filter=False)
    
    if b_cpu_time:
        v_end_time = time.perf_counter()
        cpu_time   = v_end_time - v_start_time
   
    return df, cpu_time

With these functions we can read in the CSV file. We restart the kernel of our Jupyter notebook to clear all memory and give it back to the OS.

After having loaded libraries and function we get: “Memory: 208.9 MB”. Now we fill a new Jupyter cell with:

import_path_voc_grams = '/YOUR_PATH/voc_uml_grams.csv'

print("Starting read-in of vocabulary with 3-char-grams")
dfw_uml, cpu_time = readin_voc_with_grams( import_path=import_path_voc_grams,
                                           num_grams = 20)
print()
print("cpu time for df-creation = ", cpu_time)

We run this code and get :

Starting read-in of vocabulary with 3-char-grams

cpu time for df-creation =  16.770479727001657

and : “Memory: 1.4 GB”

“dfw_uml.info(memory_usage=’deep’)
” indeed shows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype   
---  ------   -----   
 0   indw     object  
 1   word     object  
 2   len      int16   
 3   lower    object  
 4   gram_0   category
 5   gram_1   category
 6   gram_2   category
 7   gram_3   category
 8   gram_4   category
 9   gram_5   category
 10  gram_6   category
 11  gram_7   category
 12  gram_8   category
 13  gram_9   category
 14  gram_10  category
 15  gram_11  category
 16  gram_12  category
 17  gram_13  category
 18  gram_14  category
 19  gram_15  category
 20  gram_16  category
 21  gram_17  category
 22  gram_18  category
 23  gram_19  category
 24  gram_20  category
 25  gram_21  category
dtypes: category(22), int16(1), object(3)
memory usage: 724.4 MB

Obviously, we save some bytes by “int16” as dtype for len. But Pandas seems to use around 400 MB memory in the background for data handling during the read-in process.

Nevertheless: instead of using 4.5 GB we now consume only 1.4 GB.

Conclusion

Working with huge vocabularies and creating 3-char-gram-segments for each word in the vocabulary is a memory consuming process with Pandas. Using the dtype ‘category’ helps a lot to save memory. For a typical German a memory reduction by a factor of 4 is within reach.
When importing data from a CSV-file with already prepared 3-char-gram (columns) we can enforce the use of dtype ‘category’ for columns of a dataframe by providing a suitable dictionary to the function “read_csv()”.

Performance of data retrieval from a simple wordlist in a Pandas dataframe with a string based index – I

When preparing a bunch of texts for Machine Learning [ML] there may come a point where you need to eliminate probable junk words or simply wrongly written words form the texts. This is especially true for scanned texts. Let us assume that you have already applied a tokenizer to your texts and that you have created a “bag of words” [BoW] for each individual text or even a global one for all of your texts.

Now, you may want to compare each word in your bag with a checked list of words – a “reference vocabulary” – which you assume to comprise the most relevant words of a language. If you do not find a specific word of your bag in your reference “vocabulary” you may want to put this word into a second bag for a later, more detailed analysis. Such an analysis may be based on a table where the vocabulary words are split in n-grams of characters. These n-grams will stored in additional columns added to your wordlist, thus turning it into a 2-dimensional array of data.

Such tasks require a tool which – among other things –

  • is able to load 1-, 2-dimensional and sometimes 3-dimensional data structures in a fast way from CSV-files into series-, table- or cube-like data structures in RAM,
  • provides tools to select, filter, retrieve and manipulate data from rows,columns and cells,
  • provides tools to operate on a multitude of rows or columns,
  • provides tools to create some statistics on the data.

All of it pretty fast – which means that the tool must support the creation of an index or indices and/or support vectorized operations (mostly on columns).

I had read in ML books the Pandas is such a tool for a Python environment. A way to accomplish our task would be to load the reference vocabulary into a Pandas structure and check the words of the BoW(s) against it. This means that you try to find the word in the reference list and evaluate the result (positive or negative). And when you are done with this challenge you may want to retrieve additional information from a 2-dimensional Pandas data structure.

This article is about the performance of some data retrieval experiments I recently did on a wordlist of around 2 million words. The objective was to check the existence of tokenized words of some 200.000 texts, each with around 2000 tokens, against this wordlist being embedded in a Pandas dataframe and also to retrieve additional information from other columns of the dataframe.

As we talk about scanned texts and OCR treatment it is very probable that the number of tokens you have to compare with your vocabulary is well above 10 millions. It is clear that there is an requirement for performance if you want to work with a standard Linux PC.

Multiple ways to retrieve or query information from a Pandas series or dataframe

When I started to really use Pandas some days ago I became a bit overwhelmed by the documentation – and the differences in comparison to databases. After having used a database like MySQL for years I had a certain vision about the handling of “table”-like data and related performance. Well, I had to swallow some camels!

And when I started to really care about performance I also realized that there where very many ways to “query” a Pandas dataframe – and not all will give you the same speed in data retrieval.

This article, therefore, dives a bit below the glittering surface of Pandas and looks at different methods to retrieve rows and certain cell values out of a simple “Pandas dataframe”. To work with some practical data I used a reference vocabulary for the German language based on Wikipedia articles.

The first objective was very simple: Verify that a certain word is an element in the reference vocabulary.
The second objective was a natural extension: Retrieve
rows (with multiple columns) for fitting entries – sometimes multiple entries with different word writings.

I was somewhat astonished top see factors between at least 16 and 10.000 for real data retrieval, in comparison with the fastest solution. Just checking the existence of a word in the wordlist proved to be extremely faster after having created a suitable index – and not using any data columns at all.

The response times of Pandas depended strongly on the “query” method and the usage of an index.

I hope the information given below and in the next article is useful for other beginners with Pandas. I shall speak of a “query” when I want to select data from a Pandas dataframe and a “resultset” when addressing one or a collection of data rows as the result of a query. Can’t forget my time with databases …

I assume that you already have a valid Pandas installation in a Python 3 environment on your Linux PC. I did my simple experiments with a Jupyter notebook, but, of course, other tools can be used, too.

Loading an example wordlist into a Pandas dataframe

For my small “query” experiments I first loaded a simple list with around 2.1 million words from a text file into a Pandas data structure. This operation created a so called “Pandas series” and also produced an unique index – appearing as integers, which marked each row of the data with a specific integer.

Then I created two additional columns: The first one with all words written in lower case letters. The second one containing the number of characters of the word’s string. By these operations I created a real 2-dim object – a so called Pandas “dataframe”.

Let us follow this line of operations as a first step. So, where do we get a wordlist from?

A friendly engineer (Torsten Brischalle) has provided a German word-list based on Wikipedia which we can use as an example.
See: http://www.aaabbb.de/WordList/WordList.php

We first import the “uppercase”-wordlist. You can download from this link. On your Linux PC you expand the 7zip archive by standard Linux tools.

This “uppercase” list has the advantage that an index which we will later base on the lowercase writing of the words will (hopefully) be unique. The more extensive wordlist also provided by Brischalle instead comprises multiple writings for some words. The related index would, therefore, not be unique. We shall see that this has a major impact on the response time of the resulting Pandas dataframe.

The wordlists, after 7zip-expansion, all are very simple text-files: Each line contains just one word.

We shall nevertheless work with a 2-dim general Pandas “dataframe” instead of a “series”. A reason is that in a real data analysis environment we may want to add multiple columns with more information later on. E.g. columns for n-grams of character sequences constituting the word or for other information as frequencies, consonant to vocal ratio, etc. And then we would work on 2-dim data structures.

Loading the data into a Pandas dataframe and creating an index based on lowercase word representation

Let us import the wordlist data by the help of some Python code in a Jupyter cell (in my case from a directory “/py/projects/CA22/catch22/Wortlisten/”):

import os
import time
import pandas as pd
import numpy as np

dfw_smallx = pd.read_csv('/py/projects/CA22/catch22/Wortlisten/word_list_german_uppercase_spell_checked.txt', dtype='str', na_filter=False)
dfw_smallx.columns = ['word']
dfw_smallx['indw'] = dfw_smallx['word']

pdx_shape = dfw_smallx.shape
print("shape of dfw_smallx = ", pdx_shape)
pdx_rows = pdx_shape[0]
pdx_cols = pdx_shape[1]
print("rows of dfw_smallx = ", pdx_rows)
print("cols 
of dfw_smallx = ", pdx_cols)

dfw_smallx.head(8)

You see that we need to import the Pandas module besides other standard modules. Then you find that Pandas obviously provides a function “read_csv()” to import CSV like text files. You find more about it in the Pandas documentation here.
The CSV import should in our case be a matter of a few seconds, only.

A column name or column names can be added to a Pandas series or Pandas dataframe, respectively, afterward.

Why did I use the parameter “na_filter“? Well, this was done to handle a special value in the wordlist, namely “NULL”. You may remember that this is a key-word in Python! We would get an empty entry in the dataframe for this input value without the named parameter. You find more information on this topic in the Pandas documentation on the “read_csv()”-function.

The reader also notices that I just named the single data column (resulting from the import) ‘word’ and then copied this column to another new column called ‘indw’. I shall use the latter column as an index in a minute. I then print out some information on the dataframe:

shape of dfw_smallx =  (2188246, 2)
rows of dfw_smallx =  2188246
cols of dfw_smallx =  2

	word 			indw
0 	AACHENER 		AACHENER
1 	AACHENERIN 		AACHENERIN
2 	AACHENERINNEN 	AACHENERINNEN
3 	AACHENERN 		AACHENERN
4 	AACHENERS 		AACHENERS
5 	AACHENS 		AACHENS
6 	AAL 			AAL
7 	AALE			AALE

Almost 2.2 million words. OK, I do not like uppercase. I want a lowercase representation to be used as an index later on. This gives me the opportunity to apply an operation to a whole column with 2.2 mio words.

The creation of our string based index can be achieved by the “set_index()” function:

dfw_smallx['indw'] = dfw_smallx['word'].str.lower()
dfw_smallx = dfw_smallx.set_index('indw')
dfw_smallx.head(5)

Leading after less than 0.5 secs (!) to:

 				word
indw 	
aachener 		AACHENER
aachenerin 		AACHENERIN
aachenerinnen 	AACHENERINNEN
aachenern 		AACHENERN
aacheners 		AACHENERS

Now, let us add one more column containing the length information on the word(s).

This can be done by two methods

  • dfw_smallx[‘len’] = dfw_smallx[‘word’].str.len()
  • dfw_smallx[‘len’] = dfw_smallx[‘word’].apply(len)

The second method is a bit faster (by a factor of 0.7), but does not work on NaN cells of a column. In our case no problem, we get:

# A a column for len information 
v_start_time = time.perf_counter()
dfw_smallx['len'] = dfw_smallx['word'].apply(len)
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
dfw_smallx.head(3)

Total CPU time  0.3626117290004913

			word 			len
indw 		
aachener 		AACHENER 		8
aachenerin 		AACHENERIN 		10
aachenerinnen 	AACHENERINNEN 	13

Basics of addressing data in a Pandas dataframe

Ok, we have loaded our reference list of words into a dataframe. A Pandas “dataframe” basically is a 2-dimensional data structure based on Numpy array technology for the columns. Now, we want to address data in specific rows or cells. Below I repeat some basics for the retrieval of single values from a dataframe:

Each “cell” has a two dimensional integer-“index” – a tuple [i,j], with “i” identifying a row and “j” a column. You can use respective integer values by the “iloc[]“-operator. E.g. dfw_smallx.iloc[2,1] will give you the value “13”.

The “loc[]“-operator instead works with “labels” given to the rows and columns; in the most primitive form as :

dataframe.loc[row label, column label], e.g. dfw_smallx.loc.[ ‘aachenerinnen’, ‘len’ ] .

Labels have to be defined. For columns you may define names (often already during construction of the dataframe). For rows you may define an index – as we actually did above. If you want to compare this with databases: You define a primary key (sometimes based on column-combinations).

Other almost equivalent methods

  • iat[] – operator ,
  • at[] – operator,
  • array like usage of the column label + row-index
  • and the so called dot-notation

for the retrieval of single values are presented in the following code snippet:

print(dfw_smallx.iloc[2,1])
print(dfw_smallx.iat[2,1])
print(dfw_smallx['len'][2]) 
print(dfw_smallx.loc['aachenerinnen', 'len'])
print(dfw_smallx.at['aachenerinnen', 'len'])
print(dfw_smallx.len.aachenerinnen)

13
13
13
13
13
13

Note that the “iat[]” and “at[]” operators can only be used for cells, so both row and column values have to be provided; the other methods can be used for more general slicing of columns.

Slicing

Slicing in general supported by the “:” notation – just as in NumPy. So, with the notation “labelvalue1 : labelvalue2” one can define slices. This works even for string label values:

words = dfw_smallx.loc['alt':'altersschwach', 'word':'len']
print(words)
                          word  len
indw                               
alt                        ALT    3
altaachener        ALTAACHENER   11
altablage            ALTABLAGE    9
altablagen          ALTABLAGEN   10
altablagerung    ALTABLAGERUNG   13
...                        ...  ...
altersschnitt    ALTERSSCHNITT   13
altersschnitts  ALTERSSCHNITTS   14
altersschrift    ALTERSSCHRIFT   13
altersschutz      ALTERSSCHUTZ   12
altersschwach    ALTERSSCHWACH   13

[3231 rows x 2 columns]

Queries with conditions on column values – and Pandas objects containing multiple results

Now let us look at some queries with conditions on columns and the form of the “result sets” when more than just a single value is returned in a Pandas response. Multiple return values may mean multiple rows (with one or more column values) or just one row with multiple column values. Two points are noteworthy:

  1. Pandas produces a new dataframe or series with multiple rows if multiple values are returned. Whenever we get a Pandas “object” with an internal structure as a Pandas response, we need to narrow down the result to the particular value we want to see.
  2. To grasp a certain value you need to include some special methods already in the “query” or to apply a method to the result series or dataframe.

An interesting type of “query” for a Pandas dataframe is provided by the “query()“-function: it allows us to retrieve rows or single values by conditions on column entries. But conditions can also be supplied when using the “loc[]” operator:

w1 = dfw_smallx.loc['null', 'word']
pd_w2 = dfw_smallx.loc['null'] # resulting in a series 
w2 = pd_w2[0]
pd_w3 = dfw_smallx.loc[dfw_smallx['word'] == 'NULL', 'word']
w3 = pd_w3[0]
pd_w4 = dfw_smallx.query('word == "NULL"')
w4 = pd_w4.iloc[0,0]
w5 = dfw_smallx.query('word == "NULL"').iloc[0,0]
w6 = dfw_smallx.query('word == "NULL"').word.item()
print("w1 = ", w1)
print("pd_w2 = ", pd_w2)
print("w2 = ", w2)
print("pd_wd3 = ", pd_w3)
print("w3 = ", w3)
print("w4 = ", w4)
print("w5 = ", w5)
print("w6 = ", w6)
r

I have added a prefix “pd_” to some variables where I expected a Pandas dataframe to be the answer. And really:

w1 =  NULL
pd_w2 =  word    NULL
len        4
Name: null, dtype: object
w2 =  NULL
pd_wd3 =  indw
null    NULL
Name: word, dtype: object
w3 =  NULL
w4 =  NULL
w5 =  NULL
w6 =  NULL

Noteworthy: For loc[] (in contrast to iloc[]) the last value of the slice definition is included in the result set.

Retrieving data by a list of index values

As soon as you dig a bit deeper into the Pandas documentation you will certainly find the following way to retrieve multiple rows by providing a list of of index values:

# Retrieving col values by a list of index values 
inf = ['null', 'mann', 'frau']
wordx = dfw_smallx.loc[inf, 'word']
wx = wordx.iloc[0:3] # resulting in a Pandas series 
print(wx.iloc[0])
print(wx.iloc[1])
print(wx.iloc[2])
NULL
MANN
FRAUp

Intermediate conclusion

The variety of options even in our very simple scenario to retrieve values from a wordlist (with an additional column) is almost overwhelming. They all serve their purpose – depending on the structure of the dataframe and your knowledge on the data positions.

But actually in our scenario for analyzing a BoWs, we have a very simple task ahead of us: We just want to check whether a word or a list of words exists in the list, i.e. if there is an entry for a word (written in small letters) in the list. What about the performance of the different methods for this task?

Actually, there is a very simple answer for the existence check – giving you maximum performance.

But to learn a bit more about the performance of different forms of Pandas queries we also shall look at methods performing some real data retrieval from the columns of a row addressed by some (string) index value.

These will be the topics of the next article. Stay tuned …

Links

Various ways of “querying” Pandas dataframes
https://www.sharpsightlabs.com/blog/pandas-loc/
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
https://cmsdk.com/python/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.html
https://pythonexamples.org/pandas-dataframe-query/

The book “Mastering Pandas” of Ashish Kumar, 2nd, edition, 2019, Packt Publishing Ltd. may be of help – though it does not really comment on performance issues on this level.

NULL values
https://stackoverflow.com/questions/50683765/how-to-treat-null-as-a-normal-string-with-pandas