Pandas and 3-char-grams of a vocabulary – reduce memory consumption by datatype “category”

I sit in front of my old laptop and want to pre-process data of a pool of scanned texts for an analysis with ML and conventional algorithms. One of the tasks will be to correct at least some wrongly scanned words by “brute force” methods. A straight forward approach is to compare “3-character-gram” segments of the texts’ distinguished words (around 1.9 million) with the 3-char-gram patterns of the words of a reference vocabulary. The vocabulary I use contains around 2.7 million German words.

I started today with a 3-char-gram segmentation of the vocabulary – just to learn that tackling this problem with Pandas and Python pretty soon leads to a significant amount of RAM consumption. RAM is limited on my laptop (16 GB), so I have to keep memory requirements low. In this post I discuss some elementary Pandas tricks which helped me reduce memory consumption.

The task

I load my vocabulary from a CSV file into a Pandas dataframe named “dfw_uml“. The structure of the data is as follows:

The “indw”-column is identical to the “lower”-column. “indw” allows me to quickly use the “lower” version of the words as an (unique) index instead of an integer index. (This is a very useful option regarding some analysis. Note: As long as the string-based index is unique a hash function is used to make operations using a string-based index very fast.)

For all the words in the vocabulary I want to get all their individual 3-char-gram segments. “All” needs to be qualified: When you split a word in 3-char-grams you can do this with an overlap of the segments or without. Similar to filter kernels of CNNs I call the character-shift of consecutive 3-char-grams against each other “stride“.

Let us look at a short word like “angular” (with a length “len” = 7 characters). How many 3-char-grams do we get with a stride of 1? This depends on a padding around the word’s edges with special characters. Let us say we allow for a left-padding of 2 characters “%” on the left side of the word and 2 characters “#” on the right side. (Assuming that these characters are no parts of the words themselves. Then, with a stride of “1”, the 3-char-grams are :

‘%%a’, ‘%an’, ‘ang’, ‘ngu’, ‘gul’, ‘ula’, ‘lar’, ‘ar#’, ‘r##’

I.e., we get len+2 (=9) different 3-char-grams.

However, with a stride of 3 and a left-padding of 0 we get :

‘ang’, ‘ula’, ‘r##’

I.e., len/3 + 1 (=3) different 3-char-grams. (Whether we need an additional 3-char-ram depends on the division rest len%3). On the right-hand side of the word we have to allow for filling the rightmost 3-char-gram with our extra character “#”.

The difference in the total number of 3-char-grams is substantial. And it becomes linearly bigger with the word-length.
In a German vocabulary many quite long words (composita) may appear. In my vocabulary the longest word has 58 characters:

“telekommunikationsnetzgeschaeftsfuehrungsgesellschaft”

(with umlauts ä and ü written as ‘ae’ and ‘ue’, respectively). So, we talk about 60 or 20 additional columns required for “all” 3-char-grams.

So, choosing a suitable stride is an important factor to control memory consumption. But for some kind of analysis you may just want to limit the number (num_gram) of 3-char-grams for your analysis. E.g. you may set num_grams = 20.

When working with a Pandas table-like data structure it seems logical to arrange all of the 3-char-grams in form of different columns. Let us take a number of 20 columns
for different 3-char-grams as an objective for this article. We can create such 3-char-grams for all vocabulary words either with a “stride=3” or “stride = 1” and “num_grams = 20”. I pick the latter option.

Which padding and stride values are reasonable?

Padding on the right side of a word is in my opinion always reasonable when creating the 3-char-grams. You will see from the code in the next section how one creates the right-most 3-char-grams of the vocabulary words efficiently. On the left side of a word padding may depend on what you want to analyze. The following stride and left-padding combinations seem reasonable to me for 3-char-grams:

  • stride = 3, left-padding = 0
  • stride = 2, left-padding = 0
  • stride = 2, left-padding = 2
  • stride = 1, left-padding = 2
  • stride = 1, left-padding = 1
  • stride = 1, left-padding = 0

Code to create 3-char-grams

The following function builds the 3-char-grams for the different combinations.

def create_3grams_of_voc(dfw_words, num_grams=20, 
                         padding=2, stride=1, 
                         char_start='%', char_end='#', b_cpu_time=True):
    
    cpu_time = 0.0
    if b_cpu_time:
        v_start_time = time.perf_counter()
    
    # Some checks 
    if stride > 3:
        print('stride > 3 cannot be handled of this function for 3-char-grams')
        return dfw_words, cpu_time
    if stride == 3 and padding > 0:
        print('stride == 3 should be used with padding=0 ')
        return dfw_words, cpu_time 
    if stride == 2 and padding == 1: 
        print('stride == 2 should be used with padding=0, 2 - only')
        return dfw_words, cpu_time 

    st1 = char_start
    st2 = 2*char_start
    
    # begin: starting index for loop below   
    begin = 0 
    if stride == 3:
        begin = 0
    if stride == 2 and padding == 2:
        dfw_words['gram_0'] = st2 + dfw_words['lower'].str.slice(start=0, stop=1)
        begin = 1
    if stride == 2 and padding == 0:
        begin = 0
    if stride == 1 and padding == 2:
        dfw_words['gram_0'] = st2 + dfw_words['lower'].str.slice(start=0, stop=1)
        dfw_words['gram_1'] = st1 + dfw_words['lower'].str.slice(start=0, stop=2)
        begin = 2
    if stride == 1 and padding == 1:    
        dfw_words['gram_0'] = st1 + dfw_words['lower'].str.slice(start=0, stop=2)
        begin = 1
    if stride == 1 and padding == 0:    
        begin = 0
        
    # for num_grams == 20 we have to create elements up to and including gram_21 (range -> 22)
        
    # Note that the operations in the loop occur column-wise, i.e vectorized
    # => You cannot make them row dependend 
    # We are lucky that slice returns '' 
    for i in range(begin, num_grams+2):
        col_name = 'gram_' + str(i)
        
        sl_start = i*stride - padding
        sl_stop  = sl_start + 3
        
        dfw_words[col_name] = dfw_words['lower'].str.slice(start=sl_start, stop=sl_stop) 
        dfw_words[col_name] = dfw_words[col_name].str.ljust(3, '#')
    
    # We are lucky that nothing happens if not required to fill up  
    #for i in range(begin, num_grams+2):
    #    col_name = 'gram_' + str(i)
    #    dfw_words[col_name] = dfw_words[col_name].str.ljust(3, '#')

    if b_cpu_time:
        v_end_time = time.perf_counter()
        cpu_time   = v_end_time - v_start_time
        
    return dfw_words, cpu_time

The only noticeable thing about this code is the vectorized handling of the columns. (The whole setup of the 3-char-gram columns still requires around 51 secs on my laptop).

We call the function above for stride=1, padding=2, num_grams=20 by the following code in a
Jupyter cell:

num_grams = 20; stride = 2; padding = 2
dfw_uml, cpu_time = create_3grams_of_voc(dfw_uml, num_grams=num_grams, padding=padding, stride=stride)
print("cpu_time = ", cpu_time)
print()
dfw_uml.head(3)

RAM consumption

Let us see how the memory consumption looks like. After having loaded all required libraries and some functions my Jupyter plugin “jupyter-resource-usage” for memory consumption shows: “Memory: 208.3 MB“.

When I fill the Pandas dataframe “dfw_uml” with the vocabulary data this number changes to: “Memory: 915 MB“.

Then I create the 3-char-gram-columns for “num_grams = 20; stride = 1; padding = 2” and get:

The memory jumped to “Memory: 4.5 GB“. The OS wth some started servers on the laptop takes around 2.6 GB. So, we have already consumed around 45% of the available RAM.

Looking at details by

dfw_uml.info(memory_usage='deep')

shows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   indw     object
 1   word     object
 2   len      int64 
 3   lower    object
 4   gram_0   object
 5   gram_1   object
 6   gram_2   object
 7   gram_3   object
 8   gram_4   object
 9   gram_5   object
 10  gram_6   object
 11  gram_7   object
 12  gram_8   object
 13  gram_9   object
 14  gram_10  object
 15  gram_11  object
 16  gram_12  object
 17  gram_13  object
 18  gram_14  object
 19  gram_15  object
 20  gram_16  object
 21  gram_17  object
 22  gram_18  object
 23  gram_19  object
 24  gram_20  object
 25  gram_21  object
dtypes: int64(1), object(25)
memory usage: 4.0 GB

The memory consumption due to our expanded dataframe is huge. No wonder with around 59.4 million string like entries in the dataframe! With Pandas we have no direct option of telling the columns to use specific 3 character columns. For strings Pandas instead uses a flexible datatype “object“.

Reducing memory consumption by using datatype “category”

Looking at the data we get the impression that one should be able to reduce the amount of required memory because the entries in all of the 3-char-gram-columns are non-unique. Actually, the 3-char-grams mark major groups of words (probably in a typical way for a given western language).

We can get the number of unique 3-char-grams in a column with the following code snippet:

li_unique = []
for i in range(2,22):
    col_name     = 'gram_' + str(i)
    count_unique = dfw_uml[col_name].nunique() 
    li_unique.append(count_unique)
print(li_unique)         

Giving for our 21 columns:

[3068, 4797, 8076, 8687, 8743, 8839, 8732, 8625, 8544, 8249, 7829, 7465, 7047, 6700, 6292, 5821, 5413, 4944, 4452, 3989]

Compared to 2.7 million rows these numbers are relatively small. This is where the datatype (dtype) “category” comes handy. We can transform the dtype of the dataframe columns by

for i in range(0,22):
    col_name     = 'gram_' + str(i)
    dfw_uml[col_name] = dfw_uml[col_name].astype('category')

“dfw_uml.info(memory_usage=’deep’)” afterwards gives us:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype   
---  ------   -----   
 0   indw     object  
 1   word     object  
 2   len      
int64   
 3   lower    object  
 4   gram_0   category
 5   gram_1   category
 6   gram_2   category
 7   gram_3   category
 8   gram_4   category
 9   gram_5   category
 10  gram_6   category
 11  gram_7   category
 12  gram_8   category
 13  gram_9   category
 14  gram_10  category
 15  gram_11  category
 16  gram_12  category
 17  gram_13  category
 18  gram_14  category
 19  gram_15  category
 20  gram_16  category
 21  gram_17  category
 22  gram_18  category
 23  gram_19  category
 24  gram_20  category
 25  gram_21  category
dtypes: category(22), int64(1), object(3)
memory usage: 739.9 MB

Just 740 MB!
Hey, we have reduced the required memory for the dataframe by more than a factor of 4!/

Read in data from CSV with already reduced memory

We can now save the result of our efforts in a CSV-file by

# the following statement just helps to avoid an unnamed column during export
dfw_uml = dfw_uml.set_index('indw') 
# export to csv-file
export_path_voc_grams = '/YOUR_PATH/voc_uml_grams.csv'
dfw_uml.to_csv(export_path_voc_grams)

For the reverse process of importing the data from a CSV-file the following question comes up:
How can we enforce that the data are read in into dataframe columns with dtype “category”? Such that no unnecessary memory is used during the read-in process. The answer is simple:

Pandas allows the definition of the columns’ dtype in form of a dictionary which can be provided as a parameter to the function “read_csv()“.

We define two functions to prepare data import accordingly:


# Function to create a dictionary with dtype information for columns
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def create_type_dict_for_gram_cols(num_grams=20):
    # Expected structure:
    # {indw: str, word: str, len: np.int16, lower: str, gram_0 ....gram_21: 'category'  
    
    gram_col_dict = {}
    gram_col_dict['indw']  = 'str'
    gram_col_dict['word']  = 'str'
    gram_col_dict['len']   = np.int16
    gram_col_dict['lower'] = 'str'
    
    for i in range(0,num_grams+2):
        col_name     = 'gram_' + str(i)
        gram_col_dict[col_name] = 'category'
    
    return gram_col_dict

# Function to read in vocabulary with prepared grams 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def readin_voc_with_grams(import_path='', num_grams = 20, b_cpu_time = True):
    if import_path == '':
        import_path = '/YOUR_PATH/voc_uml_grams.csv'
    
    cpu_time = 0.0 
    
    if b_cpu_time:
        v_start_time = time.perf_counter()

    # ceate dictionary with dtype-settings for the columns
    d_gram_cols = create_type_dict_for_gram_cols(num_grams = num_grams )
    df = pd.read_csv(import_path, dtype=d_gram_cols, na_filter=False)
    
    if b_cpu_time:
        v_end_time = time.perf_counter()
        cpu_time   = v_end_time - v_start_time
   
    return df, cpu_time

With these functions we can read in the CSV file. We restart the kernel of our Jupyter notebook to clear all memory and give it back to the OS.

After having loaded libraries and function we get: “Memory: 208.9 MB”. Now we fill a new Jupyter cell with:

import_path_voc_grams = '/YOUR_PATH/voc_uml_grams.csv'

print("Starting read-in of vocabulary with 3-char-grams")
dfw_uml, cpu_time = readin_voc_with_grams( import_path=import_path_voc_grams,
                                           num_grams = 20)
print()
print("cpu time for df-creation = ", cpu_time)

We run this code and get :

Starting read-in of vocabulary with 3-char-grams

cpu time for df-creation =  16.770479727001657

and : “Memory: 1.4 GB”

“dfw_uml.info(memory_usage=’deep’)
” indeed shows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype   
---  ------   -----   
 0   indw     object  
 1   word     object  
 2   len      int16   
 3   lower    object  
 4   gram_0   category
 5   gram_1   category
 6   gram_2   category
 7   gram_3   category
 8   gram_4   category
 9   gram_5   category
 10  gram_6   category
 11  gram_7   category
 12  gram_8   category
 13  gram_9   category
 14  gram_10  category
 15  gram_11  category
 16  gram_12  category
 17  gram_13  category
 18  gram_14  category
 19  gram_15  category
 20  gram_16  category
 21  gram_17  category
 22  gram_18  category
 23  gram_19  category
 24  gram_20  category
 25  gram_21  category
dtypes: category(22), int16(1), object(3)
memory usage: 724.4 MB

Obviously, we save some bytes by “int16” as dtype for len. But Pandas seems to use around 400 MB memory in the background for data handling during the read-in process.

Nevertheless: instead of using 4.5 GB we now consume only 1.4 GB.

Conclusion

Working with huge vocabularies and creating 3-char-gram-segments for each word in the vocabulary is a memory consuming process with Pandas. Using the dtype ‘category’ helps a lot to save memory. For a typical German a memory reduction by a factor of 4 is within reach.
When importing data from a CSV-file with already prepared 3-char-gram (columns) we can enforce the use of dtype ‘category’ for columns of a dataframe by providing a suitable dictionary to the function “read_csv()”.

Performance of data retrieval from a simple wordlist in a Pandas dataframe with a string based index – I

When preparing a bunch of texts for Machine Learning [ML] there may come a point where you need to eliminate probable junk words or simply wrongly written words form the texts. This is especially true for scanned texts. Let us assume that you have already applied a tokenizer to your texts and that you have created a “bag of words” [BoW] for each individual text or even a global one for all of your texts.

Now, you may want to compare each word in your bag with a checked list of words – a “reference vocabulary” – which you assume to comprise the most relevant words of a language. If you do not find a specific word of your bag in your reference “vocabulary” you may want to put this word into a second bag for a later, more detailed analysis. Such an analysis may be based on a table where the vocabulary words are split in n-grams of characters. These n-grams will stored in additional columns added to your wordlist, thus turning it into a 2-dimensional array of data.

Such tasks require a tool which – among other things –

  • is able to load 1-, 2-dimensional and sometimes 3-dimensional data structures in a fast way from CSV-files into series-, table- or cube-like data structures in RAM,
  • provides tools to select, filter, retrieve and manipulate data from rows,columns and cells,
  • provides tools to operate on a multitude of rows or columns,
  • provides tools to create some statistics on the data.

All of it pretty fast – which means that the tool must support the creation of an index or indices and/or support vectorized operations (mostly on columns).

I had read in ML books the Pandas is such a tool for a Python environment. A way to accomplish our task would be to load the reference vocabulary into a Pandas structure and check the words of the BoW(s) against it. This means that you try to find the word in the reference list and evaluate the result (positive or negative). And when you are done with this challenge you may want to retrieve additional information from a 2-dimensional Pandas data structure.

This article is about the performance of some data retrieval experiments I recently did on a wordlist of around 2 million words. The objective was to check the existence of tokenized words of some 200.000 texts, each with around 2000 tokens, against this wordlist being embedded in a Pandas dataframe and also to retrieve additional information from other columns of the dataframe.

As we talk about scanned texts and OCR treatment it is very probable that the number of tokens you have to compare with your vocabulary is well above 10 millions. It is clear that there is an requirement for performance if you want to work with a standard Linux PC.

Multiple ways to retrieve or query information from a Pandas series or dataframe

When I started to really use Pandas some days ago I became a bit overwhelmed by the documentation – and the differences in comparison to databases. After having used a database like MySQL for years I had a certain vision about the handling of “table”-like data and related performance. Well, I had to swallow some camels!

And when I started to really care about performance I also realized that there where very many ways to “query” a Pandas dataframe – and not all will give you the same speed in data retrieval.

This article, therefore, dives a bit below the glittering surface of Pandas and looks at different methods to retrieve rows and certain cell values out of a simple “Pandas dataframe”. To work with some practical data I used a reference vocabulary for the German language based on Wikipedia articles.

The first objective was very simple: Verify that a certain word is an element in the reference vocabulary.
The second objective was a natural extension: Retrieve
rows (with multiple columns) for fitting entries – sometimes multiple entries with different word writings.

I was somewhat astonished top see factors between at least 16 and 10.000 for real data retrieval, in comparison with the fastest solution. Just checking the existence of a word in the wordlist proved to be extremely faster after having created a suitable index – and not using any data columns at all.

The response times of Pandas depended strongly on the “query” method and the usage of an index.

I hope the information given below and in the next article is useful for other beginners with Pandas. I shall speak of a “query” when I want to select data from a Pandas dataframe and a “resultset” when addressing one or a collection of data rows as the result of a query. Can’t forget my time with databases …

I assume that you already have a valid Pandas installation in a Python 3 environment on your Linux PC. I did my simple experiments with a Jupyter notebook, but, of course, other tools can be used, too.

Loading an example wordlist into a Pandas dataframe

For my small “query” experiments I first loaded a simple list with around 2.1 million words from a text file into a Pandas data structure. This operation created a so called “Pandas series” and also produced an unique index – appearing as integers, which marked each row of the data with a specific integer.

Then I created two additional columns: The first one with all words written in lower case letters. The second one containing the number of characters of the word’s string. By these operations I created a real 2-dim object – a so called Pandas “dataframe”.

Let us follow this line of operations as a first step. So, where do we get a wordlist from?

A friendly engineer (Torsten Brischalle) has provided a German word-list based on Wikipedia which we can use as an example.
See: http://www.aaabbb.de/WordList/WordList.php

We first import the “uppercase”-wordlist. You can download from this link. On your Linux PC you expand the 7zip archive by standard Linux tools.

This “uppercase” list has the advantage that an index which we will later base on the lowercase writing of the words will (hopefully) be unique. The more extensive wordlist also provided by Brischalle instead comprises multiple writings for some words. The related index would, therefore, not be unique. We shall see that this has a major impact on the response time of the resulting Pandas dataframe.

The wordlists, after 7zip-expansion, all are very simple text-files: Each line contains just one word.

We shall nevertheless work with a 2-dim general Pandas “dataframe” instead of a “series”. A reason is that in a real data analysis environment we may want to add multiple columns with more information later on. E.g. columns for n-grams of character sequences constituting the word or for other information as frequencies, consonant to vocal ratio, etc. And then we would work on 2-dim data structures.

Loading the data into a Pandas dataframe and creating an index based on lowercase word representation

Let us import the wordlist data by the help of some Python code in a Jupyter cell (in my case from a directory “/py/projects/CA22/catch22/Wortlisten/”):

import os
import time
import pandas as pd
import numpy as np

dfw_smallx = pd.read_csv('/py/projects/CA22/catch22/Wortlisten/word_list_german_uppercase_spell_checked.txt', dtype='str', na_filter=False)
dfw_smallx.columns = ['word']
dfw_smallx['indw'] = dfw_smallx['word']

pdx_shape = dfw_smallx.shape
print("shape of dfw_smallx = ", pdx_shape)
pdx_rows = pdx_shape[0]
pdx_cols = pdx_shape[1]
print("rows of dfw_smallx = ", pdx_rows)
print("cols 
of dfw_smallx = ", pdx_cols)

dfw_smallx.head(8)

You see that we need to import the Pandas module besides other standard modules. Then you find that Pandas obviously provides a function “read_csv()” to import CSV like text files. You find more about it in the Pandas documentation here.
The CSV import should in our case be a matter of a few seconds, only.

A column name or column names can be added to a Pandas series or Pandas dataframe, respectively, afterward.

Why did I use the parameter “na_filter“? Well, this was done to handle a special value in the wordlist, namely “NULL”. You may remember that this is a key-word in Python! We would get an empty entry in the dataframe for this input value without the named parameter. You find more information on this topic in the Pandas documentation on the “read_csv()”-function.

The reader also notices that I just named the single data column (resulting from the import) ‘word’ and then copied this column to another new column called ‘indw’. I shall use the latter column as an index in a minute. I then print out some information on the dataframe:

shape of dfw_smallx =  (2188246, 2)
rows of dfw_smallx =  2188246
cols of dfw_smallx =  2

	word 			indw
0 	AACHENER 		AACHENER
1 	AACHENERIN 		AACHENERIN
2 	AACHENERINNEN 	AACHENERINNEN
3 	AACHENERN 		AACHENERN
4 	AACHENERS 		AACHENERS
5 	AACHENS 		AACHENS
6 	AAL 			AAL
7 	AALE			AALE

Almost 2.2 million words. OK, I do not like uppercase. I want a lowercase representation to be used as an index later on. This gives me the opportunity to apply an operation to a whole column with 2.2 mio words.

The creation of our string based index can be achieved by the “set_index()” function:

dfw_smallx['indw'] = dfw_smallx['word'].str.lower()
dfw_smallx = dfw_smallx.set_index('indw')
dfw_smallx.head(5)

Leading after less than 0.5 secs (!) to:

 				word
indw 	
aachener 		AACHENER
aachenerin 		AACHENERIN
aachenerinnen 	AACHENERINNEN
aachenern 		AACHENERN
aacheners 		AACHENERS

Now, let us add one more column containing the length information on the word(s).

This can be done by two methods

  • dfw_smallx[‘len’] = dfw_smallx[‘word’].str.len()
  • dfw_smallx[‘len’] = dfw_smallx[‘word’].apply(len)

The second method is a bit faster (by a factor of 0.7), but does not work on NaN cells of a column. In our case no problem, we get:

# A a column for len information 
v_start_time = time.perf_counter()
dfw_smallx['len'] = dfw_smallx['word'].apply(len)
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
dfw_smallx.head(3)

Total CPU time  0.3626117290004913

			word 			len
indw 		
aachener 		AACHENER 		8
aachenerin 		AACHENERIN 		10
aachenerinnen 	AACHENERINNEN 	13

Basics of addressing data in a Pandas dataframe

Ok, we have loaded our reference list of words into a dataframe. A Pandas “dataframe” basically is a 2-dimensional data structure based on Numpy array technology for the columns. Now, we want to address data in specific rows or cells. Below I repeat some basics for the retrieval of single values from a dataframe:

Each “cell” has a two dimensional integer-“index” – a tuple [i,j], with “i” identifying a row and “j” a column. You can use respective integer values by the “iloc[]“-operator. E.g. dfw_smallx.iloc[2,1] will give you the value “13”.

The “loc[]“-operator instead works with “labels” given to the rows and columns; in the most primitive form as :

dataframe.loc[row label, column label], e.g. dfw_smallx.loc.[ ‘aachenerinnen’, ‘len’ ] .

Labels have to be defined. For columns you may define names (often already during construction of the dataframe). For rows you may define an index – as we actually did above. If you want to compare this with databases: You define a primary key (sometimes based on column-combinations).

Other almost equivalent methods

  • iat[] – operator ,
  • at[] – operator,
  • array like usage of the column label + row-index
  • and the so called dot-notation

for the retrieval of single values are presented in the following code snippet:

print(dfw_smallx.iloc[2,1])
print(dfw_smallx.iat[2,1])
print(dfw_smallx['len'][2]) 
print(dfw_smallx.loc['aachenerinnen', 'len'])
print(dfw_smallx.at['aachenerinnen', 'len'])
print(dfw_smallx.len.aachenerinnen)

13
13
13
13
13
13

Note that the “iat[]” and “at[]” operators can only be used for cells, so both row and column values have to be provided; the other methods can be used for more general slicing of columns.

Slicing

Slicing in general supported by the “:” notation – just as in NumPy. So, with the notation “labelvalue1 : labelvalue2” one can define slices. This works even for string label values:

words = dfw_smallx.loc['alt':'altersschwach', 'word':'len']
print(words)
                          word  len
indw                               
alt                        ALT    3
altaachener        ALTAACHENER   11
altablage            ALTABLAGE    9
altablagen          ALTABLAGEN   10
altablagerung    ALTABLAGERUNG   13
...                        ...  ...
altersschnitt    ALTERSSCHNITT   13
altersschnitts  ALTERSSCHNITTS   14
altersschrift    ALTERSSCHRIFT   13
altersschutz      ALTERSSCHUTZ   12
altersschwach    ALTERSSCHWACH   13

[3231 rows x 2 columns]

Queries with conditions on column values – and Pandas objects containing multiple results

Now let us look at some queries with conditions on columns and the form of the “result sets” when more than just a single value is returned in a Pandas response. Multiple return values may mean multiple rows (with one or more column values) or just one row with multiple column values. Two points are noteworthy:

  1. Pandas produces a new dataframe or series with multiple rows if multiple values are returned. Whenever we get a Pandas “object” with an internal structure as a Pandas response, we need to narrow down the result to the particular value we want to see.
  2. To grasp a certain value you need to include some special methods already in the “query” or to apply a method to the result series or dataframe.

An interesting type of “query” for a Pandas dataframe is provided by the “query()“-function: it allows us to retrieve rows or single values by conditions on column entries. But conditions can also be supplied when using the “loc[]” operator:

w1 = dfw_smallx.loc['null', 'word']
pd_w2 = dfw_smallx.loc['null'] # resulting in a series 
w2 = pd_w2[0]
pd_w3 = dfw_smallx.loc[dfw_smallx['word'] == 'NULL', 'word']
w3 = pd_w3[0]
pd_w4 = dfw_smallx.query('word == "NULL"')
w4 = pd_w4.iloc[0,0]
w5 = dfw_smallx.query('word == "NULL"').iloc[0,0]
w6 = dfw_smallx.query('word == "NULL"').word.item()
print("w1 = ", w1)
print("pd_w2 = ", pd_w2)
print("w2 = ", w2)
print("pd_wd3 = ", pd_w3)
print("w3 = ", w3)
print("w4 = ", w4)
print("w5 = ", w5)
print("w6 = ", w6)
r

I have added a prefix “pd_” to some variables where I expected a Pandas dataframe to be the answer. And really:

w1 =  NULL
pd_w2 =  word    NULL
len        4
Name: null, dtype: object
w2 =  NULL
pd_wd3 =  indw
null    NULL
Name: word, dtype: object
w3 =  NULL
w4 =  NULL
w5 =  NULL
w6 =  NULL

Noteworthy: For loc[] (in contrast to iloc[]) the last value of the slice definition is included in the result set.

Retrieving data by a list of index values

As soon as you dig a bit deeper into the Pandas documentation you will certainly find the following way to retrieve multiple rows by providing a list of of index values:

# Retrieving col values by a list of index values 
inf = ['null', 'mann', 'frau']
wordx = dfw_smallx.loc[inf, 'word']
wx = wordx.iloc[0:3] # resulting in a Pandas series 
print(wx.iloc[0])
print(wx.iloc[1])
print(wx.iloc[2])
NULL
MANN
FRAUp

Intermediate conclusion

The variety of options even in our very simple scenario to retrieve values from a wordlist (with an additional column) is almost overwhelming. They all serve their purpose – depending on the structure of the dataframe and your knowledge on the data positions.

But actually in our scenario for analyzing a BoWs, we have a very simple task ahead of us: We just want to check whether a word or a list of words exists in the list, i.e. if there is an entry for a word (written in small letters) in the list. What about the performance of the different methods for this task?

Actually, there is a very simple answer for the existence check – giving you maximum performance.

But to learn a bit more about the performance of different forms of Pandas queries we also shall look at methods performing some real data retrieval from the columns of a row addressed by some (string) index value.

These will be the topics of the next article. Stay tuned …

Links

Various ways of “querying” Pandas dataframes
https://www.sharpsightlabs.com/blog/pandas-loc/
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
https://cmsdk.com/python/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.html
https://pythonexamples.org/pandas-dataframe-query/

The book “Mastering Pandas” of Ashish Kumar, 2nd, edition, 2019, Packt Publishing Ltd. may be of help – though it does not really comment on performance issues on this level.

NULL values
https://stackoverflow.com/questions/50683765/how-to-treat-null-as-a-normal-string-with-pandas