Performance of data retrieval from a simple wordlist in a Pandas dataframe with a string based index – II

In my last post I set up a Pandas dataframe with a column containing a (German) wordlist of around 2.2 million words. We created a unique string based index for the dataframe from a column wither "lower case" writing of the words. My eventual objectives are

  1. to find out whether a string like token out of some millions of tokens is a member of the wordlist or not,
  2. to compare n-grams of characters, i.e. (sub-strings) of millions of given strange string tokens with the n-grams of each word in the wordlist.

In the first case a kind of "existence-query" on the wordlist is of major importance. We could work with a condition on a row-value or somehow use the string based index itself. For the second objective we need a requests on column values with "OR" conditions or again a kind of existence-queries on individual columns, which we turn into index structures before.

It found it interesting and a bit frustrating that a lot of introductory articles on the Internet and even books do not comment on performance. In this article we, therefore, compare the performance of different forms of simple data requests on a Pandas dataframe. To learn a bit more about Pandas' response times, we extend the data retrieval requests a bit beyond the objectives listed above: We are going to look for rows where conditions for multiple words are fulfilled.

For the time being we restrict our experiments to a dataframe with just one UNIQUE index. I.e. we do not, yet, work with a multi-index. However, at the end of this article, I am going to look a bit at a dataframe with a NON-UNIQUE index, too.

Characteristics of the dataframe and the "query"

We work on a Pandas dataframe "dfw_smallx" with the following characteristics:

pdx_shape = dfw_smallx.shape
print("shape of dfw_smallx = ", pdx_shape)
pdx_rows = pdx_shape[0]
pdx_cols = pdx_shape[1]
print("rows of dfw_smallx = ", pdx_rows)
print("cols of dfw_smallx = ", pdx_cols)
print("column names", dfw_smallx.columns)
print("index", dfw_smallx.index)
print("index is unique: ", dfw_smallx.index.is_unique)
print('')
print(dfw_smallx.loc[['aachener']])
shape of dfw_smallx =  (2188246, 3)
rows of dfw_smallx =  2188246
cols of dfw_smallx =  3
column names Index(['lower', 'word', 'len'], dtype='object')
index Index(['aachener', 'aachenerin', 'aachenerinnen', 'aachenern', 'aacheners',
       'aachens', 'aal', 'aale', 'aalen', 'aales',
       ...
       'zynisches', 'zynischste', 'zynischsten', 'zynismus', 'zypern',
       'zyperns', 'zypresse', 'zypressen', 'zyste', 'zysten'],
      dtype='object', name='indw', length=2188246)
index is unique:  True

             lower      word  len
indw                             
aachener  aachener  AACHENER    8

The only difference to the dataframe created in the last article is the additional column "lower", repeating the index. As said, the string based index of this dataframe is unique (checked by dfw_smallx.index.is_unique). At the end of this post we shall also have a look at a similar dataframe with a non-unique index.

Query: For a comparison we look at different methods to answer the following question: Are there entries for the words "null", "mann" and "frau" in the list?

We apply each methods a hundred times to get some statistics. I did the experiments on a CPU (i7-6700K). The response time depends a bit on the background load - I took the best result out of three runs.

Unique index: CPU Time for checking the existence of an entry within the index itself

There is a very simple answer to the question, of how one can check the existence of a value in a (string based) index of a Pandas dataframe. We just use

"('STRING-VALUE' in df.index)" !

Let us apply this for three values

b1 = 0; b2=0; b3=0;  
v_start_time = time.perf_counter()
for i in range(0, 100):
    if 'null' in dfw_smallx.index:
        b1 = 1
    if 'mann' in dfw_smallx.index:
        b2=1 
    if 'frau' in dfw_smallx.index:
        b3=1
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
b4 = 'gamling' in dfw_smallx.index 
print(b1, b2, b3, b4)
Total CPU time  0.00020675300038419664
NULL
1 1 1 False

Giving me a total time on my old PC of about 2.1e-4 secs. Which - as we are going to see is a pretty good value - for Pandas!

Total time 2.1e-4 secs.    Per query: 6.9e-7 secs.

Unique index: CPU Time for checking the existence of an entry with a Python dictionary

It is interesting to compare the query time required for a simple dictionary.

We first create a usable dictionary with the lower case word strings as the index:

ay_voc = dfw_smallx.to_numpy()
print(ay_voc.shape)
print(ay_voc[0:2, :])

ay_lower = ay_voc[:,0].copy()
d_lower = dict(enumerate(ay_lower))
d_low   = {y:x for x,y in d_lower.items()}
print(d_lower[0])
print(d_low['aachener'])
(2188246, 3)
[['aachener' 'AACHENER' 8]
 ['aachenerin' 'AACHENERIN' 10]]
aachener
0

And then:

b1 = 0; b2 = 0; b3 = 0
v_start_time = time.perf_counter()
for i in range(0, 100):
    if 'null' in d_low:
        b1 = 1
    if 'mann' in d_low:
        b2=1 
    if 'frau' in d_low:
        b3=1    
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

print(b1, b2, b3)
print(d_low['mann'], d_low['frau'])
Total CPU time  8.626400085631758e-05
1 1 1
1179968 612385

Total time 8.6e-5 secs.    Per query: 2.9e-7 secs.

A dictionary is by almost a factor of 2.4 faster regarding a verification of the existence of a given string value in a string based index than related queries on an indexed Pandas series or dataframes!

Unique index: CPU Time for direct "at"-queries by providing individual index values

Now, let us start with repeated query calls on a our dataframe with the "at"-operator:

v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.at['null', 'word']
    wordy = dfw_smallx.at['mann', 'word']
    wordz = dfw_smallx.at['frau', 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  0.0013257559985504486

Total time: 1.3e-3 secs.    Per query: 4.4e-6 secs
This approach is by factors 6.5 and 15 slower than the fastest solutions for Pandas and the dictionary, respectively

Unique index: CPU Time for direct "loc"-queries by providing individual index values

We now compare this with the same queries - but with the "loc"-operator:

v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc['null', 'word']
    wordy = dfw_smallx.loc['mann', 'word']
    wordz = dfw_smallx.loc['frau', 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  0.0021894429992244113
NULL

Total time: 2.2e-3 secs.    Per query: 7.3e-6 secs
More than a factor of 10.6 and 25.4 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: CPU Time for a query based on a list of index values

Now, let us use a list of index values - something which would be typical for a programmer who wants to save some typing time:

# !!!! Much longer CPU time for a list of index values  
inf = ['null', 'mann', 'frau']
v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc[inf, 'word'] # a Pandas series 
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print(wordx)
Total CPU time  0.037733839999418706
indw
null    NULL
mann    MANN
frau    FRAU
Name: word, dtype: object

Total time: 3.8e-2 secs.    Per query: 1.3e-4 secs
More than a factor of 182 and 437 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: CPU Time for a query based on a list of index values with index.isin()

We now try a small variation by

ix = dfw_smallx.index.isin(['null', 'mann', 'frau'])
v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc[ix, 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  0.058356915000331355

Total time: 5.8e-2 secs.    Per query: 1.9e-4 secs
We are loosing ground, again.
More than a factor of 282 and 667 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: CPU Time for a query based on a list of index values with index.isin() within loc()

Yet, another seemingly small variation

inf = ['null', 'mann', 'frau']
v_start_time = time.perf_counter()
for i in range(0, 100):
    wordx = dfw_smallx.loc[dfw_smallx.index.isin(inf), 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  6.466159620998951

OOOOPs!
Total time: 6.48 secs.    Per query: 2.2e-2 secs
More than a factor of 31000 and 75000 slower than the fastest Pandas solution and the dictionary, respectively.

Unique index: Query by values for a column and ignoring the index

What happens if we ignore the index and query by values of a column?

v_start_time = time.perf_counter()
for i in range(0, 100):
    pdq = dfw_smallx.query('indw == "null" | indw == "mann" | indw == "frau" ')
    #print(pdq)
    w1 = pdq.iloc[0,0]
    w2 = pdq.iloc[1,0]
    w3 = pdq.iloc[2,0]
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  16.747838952000166

Well, well - the worst result so far!
Total time: 16.75 secs.    Per query: 5.6e-2 secs
More than a factor of 81000 and 194000 slower than the fastest Pandas solution and the dictionary, respectively.

The following takes even longer:

v_start_time = time.perf_counter()
for i in range(0, 100):
    word3 = dfw_smallx.loc[dfw_smallx['word'] == 'NULL', 'word']
    word4 = dfw_smallx.loc[dfw_smallx['word'] == 'MANN', 'word']
    word5 = dfw_smallx.loc[dfw_smallx['word'] == 'FRAU', 'word']
    w4 = word3.iloc[0]
    w5 = word4.iloc[0]
    w6 = word4.iloc[0]
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
Total CPU time  22.6538158809999

Total time: 22.76 secs.    Per query: 7.6e-2 secs
More than a factor of 109000 and 262000 slower than the fastest Pandas solution and the dictionary, respectively.

However:

v_start_time = time.perf_counter()
for i in range(0, 100):
    pdl = dfw_smallx.loc[dfw_smallx['word'].isin(['MANN', 'FRAU', 'NULL' ]), 'word']
    w6 = pdl.iloc[0]
    w7 = pdl.iloc[1]
    w8 = pdl.iloc[2]
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

This gives us 6.576 secs again.

Unique index: Query by values of a dictionary and ignoring the index

Here a comparison to a dictionary is interesting again:

v_start_time = time.perf_counter()
b1 = 0; b2 = 0; b3 = 0
for i in range(0, 100):
    if 'null' in d_lower.values():
        b1 = 1
    if 'mann' in d_lower.values():
        b2=1 
    if 'frau' in d_lower.values():
        b3=1    
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print(b1, b2, b3)
Total CPU time  4.572028649003187
1 1 1

So a dictionary is faster than Pandas - even if we ignore the index and query for certain values!

Intermediate conclusions

What do the above results tell us about the handling of Pandas dataframes or series with strings as elements and an unique index containing strings, too?

The first thing is:

If possible do not use a Pandas dataframe at all! Turn (two) required (string) columns of the dataframe into a string indexed dictionary!

In our case a simple solution was to turn a column with lower case writings of the words into a dictionary index and using an enumerated column as values.

A dictionary with string based keys will give you by far the fastest solution if you are only interested in the existence of a certain key-value.

Now, if you want to use Pandas to check the existence of a certain string in a unique index or column the following rules should be followed:

  • Whenever possible use a (unique) index based on the strings whose existence you are interested in.
  • For pure existence checks of a string in the index use a query of the type
    "if 'STRING' in df.index".
    This will give you the fastest solution with Pandas.
  • Whenever possible use a series of simple queries - each for exactly one index value and one or multiple column labels instead of providing multiple index values in a list.
  • If you want to use the "at" or "loc"-operators, prefer the "at"-operator for a unique index! The form should be
    result = df.at['IndexValue', 'ColLabel'].
    The loc-operator
    result = df.loc['IndexValue', 'colLabel1', 'colLabel2', ...] is somewhat slower, but the right choice if you want to retrieve multiple columns for a single row index value.
  • Avoid results which themselves become Pandas dataframes or series - i.e. results which contain a multitude of rows and column values
  • Avoid queries on column-values! The CPU times to produce results depending on conditions for a column may vary; factors between 6 and 200,000 in comparison to the fastest solution for single values are possible.

Using a string based index in a Pandas dataframe is pretty fast because Pandas then uses a hash-function and a hashtable to index datarows. Very much like Python handles dictionaries.

Dataframes with a non-unique string index

Let us now quickly check what happens if we turn to a vocabulary with a non-unique string based index. We can easily get this from the standard checked German wordlist provided by T.Brischalle.

dfw_fullx = pd.read_csv('/py/projects/CA22/catch22/Wortlisten/word_list_german_spell_checked.txt', dtype='str', na_filter=False)
dfw_fullx.columns = ['word']
dfw_fullx['indw'] = dfw_fullx['word']

pdfx_shape = dfw_fullx.shape
print('')
print("shape of dfw_fullx = ", pdfx_shape)
pdfx_rows = pdfx_shape[0]
pdfx_cols = pdfx_shape[1]
print("rows of dfw_fullx = ", pdfx_rows)
print("cols of dfw_fullx = ", pdfx_cols)

giving:

shape of dfw_fullx =  (2243546, 2)
rows of dfw_fullx =  2243546
cols of dfw_fullx =  2

We set an index as before by lowercase word values:

dfw_fullx['indw'] = dfw_fullx['word'].str.lower()
dfw_fullx = dfw_fullx.set_index('indw')

word_null = dfw_fullx.loc['null', 'word']
word_mann = dfw_fullx.loc['mann', 'word']
word_frau = dfw_fullx.loc['frau', 'word']

print('')
print(word_null)
print('')
print(word_mann)
print('')
print(word_frau)

Giving:

indw
null    null
null    Null
null    NULL
Name: word, dtype: object

indw
mann    Mann
mann    MANN
Name: word, dtype: object

indw
frau    Frau
frau    FRAU
Name: word, dtype: object

You see directly that the index is not unique.

Non-unique index: CPU Time for direct queries by with single index values

We repeat our first experiment from above :

v_start_time = time.perf_counter()
for i in range(0, 100):
    pds1 = dfw_fullx.loc['null', 'word']
    pds2 = dfw_fullx.loc['mann', 'word']
    pds3 = dfw_fullx.loc['frau', 'word']
v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print(pds1) 

This results in:

indw
null    null
null    Null
null    NULL
Name: word, dtype: object
TotalCPU time  7.232291821999752

7.2 secs! Not funny!

The reason is that the whole index must be checked for entries of the given value.

Why the whole dataset? Well, Pandas cannot be sure that the string based index is sorted in a way!

Non-unique index: CPU Time for direct queries by single index values and a sorted index

We remedy the above problem by sorting the index:

dfw_fullx = dfw_fullx.sort_index(axis=0)

And afterwards we try our test from above again - this time giving us :

indw
null    Null
null    null
null    NULL
Name: word, dtype: object
Total CPU time  0.04120599599991692

0.041 secs - much faster. On average the response time now depends on log(N) because Pandas now can use a binary search - with N being the number of elements in the index.

Conclusions

For me as a beginner with Pandas the multitude of options to retrieve values from a Pandas series or dataframe was confusing and their relation to index involvement and the construction of the "result set" was not always obvious. Even more surprising, however, was the impact on performance:

As soon as you hit multiple rows by a "query" a Pandas series or dataframe is constructed which contains the results. This may have advantages regarding the presentation of the result data and advantages regarding a convenient post-query handling of the results. It is however a costly procedure in terms of CPU time. In database language:

For Pandas building the result set can become more costly than the query itself - even for short result sets. This was in a way a bit disappointing.

If you are interested in just the existence of a certain string values in a list of unique strings just create a dictionary - indexed by your string values. Then check the existence of a string value by "('STRING-VALUE' in dict)".
You do not need Pandas or this task!

When you want to use an (indexed) Pandas for existence checks then query like "if 'STRING-VALUE' in df.index" to check the existence of a string value in the properly created string based index.

When you need a bunch of values in various columns for given indices go for the "at"-operator and individual queries for each index value - and not a list.

When retrieving values you should, in general, use a direct form of "selecting" by just one index value and (if possible) one column value with the "at"-operator (df.at['indexValue', 'columnLabel']). The loc-operator is up to a factor 1,7 slower - but relevant if you query for more than one columns.

Non-unique indices cost time, too. If you have no chance to associate your dataframe with unique index that suits your objectives then sort the non-unique index at least.