Pandas – Extending a vocabulary or simple dataframe relatively fast

During some work for a ML project on a large text corpus I needed to extend a personally used reference vocabulary by some complex ad unusual German compounds and very branch specific technical terms. I kept my vocabulary data in a Pandas dataframe. Each “word” there had some additional information associated with it in some extra columns of the dataframe – as e.g. the length of a word or a stem or a list of constituting tri-char-grams. I was looking for a fast method to extend the dataframe in a quick procedure with a list of hundreds or thousands of new words.

I tried the df.append() method first and got disappointed with its rather bad performance. I also experimented with the incorporation of some lists or dictionaries. In the end a procedure based on csv-data was the by far most convenient and fastest approach. I list up the basic steps below.

In my case I used the lower case character version of the vocabulary words as an index of the dataframe. This is a very natural step. It requires some small intermediate column copies in the step sequence below, which may not be necessary for other use-cases. For the sake of completeness the following list contains many steps which have to be performed only once and which later on are superfluous for a routine workflow.

  1. Step1: Collect your extension data, i.e. a huge bunch of words, in a Libreoffice Calc-file in ods-format or (if you absolutely must) in an MS Excel-file. One of the columns of your datasheet should contain data which you later want to use as a (unique) index of your dataframe – in my case a column “lower” (containing the low letter representation of a word).
  2. Step 2: Avoid any operations for creating additional column information which you later can create by Python functions working on information already contained in some dataframe columns. Fill in dummy values into respective columns. (Or control the filling of a dataframe with special data during the data import below)
  3. Step 3: Create a CSV-File containing the collected extension data with all required field information in columns which correspond to respective columns of the dataframe to be extended.
  4. Step 4:Create a backup copy of your original dataframe which you want to extend. Just as a precaution ….
  5. Step 5: Copy the contents of the index of your existing dataframe to a specific dataframe column consistent with step 1. In my case I copied the words’ lower case version into a new data column “lower”.
  6. Step 6: Delete the existing index of the original dataframe and create a new basic integer based index.
  7. Step 7: Import the CSV-file into a new and separate intermediate Pandas dataframe with the help of the method pd.read_csv(). Map the data columns and the data formats properly by supplying respective (list-like) information to the parameter list of read_csv(). Control the filling of possibly empty row-fields. Check for fields containing “null” as string and handle these by the parameter “na_filter” if possible (in my case by “na_filter=False”)
  8. Step 8: Work on the freshly created dataframe and create required information in special columns by applying row-specific Python operations with a function and the df.apply()-method. For the sake of performance: Watch out for naturally vectorizable operations whilst doing so and separate them from other operations, if possible.
  9. Step 9: Check for completeness of all information in
    your intermediate dataframe. verify that the column structure matches the columns of the original dataframe to be extend.
  10. Step 10: Concatenate the original Pandas dataframe (for your vocabulary) with the new dataframe containing the extension data by using the df.concat() or (simpler) by df.append() methods.
  11. Step 11: Drop the index in the extended dataframe by the method pd.reset_index(). Afterward recreate a new index by pd.set_index() and using a special column containing the data – in my case the column “lower”
  12. Step 12: Check the new index for uniqueness – if required.
  13. Step 13: If uniqueness is not given but required:
    Apply df = df[~df.index.duplicated(keep=’first’)] to keep only the first occurrence of rows for identical indices. But be careful and verify that this operation really fits your needs.
  14. Step 14: Resort your index (and extended dataframe) if necessary by applying df.sort_index(inplace=True)

Some steps in the list above are of course specific for a dataframe with a vocabulary. But the general scheme should also be applicable for other cases.

From the description you have certainly realized which steps must only be performed once in the beginning to establish a much shorter standard pipeline for dataframe extensions. Some operations regarding the index-recreation and re-sorting can also be automatized by some simple Python function.

Have fun with Pandas!

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – IV

In the last posts of this mini-series we have studied if and how we can use three 3-char-grams at defined positions of a string token to identify matching words in a reference vocabulary. We have seen that we should choose some distance between the char-grams and that we should use the words length information to keep the list of possible hits small.

Such a search may be interesting if there is only fragmented information available about some words of a text or if one cannot trust the whole token to be written correctly. There may be other applications. Note: This has so far nothing to do with text analysis based on machine learning procedures. I would put the whole topic more in the field of text preparation or text rebuilding. But, I think that one can combine our simple identification of fitting words by 3-char-grams with ML-methods which evaluate the similarity or distance of a (possibly misspelled) token with vocabulary words: When we get a long hit-list we could invoke ML-methods to to determine the best fitting word.

We saw that we can do a 100,000 search runs with 3-char-grams on a decent vocabulary of around 2 million words in a Pandas dataframe below a 1.3 minutes on one CPU core of an older PC. In this concluding article I want to look a bit at the idea of multiprocessing the search with up to 4 CPU cores.

Points to take into account when using multiprocessing – do not expect too much

Pandas normally just involves one CPU core to do its job. And not all operations on a Pandas dataframe may be well suited for multiprocessing. Readers who have followed the code fragments in this series so far will probably and rightly assume that there is indeed a chance for reasonably separating our search process for words or at least major parts of it.

But even then – there is always some overhead to expect from splitting a Pandas dataframe into segments (or “partitions”) for a separate operations on different CPU cores. Overhead is also expected from the task to correctly to combine the particular results from the different processor cores to a data unity (here: dataframe) again at the end of a multiprocessed run.

A bottleneck for multiprocessing may also arise if multiple processes have to access certain distinct objects in memory at the same time. In our case we this point is to be expected for the access of and search within distinct sub-dataframes of the vocabulary containing words of a specific length.

Due to overhead and bottlenecks we do not expect that a certain problem scales directly and linearly with the number of CPU cores. Another point is that although the Linux OS may recognize a hyperthreading physical core of an Intel processor as two cores – but it may not be able to use such virtual cores in a given context as if they were real separate physical cores.

Code to invoke multiple processor cores

In this article I just use the standard Python “multiprocessing” module. (I did not test Ray yet – as a first trial gave me trouble in some preparing code-segments of my Jupyter notebooks. I did not have time to solve the problems there.)

Following some advice on the Internet I handled parallelization in the following way:

import multiprocessing
from multiprocessing import cpu_count, Pool

#cores = cpu_count() # Number of physical CPU cores on your system
cores = 4
partitions = cores # But actually you can define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split), copy=False)
    pool.close()
    pool.join()
    return data

The basic function, corresponding to the parameter “func” of function “parallelize”, which shall be executed in our case is structurally well known from the last posts of this article series:

We perform a search via
putting conditions on columns (of the vocabulary-dataframe) containing 3-char-grams at different positions. The search is done on sub-dataframes of the vocabulary containing only words with a given length. The respective addresses are controlled by a Python dictionary “d_df”; see the last post for its creation. We then build a list of indices of fitting words. The dataframe containing the test tokens – in our case a random selection of real vocabulary words – will be called “dfw” inside the function “func() => getlen()” (see below). To understand the code you should be aware of the fact that the original dataframe is split into (4) partitions.

We only return the length of the list of hits and not the list of indices for each token itself.

# Function for parallelized operation 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def getlen(dfw):
    # Note 1: The dfw passed is a segment ("partition") of the original dataframe  
    # Note 2: We use a dict d_lilen which was defined outside  
    #         and is under the control of the parallelization manager
    
    num_rows = len(dfw)
    for i in range(0, num_rows):
        len_w = dfw.iat[i,0]
        idx = dfw.iat[i,33]
        
        df_name = "df_" + str(len_w)
        df_ = d_df[df_name]

        j_m = math.floor(len_w/2)+1
        j_l = 2
        j_r = len_w -1
        col_l = 'gram_' + str(j_l)
        col_m = 'gram_' + str(j_m)
        col_r = 'gram_' + str(j_r)
        val_l = dfw.iat[i, j_l+2]
        val_m = dfw.iat[i, j_m+2]
        val_r = dfw.iat[i, j_r+2]
        li_ind = df_.index[   (df_[col_r]==val_r) 
                            & (df_[col_m]==val_m)
                            & (df_[col_l]==val_l)
                            ]
        d_lilen[idx] = len(li_ind)

    # The dataframe must be returned - otherwise it will not be concatenated after parallelization 
    return dfw

While the processes work on different segments of our input dataframe we write results to a Python dictionaryd_lilen” which is under the control of the “parallelization manager” (see below). A dictionary is appropriate as we might otherwise loose control over the dataframe-indices during the following processes.

A reduced dataframe containing randomly selected “tokens”

To make things a bit easier we first create a “token”-dataframe “dfw_shorter3” based on a random selection of 100,000 indices from a dataframe containing long vocabulary words (length ≥ 10). We can derive it from our reference vocabulary. I have called the latter dataframe “dfw_short3” in the last post (because we use three 3-char-grams for longer tokens). “dfw_short3” contains all words of our vocabulary with a length of “10 ≤ length ≤ 30”.

# Prepare a sub-dataframe for of the random 100,000 words 
# ******************************
num_w = 100000
len_dfw = len(dfw_short3)

# select a 100,000 random rows 
random.seed()
# Note: random.sample does not repeat values 
li_ind_p_w = random.sample(range(0, len_dfw), num_w)
len_li_p_w = len(li_ind_p_w)

dfw_shorter3 = dfw_short3.iloc[li_ind_p_w, :].copy() 
dfw_shorter3['lx'] = 0
dfw_shorter3['idx'] = dfw_shorter3.index
dfw_shorter3.head(5)

The resulting dataframe “dfw_shorter3” looks like :


nYou see that the index varies randomly and is not in ascending order! This is the reason why we must pick up the index-information during our parallelized operations!

Code for executing parallelized run

The following code enforces a parallelized execution:

manager = multiprocessing.Manager()
d_lilen = manager.dict()
print(len(d_lilen))

v_start_time = time.perf_counter()
dfw_res = parallelize(dfw_shorter3, getlen)
v_end_time = time.perf_counter()
cpu_time   = v_end_time - v_start_time
print("cpu : ", cpu_time)

print(len(d_lilen))
mean_length  = sum(d_lilen.values()) / len(d_lilen)
print(mean_length)

The parallelized run takes about 29.5 seconds.

cpu :  29.46206265499968
100000
1.25008

How does cpu-time vary with the number of cores of my (hyperthreading) CPU?

The cpu-time does not improve much when the number of cores gets bigger than the number of real physical cores:

1 core : 90.5 secs       
2 cores: 47.6 secs  
3 cores: 35.1 secs 
4 cores: 29.4 secs 
5 cores: 28.2 secs 
6 cores: 26.9 secs 
7 cores: 26.0 secs 
8 cores: 25.5 secs

My readers know about this effect already from ML experiments with CUDA and libcublas:

As long a s we use physical processor cores we see substantial improvement, beyond that no real gain in performance is observed on hyperthreading CPUs.

Compared to a run with just one CPU core we seem to gain a factor of almost 3 by parallelization. But, actually, this is no fair comparison: My readers have certainly seen that the CPU-time for the run with one CPU-Core is significantly slower than comparable runs which I described in my last post. At that time we found a cpu-time of around 75 secs, only. So, we have a basic deficit of about 15 secs – without real parallelization!

Overhead and RAM consumption of multiprocessing

Why does run with just one CPU core take so long time? Is it functional overhead for organizing and controlling multiprocessing – which may occur despite using just one core and just one “partition” of the dataframe (i.e. the full dataframe)? Well, we can test this easily by reconstructing the runs of my last post a bit:

# Reformulate Run just for cpu-time comparisons 
# **********************************************
b_test = True 

# Function  
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def getleng(dfw, d_lileng):
    # Note 1: The dfw passed is a segment of the original dataframe  
    # Note 2: We use a list l_lilen which was outside defined 
    #         and is under the control of the prallelization manager
    
    num_rows = len(dfw)
    #print(num_rows)
    for i in range(0, num_rows):
        len_w = dfw.iat[i,0]
        idx = dfw.iat[i,33]
        
        df_name = "df_" + str(len_w)
        df_ = d_df[df_name]

        j_m = math.floor(len_w/2)+1
        j_l = 2
        j_r = len_w -1
        col_l = 'gram_' + str(j_l)
        col_m = 'gram_' + str(j_m)
        col_r = 'gram_' + str(j_r)
        val_l = dfw.iat[i, j_l+2]
        val_m = dfw.iat[i, j_m+2]
        val_r = dfw.iat[i, j_r+2]
        li_ind = df_.index[   (df_[col_r]==val_r) 
                            & (df_[col_m]==val_m)
                            & (df_[col_l]==val_l)
                            ]
        leng = len(li_ind)
        d_lileng[idx] = leng

    return d_lileng


if b_test: 
    num_w = 100000
    len_dfw = len(dfw_short3)

    # select a 100,000 random rows 
    random.seed()
    # Note: random.sample does not repeat values 
    li_ind_p_w = random.sample(range(0, len_dfw), num_w)
    len_li_p_w = len(li_ind_p_w)

    dfw_shortx = dfw_short3.iloc[li_ind_p_
w, :].copy() 
    dfw_shortx['lx']  = 0
    dfw_shortx['idx'] = dfw_shortx.index

    d_lileng = {} #

    v_start_time = time.perf_counter()
    d_lileng = getleng(dfw_shortx, d_lileng)
    v_end_time = time.perf_counter()
    cpu_time   = v_end_time - v_start_time
    print("cpu : ", cpu_time)
    print(len(d_lileng))
    mean_length = sum(d_lileng.values()) / len(d_lileng)
    print(mean_length)
    
    dfw_shortx.head(3)

 
How long does such a run take?

cpu :  77.96989408900026
100000
1.25666

Just 78 secs! This is pretty close to the number of 75 secs we got in our last post’s efforts! So, we see that turning to multiprocessing leads to significant functional overhead! The gain in performance, therefore, is less than the factor 3 observed above:

We (only) get a gain in performance by a factor of roughly 2.5 – when using 4 physical CPU cores.

I admit that I have no broad or detailed experience with Python multiprocessing. So, if somebody sees a problem in my code, please, send me a mail.

RAM is not released completely
Another negative side effect was the use of RAM in my case. Whereas we just get 2.2 GB RAM consumption with all required steps and copying parts of the loaded dataframe with all 3-char-grams in the above test run without multiprocessing, I saw a monstrous rise in memory during the parallelized runs:

Starting from a level of 2.4 GB, memory rose to 12.5 GB during the run and then fell back to 4.5 GB. So, there are copying processes and memory is not completely released again in the end – despite having all and everything encapsulated in functions. Repeating the multiprocessed runs even lead to a systematic increase in memory by about 150 MB per run.

So, when working with the “multiprocessing module” and big Pandas dataframes you should be a bit careful about the actual RAM consumption during the runs.

Conclusion

This series about finding words in a vocabulary by using two or three 3-char-grams may have appeared a bit “academical” – as one of my readers told me. Why the hell should someone use only a few 3-char-grams to identify words?

Well, I have tried to give some answers to this question: Under certain conditions you may only have fragments of words available; think of text transcribed from a recorded, but distorted communication with Skype or think of physically damaged written text documents. A similar situation may occur when you cannot trust a written string token to be a correctly written word – due to misspelling or other reasons (bad OCR SW or bad document conditions for scans combined with OCR).

In addition: character-grams are actually used as a basis for multiple ML methods for text-analysis tasks, e.g. in Facebook’s Fasttext. They give a solid base for an embedded word vector space which can help to find and measure similarities between correctly written words, but also between correctly written words and fantasy words or misspelled words. Looking a bit at the question of how much a few 3-char-grams help to identify a word is helpful to understand their power in other contexts, too.

We have seen that only three 3-char-grams can identify matching words quite well – even if the words are long words (up to 30 characters). The list of matching words can be kept surprisingly small if and when

  • we use available or reasonable length information about the words we want to find,
  • we define positions for the 3-char-grams inside the words,
  • we put some positional distance between the location of the chosen 3-char-grams inside the words.

For a 100,000 random cases with correctly written 3-char-grams the average length of the hit list was below 2 – if the distance between the 3-char-grams was
reasonably large compared to the token-length. Similar results were found for using only two 3-char-grams for short words.

We have also covered some very practical aspects regarding search operation on relatively big Pandas dataframes :

The CPU-time for identifying words in a Pandas dataframe by using 3-char-grams is reasonably small to allow for experiments with around 100,000 tokens even on PCs within minutes or quarters of an hour – but it does not take hours. As using 3-char-grams corresponds to putting conditions on two or three columns of a dataframe this result can be generalized to other similar problems with string comparisons on dataframe columns.

The basic RAM consumption of dataframes containing up to fifty-five 3-char-grams per word can be efficiently controlled by using the dtype “category” for the respective columns.

Regarding cpu-time we saw that working with many searches may get a performance boost by a factor well above 2 by using simple multiprocessing techniques based on Python’s “multiprocessing” module. However, this comes with an unpleasant side effect of enormous RAM consumption – at least temporarily.

I hope you had some fun with this series of posts. In a forthcoming series I will apply these results to the task of error correction. Stay tuned.

Links

https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a
https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-