Pandas and 3-char-grams of a vocabulary – reduce memory consumption by datatype „category“

I sit in front of my old laptop and want to pre-process data of a pool of scanned texts for an analysis with ML and conventional algorithms. One of the tasks will be to correct at least some wrongly scanned words by "brute force" methods. A straight forward approach is to compare "3-character-gram" segments of the texts' distinguished words (around 1.9 million) with the 3-char-gram patterns of the words of a reference vocabulary. The vocabulary I use contains around 2.7 million German words.

I started today with a 3-char-gram segmentation of the vocabulary - just to learn that tackling this problem with Pandas and Python pretty soon leads to a significant amount of RAM consumption. RAM is limited on my laptop (16 GB), so I have to keep memory requirements low. In this post I discuss some elementary Pandas tricks which helped me reduce memory consumption.

The task

I load my vocabulary from a CSV file into a Pandas dataframe named "dfw_uml". The structure of the data is as follows:

The "indw"-column is identical to the "lower"-column. "indw" allows me to quickly use the "lower" version of the words as an (unique) index instead of an integer index. (This is a very useful option regarding some analysis. Note: As long as the string-based index is unique a hash function is used to make operations using a string-based index very fast.)

For all the words in the vocabulary I want to get all their individual 3-char-gram segments. "All" needs to be qualified: When you split a word in 3-char-grams you can do this with an overlap of the segments or without. Similar to filter kernels of CNNs I call the character-shift of consecutive 3-char-grams against each other "stride".

Let us look at a short word like "angular" (with a length "len" = 7 characters). How many 3-char-grams do we get with a stride of 1? This depends on a padding around the word's edges with special characters. Let us say we allow for a left-padding of 2 characters "%" on the left side of the word and 2 characters "#" on the right side. (Assuming that these characters are no parts of the words themselves. Then, with a stride of "1", the 3-char-grams are :

'%%a', '%an', 'ang', 'ngu', 'gul', 'ula', 'lar', 'ar#', 'r##'

I.e., we get len+2 (=9) different 3-char-grams.

However, with a stride of 3 and a left-padding of 0 we get :

'ang', 'ula', 'r##'

I.e., len/3 + 1 (=3) different 3-char-grams. (Whether we need an additional 3-char-ram depends on the division rest len%3). On the right-hand side of the word we have to allow for filling the rightmost 3-char-gram with our extra character "#".

The difference in the total number of 3-char-grams is substantial. And it becomes linearly bigger with the word-length.
In a German vocabulary many quite long words (composita) may appear. In my vocabulary the longest word has 58 characters:

"telekommunikationsnetzgeschaeftsfuehrungsgesellschaft"

(with umlauts ä and ü written as 'ae' and 'ue', respectively). So, we talk about 60 or 20 additional columns required for "all" 3-char-grams.

So, choosing a suitable stride is an important factor to control memory consumption. But for some kind of analysis you may just want to limit the number (num_gram) of 3-char-grams for your analysis. E.g. you may set num_grams = 20.

When working with a Pandas table-like data structure it seems logical to arrange all of the 3-char-grams in form of different columns. Let us take a number of 20 columns for different 3-char-grams as an objective for this article. We can create such 3-char-grams for all vocabulary words either with a "stride=3" or "stride = 1" and "num_grams = 20". I pick the latter option.

Which padding and stride values are reasonable?

Padding on the right side of a word is in my opinion always reasonable when creating the 3-char-grams. You will see from the code in the next section how one creates the right-most 3-char-grams of the vocabulary words efficiently. On the left side of a word padding may depend on what you want to analyze. The following stride and left-padding combinations seem reasonable to me for 3-char-grams:

  • stride = 3, left-padding = 0
  • stride = 2, left-padding = 0
  • stride = 2, left-padding = 2
  • stride = 1, left-padding = 2
  • stride = 1, left-padding = 1
  • stride = 1, left-padding = 0

Code to create 3-char-grams

The following function builds the 3-char-grams for the different combinations.

def create_3grams_of_voc(dfw_words, num_grams=20, 
                         padding=2, stride=1, 
                         char_start='%', char_end='#', b_cpu_time=True):
    
    cpu_time = 0.0
    if b_cpu_time:
        v_start_time = time.perf_counter()
    
    # Some checks 
    if stride > 3:
        print('stride > 3 cannot be handled of this function for 3-char-grams')
        return dfw_words, cpu_time
    if stride == 3 and padding > 0:
        print('stride == 3 should be used with padding=0 ')
        return dfw_words, cpu_time 
    if stride == 2 and padding == 1: 
        print('stride == 2 should be used with padding=0, 2 - only')
        return dfw_words, cpu_time 

    st1 = char_start
    st2 = 2*char_start
    
    # begin: starting index for loop below   
    begin = 0 
    if stride == 3:
        begin = 0
    if stride == 2 and padding == 2:
        dfw_words['gram_0'] = st2 + dfw_words['lower'].str.slice(start=0, stop=1)
        begin = 1
    if stride == 2 and padding == 0:
        begin = 0
    if stride == 1 and padding == 2:
        dfw_words['gram_0'] = st2 + dfw_words['lower'].str.slice(start=0, stop=1)
        dfw_words['gram_1'] = st1 + dfw_words['lower'].str.slice(start=0, stop=2)
        begin = 2
    if stride == 1 and padding == 1:    
        dfw_words['gram_0'] = st1 + dfw_words['lower'].str.slice(start=0, stop=2)
        begin = 1
    if stride == 1 and padding == 0:    
        begin = 0
        
    # for num_grams == 20 we have to create elements up to and including gram_21 (range -> 22)
        
    # Note that the operations in the loop occur column-wise, i.e vectorized
    # => You cannot make them row dependend 
    # We are lucky that slice returns '' 
    for i in range(begin, num_grams+2):
        col_name = 'gram_' + str(i)
        
        sl_start = i*stride - padding
        sl_stop  = sl_start + 3
        
        dfw_words[col_name] = dfw_words['lower'].str.slice(start=sl_start, stop=sl_stop) 
        dfw_words[col_name] = dfw_words[col_name].str.ljust(3, '#')
    
    # We are lucky that nothing happens if not required to fill up  
    #for i in range(begin, num_grams+2):
    #    col_name = 'gram_' + str(i)
    #    dfw_words[col_name] = dfw_words[col_name].str.ljust(3, '#')

    if b_cpu_time:
        v_end_time = time.perf_counter()
        cpu_time   = v_end_time - v_start_time
        
    return dfw_words, cpu_time

The only noticeable thing about this code is the vectorized handling of the columns. (The whole setup of the 3-char-gram columns still requires around 51 secs on my laptop).

We call the function above for stride=1, padding=2, num_grams=20 by the following code in a Jupyter cell:

num_grams = 20; stride = 2; padding = 2
dfw_uml, cpu_time = create_3grams_of_voc(dfw_uml, num_grams=num_grams, padding=padding, stride=stride)
print("cpu_time = ", cpu_time)
print()
dfw_uml.head(3)

RAM consumption

Let us see how the memory consumption looks like. After having loaded all required libraries and some functions my Jupyter plugin "jupyter-resource-usage" for memory consumption shows: "Memory: 208.3 MB".

When I fill the Pandas dataframe "dfw_uml" with the vocabulary data this number changes to: "Memory: 915 MB".

Then I create the 3-char-gram-columns for "num_grams = 20; stride = 1; padding = 2" and get:

The memory jumped to "Memory: 4.5 GB". The OS wth some started servers on the laptop takes around 2.6 GB. So, we have already consumed around 45% of the available RAM.

Looking at details by

dfw_uml.info(memory_usage='deep')

shows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   indw     object
 1   word     object
 2   len      int64 
 3   lower    object
 4   gram_0   object
 5   gram_1   object
 6   gram_2   object
 7   gram_3   object
 8   gram_4   object
 9   gram_5   object
 10  gram_6   object
 11  gram_7   object
 12  gram_8   object
 13  gram_9   object
 14  gram_10  object
 15  gram_11  object
 16  gram_12  object
 17  gram_13  object
 18  gram_14  object
 19  gram_15  object
 20  gram_16  object
 21  gram_17  object
 22  gram_18  object
 23  gram_19  object
 24  gram_20  object
 25  gram_21  object
dtypes: int64(1), object(25)
memory usage: 4.0 GB

The memory consumption due to our expanded dataframe is huge. No wonder with around 59.4 million string like entries in the dataframe! With Pandas we have no direct option of telling the columns to use specific 3 character columns. For strings Pandas instead uses a flexible datatype "object".

Reducing memory consumption by using datatype "category"

Looking at the data we get the impression that one should be able to reduce the amount of required memory because the entries in all of the 3-char-gram-columns are non-unique. Actually, the 3-char-grams mark major groups of words (probably in a typical way for a given western language).

We can get the number of unique 3-char-grams in a column with the following code snippet:

li_unique = []
for i in range(2,22):
    col_name     = 'gram_' + str(i)
    count_unique = dfw_uml[col_name].nunique() 
    li_unique.append(count_unique)
print(li_unique)         

Giving for our 21 columns:

[3068, 4797, 8076, 8687, 8743, 8839, 8732, 8625, 8544, 8249, 7829, 7465, 7047, 6700, 6292, 5821, 5413, 4944, 4452, 3989]

Compared to 2.7 million rows these numbers are relatively small. This is where the datatype (dtype) "category" comes handy. We can transform the dtype of the dataframe columns by

for i in range(0,22):
    col_name     = 'gram_' + str(i)
    dfw_uml[col_name] = dfw_uml[col_name].astype('category')

"dfw_uml.info(memory_usage='deep')" afterwards gives us:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype   
---  ------   -----   
 0   indw     object  
 1   word     object  
 2   len      int64   
 3   lower    object  
 4   gram_0   category
 5   gram_1   category
 6   gram_2   category
 7   gram_3   category
 8   gram_4   category
 9   gram_5   category
 10  gram_6   category
 11  gram_7   category
 12  gram_8   category
 13  gram_9   category
 14  gram_10  category
 15  gram_11  category
 16  gram_12  category
 17  gram_13  category
 18  gram_14  category
 19  gram_15  category
 20  gram_16  category
 21  gram_17  category
 22  gram_18  category
 23  gram_19  category
 24  gram_20  category
 25  gram_21  category
dtypes: category(22), int64(1), object(3)
memory usage: 739.9 MB

Just 740 MB!
Hey, we have reduced the required memory for the dataframe by more than a factor of 4!/

Read in data from CSV with already reduced memory

We can now save the result of our efforts in a CSV-file by

# the following statement just helps to avoid an unnamed column during export
dfw_uml = dfw_uml.set_index('indw') 
# export to csv-file
export_path_voc_grams = '/YOUR_PATH/voc_uml_grams.csv'
dfw_uml.to_csv(export_path_voc_grams)

For the reverse process of importing the data from a CSV-file the following question comes up:
How can we enforce that the data are read in into dataframe columns with dtype "category"? Such that no unnecessary memory is used during the read-in process. The answer is simple:

Pandas allows the definition of the columns' dtype in form of a dictionary which can be provided as a parameter to the function "read_csv()".

We define two functions to prepare data import accordingly:


# Function to create a dictionary with dtype information for columns
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def create_type_dict_for_gram_cols(num_grams=20):
    # Expected structure:
    # {indw: str, word: str, len: np.int16, lower: str, gram_0 ....gram_21: 'category'  
    
    gram_col_dict = {}
    gram_col_dict['indw']  = 'str'
    gram_col_dict['word']  = 'str'
    gram_col_dict['len']   = np.int16
    gram_col_dict['lower'] = 'str'
    
    for i in range(0,num_grams+2):
        col_name     = 'gram_' + str(i)
        gram_col_dict[col_name] = 'category'
    
    return gram_col_dict

# Function to read in vocabulary with prepared grams 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def readin_voc_with_grams(import_path='', num_grams = 20, b_cpu_time = True):
    if import_path == '':
        import_path = '/YOUR_PATH/voc_uml_grams.csv'
    
    cpu_time = 0.0 
    
    if b_cpu_time:
        v_start_time = time.perf_counter()

    # ceate dictionary with dtype-settings for the columns
    d_gram_cols = create_type_dict_for_gram_cols(num_grams = num_grams )
    df = pd.read_csv(import_path, dtype=d_gram_cols, na_filter=False)
    
    if b_cpu_time:
        v_end_time = time.perf_counter()
        cpu_time   = v_end_time - v_start_time
   
    return df, cpu_time

With these functions we can read in the CSV file. We restart the kernel of our Jupyter notebook to clear all memory and give it back to the OS.

After having loaded libraries and function we get: "Memory: 208.9 MB". Now we fill a new Jupyter cell with:

import_path_voc_grams = '/YOUR_PATH/voc_uml_grams.csv'

print("Starting read-in of vocabulary with 3-char-grams")
dfw_uml, cpu_time = readin_voc_with_grams( import_path=import_path_voc_grams,
                                           num_grams = 20)
print()
print("cpu time for df-creation = ", cpu_time)

We run this code and get :

Starting read-in of vocabulary with 3-char-grams

cpu time for df-creation =  16.770479727001657

and : "Memory: 1.4 GB"

"dfw_uml.info(memory_usage='deep')" indeed shows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700936 entries, 0 to 2700935
Data columns (total 26 columns):
 #   Column   Dtype   
---  ------   -----   
 0   indw     object  
 1   word     object  
 2   len      int16   
 3   lower    object  
 4   gram_0   category
 5   gram_1   category
 6   gram_2   category
 7   gram_3   category
 8   gram_4   category
 9   gram_5   category
 10  gram_6   category
 11  gram_7   category
 12  gram_8   category
 13  gram_9   category
 14  gram_10  category
 15  gram_11  category
 16  gram_12  category
 17  gram_13  category
 18  gram_14  category
 19  gram_15  category
 20  gram_16  category
 21  gram_17  category
 22  gram_18  category
 23  gram_19  category
 24  gram_20  category
 25  gram_21  category
dtypes: category(22), int16(1), object(3)
memory usage: 724.4 MB

Obviously, we save some bytes by "int16" as dtype for len. But Pandas seems to use around 400 MB memory in the background for data handling during the read-in process.

Nevertheless: instead of using 4.5 GB we now consume only 1.4 GB.

Conclusion

Working with huge vocabularies and creating 3-char-gram-segments for each word in the vocabulary is a memory consuming process with Pandas. Using the dtype 'category' helps a lot to save memory. For a typical German a memory reduction by a factor of 4 is within reach.
When importing data from a CSV-file with already prepared 3-char-gram (columns) we can enforce the use of dtype 'category' for columns of a dataframe by providing a suitable dictionary to the function "read_csv()".

Walking a directory tree of text files and detect languages with Python, langdetect and fastText

Whilst preparing text data for ML we sometimes are confronted with a large amount of separate text-files organized in a directory hierarchy. In the last article in this category

Walking a directory tree with Python to eliminate unwanted files

I have shown how we "walk" through such a directory tree of files with the help of Python and os.walk(). We can for example use os.walk() procedure to eliminate unwanted files in the tree.

As an example I discussed a collection of around 200.000 scanned paper pages with typewriter text; the collection contained jpg-files and related OCR-based txt-files. With os.walk() I could easily get rid of the jpg-files.

A subsequent task which awaited me regarding my special text collection was the elimination of those text-files which, with a relatively high probability, were not written in German language. To solve such a problem with Python one needs a module that guesses the language of a given text correctly - in my case despite scanning and OCR errors. In this article I discuss the application of two methods - one based on the standard Python "langdetect" module and the other one based on a special variant of Facebook's "fastText"-algorithm. The latter is by far the faster approach - which is of relevance given the relatively big number of files to analyze.

"langdetect" and its "detect()"-function

To get the "langdetect"-module import it into your virtual Python environment via "pip install langdetect". To test its functionality just present a written text in French and another one in German from respective txt-files to the "detect()"-function of this module - like e.g. in the following Jupyter cell:

from langdetect import detect
import os
import time

ftxt1 = open('/py/projects/CA22/catch22/0034_010.txt', 'r').read()
print(detect(ftxt1))
ftxt2 = open('/py/projects/CA22/catch22/0388_001.txt', 'r').read()
print(detect(ftxt2))

In my test case this gave the result:

fr
de

So, detect() is pretty easy to use.

What about the performance of langdetect.detect()?
Let us look at the performance. For this I just ran os.walk() over the directory tree of my collection and printed out some information for all files not being of German language.

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

n = 0
m = 0
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    subdirs.sort()
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        ftext = open(filepath, 'r').read()
        lang = detect(ftext) 
        if lang != 'de':
            print(filepath, ' ', lang)
            m += 1 
        n += 1
    if n > 9999: 
        break

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

print("num of files checked: ", n)
print("num of non German files: ", m)

Resulting numbers were:

Total CPU time  72.65049991299747
num of files checked:  10033
num of non German files:  1465

Well, printing in a browser to a Jupyter cell takes some time itself. However, we shall see that this aspect is not really relevant in this example. (The situation would be different if we had to switch contexts between GPU operations and CPU operations whilst printing in real ML applications run on a GPU, primarily.)
In addition we are only interested in relative (CPU) performance when comparing the number later to the result of fastText.

I measured 74 secs for a standard hard disk (no SSD) with printing - observing that everything just ran on one core of the CPU. Without printing the run required 72 secs. All measured with disc caching active and taking an average of three runs on cached data.

So, for my whole collection of 200.000 txt-files I would have to wait more than 25 minutes - even if the files had been opened, read and cached previously. Deleting the files form the disk would cost a bit of extra time in addition.

Without caching - i.e. for an original run I would probably have to wait for around 3 hrs. langdetect.detect() is somewhat slow.

Fasttext

FastText was developed by Facebook. It is used to create word-vector models - which can be embedded in neural networks via special layers. See
https://en.wikipedia.org/wiki/FastText and https://fasttext.cc/.

An original publication on the fastText algorithm can be found at https://arxiv.org/abs/1607.04606. An somewhat less scientific and understandable introduction is presented at:
https://amitness.com/2020/06/fasttext-embeddings/.

Besides supporting a variety of real ML-tasks, fastText also offers a language detection functionality. And a Python is supported, too:

In your virtual Python environment just use "pip install fasttext". In addition on a Linux CLI download a pretrained model for language detection, e.g. as with wget:

wget -O /py/projects/fasttext/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

FastText for language detection can be used together with os.walk() as follows:

import fasttext
PRETRAINED_MODEL_PATH = '/py/projects/fasttext/lid.176.bin'
model = fasttext.load_model(PRETRAINED_MODEL_PATH)

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

n = 0
m = 0
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    subdirs.sort()
    #print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        ftext = open(filepath, 'r').read()
        ftext = ftext.replace("\n", " ")
        tupel_lang = model.predict(ftext) 
        #print(tupel_lang)
        lang = tupel_lang[0][0][-2:]
        #print(lang)
        if lang != 'de':
            #print(filepath, ' ', lang) 
            m += 1 
        n += 1
    if n > 9999: 
        break

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print("num of files checked: ", n)
print("num of non German files: ", m)

Result numbers:

Total CPU time  1.895524049999949
num of files checked:  10033
num of non German files:  1355

Regarding performance fastText is almost by a factor of 38 faster than langdetect.detect() !

Eliminating non-German-files from the collection of 208.000 files took only 42 secs - once the files had been cached. An original delete run without caching took around 320 secs - which is still much faster than a detect()-run on cached data. As said on an (older) HD. This is very acceptable. SSDs would give us an even better performance.

Accuracy of the language detection

The reader certainly has noticed that the number of non-German files detected by fastText is somewhat smaller than what langdetect.detect() gave me. That there is a discrepancy is no wonder: Some of the text files are filled with so many scan and OCR errors that even a human would have problems to detect the original language. I just checked 10 of the files where the different methods showed a discrepancy for the guessed language. In all of these 10 cases the guess of fastText was better.

Conclusion

Although I am not a friend of Facebook they have produced something really useful with their fastText algorithm. Regarding language detection fastText provides a really convincing super-fast and reliable method to guess the main language used in text pieces. It supports Python environments and allows the analysis of a huge amount of text files even on a PC.

Walking a directory tree with Python to eliminate unwanted files

Recently I got an a enormous amount of text and related image files of scanned texts (around 210.000). The whole dataset had a size around 25 GB. The text data should be analyzed with some machine learning [ML] algorithms. One of the first things to do in such a situation is to get rid of the jpg-files. Such files consume most of the disk space.

In may case I also got the data from a Mac machine. Hidden "._"-files were created on the Mac when the original data were downloaded from the Internet. These files control Mac security operations. I had to eliminate these files, too.

Due to the doubling of files and additional "."-files the total number of files was around 830.000. The number of files really required was much smaller. To eliminate text files, which one does not need is an exercise which is often required in a Machine Learning context for text files.

In such a situation the function "os.walk()" in a Python environment.

os.walk

os.walk() allows us to walk recursively through a directory tree. We get a tupel back containing

  1. the path to the present directory,
  2. a list of all sub-directories,
  3. a list of all files (by their names) in the directory.

For typical applications this is enough information to perform analysis and file operations within the directories.

Application

In my case the usage was very simple. In a Jupyter cell the following code helped:

import os
import time

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        if filename.endswith('.jpg') or filename.endswith('.db') or filename.endswith('-GS.txt'):
            os.remove(filepath) 
            
# extra loop as there are hidden "."-files also for ".jpg"-files      
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        if filename.startswith('._'):
            os.remove(filepath) 

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

If you do not print out file-paths this should be a matter of seconds only, on a SSD below a second.

Counting the remaining files

We can also use os.walk() to count the number of remaining files:

n = 0 
v_start_time = time.perf_counter()
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    n += len(filesdir)
v_end_time = time.perf_counter()
print('Number of files = ', n )
print("Total CPU time ", v_end_time - v_start_time)

On a Linux system you could also use

mytux:~ # find /py/projects/CA22/catch22/ -type f | wc -l
208123

Thus I could bring down the total size to 780 MB and the number of txt-files to be processed down to around 208.000.