Walking a directory tree of text files and detect languages with Python, langdetect and fastText

Whilst preparing text data for ML we sometimes are confronted with a large amount of separate text-files organized in a directory hierarchy. In the last article in this category

Walking a directory tree with Python to eliminate unwanted files

I have shown how we "walk" through such a directory tree of files with the help of Python and os.walk(). We can for example use os.walk() procedure to eliminate unwanted files in the tree.

As an example I discussed a collection of around 200.000 scanned paper pages with typewriter text; the collection contained jpg-files and related OCR-based txt-files. With os.walk() I could easily get rid of the jpg-files.

A subsequent task which awaited me regarding my special text collection was the elimination of those text-files which, with a relatively high probability, were not written in German language. To solve such a problem with Python one needs a module that guesses the language of a given text correctly - in my case despite scanning and OCR errors. In this article I discuss the application of two methods - one based on the standard Python "langdetect" module and the other one based on a special variant of Facebook's "fastText"-algorithm. The latter is by far the faster approach - which is of relevance given the relatively big number of files to analyze.

"langdetect" and its "detect()"-function

To get the "langdetect"-module import it into your virtual Python environment via "pip install langdetect". To test its functionality just present a written text in French and another one in German from respective txt-files to the "detect()"-function of this module - like e.g. in the following Jupyter cell:

from langdetect import detect
import os
import time

ftxt1 = open('/py/projects/CA22/catch22/0034_010.txt', 'r').read()
print(detect(ftxt1))
ftxt2 = open('/py/projects/CA22/catch22/0388_001.txt', 'r').read()
print(detect(ftxt2))

In my test case this gave the result:

fr
de

So, detect() is pretty easy to use.

What about the performance of langdetect.detect()?
Let us look at the performance. For this I just ran os.walk() over the directory tree of my collection and printed out some information for all files not being of German language.

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

n = 0
m = 0
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    subdirs.sort()
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        ftext = open(filepath, 'r').read()
        lang = detect(ftext) 
        if lang != 'de':
            print(filepath, ' ', lang)
            m += 1 
        n += 1
    if n > 9999: 
        break

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

print("num of files checked: ", n)
print("num of non German files: ", m)

Resulting numbers were:

Total CPU time  72.65049991299747
num of files checked:  10033
num of non German files:  1465

Well, printing in a browser to a Jupyter cell takes some time itself. However, we shall see that this aspect is not really relevant in this example. (The situation would be different if we had to switch contexts between GPU operations and CPU operations whilst printing in real ML applications run on a GPU, primarily.)
In addition we are only interested in relative (CPU) performance when comparing the number later to the result of fastText.

I measured 74 secs for a standard hard disk (no SSD) with printing - observing that everything just ran on one core of the CPU. Without printing the run required 72 secs. All measured with disc caching active and taking an average of three runs on cached data.

So, for my whole collection of 200.000 txt-files I would have to wait more than 25 minutes - even if the files had been opened, read and cached previously. Deleting the files form the disk would cost a bit of extra time in addition.

Without caching - i.e. for an original run I would probably have to wait for around 3 hrs. langdetect.detect() is somewhat slow.

Fasttext

FastText was developed by Facebook. It is used to create word-vector models - which can be embedded in neural networks via special layers. See
https://en.wikipedia.org/wiki/FastText and https://fasttext.cc/.

An original publication on the fastText algorithm can be found at https://arxiv.org/abs/1607.04606. An somewhat less scientific and understandable introduction is presented at:
https://amitness.com/2020/06/fasttext-embeddings/.

Besides supporting a variety of real ML-tasks, fastText also offers a language detection functionality. And a Python is supported, too:

In your virtual Python environment just use "pip install fasttext". In addition on a Linux CLI download a pretrained model for language detection, e.g. as with wget:

wget -O /py/projects/fasttext/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

FastText for language detection can be used together with os.walk() as follows:

import fasttext
PRETRAINED_MODEL_PATH = '/py/projects/fasttext/lid.176.bin'
model = fasttext.load_model(PRETRAINED_MODEL_PATH)

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

n = 0
m = 0
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    subdirs.sort()
    #print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        ftext = open(filepath, 'r').read()
        ftext = ftext.replace("\n", " ")
        tupel_lang = model.predict(ftext) 
        #print(tupel_lang)
        lang = tupel_lang[0][0][-2:]
        #print(lang)
        if lang != 'de':
            #print(filepath, ' ', lang) 
            m += 1 
        n += 1
    if n > 9999: 
        break

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)
print("num of files checked: ", n)
print("num of non German files: ", m)

Result numbers:

Total CPU time  1.895524049999949
num of files checked:  10033
num of non German files:  1355

Regarding performance fastText is almost by a factor of 38 faster than langdetect.detect() !

Eliminating non-German-files from the collection of 208.000 files took only 42 secs - once the files had been cached. An original delete run without caching took around 320 secs - which is still much faster than a detect()-run on cached data. As said on an (older) HD. This is very acceptable. SSDs would give us an even better performance.

Accuracy of the language detection

The reader certainly has noticed that the number of non-German files detected by fastText is somewhat smaller than what langdetect.detect() gave me. That there is a discrepancy is no wonder: Some of the text files are filled with so many scan and OCR errors that even a human would have problems to detect the original language. I just checked 10 of the files where the different methods showed a discrepancy for the guessed language. In all of these 10 cases the guess of fastText was better.

Conclusion

Although I am not a friend of Facebook they have produced something really useful with their fastText algorithm. Regarding language detection fastText provides a really convincing super-fast and reliable method to guess the main language used in text pieces. It supports Python environments and allows the analysis of a huge amount of text files even on a PC.

Walking a directory tree with Python to eliminate unwanted files

Recently I got an a enormous amount of text and related image files of scanned texts (around 210.000). The whole dataset had a size around 25 GB. The text data should be analyzed with some machine learning [ML] algorithms. One of the first things to do in such a situation is to get rid of the jpg-files. Such files consume most of the disk space.

In may case I also got the data from a Mac machine. Hidden "._"-files were created on the Mac when the original data were downloaded from the Internet. These files control Mac security operations. I had to eliminate these files, too.

Due to the doubling of files and additional "."-files the total number of files was around 830.000. The number of files really required was much smaller. To eliminate text files, which one does not need is an exercise which is often required in a Machine Learning context for text files.

In such a situation the function "os.walk()" in a Python environment.

os.walk

os.walk() allows us to walk recursively through a directory tree. We get a tupel back containing

  1. the path to the present directory,
  2. a list of all sub-directories,
  3. a list of all files (by their names) in the directory.

For typical applications this is enough information to perform analysis and file operations within the directories.

Application

In my case the usage was very simple. In a Jupyter cell the following code helped:

import os
import time

dir_path = "/py/projects/CA22/catch22/"

# use os.walk to recursively run through the directory tree
v_start_time = time.perf_counter()

for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        if filename.endswith('.jpg') or filename.endswith('.db') or filename.endswith('-GS.txt'):
            os.remove(filepath) 
            
# extra loop as there are hidden "."-files also for ".jpg"-files      
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    for filename in filesdir:
        filepath = os.path.join(dirname, filename) 
        #print(filepath)
        if filename.startswith('._'):
            os.remove(filepath) 

v_end_time = time.perf_counter()
print("Total CPU time ", v_end_time - v_start_time)

If you do not print out file-paths this should be a matter of seconds only, on a SSD below a second.

Counting the remaining files

We can also use os.walk() to count the number of remaining files:

n = 0 
v_start_time = time.perf_counter()
for (dirname, subdirs, filesdir) in os.walk(dir_path): 
    print('[' + dirname + ']')
    n += len(filesdir)
v_end_time = time.perf_counter()
print('Number of files = ', n )
print("Total CPU time ", v_end_time - v_start_time)

On a Linux system you could also use

mytux:~ # find /py/projects/CA22/catch22/ -type f | wc -l
208123

Thus I could bring down the total size to 780 MB and the number of txt-files to be processed down to around 208.000.