Whilst preparing text data for ML we sometimes are confronted with a large amount of separate text-files organized in a directory hierarchy. In the last article in this category
I have shown how we "walk" through such a directory tree of files with the help of Python and os.walk(). We can for example use os.walk() procedure to eliminate unwanted files in the tree.
As an example I discussed a collection of around 200.000 scanned paper pages with typewriter text; the collection contained jpg-files and related OCR-based txt-files. With os.walk() I could easily get rid of the jpg-files.
A subsequent task which awaited me regarding my special text collection was the elimination of those text-files which, with a relatively high probability, were not written in German language. To solve such a problem with Python one needs a module that guesses the language of a given text correctly - in my case despite scanning and OCR errors. In this article I discuss the application of two methods - one based on the standard Python "langdetect" module and the other one based on a special variant of Facebook's "fastText"-algorithm. The latter is by far the faster approach - which is of relevance given the relatively big number of files to analyze.
"langdetect" and its "detect()"-function
To get the "langdetect"-module import it into your virtual Python environment via "pip install langdetect". To test its functionality just present a written text in French and another one in German from respective txt-files to the "detect()"-function of this module - like e.g. in the following Jupyter cell:
from langdetect import detect import os import time ftxt1 = open('/py/projects/CA22/catch22/0034_010.txt', 'r').read() print(detect(ftxt1)) ftxt2 = open('/py/projects/CA22/catch22/0388_001.txt', 'r').read() print(detect(ftxt2))
In my test case this gave the result:
So, detect() is pretty easy to use.
What about the performance of langdetect.detect()?
Let us look at the performance. For this I just ran os.walk() over the directory tree of my collection and printed out some information for all files not being of German language.
dir_path = "/py/projects/CA22/catch22/" # use os.walk to recursively run through the directory tree v_start_time = time.perf_counter() n = 0 m = 0 for (dirname, subdirs, filesdir) in os.walk(dir_path): subdirs.sort() print('[' + dirname + ']') for filename in filesdir: filepath = os.path.join(dirname, filename) #print(filepath) ftext = open(filepath, 'r').read() lang = detect(ftext) if lang != 'de': print(filepath, ' ', lang) m += 1 n += 1 if n > 9999: break v_end_time = time.perf_counter() print("Total CPU time ", v_end_time - v_start_time) print("num of files checked: ", n) print("num of non German files: ", m)
Resulting numbers were:
Total CPU time 72.65049991299747 num of files checked: 10033 num of non German files: 1465
Well, printing in a browser to a Jupyter cell takes some time itself. However, we shall see that this aspect is not really relevant in this example. (The situation would be different if we had to switch contexts between GPU operations and CPU operations whilst printing in real ML applications run on a GPU, primarily.)
In addition we are only interested in relative (CPU) performance when comparing the number later to the result of fastText.
I measured 74 secs for a standard hard disk (no SSD) with printing - observing that everything just ran on one core of the CPU. Without printing the run required 72 secs. All measured with disc caching active and taking an average of three runs on cached data.
So, for my whole collection of 200.000 txt-files I would have to wait more than 25 minutes - even if the files had been opened, read and cached previously. Deleting the files form the disk would cost a bit of extra time in addition.
Without caching - i.e. for an original run I would probably have to wait for around 3 hrs. langdetect.detect() is somewhat slow.
FastText was developed by Facebook. It is used to create word-vector models - which can be embedded in neural networks via special layers. See
https://en.wikipedia.org/wiki/FastText and https://fasttext.cc/.
An original publication on the fastText algorithm can be found at https://arxiv.org/abs/1607.04606. An somewhat less scientific and understandable introduction is presented at:
Besides supporting a variety of real ML-tasks, fastText also offers a language detection functionality. And a Python is supported, too:
In your virtual Python environment just use "pip install fasttext". In addition on a Linux CLI download a pretrained model for language detection, e.g. as with wget:
wget -O /py/projects/fasttext/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
FastText for language detection can be used together with os.walk() as follows:
import fasttext PRETRAINED_MODEL_PATH = '/py/projects/fasttext/lid.176.bin' model = fasttext.load_model(PRETRAINED_MODEL_PATH) dir_path = "/py/projects/CA22/catch22/" # use os.walk to recursively run through the directory tree v_start_time = time.perf_counter() n = 0 m = 0 for (dirname, subdirs, filesdir) in os.walk(dir_path): subdirs.sort() #print('[' + dirname + ']') for filename in filesdir: filepath = os.path.join(dirname, filename) #print(filepath) ftext = open(filepath, 'r').read() ftext = ftext.replace("\n", " ") tupel_lang = model.predict(ftext) #print(tupel_lang) lang = tupel_lang[-2:] #print(lang) if lang != 'de': #print(filepath, ' ', lang) m += 1 n += 1 if n > 9999: break v_end_time = time.perf_counter() print("Total CPU time ", v_end_time - v_start_time) print("num of files checked: ", n) print("num of non German files: ", m)
Total CPU time 1.895524049999949 num of files checked: 10033 num of non German files: 1355
Regarding performance fastText is almost by a factor of 38 faster than langdetect.detect() !
Eliminating non-German-files from the collection of 208.000 files took only 42 secs - once the files had been cached. An original delete run without caching took around 320 secs - which is still much faster than a detect()-run on cached data. As said on an (older) HD. This is very acceptable. SSDs would give us an even better performance.
Accuracy of the language detection
The reader certainly has noticed that the number of non-German files detected by fastText is somewhat smaller than what langdetect.detect() gave me. That there is a discrepancy is no wonder: Some of the text files are filled with so many scan and OCR errors that even a human would have problems to detect the original language. I just checked 10 of the files where the different methods showed a discrepancy for the guessed language. In all of these 10 cases the guess of fastText was better.
Although I am not a friend of Facebook they have produced something really useful with their fastText algorithm. Regarding language detection fastText provides a really convincing super-fast and reliable method to guess the main language used in text pieces. It supports Python environments and allows the analysis of a huge amount of text files even on a PC.