TF-IDF – which formula to take in combination with the Keras Tokenizer? And when to calculate TF-IDF by your own code …

Posted on 11. October 2021 by eremo

When performing Computer based text analysis we sometimes need to shorten our texts by some criterion before we apply machine learning algorithms. One of the reasons could be that a classical vectorization process applied to the original texts would lead to matrices or tensors which are beyond our PC’s memory capabilities.

Another reason for shortening might be that we want to focus our analysis upon words or tokens which are “significant” for the text documents we are dealing with. The individual texts we work with typically are members of a limited collection of texts -a so called text corpus.

What does “significance” mean in this context? Well, words which are significant for a specific text should single out this text among all other texts of the corpus – or vice versa. There should be a strong and specific correlation between the text and its “significant” tokens. Such a kind of distinguishing correlation could be: The significant tokens may appear in the selected specific document, only, or especially often – always in comparison to other texts in the corpus.

It is clear that we need some “measure” for the significance of a token with respect to individual texts. And we somehow need to compare the frequency by which a word/token appears in a text with the frequencies by which the token appears in other texts of our corpus.

For some analysis we might in the end only keep “significant” words which distinguish the texts of corpus from each other. Note that such a shortening procedure would reduce the full vocabulary, i.e. the set of all unique words appearing in the corpus’ texts. And after shortening the statistical basis for “significance” may have changed.

Due to the impact the choice for a “measure” of a token’s significance may have on our eventual analysis results we must be careful and precise when we discuss our results and its presuppositions. We should name the formula used for the measure of significance.

This leads to the question: Do all authors in the filed of Machine Learning [ML] and NLP use the same formula for the “significance” of words or tokens? Are there differences which can be confusing? Oh yes, there are … The purpose of this post is to remind beginners in the field of NLP or text analysis about this fact and to give an overview over the most common approaches. In addition I will discuss some practical aspects and give some snippets of a code which reproduces the TF-IDF values which the fast Keras Tokenizer would give you.

For large corpus the differences of using different formulas for measuring the significance of tokens may be minor and not change fundamental conclusions. But in my opinion the differences should be checked and at least be named.

TF-IDF values as a measure of a token’s significance – dependency on token, text AND corpus

In NLP a measure of a word’s significance with respect to a specific text of a defined corpus is given by a quantity called “TF-IDF” – “term frequency – inverse document frequency” (see below). TF-IDF values are specific for a word (or token) and a selected text (out of the corpus). We will discuss elements of the formulas in a minute.

Note: Significance values as TF-IDF values will in general depend on the corpus, too.

This is due to the fact that “significance” is based on correlations. As said above: We need to compare token frequencies within a specific text with the token’s frequencies in other texts of the given corpus. Significance is thus rooted in a singular text and the collection of other texts in a specific corpus. Keep a specific text, but change the corpus (e.g by eliminating some texts) – and you will change the significance of a token for the selected text.

If you have somehow calculated “TF-IDF”-values for all the words used in a specific text (of the corpus collection), a simple method to shorten the selected text would be to use a “TF-IDF”-threshold: We keep words which have a “TF-IDF”-value above the defined threshold and omit others from our specific text.

Such a shortening procedure would depend on word- and text-specific values. Another way of shortening could be based on an averaged TF-IDF value of each token evaluated over all documents. We then would get a corpus-specific significance value <TF-IDF> for each of our tokens. Such averaged TF-IDF values for our tokens together with a threshold could also be used for text-shortening.

Whatever method for shortening you choose: “TF-IDF”-values require a statistical analysis over the given ensemble of texts, i.e. the text corpus. The basic statistical data are often collected during the application of a tokenizer to the texts of a given corpus. A tokenizer identifies unique tokens appearing in the texts of a corpus and collects them in a long vector. Such a vector represents the tokenizer’s (and the corpus’) “vocabulary”. The tokenizer vocabulary often is sorted by the total frequency of the tokens in the corpus. In a Python environment the tokenizer’s vocabulary often will be represented by one or more Python dictionary objects.

Technical obstacles may require the explicit calculation of TF-IDF values by a coded formula in your programs

NLP frameworks in most cases provide specific objects and methods to automatically calculate TF-IDF values for the tokens and texts of a corpus during certain analysis runs. But things can become problematic because some tokenizers provide “TF-IDF”-data during a vectorization procedure, only. By vectorization we mean a digital encoding of the texts with respect to the tokenizer’s vocabulary in a common way for all texts.

An example is one-hot encoding: Each text can be represented by numbers 0,1..,n at positions in a long vector in which each position represents a specific token. 0,1,..n would then mean: In this text the token appears 0,1 or n times. Such a kind of encoding is especially useful for the training of neural networks.

A tool which gives us TF-IDF values after some vectorization is the Keras Tokenizer. I prefer the Keras Tokenizer in my projects because it really is super fast.

But: You see at once that we may run into severe trouble if we need to feed all of the tokens of a really big corpus into a TF-IDF analysis based on vectorization. For a collection of texts the number of unique tokens may lie in the range of several millions. This in turn leads to very long vectors. The number of vectors to look at is given by the number of texts which in itself may be hundreds of thousands or even millions. You can do the math for RAM requirements with a 16- or 32 Bit-representation yourself. As a consequence you may have to accept that you can do vectorization only batch-wise. And this may be time-consuming and may require intermediate manual interaction with your programs – especially when working with Python code and Jupyter notebooks (see below).

For the analysis of huge text corpora the snake of tokenizing, TF-IDF calculation, reasonable shortening and vectorization for ML may bite in its tail:
We need tf-idf to to shorten texts in a reasonable way and to avoid later memory problems during vectorization for ML tasks. But sometimes our tool-kit provides “tf-idf”-data only after a first vectorization, which may not be feasible due to the size of our corpus and the size of the resulting vocabulary of the tokenizer.

A typical example is given by the Keras tokenizer. If you have a big corpus with millions of texts and tokens vectorization may not be a good recipe to get your TF-IDF values. Especially in the case of post-OCR applications you may not be allowed to throw away any of the identified tokens before you have corrected them for possible OCR errors. And your ML-mechanism for such correction may depend on the results of some kind of TF-IDF analysis.

In such a situation one must invest some (limited) effort into a “manual” calculation of TF-IDF values. You then have to pick some formula from a text book on NLP and/or ML-based text analysis. But having done so you may soon find out that your (text-book) formula for a “TF-IDF”-calculation does not reproduce the values the tokenizer of a selected NLP framework would have given you after a vectorization of your texts. Note that you can always check for such differences by using an artificially and drastically reduced corpus.

A formula for the TF-IDF calculation performed the Keras tokenizer is one of the central topic of this post. I omit the hyphen in TF-IDF below sometimes for convenience reasons.

Is it worth the effort to calculate TF-IDF values by your own code?

Tokenizers of a NLP tool-kit do not only identify individual tokens in a text, but are also capable to vectorize texts. Vectorization leads to the representation of a text by an (ordered) series of integer or float numbers, i.e. a vector, which in a unique way refers to the words of a vocabulary extracted from the text collection:

The indexed position in the vector refers to a specific word in the vocabulary of the text corpus, the value given at this position instead describes the word’s (statistical) appearance in a text in some way.

We saw already that “one-hot”-encoding for vectorization may result in a “bag-of words”-model: A word appearing in a text is marked by a “one” (or integer) in an indexed vector referring to tokens of the ordered vocabulary derived by a tokenizer. But vectorization can be provided in different modes, too: The “ones” (1) in a simple “one-hot-encoded” vector can e.g. be replaced by TF-IDF values (floats) of the words (tfidf-mode). This is the case for the Keras Tokenizer: By using respective Keras tokenizer functions you may get the aspired TFIDF-values for reducing the texts during a vectorization run. However, all one-hot like encodings of texts come with a major disadvantage:

The length of the word vectors depends on the number of words the tokenizer has identified over all texts in a collection for the vocabulary.

If you have extracted 3 million words (tokens) out of hundreds of thousands of texts you may run into major trouble with the RAM of PC (and CPU-time). Most tokenizers allow for a (manual) sequential vectorization approach for a limited number of texts to overcome memory problems under such circumstances. What does this mean practically when working with Jupyter notebooks?

Well, if you work with notebooks on a PC with a small amount of RAM – and small may mean 128 GB (!) in some cases – you may have to perform a sequence of vectorization runs, each with maximally some hundred or thousand texts. Then you may have to export your results and afterward manually reset your notebook kernel to get rid of the RAM consumption – just because the standard garbage collection of your Python environment may not work fast enough.

I recently had this problem in a project with 200,000 texts, the Keras tokenizer and a vocabulary of 2.4 million words (where words with less than 4 characters were already omitted). The Keras tokenizer produces almost all relevant data for a “manual” calculation of TF-IDFF values after it has been applied to a corpus. In my case the CPU-time required to tokenize and build a vocabulary for the 200,000 texts took 20 secs, only. A manual and sequential approach to create all TFIDF values via vectorization, however, required about an hour’s time. This was due both to the time the vectorization and tf-idf calculation needed and the time required for resetting the notebook kernel.

After I had decided to implement the TFIDF calculation on my own in my codes I could work on the full corpus and get the values within a minute. So, if one has to work on a big corpus multiple times with some iterative processes (e.g. in post-OCR procedures) we may talk of a performance difference in terms of hours.

Therefore, I would in general recommend to perform TFIDF-calculations by your own code segments when being confronted with big corpora – and not to rely on (vectorizing) tools of a framework. Besides performance aspects another reason for TF-IDF functions programmed by yourself is that you afterward know exactly what kind of TF-IDF formula you have used.

TF-IDF formulas: The “IDF”-term – and what does the Keras tokenizer use for it?

The TF-IDF data describe the statistical overabundance of a token in a specific text by some formula measuring the token’s frequency in the selected text in comparison to the frequency over all texts in a weighted and normalized way.

During my own “TF-IDF”-calculations based on some Python code and basic tokenizer data, I, of course, wanted to reproduce the values the Keras tokenizer gave me during my previous vectorization approach. To my surprise it was rather difficult to achieve this goal. Just using a reasonable “TF-IDF”-formula taken from some NLP text-book simply failed.

The reason was that “TF-IDF”-data can be and are indeed calculated in different ways both in NLP literature adn in NLP frameworks. The Keras tokenizer does it differently than the tokenizer of SciKit – actually for both the TF and the IDF-part. There is a basic common structure behind a normalized TFIDF-value; there are, however, major differences in the details. Lets look at both points.

Everybody who has once in his/her life programmed a search engine knows that the significance of a word for a specific text (of an ensemble) depends on the number of occurrences of the word inside the specific text, but also on the frequency of the very same word in all the other texts of a given text collection:

If a word also appears very often in all other texts of a text ensemble then the word is not very significant for the specific text we are looking at.

Examples are typical “stop-words” – like “this” or “that” or “and”. Such words appear in very many texts of a corpus of English texts. Therefore, stop-words are not significant for a specific text.

Thus we expect that a measure of the statistical overabundance of a word in a selected text is a combination of the abundance in this specific text and a measure of the frequency in all the other texts of the corpus. The “TF-IDF” quantity follows this recipe. It is a combination of the so so called “term frequency” [tf(t)] with the “inverse document frequency [idf(t)], with “t” representing a special token or term:

tfidf(t) = tf(t) * idf(t)

While the term frequency measures the occurrence or frequency of a word within a selected text, the “idf” factor measures the frequency of a word in multiple texts of the collection. To get some weighing and normalization into this formula, the “idf”-term is typically based on the natural logarithm of the fraction described

by the number of texts NT comprised by a corpus as the nominator
and the number of documents ND(t) in which a special word or term appears as the denominator.

Once again: A TF-IDF value is always characteristic of a word or term and the specific text we look at. But, the “idf”-term is calculated in different manners in various text-books on text-analysis. Most variants try to avoid the idf-term becoming negative or want to avoid a division by zero; typical examples are:

idf(t) = log( NT / (ND + 1) )
idf(t) = log( (1 + NT) / (ND + 1) )
idf(t) = log( 1 + NT / (ND + 1) )
idf(t) = log( 1 + NT / ND )
idf(t) = log( (1 + NT) / (ND + 1) ) + 1

Note: log() represents the natural logarithm above.

Who uses which IDF version?
The second variant appears e.g. in a book of S. Raschka (see below) on “Python Machine Learning” (2016, Packt Publishing).

The fourth version in the list above is used in Scikit according to https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

This is in so far consistent to Raschka’s version as he himself characterizes the SciKit “TF-IDF” version as:

tfidf(t) = tf(t) * [ idf(t) + 1 ]

Keras: The third variant is the one you find in the source code of the Keras tokenizer. The strange thing is that you also find a reference in the Keras code which points to a section in a Wikipedia article that actually reflects the fourth form (!) given in the list above.

Source code excerpt of the Keras Tokenizer (as of 10/2021):

.....
.....
elif mode == 'tfidf':
                    # Use weighting scheme 2 in
                    # https://en.wikipedia.org/wiki/Tf%E2%80%93idf
                    tf = 1 + np.log(c)
                    idf = np.log(1 + self.document_count /
                                 (1 + self.index_docs.get(j, 0)))
                    x[i][j] = tf * idf
.....
.....

What we learn from this is that there are several variants of the “IDF”-term out there. So, if you want to reproduce TFIDF-numbers of a certain NLP framework you should better look into the code of your framework’s classes and functions – if possible.

Variants of the “term frequency” TF? Yes, they do exist!

While I had already become aware of the existence of different IDF-variants, I did not at all know that here were and are even differences regarding the term-frequency “tf(t)“. Normally, one would think that it is just the number describing how often a certain word or term appears in a specific text, i.e. the token’s frequency fro the selected text.

Let us, for example, assume that we have turned a specific text via a tokenizer function into a “sequence” of numbers. An entry in this sequence refers to a unique number assigned to a word of a somehow sorted tokenizer vocabulary. A tokenizer vocabulary is typically represented by a Python dictionary where the key is the word itself (or a hash of it) and the value corresponds to a unique number for the word. (Hint: In my applications for texts I always create a supplementary dictionary, which I call “switched_vocab”, with keys and values switched (number => word). Such a dictionary is useful for a lot of analysis steps)

A sequence can typically be represented by a Python list of numbers “li_seq”: the position in the list corresponds to the word’s position in the text (marked by separators), the number given corresponds to the words unique index number in the vocabulary.

Then, with Python 3, a straight-forward code snippet to get simple tf-values (as we sum of the number’s frequency in the sequence) would be

ind_w = li_seq[i]    # with "i" selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = d_count[ind_w]

This code creates a dictionary “d_count” with the word’s unique number appearing in the original sequence and the sum of occurrences of this specific number in the text’s sequence – i.e. in the text we are looking at.

Does the Keras tokenizer calculate and use the tf-term in this manner when it vectorizes texts in tfidf-mode? No, it does not! And this was a major factor for the differences in comparison to the TFIDF-values I naively produced for my texts.

With the terms above the Keras tokenizer instead uses a logarithmic value for tf (= TF):

ind_w = li_seq[i] # i selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = log( 1 + d_count[ind_w] )

This makes a significant difference in the derived “TF-IDF” values – even if one had gotten the “IDF”-term right!
Please note that all of the variants used for the TF- and the IDF-terms have their advantages and disadvantages. You should at least know exactly which formula you use in your analysis. In my project the Keras way of doing TF-IDF was useful, but there may be cases where another choice is appropriate.

Quick and dirty Python code to calculate TF-IDF values manually for a list of texts with the Keras tokenizer

For reasons of completeness, I outline some code fragments below, which may help readers to calculate “TF-IDF”-values, which are consistent with those produced during “sequences to matrix”-vectorization runs with the Keras tokenizer (as of 10/2021).

I assume that you already have a working Keras implementation using either CPU or GPU. I further assume that you have gathered a collection of texts (cleansed by some Regex operations) in a column “txt” of a Pandas dataframe “df_rex”. We first extract all the texts into a list (representing the corpus) and then apply the Keras tokenizer:

from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer

num_words = 1800000    # or whatever number of words you want to be taken into account from the vocabulary  

li_txts = df_rex['txt'].to_list()
tokenizer = Tokenizer(num_words=num_words, lower=True) # converts tokens to lower-case 
tokenizer.fit_on_texts(li_txts)    

vocab   = tokenizer.word_index
w_count = tokenizer.word_counts
w_docs  = tokenizer.word_docs
num_tot_vocab_words = len(vocab) 
    
# Switch vocab - key <> value 
# ****************************
switched_vocab = dict([(value, key) for key, value in vocab.items()])

Tokenizing should be a matter of seconds or a few ten-second intervals depending on the number of texts and the length of the texts. In my case with 200,000 texts, on average each with 2000 words, it took 25 secs and produced a vocabulary of about 2.4 million words.

In a next step we create “integer sequences” from all texts:

li_seq_full  = tokenizer.texts_to_sequences(li_txts)
leng_li_seq_full = len(li_seq_full)

Now, we are able to create a super-list of lists – including a list of tf-idf-values per text:

li_all_txts = []

j_end = leng_li_seq_full
for j in range(0, j_end):
    li_text = []
    li_text.append(j)

    leng_seq = len(li_seq_full[j])
    li_seq     = []
    li_tfidf   = []
    li_words   = []
    d_count    = {}

    d_count  = Counter(li_seq_full[j])
    for i in range(0,leng_seq):
        ind_w    = li_seq_full[j][i] 
        word     = switched_vocab[ind_w]
        
        # calculation of tf-idf
        # ~~~~~~~~~~~~~~~~~~~~~
        # https://github.com/keras-team/keras-preprocessing/blob/1.1.2/keras_preprocessing/text.py#L372-L383
        # Use weighting scheme 2 in https://en.wikipedia.org/wiki/Tf%E2%80%93idf
        dfreq    = w_docs[word] # document frequency 
        idf      = np.log( 1.0 + (leng_li_seq_full)  / (dfreq + 1.0) )
        tf_basic = d_count[ind_w]
        tf       = 1.0 + np.log(tf_basic)
        tfidf    = tf * idf 
                
        li_seq.append(ind_w) 
        li_tfidf.append(tfidf) 
        li_words.append(word) 

    li_text.append(li_seq)
    li_text.append(li_tfidf)
    li_text.append(li_words)

    li_all_txts.append(li_text)

leng_li_all_txts = len(li_all_txts)

This last run took about 1 minute in my case. When getting the same numbers with a sequential approach calculating Keras vectorization matrices in tf-idf mode for around 6000 texts in each run with in-between memory cleansing it took me around an hour and required continuous manual system interactions. Well, such a run provides the encoded text vectors as one of its major products, but, actually, we do not need these vectors for just evaluating TF-IDF values.

Conclusion

In this article I have demonstrated that “TF-IDF”-values can be calculated almost directly from the output of a tokenizer like the Keras Tokenizer. Such a “manual” calculation by one’s own coded instructions is preferable in comparison to e.g. a Keras based vectorization run in “tf-idf”-mode when both the number of texts and the vocabulary of your text corpus are huge. “tf-idf”-word-vectors may easily get a length of more than a million words with reasonably complex text ensembles. This poses RAM problems on many PC-based systems.

With TF-IDF-values calculated by your code functions you will get a measure for the significance of words or tokens within a reasonable CPU time and without any RAM problems. The evaluated “TF-IDF”- values may afterward help you to shorten your texts in a well founded and reasonable way before you vectorize your texts, i.e. ahead of applying advanced ML-algorithms.

Formulas for TF-IDF values used in the literature and various NLP tool-kits or frameworks do differ with respect to both for the TF-terms and the IDF-terms. You should be aware of this fact and choose one of the formulas given above carefully. You should also specify the formula used during your text analysis work when presenting results to your customers.

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – IV

Posted on 26. September 2021 by eremo

In the last posts of this mini-series we have studied if and how we can use three 3-char-grams at defined positions of a string token to identify matching words in a reference vocabulary. We have seen that we should choose some distance between the char-grams and that we should use the words length information to keep the list of possible hits small.

Such a search may be interesting if there is only fragmented information available about some words of a text or if one cannot trust the whole token to be written correctly. There may be other applications. Note: This has so far nothing to do with text analysis based on machine learning procedures. I would put the whole topic more in the field of text preparation or text rebuilding. But, I think that one can combine our simple identification of fitting words by 3-char-grams with ML-methods which evaluate the similarity or distance of a (possibly misspelled) token with vocabulary words: When we get a long hit-list we could invoke ML-methods to to determine the best fitting word.

We saw that we can do a 100,000 search runs with 3-char-grams on a decent vocabulary of around 2 million words in a Pandas dataframe below a 1.3 minutes on one CPU core of an older PC. In this concluding article I want to look a bit at the idea of multiprocessing the search with up to 4 CPU cores.

Points to take into account when using multiprocessing – do not expect too much

Pandas normally just involves one CPU core to do its job. And not all operations on a Pandas dataframe may be well suited for multiprocessing. Readers who have followed the code fragments in this series so far will probably and rightly assume that there is indeed a chance for reasonably separating our search process for words or at least major parts of it.

But even then – there is always some overhead to expect from splitting a Pandas dataframe into segments (or “partitions”) for a separate operations on different CPU cores. Overhead is also expected from the task to correctly to combine the particular results from the different processor cores to a data unity (here: dataframe) again at the end of a multiprocessed run.

A bottleneck for multiprocessing may also arise if multiple processes have to access certain distinct objects in memory at the same time. In our case we this point is to be expected for the access of and search within distinct sub-dataframes of the vocabulary containing words of a specific length.

Due to overhead and bottlenecks we do not expect that a certain problem scales directly and linearly with the number of CPU cores. Another point is that although the Linux OS may recognize a hyperthreading physical core of an Intel processor as two cores – but it may not be able to use such virtual cores in a given context as if they were real separate physical cores.

Code to invoke multiple processor cores

In this article I just use the standard Python “multiprocessing” module. (I did not test Ray yet – as a first trial gave me trouble in some preparing code-segments of my Jupyter notebooks. I did not have time to solve the problems there.)

Following some advice on the Internet I handled parallelization in the following way:

import multiprocessing
from multiprocessing import cpu_count, Pool

#cores = cpu_count() # Number of physical CPU cores on your system
cores = 4
partitions = cores # But actually you can define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split), copy=False)
    pool.close()
    pool.join()
    return data

The basic function, corresponding to the parameter “func” of function “parallelize”, which shall be executed in our case is structurally well known from the last posts of this article series:

We perform a search via
putting conditions on columns (of the vocabulary-dataframe) containing 3-char-grams at different positions. The search is done on sub-dataframes of the vocabulary containing only words with a given length. The respective addresses are controlled by a Python dictionary “d_df”; see the last post for its creation. We then build a list of indices of fitting words. The dataframe containing the test tokens – in our case a random selection of real vocabulary words – will be called “dfw” inside the function “func() => getlen()” (see below). To understand the code you should be aware of the fact that the original dataframe is split into (4) partitions.

We only return the length of the list of hits and not the list of indices for each token itself.

# Function for parallelized operation 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def getlen(dfw):
    # Note 1: The dfw passed is a segment ("partition") of the original dataframe  
    # Note 2: We use a dict d_lilen which was defined outside  
    #         and is under the control of the parallelization manager
    
    num_rows = len(dfw)
    for i in range(0, num_rows):
        len_w = dfw.iat[i,0]
        idx = dfw.iat[i,33]
        
        df_name = "df_" + str(len_w)
        df_ = d_df[df_name]

        j_m = math.floor(len_w/2)+1
        j_l = 2
        j_r = len_w -1
        col_l = 'gram_' + str(j_l)
        col_m = 'gram_' + str(j_m)
        col_r = 'gram_' + str(j_r)
        val_l = dfw.iat[i, j_l+2]
        val_m = dfw.iat[i, j_m+2]
        val_r = dfw.iat[i, j_r+2]
        li_ind = df_.index[   (df_[col_r]==val_r) 
                            & (df_[col_m]==val_m)
                            & (df_[col_l]==val_l)
                            ]
        d_lilen[idx] = len(li_ind)

    # The dataframe must be returned - otherwise it will not be concatenated after parallelization 
    return dfw

While the processes work on different segments of our input dataframe we write results to a Python dictionary “d_lilen” which is under the control of the “parallelization manager” (see below). A dictionary is appropriate as we might otherwise loose control over the dataframe-indices during the following processes.

A reduced dataframe containing randomly selected “tokens”

To make things a bit easier we first create a “token”-dataframe “dfw_shorter3” based on a random selection of 100,000 indices from a dataframe containing long vocabulary words (length ≥ 10). We can derive it from our reference vocabulary. I have called the latter dataframe “dfw_short3” in the last post (because we use three 3-char-grams for longer tokens). “dfw_short3” contains all words of our vocabulary with a length of “10 ≤ length ≤ 30”.

# Prepare a sub-dataframe for of the random 100,000 words 
# ******************************
num_w = 100000
len_dfw = len(dfw_short3)

# select a 100,000 random rows 
random.seed()
# Note: random.sample does not repeat values 
li_ind_p_w = random.sample(range(0, len_dfw), num_w)
len_li_p_w = len(li_ind_p_w)

dfw_shorter3 = dfw_short3.iloc[li_ind_p_w, :].copy() 
dfw_shorter3['lx'] = 0
dfw_shorter3['idx'] = dfw_shorter3.index
dfw_shorter3.head(5)

The resulting dataframe “dfw_shorter3” looks like :

nYou see that the index varies randomly and is not in ascending order! This is the reason why we must pick up the index-information during our parallelized operations!

Code for executing parallelized run

The following code enforces a parallelized execution:

manager = multiprocessing.Manager()
d_lilen = manager.dict()
print(len(d_lilen))

v_start_time = time.perf_counter()
dfw_res = parallelize(dfw_shorter3, getlen)
v_end_time = time.perf_counter()
cpu_time   = v_end_time - v_start_time
print("cpu : ", cpu_time)

print(len(d_lilen))
mean_length  = sum(d_lilen.values()) / len(d_lilen)
print(mean_length)

The parallelized run takes about 29.5 seconds.

cpu :  29.46206265499968
100000
1.25008

How does cpu-time vary with the number of cores of my (hyperthreading) CPU?

The cpu-time does not improve much when the number of cores gets bigger than the number of real physical cores:

1 core : 90.5 secs       
2 cores: 47.6 secs  
3 cores: 35.1 secs 
4 cores: 29.4 secs 
5 cores: 28.2 secs 
6 cores: 26.9 secs 
7 cores: 26.0 secs 
8 cores: 25.5 secs

My readers know about this effect already from ML experiments with CUDA and libcublas:

As long a s we use physical processor cores we see substantial improvement, beyond that no real gain in performance is observed on hyperthreading CPUs.

Compared to a run with just one CPU core we seem to gain a factor of almost 3 by parallelization. But, actually, this is no fair comparison: My readers have certainly seen that the CPU-time for the run with one CPU-Core is significantly slower than comparable runs which I described in my last post. At that time we found a cpu-time of around 75 secs, only. So, we have a basic deficit of about 15 secs – without real parallelization!

Overhead and RAM consumption of multiprocessing

Why does run with just one CPU core take so long time? Is it functional overhead for organizing and controlling multiprocessing – which may occur despite using just one core and just one “partition” of the dataframe (i.e. the full dataframe)? Well, we can test this easily by reconstructing the runs of my last post a bit:

# Reformulate Run just for cpu-time comparisons 
# **********************************************
b_test = True 

# Function  
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def getleng(dfw, d_lileng):
    # Note 1: The dfw passed is a segment of the original dataframe  
    # Note 2: We use a list l_lilen which was outside defined 
    #         and is under the control of the prallelization manager
    
    num_rows = len(dfw)
    #print(num_rows)
    for i in range(0, num_rows):
        len_w = dfw.iat[i,0]
        idx = dfw.iat[i,33]
        
        df_name = "df_" + str(len_w)
        df_ = d_df[df_name]

        j_m = math.floor(len_w/2)+1
        j_l = 2
        j_r = len_w -1
        col_l = 'gram_' + str(j_l)
        col_m = 'gram_' + str(j_m)
        col_r = 'gram_' + str(j_r)
        val_l = dfw.iat[i, j_l+2]
        val_m = dfw.iat[i, j_m+2]
        val_r = dfw.iat[i, j_r+2]
        li_ind = df_.index[   (df_[col_r]==val_r) 
                            & (df_[col_m]==val_m)
                            & (df_[col_l]==val_l)
                            ]
        leng = len(li_ind)
        d_lileng[idx] = leng

    return d_lileng


if b_test: 
    num_w = 100000
    len_dfw = len(dfw_short3)

    # select a 100,000 random rows 
    random.seed()
    # Note: random.sample does not repeat values 
    li_ind_p_w = random.sample(range(0, len_dfw), num_w)
    len_li_p_w = len(li_ind_p_w)

    dfw_shortx = dfw_short3.iloc[li_ind_p_
w, :].copy() 
    dfw_shortx['lx']  = 0
    dfw_shortx['idx'] = dfw_shortx.index

    d_lileng = {} #

    v_start_time = time.perf_counter()
    d_lileng = getleng(dfw_shortx, d_lileng)
    v_end_time = time.perf_counter()
    cpu_time   = v_end_time - v_start_time
    print("cpu : ", cpu_time)
    print(len(d_lileng))
    mean_length = sum(d_lileng.values()) / len(d_lileng)
    print(mean_length)
    
    dfw_shortx.head(3)

How long does such a run take?

cpu :  77.96989408900026
100000
1.25666

Just 78 secs! This is pretty close to the number of 75 secs we got in our last post’s efforts! So, we see that turning to multiprocessing leads to significant functional overhead! The gain in performance, therefore, is less than the factor 3 observed above:

We (only) get a gain in performance by a factor of roughly 2.5 – when using 4 physical CPU cores.

I admit that I have no broad or detailed experience with Python multiprocessing. So, if somebody sees a problem in my code, please, send me a mail.

RAM is not released completely
Another negative side effect was the use of RAM in my case. Whereas we just get 2.2 GB RAM consumption with all required steps and copying parts of the loaded dataframe with all 3-char-grams in the above test run without multiprocessing, I saw a monstrous rise in memory during the parallelized runs:

Starting from a level of 2.4 GB, memory rose to 12.5 GB during the run and then fell back to 4.5 GB. So, there are copying processes and memory is not completely released again in the end – despite having all and everything encapsulated in functions. Repeating the multiprocessed runs even lead to a systematic increase in memory by about 150 MB per run.

So, when working with the “multiprocessing module” and big Pandas dataframes you should be a bit careful about the actual RAM consumption during the runs.

Conclusion

This series about finding words in a vocabulary by using two or three 3-char-grams may have appeared a bit “academical” – as one of my readers told me. Why the hell should someone use only a few 3-char-grams to identify words?

Well, I have tried to give some answers to this question: Under certain conditions you may only have fragments of words available; think of text transcribed from a recorded, but distorted communication with Skype or think of physically damaged written text documents. A similar situation may occur when you cannot trust a written string token to be a correctly written word – due to misspelling or other reasons (bad OCR SW or bad document conditions for scans combined with OCR).

In addition: character-grams are actually used as a basis for multiple ML methods for text-analysis tasks, e.g. in Facebook’s Fasttext. They give a solid base for an embedded word vector space which can help to find and measure similarities between correctly written words, but also between correctly written words and fantasy words or misspelled words. Looking a bit at the question of how much a few 3-char-grams help to identify a word is helpful to understand their power in other contexts, too.

We have seen that only three 3-char-grams can identify matching words quite well – even if the words are long words (up to 30 characters). The list of matching words can be kept surprisingly small if and when

we use available or reasonable length information about the words we want to find,
we define positions for the 3-char-grams inside the words,
we put some positional distance between the location of the chosen 3-char-grams inside the words.

For a 100,000 random cases with correctly written 3-char-grams the average length of the hit list was below 2 – if the distance between the 3-char-grams was
reasonably large compared to the token-length. Similar results were found for using only two 3-char-grams for short words.

We have also covered some very practical aspects regarding search operation on relatively big Pandas dataframes :

The CPU-time for identifying words in a Pandas dataframe by using 3-char-grams is reasonably small to allow for experiments with around 100,000 tokens even on PCs within minutes or quarters of an hour – but it does not take hours. As using 3-char-grams corresponds to putting conditions on two or three columns of a dataframe this result can be generalized to other similar problems with string comparisons on dataframe columns.

The basic RAM consumption of dataframes containing up to fifty-five 3-char-grams per word can be efficiently controlled by using the dtype “category” for the respective columns.

Regarding cpu-time we saw that working with many searches may get a performance boost by a factor well above 2 by using simple multiprocessing techniques based on Python’s “multiprocessing” module. However, this comes with an unpleasant side effect of enormous RAM consumption – at least temporarily.

I hope you had some fun with this series of posts. In a forthcoming series I will apply these results to the task of error correction. Stay tuned.

Links

https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a
https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – III

Posted on 20. September 2021 by eremo

Welcome back to this mini-series of posts on how we can search words in a vocabulary with the help of a few 3-char-grams. The sought words should fulfill the condition that they fit two or three selected 3-char-grams at certain positions of a given string-token:

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – I
Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – II

In the first post we looked at general properties of a representative German vocabulary with respect to the distribution of 3-char-gram against their position in words. In my last post we learned from some experiments that we should use 3-char-grams with some positional distance between them. This will reduce the number of matching vocabulary words to a relatively small value – mostly below 10, often even below 5. Such a small number allows for a detailed analysis of the words. The analysis for selecting the best match may, among other more complicated things, involve a character to character comparison with the original string token or a distance measure in some word vector space.

My vocabulary resides in a Pandas dataframe. Pandas is often used as a RAM based data container in the context of text analysis tasks or data preparation for machine learning. In the present article I focus on the CPU-time required to find matching vocabulary words for 100,000 different tokens with the help of two or three selected 3-char-grams. So, this is basically about the CPU-time for requests which put conditions on a few columns of a medium sized Pandas dataframe.

I will distinguish between searches for words with a length ≤ 9 characters and searches for longer words. Whilst processing the data I will also calculate the resulting average number of words in the hit list of matching words.

A simplifying approach

As test-tokens I pick 100,000 randomly distributed words out of my alphabetically sorted vocabulary or 100,000 words out of certain regions of the vocabulary,

I select two or three 3-char-grams out of each of these words,

I search for matching words in the vocabulary with the same 3-char-grams at their given positions within the respective word string.

So, our 3-char-grams for comparison are correctly written. In real data analysis experiments for string tokens of a given text collection the situation may be different – just wait for future posts. You may then have to vary the 3-char-gram positions to get a hit list at all. But even for correct 3-grams we already know from previous experiments that the hit list, understandably, often enough contains more than just one word.

For words ≤ 9 letters we use two 3-char-grams, for longer words three 3-char-grams. We process 7 runs in each case. The runs are different
regarding the choice of the 3-char-grams’ positions within the tokens; see the code in the sections below for the differences in the positions.

My selections of the positions of the 3-char-grams within the word follow mainly the strategy of relatively big distances between the 3-char-grams. This strategy was the main result of the last post. We also follow another insight which we got there:
For each token we use the length information, i.e. we work on a pre-defined slice of the dataframe containing only words of the same length as the token. (In the case of real life tokens you may have to vary the length parameters for different search attempts if you have reason to assume that the token is misspelled.)

I perform all test runs on a relatively old i7-6700K CPU.

Predefined slices of the vocabulary for words with a given length

We first create slices for words of a certain length and put the addresses into a dictionary:

# Create vocab slices for all word-lengths  
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~--------------
b_exact_len = True

li_min = []
li_df = []
d_df  = {}
for i in range(4,57): 
    li_min.append(i)

len_li = len(li_min)
for i in range(0, len_li-1):
    mil = li_min[i]
    if b_exact_len: 
        df_x = dfw_uml.loc[(dfw_uml['len'] == mil)]
        df_x = df_x.iloc[:, 2:]
        li_df.append(df_x)
        key = "df_" + str(mil)
        d_df[key] = df_x
    else: 
        mal = li_min[i+1]
        df_x = dfw_uml.loc[(dfw_uml['len'] >= mil) & (dfw_uml['len']< mal)]
        df_x = df_x.iloc[:, 2:]
        li_df.append(df_x)
        key = "df_" + str(mil) + str(mal -1)
        d_df[key] = df_x
print("Fertig: len(li_df) = ", len(li_df), " : len(d_df) = ", len(d_df))
li_df[12].head(5)

Giving e.g:

Dataframe with words longer than 9 letters

We then create a sub-dataframe containing words with “10 ≤ word-length < 30“. Reason: We know from a previous post that this selection covers most of the longer words in the vocabulary.

 
#******************************************************
# Reduce the vocab to strings in a certain length range 
# => Build dfw_short3 for long words and dfw_short2 for short words 
#******************************************************
# we produce two dfw_short frames: 
# - one for words with length >= 10  => 3-char-grams
# - one for words with length <= 9   => 2-char-grams 

# Parameters
# ~~~~~~~~~~~
min_3_len = 10
max_3_len = 30

min_2_len = 4
max_2_len = 9

mil_3 = min_3_len - 1 
mal_3 = max_3_len + 1
max_3_col = max_3_len + 4
dfw_short3 = dfw_uml.loc[(dfw_uml.lower.str.len() > mil_3) & (dfw_uml.lower.str.len() < mal_3)]
dfw_short3 = dfw_short3.iloc[:, 2:max_3_col]

mil_2 = min_2_len - 1 
mal_2 = max_2_len + 1
max_2_col = max_2_len + 4
dfw_short2 = dfw_uml.loc[(dfw_uml.lower.str.len() > mil_2) & (dfw_uml.lower.str.len() < mal_2)]
dfw_short2 = dfw_short2.iloc[:, 2:max_2_col]

print(len(dfw_short3))
print()
dfw_short3.head(8)

This gives us a list of around 2.5 million words (out of 2.7 million) in “dfw_short3”. The columns are “len” (containing the length), lower (containing the lower case version of a word) and columns for 3-char-grams from position 0 to 29:

nThe first 3-char-gram residing completely within the word is at column “gram_2”. We have used left- and right-padding 3-char-grams; see a previous post for this point.

The corresponding “dfw_short2” for words with a length below 10 characters is much shorter; it contains around 186000 words only.

A function to get a hit list of words matching two or three 3-char-grams

For our experiment I use the following (quick and dirty) function get_fit_words_3_grams() to select the right slice of the vocabulary and perform the search for words matching three 3-char-grams of longer string tokens:

def get_fit_words_3_grams(dfw, len_w, j, pos_l=-1, pos_m=-1, pos_r=-1, b_std_pos = True):
    # dfw: source df for tokens)
    # j: row position of token in dfw (not index-label)
    
    b_analysis = False
        
    try:
        dfw
    except NameError:
        print("dfw not defined ")
    
    # get token length 
    #len_w = dfw.iat[j,0]
    #word  = dfw.iat[j, 1]
    
    # get the right slice of the vocabulary with words corresponding to the length
    df_name = "df_" + str(len_w)
    df_ = d_df[df_name]
    
    if b_std_pos:
        j_l  = 2
        j_m  = math.floor(len_w/2)+1
        j_r  = len_w - 1 
        j_rm = j_m + 2 
    else:
        if pos_l==-1 or pos_m == -1 or pos_r == -1 or pos_m >= pos_r: 
            print("one or all of the positions is not defined or pos_m >= pos_r")
            sys.exit()
        j_l = pos_l
        j_m = pos_m
        j_r = pos_r
        if pos_m >= len_w+1 or pos_r >= len_w+2:
            print("Positions exceed defined positions of 3-char-grams for the token (len= ", len_w, ")") 
            sys.exit()

    col_l  = 'gram_' + str(j_l);  val_l  = dfw.iat[j, j_l+2]
    col_m  = 'gram_' + str(j_m);  val_m  = dfw.iat[j, j_m+2]
    col_r  = 'gram_' + str(j_r);  val_r  = dfw.iat[j, j_r+2]
    #print(len_w, ":", word, ":", j_l, ":", j_m, ":", j_r, ":", val_l, ":", val_m, ":", val_r )

    li_ind = df_.index[  (df_[col_r]==val_r) 
                       #& (df_[col_rm]==val_rm) 
                       & (df_[col_m]==val_m)
                       & (df_[col_l]==val_l)
                      ].to_list()
    
    if b_analysis:
        leng_li = len(li_ind)
        if leng_li >90:
            print("!!!!")
            for m in range(0, leng_li):
                print(df_.loc[li_ind[m], 'lower'])
            print("!!!!")
        
    #print(word, ":", leng_li, ":", len_w, ":", j_l, ":", j_m, ":", j_r, ":", val_l, ":", val_m, ":", val_r)
    return len(li_ind), len_w

For “b_std_pos == True” all 3-char-grams reside completely within the word with a maximum distance to each other.

An analogous function “get_fit_words_2_grams(dfw, len_w, j, pos_l=-1, pos_r=-1, b_std_pos = True)” basically does the same but for a chosen left and a right positioned 3-char-gram, only. The latter function is to be applied for words with a length ≤ 9.

Function to perform the test runs

A quick and dirty function to perform the planned different test runs is

# Check for 100,000 words, how long the index list is for conditions on three 3-gram_cols or two 3-grams 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#Three 3-char-grams or two 3-char-grams? 
b_3 = True

# parameter
num_w   = 100000
#num_w   = 50000
n_start = 0
n_end   = n_start + num_w 

# run type 
b_random = True
pos_type = 0 
#pos_type = 1 
#pos_type = 2 
#pos_type = 3 
#pos_type = 4 
#pos_type = 5
#pos_type = 6
#pos_type = 7

if b_3: 
    len_
dfw = len(dfw_short3)
else:
    len_dfw = len(dfw_short2)
    print("len dfw_short2 = ", len_dfw)
    
if b_random: 
    random.seed()
    li_ind_w = random.sample(range(0, len_dfw), num_w)
else: 
    li_ind_w = list(range(n_start, n_end, 1))
    
#print(li_ind_w) 

if n_start+num_w > len_dfw:
    print("Error: wrong choice of params ")
    sys.exit

ay_inter_lilen = np.zeros((num_w,), dtype=np.int16)
ay_inter_wolen = np.zeros((num_w,), dtype=np.int16)

v_start_time = time.perf_counter()
n = 0 
for i in range(0, num_w):
    ind = li_ind_w[i]
    if b_3:
        leng_w = dfw_short3.iat[ind,0]
    else:
        leng_w = dfw_short2.iat[ind,0]
        
    #print(ind, leng_w)
    
    # adapt pos_l, pos_m, pos_r
    # ************************
    if pos_type == 1:
        pos_l = 3
        pos_m = math.floor(leng_w/2)+1
        pos_r = leng_w - 1
    elif pos_type == 2:
        pos_l = 2
        pos_m = math.floor(leng_w/2)+1
        pos_r = leng_w - 2
    elif pos_type == 3:
        pos_l = 4
        pos_m = math.floor(leng_w/2)+2
        pos_r = leng_w - 1
    elif pos_type == 4:
        pos_l = 2
        pos_m = math.floor(leng_w/2)
        pos_r = leng_w - 3
    elif pos_type == 5:
        pos_l = 5
        pos_m = math.floor(leng_w/2)+2
        pos_r = leng_w - 1
    elif pos_type == 6:
        pos_l = 2
        pos_m = math.floor(leng_w/2)
        pos_r = leng_w - 4
    elif pos_type == 7:
        pos_l = 3
        pos_m = math.floor(leng_w/2)
        pos_r = leng_w - 2
   
    # 3-gram check 
    if b_3:
        if pos_type == 0: 
            leng, lenw = get_fit_words_3_grams(dfw_short3, leng_w, ind, 0, 0, 0, True)
        else: 
            leng, lenw = get_fit_words_3_grams(dfw_short3, leng_w, ind, pos_l, pos_m, pos_r, False)
    else:
        if pos_type == 0: 
            leng, lenw = get_fit_words_2_grams(dfw_short2, leng_w, ind, 0, 0, True)
        else: 
            leng, lenw = get_fit_words_2_grams(dfw_short2, leng_w, ind, pos_l, pos_r, False)
    
    
    ay_inter_lilen[n] = leng
    ay_inter_wolen[n] = lenw
    #print (leng)
    n += 1
v_end_time = time.perf_counter()

cpu_time   = v_end_time - v_start_time
num_tokens = len(ay_inter_lilen)
mean_hits  = ay_inter_lilen.mean()
max_hits   = ay_inter_lilen.max()

if b_random:
    print("cpu : ", "{:.2f}".format(cpu_time), " :: tokens =", num_tokens, 
          " :: mean =", "{:.2f}".format(mean_hits), ":: max =", "{:.2f}".format(max_hits) )
else:
    print("n_start =", n_start, " :: cpu : ", "{:.2f}".format(cpu_time), ":: tokens =", num_tokens, 
      ":: mean =", "{:.2f}".format(mean_hits), ":: max =", "{:.2f}".format(max_hits) )
print()
print(ay_inter_lilen)

Test runs for words with a length ≥ 10 and three 3-char-grams

Pandas runs per default on just one CPU core. Typical run times are around 76 secs depending a bit on the background load on my Linux PC. Outputs for 3 consecutive runs for “b_random = True” runs and different “pos_type”-values and are

“b_random = True” and “pos_type = 0”

     
cpu :  75.82  :: tokens = 100000  :: mean = 1.25 :: max = 91.00
cpu :  75.40  :: tokens = 100000  :: mean = 1.25 :: max = 91.00
cpu :  75.43  :: tokens = 100000  :: mean = 1.25 :: max = 91.00

The average value “mean” for the length of the hit list is quite small. But there obviously are a few tokens for which the hit list is quite long (max-value > 90). We shall see below that the surprisingly large value of the maximum is only due to words in two specific regions of the vocabulary.

The next section for “pos_type = 1” shows a better behavior:

“b_random = True” and “pos_type = 1”
n

cpu :  75.23  :: tokens = 100000  :: mean = 1.18 :: max = 27.00
cpu :  76.39  :: tokens = 100000  :: mean = 1.18 :: max = 24.00
cpu :  75.95  :: tokens = 100000  :: mean = 1.17 :: max = 27.00

The next position variation again suffers from words in the same regions of the vocabulary where we got problems already for pos_type = 0:

“b_random = True” and “pos_type = 2”

cpu :  75.07  :: tokens = 100000  :: mean = 1.28 :: max = 52.00
cpu :  75.57  :: tokens = 100000  :: mean = 1.28 :: max = 52.00
cpu :  75.78  :: tokens = 100000  :: mean = 1.28 :: max = 52.00

The next positional variation shows a much lower max-value; the mean value is convincing:

“b_random = True” and “pos_type = 3”

cpu :  74.70  :: tokens = 100000  :: mean = 1.21 :: max = 23.00
cpu :  74.78  :: tokens = 100000  :: mean = 1.22 :: max = 23.00
cpu :  74.48  :: tokens = 100000  :: mean = 1.22 :: max = 24.00

“b_random = True” and “pos_type = 4”

cpu :  75.18  :: tokens = 100000  :: mean = 1.27 :: max = 52.00
cpu :  75.45  :: tokens = 100000  :: mean = 1.26 :: max = 52.00
cpu :  74.65  :: tokens = 100000  :: mean = 1.27 :: max = 52.00

For “pos_type = 5” we get again worse results for the average values:

“b_random = True” and “pos_type = 5”

cpu :  74.21  :: tokens = 100000  :: mean = 1.70 :: max = 49.00
cpu :  74.95  :: tokens = 100000  :: mean = 1.71 :: max = 49.00
cpu :  74.28  :: tokens = 100000  :: mean = 1.70 :: max = 49.00

“b_random = True” and “pos_type = 6”

cpu :  74.21  :: tokens = 100000  :: mean = 1.49 :: max = 31.00
cpu :  74.16  :: tokens = 100000  :: mean = 1.49 :: max = 28.00
cpu :  74.21  :: tokens = 100000  :: mean = 1.50 :: max = 31.00

“b_random = True” and “pos_type = 7”

cpu :  75.02  :: tokens = 100000  :: mean = 1.28 :: max = 34.00
cpu :  74.19  :: tokens = 100000  :: mean = 1.28 :: max = 34.00
cpu :  73.56  :: tokens = 100000  :: mean = 1.28 :: max = 34.00

The data for the mean number of matching words are overall consistent with our general considerations and observations in the previous post of this article series. The CPU-times are very reasonable – even if we had to perform 5 different 3-char-gram requests per token we could do this within 6,5 to 7 minutes.

A bit worrying is the result for the maximum of the hit-list length. The next section will show that the max-values above stem from some words in two distinct sections of the vocabulary.

Data for certain regions of the vocabulary

It is always reasonable to look a bit closer at different regions of the vocabulary. Therefore, we repeat some runs – but this time not for random data, but for 100,000 tokens following a certain start-position in the alphabetically sorted vocabulary:

“b_random = False” and “pos_type = 0” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.10 :: max = 10
n_start = 50000   :: tokens = 50000 :: mean = 1.15 :: max = 14
n_start = 100000  :: tokens = 50000 :: mean = 1.46 :: max = 26
n_start = 150000  :: tokens = 50000 :: mean = 1.25 :: max = 26
n_start = 200000  :: tokens = 50000 :: mean = 1.30 :: max = 14
n_start = 250000  :: tokens = 50000 :: mean = 1.15 :: max = 20
n_start = 300000  :: tokens = 50000 :: mean = 1.10 :: max = 13
n_start = 350000  :: tokens = 50000 :: mean = 1.07 :: max = 6
n_start = 400000  :: tokens = 50000 :: mean = 1.11 :: max = 12
n_start = 450000  :: tokens = 50000 :: mean = 1.28 :: max = 14
n_start = 500000  :: tokens = 50000 :: mean = 1.38 :: max = 20
n_start = 550000  :: tokens = 50000 :: mean = 1.12 :: max = 15
n_start = 600000  :: tokens = 50000 :: mean = 1.
11 :: max = 11
n_start = 650000  :: tokens = 50000 :: mean = 1.18 :: max = 16
n_start = 700000  :: tokens = 50000 :: mean = 1.12 :: max = 17
n_start = 750000  :: tokens = 50000 :: mean = 1.20 :: max = 19
n_start = 800000  :: tokens = 50000 :: mean = 1.32 :: max = 21
n_start = 850000  :: tokens = 50000 :: mean = 1.13 :: max = 13
n_start = 900000  :: tokens = 50000 :: mean = 1.11 :: max = 9
n_start = 950000  :: tokens = 50000 :: mean = 1.15 :: max = 14
n_start = 1000000 :: tokens = 50000 :: mean = 1.21 :: max = 25
n_start = 1050000 :: tokens = 50000 :: mean = 1.08 :: max = 7
n_start = 1100000 :: tokens = 50000 :: mean = 1.08 :: max = 10
n_start = 1150000 :: tokens = 50000 :: mean = 1.32 :: max = 20
n_start = 1200000 :: tokens = 50000 :: mean = 1.14 :: max = 18
n_start = 1250000 :: tokens = 50000 :: mean = 1.15 :: max = 14
n_start = 1300000 :: tokens = 50000 :: mean = 1.10 :: max = 12
n_start = 1350000 :: tokens = 50000 :: mean = 1.14 :: max = 13
n_start = 1400000 :: tokens = 50000 :: mean = 1.09 :: max = 11
n_start = 1450000 :: tokens = 50000 :: mean = 1.12 :: max = 12
n_start = 1500000 :: tokens = 50000 :: mean = 1.15 :: max = 33
n_start = 1550000 :: tokens = 50000 :: mean = 1.15 :: max = 19
n_start = 1600000 :: tokens = 50000 :: mean = 1.27 :: max = 28
n_start = 1650000 :: tokens = 50000 :: mean = 1.10 :: max = 11
n_start = 1700000 :: tokens = 50000 :: mean = 1.13 :: max = 15
n_start = 1750000 :: tokens = 50000 :: mean = 1.23 :: max = 57
n_start = 1800000 :: tokens = 50000 :: mean = 1.79 :: max = 57
n_start = 1850000 :: tokens = 50000 :: mean = 1.44 :: max = 57
n_start = 1900000 :: tokens = 50000 :: mean = 1.17 :: max = 20
n_start = 1950000 :: tokens = 50000 :: mean = 1.24 :: max = 19
n_start = 2000000 :: tokens = 50000 :: mean = 1.31 :: max = 19
n_start = 2050000 :: tokens = 50000 :: mean = 1.08 :: max = 19
n_start = 2100000 :: tokens = 50000 :: mean = 1.12 :: max = 17
n_start = 2150000 :: tokens = 50000 :: mean = 1.24 :: max = 27
n_start = 2200000 :: tokens = 50000 :: mean = 2.39 :: max = 91
n_start = 2250000 :: tokens = 50000 :: mean = 2.76 :: max = 91
n_start = 2300000 :: tokens = 50000 :: mean = 1.14 :: max = 10
n_start = 2350000 :: tokens = 50000 :: mean = 1.17 :: max = 12
n_start = 2400000 :: tokens = 50000 :: mean = 1.18 :: max = 21
n_start = 2450000 :: tokens = 50000 :: mean = 1.16 :: max = 24

These data are pretty consistent with the random approach. We see that there are some intervals were the hit list gets bigger – but on average not bigger than 3.

However, we learn something important here:

In all segments of the vocabulary there are some relatively few words for which our recipe of distanced 3-car-grams nevertheless leads to long hit lists.

This is also reflected by the data for other positional distributions of the 3-char-grams:

“b_random = False” and “pos_type = 1” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.08 :: max = 10
n_start = 50000   :: tokens = 50000 :: mean = 1.14 :: max = 14
n_start = 100000  :: tokens = 50000 :: mean = 1.16 :: max = 13
n_start = 150000  :: tokens = 50000 :: mean = 1.17 :: max = 16
n_start = 200000  :: tokens = 50000 :: mean = 1.24 :: max = 15
n_start = 250000  :: tokens = 50000 :: mean = 1.15 :: max = 20
n_start = 300000  :: tokens = 50000 :: mean = 1.12 :: max = 12
n_start = 350000  :: tokens = 50000 :: mean = 1.13 :: max = 13
n_start = 400000  :: tokens = 50000 :: mean = 1.13 :: max = 18
n_start = 450000  :: tokens = 50000 :: mean = 1.12 :: max = 10
n_start = 500000  :: tokens = 50000 :: mean = 1.20 :: max = 18
n_start = 550000  :: tokens = 50000 :: mean = 1.15 :: max = 19
n_start = 600000  :: tokens = 50000 :: mean = 1.13 :: max = 14
n_start = 650000  :: tokens = 50000 :: 
mean = 1.17 :: max = 18
n_start = 700000  :: tokens = 50000 :: mean = 1.15 :: max = 12
n_start = 750000  :: tokens = 50000 :: mean = 1.20 :: max = 16
n_start = 800000  :: tokens = 50000 :: mean = 1.30 :: max = 21
n_start = 850000  :: tokens = 50000 :: mean = 1.13 :: max = 13
n_start = 900000  :: tokens = 50000 :: mean = 1.14 :: max = 13
n_start = 950000  :: tokens = 50000 :: mean = 1.16 :: max = 14
n_start = 1000000 :: tokens = 50000 :: mean = 1.22 :: max = 25
n_start = 1050000 :: tokens = 50000 :: mean = 1.12 :: max = 14
n_start = 1100000 :: tokens = 50000 :: mean = 1.11 :: max = 12
n_start = 1150000 :: tokens = 50000 :: mean = 1.24 :: max = 16
n_start = 1200000 :: tokens = 50000 :: mean = 1.14 :: max = 18
n_start = 1250000 :: tokens = 50000 :: mean = 1.25 :: max = 15
n_start = 1300000 :: tokens = 50000 :: mean = 1.16 :: max = 15
n_start = 1350000 :: tokens = 50000 :: mean = 1.17 :: max = 14
n_start = 1400000 :: tokens = 50000 :: mean = 1.10 :: max = 10
n_start = 1450000 :: tokens = 50000 :: mean = 1.16 :: max = 21
n_start = 1500000 :: tokens = 50000 :: mean = 1.18 :: max = 33
n_start = 1550000 :: tokens = 50000 :: mean = 1.17 :: max = 20
n_start = 1600000 :: tokens = 50000 :: mean = 1.15 :: max = 14
n_start = 1650000 :: tokens = 50000 :: mean = 1.16 :: max = 12
n_start = 1700000 :: tokens = 50000 :: mean = 1.17 :: max = 15
n_start = 1750000 :: tokens = 50000 :: mean = 1.16 :: max = 12
n_start = 1800000 :: tokens = 50000 :: mean = 1.20 :: max = 14
n_start = 1850000 :: tokens = 50000 :: mean = 1.17 :: max = 13
n_start = 1900000 :: tokens = 50000 :: mean = 1.17 :: max = 20
n_start = 1950000 :: tokens = 50000 :: mean = 1.07 :: max = 11
n_start = 2000000 :: tokens = 50000 :: mean = 1.13 :: max = 15
n_start = 2050000 :: tokens = 50000 :: mean = 1.10 :: max = 8
n_start = 2100000 :: tokens = 50000 :: mean = 1.15 :: max = 17
n_start = 2150000 :: tokens = 50000 :: mean = 1.27 :: max = 27
n_start = 2200000 :: tokens = 50000 :: mean = 1.47 :: max = 24
n_start = 2250000 :: tokens = 50000 :: mean = 1.34 :: max = 22
n_start = 2300000 :: tokens = 50000 :: mean = 1.18 :: max = 12
n_start = 2350000 :: tokens = 50000 :: mean = 1.19 :: max = 14
n_start = 2400000 :: tokens = 50000 :: mean = 1.25 :: max = 21
n_start = 2450000 :: tokens = 50000 :: mean = 1.17 :: max = 25

“b_random = False” and “pos_type = 2” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.25 :: max = 11
n_start = 50000   :: tokens = 50000 :: mean = 1.25 :: max = 8
n_start = 100000  :: tokens = 50000 :: mean = 1.50 :: max = 18
n_start = 150000  :: tokens = 50000 :: mean = 1.25 :: max = 18
n_start = 200000  :: tokens = 50000 :: mean = 1.36 :: max = 15
n_start = 250000  :: tokens = 50000 :: mean = 1.19 :: max = 13
n_start = 300000  :: tokens = 50000 :: mean = 1.15 :: max = 7
n_start = 350000  :: tokens = 50000 :: mean = 1.15 :: max = 6
n_start = 400000  :: tokens = 50000 :: mean = 1.18 :: max = 9
n_start = 450000  :: tokens = 50000 :: mean = 1.36 :: max = 15
n_start = 500000  :: tokens = 50000 :: mean = 1.39 :: max = 14
n_start = 550000  :: tokens = 50000 :: mean = 1.20 :: max = 15
n_start = 600000  :: tokens = 50000 :: mean = 1.16 :: max = 6
n_start = 650000  :: tokens = 50000 :: mean = 1.21 :: max = 8
n_start = 700000  :: tokens = 50000 :: mean = 1.18 :: max = 8
n_start = 750000  :: tokens = 50000 :: mean = 1.27 :: max = 12
n_start = 800000  :: tokens = 50000 :: mean = 1.32 :: max = 13
n_start = 850000  :: tokens = 50000 :: mean = 1.18 :: max = 8
n_start = 900000  :: tokens = 50000 :: mean = 1.17 :: max = 8
n_start = 950000  :: tokens = 50000 :: mean = 1.25 :: max = 10
n_start = 1000000 :: tokens = 50000 :: mean = 1.22 :: max = 11
n_start = 1050000 :: tokens = 50000 :: mean = 1.15 :: max = 8
n_start = 1100000 :: tokens = 50000 :: mean = 1.15 :: max = 6
r
n_start = 1150000 :: tokens = 50000 :: mean = 1.29 :: max = 15
n_start = 1200000 :: tokens = 50000 :: mean = 1.17 :: max = 7
n_start = 1250000 :: tokens = 50000 :: mean = 1.17 :: max = 8
n_start = 1300000 :: tokens = 50000 :: mean = 1.16 :: max = 9
n_start = 1350000 :: tokens = 50000 :: mean = 1.18 :: max = 8
n_start = 1400000 :: tokens = 50000 :: mean = 1.17 :: max = 8
n_start = 1450000 :: tokens = 50000 :: mean = 1.17 :: max = 7
n_start = 1500000 :: tokens = 50000 :: mean = 1.17 :: max = 9
n_start = 1550000 :: tokens = 50000 :: mean = 1.17 :: max = 7
n_start = 1600000 :: tokens = 50000 :: mean = 1.31 :: max = 24
n_start = 1650000 :: tokens = 50000 :: mean = 1.18 :: max = 9
n_start = 1700000 :: tokens = 50000 :: mean = 1.17 :: max = 13
n_start = 1750000 :: tokens = 50000 :: mean = 1.26 :: max = 21
n_start = 1800000 :: tokens = 50000 :: mean = 1.70 :: max = 21
n_start = 1850000 :: tokens = 50000 :: mean = 1.43 :: max = 21
n_start = 1900000 :: tokens = 50000 :: mean = 1.19 :: max = 10
n_start = 1950000 :: tokens = 50000 :: mean = 1.30 :: max = 11
n_start = 2000000 :: tokens = 50000 :: mean = 1.33 :: max = 11
n_start = 2050000 :: tokens = 50000 :: mean = 1.16 :: max = 8
n_start = 2100000 :: tokens = 50000 :: mean = 1.17 :: max = 9
n_start = 2150000 :: tokens = 50000 :: mean = 1.41 :: max = 20
n_start = 2200000 :: tokens = 50000 :: mean = 2.08 :: max = 52
n_start = 2250000 :: tokens = 50000 :: mean = 2.27 :: max = 52
n_start = 2300000 :: tokens = 50000 :: mean = 1.21 :: max = 11
n_start = 2350000 :: tokens = 50000 :: mean = 1.21 :: max = 10
n_start = 2400000 :: tokens = 50000 :: mean = 1.21 :: max = 9
n_start = 2450000 :: tokens = 50000 :: mean = 1.30 :: max = 18

“b_random = False” and “pos_type = 3” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.23 :: max = 23
n_start = 50000   :: tokens = 50000 :: mean = 1.25 :: max = 17
n_start = 100000  :: tokens = 50000 :: mean = 1.16 :: max = 17
n_start = 150000  :: tokens = 50000 :: mean = 1.22 :: max = 15
n_start = 200000  :: tokens = 50000 :: mean = 1.22 :: max = 17
n_start = 250000  :: tokens = 50000 :: mean = 1.18 :: max = 11
n_start = 300000  :: tokens = 50000 :: mean = 1.27 :: max = 23
n_start = 350000  :: tokens = 50000 :: mean = 1.29 :: max = 23
n_start = 400000  :: tokens = 50000 :: mean = 1.14 :: max = 11
n_start = 450000  :: tokens = 50000 :: mean = 1.18 :: max = 17
n_start = 500000  :: tokens = 50000 :: mean = 1.16 :: max = 15
n_start = 550000  :: tokens = 50000 :: mean = 1.26 :: max = 17
n_start = 600000  :: tokens = 50000 :: mean = 1.20 :: max = 13
n_start = 650000  :: tokens = 50000 :: mean = 1.10 :: max = 9
n_start = 700000  :: tokens = 50000 :: mean = 1.20 :: max = 17
n_start = 750000  :: tokens = 50000 :: mean = 1.17 :: max = 17
n_start = 800000  :: tokens = 50000 :: mean = 1.28 :: max = 19
n_start = 850000  :: tokens = 50000 :: mean = 1.15 :: max = 15
n_start = 900000  :: tokens = 50000 :: mean = 1.19 :: max = 11
n_start = 950000  :: tokens = 50000 :: mean = 1.19 :: max = 13
n_start = 1000000 :: tokens = 50000 :: mean = 1.24 :: max = 24
n_start = 1050000 :: tokens = 50000 :: mean = 1.17 :: max = 10
n_start = 1100000 :: tokens = 50000 :: mean = 1.29 :: max = 23
n_start = 1150000 :: tokens = 50000 :: mean = 1.18 :: max = 13
n_start = 1200000 :: tokens = 50000 :: mean = 1.18 :: max = 16
n_start = 1250000 :: tokens = 50000 :: mean = 1.38 :: max = 23
n_start = 1300000 :: tokens = 50000 :: mean = 1.30 :: max = 23
n_start = 1350000 :: tokens = 50000 :: mean = 1.21 :: max = 15
n_start = 1400000 :: tokens = 50000 :: mean = 1.21 :: max = 23
n_start = 1450000 :: tokens = 50000 :: mean = 1.23 :: max = 12
n_start = 1500000 :: tokens = 50000 :: mean = 1.21 :: max = 13
n_start = 1550000 :: tokens = 50000 :: mean = 1.22 :: max = 12
n_start = 1600000 :: 
tokens = 50000 :: mean = 1.12 :: max = 13
n_start = 1650000 :: tokens = 50000 :: mean = 1.27 :: max = 16
n_start = 1700000 :: tokens = 50000 :: mean = 1.23 :: max = 15
n_start = 1750000 :: tokens = 50000 :: mean = 1.26 :: max = 11
n_start = 1800000 :: tokens = 50000 :: mean = 1.08 :: max = 7
n_start = 1850000 :: tokens = 50000 :: mean = 1.11 :: max = 12
n_start = 1900000 :: tokens = 50000 :: mean = 1.26 :: max = 23
n_start = 1950000 :: tokens = 50000 :: mean = 1.06 :: max = 9
n_start = 2000000 :: tokens = 50000 :: mean = 1.11 :: max = 15
n_start = 2050000 :: tokens = 50000 :: mean = 1.16 :: max = 16
n_start = 2100000 :: tokens = 50000 :: mean = 1.17 :: max = 13
n_start = 2150000 :: tokens = 50000 :: mean = 1.33 :: max = 16
n_start = 2200000 :: tokens = 50000 :: mean = 1.29 :: max = 24
n_start = 2250000 :: tokens = 50000 :: mean = 1.20 :: max = 17
n_start = 2300000 :: tokens = 50000 :: mean = 1.35 :: max = 17
n_start = 2350000 :: tokens = 50000 :: mean = 1.25 :: max = 12
n_start = 2400000 :: tokens = 50000 :: mean = 1.26 :: max = 16
n_start = 2450000 :: tokens = 50000 :: mean = 1.29 :: max = 13

“b_random = False” and “pos_type = 4” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.25 :: max = 6
n_start = 50000   :: tokens = 50000 :: mean = 1.27 :: max = 9
n_start = 100000  :: tokens = 50000 :: mean = 1.43 :: max = 19
n_start = 150000  :: tokens = 50000 :: mean = 1.22 :: max = 19
n_start = 200000  :: tokens = 50000 :: mean = 1.33 :: max = 12
n_start = 250000  :: tokens = 50000 :: mean = 1.22 :: max = 7
n_start = 300000  :: tokens = 50000 :: mean = 1.17 :: max = 7
n_start = 350000  :: tokens = 50000 :: mean = 1.17 :: max = 8
n_start = 400000  :: tokens = 50000 :: mean = 1.21 :: max = 8
n_start = 450000  :: tokens = 50000 :: mean = 1.32 :: max = 12
n_start = 500000  :: tokens = 50000 :: mean = 1.36 :: max = 14
n_start = 550000  :: tokens = 50000 :: mean = 1.22 :: max = 8
n_start = 600000  :: tokens = 50000 :: mean = 1.18 :: max = 6
n_start = 650000  :: tokens = 50000 :: mean = 1.23 :: max = 8
n_start = 700000  :: tokens = 50000 :: mean = 1.21 :: max = 14
n_start = 750000  :: tokens = 50000 :: mean = 1.29 :: max = 14
n_start = 800000  :: tokens = 50000 :: mean = 1.31 :: max = 13
n_start = 850000  :: tokens = 50000 :: mean = 1.19 :: max = 13
n_start = 900000  :: tokens = 50000 :: mean = 1.17 :: max = 7
n_start = 950000  :: tokens = 50000 :: mean = 1.26 :: max = 8
n_start = 1000000 :: tokens = 50000 :: mean = 1.24 :: max = 11
n_start = 1050000 :: tokens = 50000 :: mean = 1.18 :: max = 9
n_start = 1100000 :: tokens = 50000 :: mean = 1.19 :: max = 7
n_start = 1150000 :: tokens = 50000 :: mean = 1.27 :: max = 10
n_start = 1200000 :: tokens = 50000 :: mean = 1.20 :: max = 7
n_start = 1250000 :: tokens = 50000 :: mean = 1.18 :: max = 13
n_start = 1300000 :: tokens = 50000 :: mean = 1.19 :: max = 9
n_start = 1350000 :: tokens = 50000 :: mean = 1.20 :: max = 9
n_start = 1400000 :: tokens = 50000 :: mean = 1.20 :: max = 8
n_start = 1450000 :: tokens = 50000 :: mean = 1.20 :: max = 9
n_start = 1500000 :: tokens = 50000 :: mean = 1.19 :: max = 14
n_start = 1550000 :: tokens = 50000 :: mean = 1.20 :: max = 11
n_start = 1600000 :: tokens = 50000 :: mean = 1.29 :: max = 11
n_start = 1650000 :: tokens = 50000 :: mean = 1.19 :: max = 6
n_start = 1700000 :: tokens = 50000 :: mean = 1.18 :: max = 8
n_start = 1750000 :: tokens = 50000 :: mean = 1.21 :: max = 22
n_start = 1800000 :: tokens = 50000 :: mean = 1.42 :: max = 33
n_start = 1850000 :: tokens = 50000 :: mean = 1.32 :: max = 33
n_start = 1900000 :: tokens = 50000 :: mean = 1.23 :: max = 15
n_start = 1950000 :: tokens = 50000 :: mean = 1.25 :: max = 9
n_start = 2000000 :: tokens = 50000 :: mean = 1.27 :: max = 10
n_start = 2050000 :: tokens = 50000 :: mean = 1.17 :: 
max = 10
n_start = 2100000 :: tokens = 50000 :: mean = 1.19 :: max = 9
n_start = 2150000 :: tokens = 50000 :: mean = 1.40 :: max = 16
n_start = 2200000 :: tokens = 50000 :: mean = 1.82 :: max = 52
n_start = 2250000 :: tokens = 50000 :: mean = 1.94 :: max = 52
n_start = 2300000 :: tokens = 50000 :: mean = 1.21 :: max = 9
n_start = 2350000 :: tokens = 50000 :: mean = 1.20 :: max = 7
n_start = 2400000 :: tokens = 50000 :: mean = 1.24 :: max = 7
n_start = 2450000 :: tokens = 50000 :: mean = 1.31 :: max = 16

“b_random = False” and “pos_type = 5” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.73 :: max = 49
n_start = 50000   :: tokens = 50000 :: mean = 1.59 :: max = 49
n_start = 100000  :: tokens = 50000 :: mean = 1.91 :: max = 49
n_start = 150000  :: tokens = 50000 :: mean = 1.99 :: max = 49
n_start = 200000  :: tokens = 50000 :: mean = 1.46 :: max = 44
n_start = 250000  :: tokens = 50000 :: mean = 1.74 :: max = 49
n_start = 300000  :: tokens = 50000 :: mean = 1.94 :: max = 49
n_start = 350000  :: tokens = 50000 :: mean = 2.00 :: max = 49
n_start = 400000  :: tokens = 50000 :: mean = 1.47 :: max = 49
n_start = 450000  :: tokens = 50000 :: mean = 2.04 :: max = 49
n_start = 500000  :: tokens = 50000 :: mean = 1.80 :: max = 49
n_start = 550000  :: tokens = 50000 :: mean = 1.76 :: max = 49
n_start = 600000  :: tokens = 50000 :: mean = 1.83 :: max = 44
n_start = 650000  :: tokens = 50000 :: mean = 1.43 :: max = 44
n_start = 700000  :: tokens = 50000 :: mean = 1.77 :: max = 49
n_start = 750000  :: tokens = 50000 :: mean = 1.43 :: max = 49
n_start = 800000  :: tokens = 50000 :: mean = 1.50 :: max = 32
n_start = 850000  :: tokens = 50000 :: mean = 1.71 :: max = 44
n_start = 900000  :: tokens = 50000 :: mean = 1.68 :: max = 40
n_start = 950000  :: tokens = 50000 :: mean = 1.74 :: max = 49
n_start = 1000000 :: tokens = 50000 :: mean = 1.98 :: max = 49
n_start = 1050000 :: tokens = 50000 :: mean = 1.73 :: max = 40
n_start = 1100000 :: tokens = 50000 :: mean = 1.71 :: max = 30
n_start = 1150000 :: tokens = 50000 :: mean = 1.32 :: max = 30
n_start = 1200000 :: tokens = 50000 :: mean = 1.49 :: max = 49
n_start = 1250000 :: tokens = 50000 :: mean = 1.93 :: max = 40
n_start = 1300000 :: tokens = 50000 :: mean = 1.94 :: max = 49
n_start = 1350000 :: tokens = 50000 :: mean = 1.67 :: max = 44
n_start = 1400000 :: tokens = 50000 :: mean = 1.61 :: max = 37
n_start = 1450000 :: tokens = 50000 :: mean = 1.86 :: max = 49
n_start = 1500000 :: tokens = 50000 :: mean = 2.04 :: max = 49
n_start = 1550000 :: tokens = 50000 :: mean = 1.60 :: max = 49
n_start = 1600000 :: tokens = 50000 :: mean = 1.38 :: max = 34
n_start = 1650000 :: tokens = 50000 :: mean = 1.77 :: max = 49
n_start = 1700000 :: tokens = 50000 :: mean = 1.77 :: max = 44
n_start = 1750000 :: tokens = 50000 :: mean = 1.79 :: max = 49
n_start = 1800000 :: tokens = 50000 :: mean = 1.08 :: max = 16
n_start = 1850000 :: tokens = 50000 :: mean = 1.46 :: max = 49
n_start = 1900000 :: tokens = 50000 :: mean = 1.51 :: max = 49
n_start = 1950000 :: tokens = 50000 :: mean = 1.31 :: max = 24
n_start = 2000000 :: tokens = 50000 :: mean = 1.24 :: max = 29
n_start = 2050000 :: tokens = 50000 :: mean = 1.85 :: max = 49
n_start = 2100000 :: tokens = 50000 :: mean = 1.96 :: max = 49
n_start = 2150000 :: tokens = 50000 :: mean = 1.66 :: max = 49
n_start = 2200000 :: tokens = 50000 :: mean = 1.45 :: max = 40
n_start = 2250000 :: tokens = 50000 :: mean = 1.51 :: max = 49
n_start = 2300000 :: tokens = 50000 :: mean = 2.07 :: max = 49
n_start = 2350000 :: tokens = 50000 :: mean = 2.01 :: max = 34
n_start = 2400000 :: tokens = 50000 :: mean = 1.94 :: max = 34
n_start = 2450000 :: tokens = 50000 :: mean = 1.85 :: max = 49

pos_type = 5 shows on average
larger maximum values; this is consistent with relatively high average values for the hit list length.

“b_random = False” and “pos_type = 6” and num_w = 50,000

n_start = 0       :: tokens = 50000 :: mean = 1.38 :: max = 9
n_start = 50000   :: tokens = 50000 :: mean = 1.44 :: max = 22
n_start = 100000  :: tokens = 50000 :: mean = 1.58 :: max = 14
n_start = 150000  :: tokens = 50000 :: mean = 1.41 :: max = 20
n_start = 200000  :: tokens = 50000 :: mean = 1.51 :: max = 16
n_start = 250000  :: tokens = 50000 :: mean = 1.43 :: max = 17
n_start = 300000  :: tokens = 50000 :: mean = 1.41 :: max = 20
n_start = 350000  :: tokens = 50000 :: mean = 1.34 :: max = 17
n_start = 400000  :: tokens = 50000 :: mean = 1.47 :: max = 21
n_start = 450000  :: tokens = 50000 :: mean = 1.56 :: max = 18
n_start = 500000  :: tokens = 50000 :: mean = 1.54 :: max = 21
n_start = 550000  :: tokens = 50000 :: mean = 1.40 :: max = 22
n_start = 600000  :: tokens = 50000 :: mean = 1.41 :: max = 22
n_start = 650000  :: tokens = 50000 :: mean = 1.47 :: max = 21
n_start = 700000  :: tokens = 50000 :: mean = 1.47 :: max = 19
n_start = 750000  :: tokens = 50000 :: mean = 1.51 :: max = 21
n_start = 800000  :: tokens = 50000 :: mean = 1.51 :: max = 17
n_start = 850000  :: tokens = 50000 :: mean = 1.36 :: max = 15
n_start = 900000  :: tokens = 50000 :: mean = 1.39 :: max = 27
n_start = 950000  :: tokens = 50000 :: mean = 1.53 :: max = 22
n_start = 1000000 :: tokens = 50000 :: mean = 1.45 :: max = 22
n_start = 1050000 :: tokens = 50000 :: mean = 1.45 :: max = 16
n_start = 1100000 :: tokens = 50000 :: mean = 1.49 :: max = 31
n_start = 1150000 :: tokens = 50000 :: mean = 1.46 :: max = 31
n_start = 1200000 :: tokens = 50000 :: mean = 1.55 :: max = 20
n_start = 1250000 :: tokens = 50000 :: mean = 1.33 :: max = 14
n_start = 1300000 :: tokens = 50000 :: mean = 1.44 :: max = 27
n_start = 1350000 :: tokens = 50000 :: mean = 1.41 :: max = 16
n_start = 1400000 :: tokens = 50000 :: mean = 1.43 :: max = 19
n_start = 1450000 :: tokens = 50000 :: mean = 1.46 :: max = 20
n_start = 1500000 :: tokens = 50000 :: mean = 1.32 :: max = 15
n_start = 1550000 :: tokens = 50000 :: mean = 1.39 :: max = 18
n_start = 1600000 :: tokens = 50000 :: mean = 1.52 :: max = 20
n_start = 1650000 :: tokens = 50000 :: mean = 1.36 :: max = 17
n_start = 1700000 :: tokens = 50000 :: mean = 1.41 :: max = 17
n_start = 1750000 :: tokens = 50000 :: mean = 1.38 :: max = 19
n_start = 1800000 :: tokens = 50000 :: mean = 1.80 :: max = 20
n_start = 1850000 :: tokens = 50000 :: mean = 1.63 :: max = 25
n_start = 1900000 :: tokens = 50000 :: mean = 1.52 :: max = 21
n_start = 1950000 :: tokens = 50000 :: mean = 1.52 :: max = 22
n_start = 2000000 :: tokens = 50000 :: mean = 1.53 :: max = 25
n_start = 2050000 :: tokens = 50000 :: mean = 1.33 :: max = 14
n_start = 2100000 :: tokens = 50000 :: mean = 1.41 :: max = 23
n_start = 2150000 :: tokens = 50000 :: mean = 1.61 :: max = 19
n_start = 2200000 :: tokens = 50000 :: mean = 2.03 :: max = 28
n_start = 2250000 :: tokens = 50000 :: mean = 2.12 :: max = 28
n_start = 2300000 :: tokens = 50000 :: mean = 1.47 :: max = 26
n_start = 2350000 :: tokens = 50000 :: mean = 1.42 :: max = 21
n_start = 2400000 :: tokens = 50000 :: mean = 1.50 :: max = 21
n_start = 2450000 :: tokens = 50000 :: mean = 1.49 :: max = 22

For pos_type == 0 typical examples for many hits are members of the following word collection. You see the common 3-char-grams at the beginning, in the middle and at the end of the words:

verbindungsbauten, verbindungsfesten, verbindungskanten, verbindungskarten, verbindungskasten,
verbindungsketten, verbindungsknoten, verbindungskosten, verbindungsleuten, verbindungslisten,
verbindungsmasten, verbindungspisten, verbindungsrouten, verbindungsweiten, 
verbindungszeiten,
verfassungsraeten, verfassungstexten, verfassungswerten, verfolgungslisten, verfolgungsnoeten, 
verfolgungstexten, verfolgungszeiten, verführungsküsten, vergnügungsbauten, vergnügungsbooten,
vergnügungsfesten, vergnügungsgarten, vergnügungsgärten, verguetungskosten, verletzungsnoeten, 
vermehrungsbauten, vermehrungsbeeten, vermehrungsgarten, vermessungsbooten, vermessungskarten,
vermessungsketten, vermessungskosten, vermessungslatten, vermessungsposten, vermessungsseiten,
vermietungslisten, verordnungstexten, verpackungskisten, verpackungskosten, verpackungsresten,
verpackungstexten, versorgungsbauten, versorgungsbooten, versorgungsgarten, versorgungsgärten,
versorgungshütten, versorgungskarten, versorgungsketten, versorgungskisten, versorgungsknoten,
versorgungskosten, versorgungslasten, versorgungslisten, versorgungsposten, versorgungsquoten,
versorgungsrenten, versorgungsrouten, versorgungszeiten, verteilungseliten, verteilungskarten,
verteilungskosten, verteilungslisten, verteilungsposten, verteilungswerten, vertretungskosten,
vertretungswerten, vertretungszeiten, vertretungsärzten, verwaltungsbauten, verwaltungseliten,
verwaltungskarten, verwaltungsketten, verwaltungsknoten, verwaltungskonten, verwaltungskosten, 
verwaltungslasten, verwaltungsleuten, verwaltungsposten, verwaltungsraeten, verwaltungstexten,
verwaltungsärzten, verwendungszeiten, verwertungseliten, verwertungsketten, verwertungskosten,
verwertungsquoten

For pos_type == 5 we get the following example words with many hits:

almbereich, altbereich, armbereich, astbereich, barbereich, baubereich, 
biobereich, bobbereich, boxbereich, busbereich, bußbereich, dombereich,
eckbereich, eisbereich, endbereich, erdbereich, essbereich, fußbereich,
gasbereich, genbereich, hofbereich, hubbereich, hutbereich, hörbereich,
kurbereich, lötbereich, nahbereich, oelbereich, ohrbereich, ostbereich,
radbereich, rotbereich, seebereich, sehbereich, skibereich, subbereich,
südbereich, tatbereich, tonbereich, topbereich, torbereich, totbereich,
türbereich, vorbereich, webbereich, wegbereich, zoobereich, zugbereich,
ökobereich

Intermediate conclusion for tokens longer than 9 letters

From what we found above something like “0 <= pos-type <= 4" and "pos_type =7" are preferable choices for the positions of the 3-char-grams in longer words. But even if we have to vary the positions a bit more, we get on average reasonably short hit lists.

It seems, however, that we must live with relatively long hit lists for some words (mostly compounds at a certain region of the vocabulary).

Test runs for words with a length ≤ 9 and two 3-char-grams

The list of words with less than 10 characters comprises only around 185869 entries. So, the cpu-time required should become smaller.

Here are some result data for runs for words with a length ≤ 9 characters:

“b_random = True” and “pos_type = 0”

     
cpu :  42.69  :: tokens = 100000  :: mean = 2.07 :: max = 78.00

“b_random = True” and “pos_type = 1”

cpu :  43.76  :: tokens = 100000  :: mean = 1.84 :: max = 40.00

“b_random = True” and “pos_type = 2”

cpu :  43.18  :: tokens = 100000  :: mean = 1.76 :: max = 30.00

“b_random = True” and “pos_type = 3”

cpu :  43.91  :: tokens = 100000  :: mean = 2.66 :: max = 46.00

“b_random = True” and “pos_type = 4”

cpu :  43.64  :: tokens = 100000  :: mean = 2.09 :: max = 30.00

“b_random = True” and “pos_type = 5”

cpu :  44.00  :: tokens = 100000  :: mean = 9.38 :: max = 265.00

“b_random = True” and “pos_type = 6”

cpu :  43.59  :: tokens = 100000  :: mean = 5.71 :: 
max = 102.00

“b_random = True” and “pos_type = 7”

cpu :  43.50  :: tokens = 100000  :: mean = 2.07 :: max = 30.00

You see that we should not shift the first or the last 3-char-gram to far into the middle of the word. For short tokens such a shift can lead to a full overlap of the 3-char-grams – and this obviously reduces our chances to reduce the list of hits.

Conclusion

In this post we continued our experiments on selecting words from a vocabulary which match some 3-char-grams at different positions of the token. We found the following:

The measured CPU-times for 100,000 tokens allow for multiple word searches with different positions of two or three 3-char-grams, even on a PC.
While we, on average, get hit lists of a length below 2 matching words there are always a few compounds which lead to significantly larger hit lists with tenths of words.
For tokens with a length less than 9 characters, we can work with two 3-char-grams – but we should avoid a too big overlap of the char-grams.

These results give us some hope that we can select a reasonably short list of words from a vocabulary which match parts of misspelled tokens – e.g. with one or sometimes two letters wrongly written. Before we turn to the challenge of correcting such tokens in a new article series we close the present series with yet another post about the effect of multiprocessing on our word selection processes.

Linux-Blog – Dr. Mönchmeyer / anracon

Notes about Linux, ML and some simple math …

Category Archives: Text analysis with ML

TF-IDF – which formula to take in combination with the Keras Tokenizer? And when to calculate TF-IDF by your own code …

TF-IDF values as a measure of a token’s significance – dependency on token, text AND corpus

Technical obstacles may require the explicit calculation of TF-IDF values by a coded formula in your programs

Is it worth the effort to calculate TF-IDF values by your own code?

TF-IDF formulas: The “IDF”-term – and what does the Keras tokenizer use for it?

Variants of the “term frequency” TF? Yes, they do exist!

Quick and dirty Python code to calculate TF-IDF values manually for a list of texts with the Keras tokenizer

Conclusion

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – IV

Points to take into account when using multiprocessing – do not expect too much

Code to invoke multiple processor cores

A reduced dataframe containing randomly selected “tokens”

Code for executing parallelized run

How does cpu-time vary with the number of cores of my (hyperthreading) CPU?

Overhead and RAM consumption of multiprocessing

Conclusion

Links

Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – III

A simplifying approach

Predefined slices of the vocabulary for words with a given length

Dataframe with words longer than 9 letters

A function to get a hit list of words matching two or three 3-char-grams

Function to perform the test runs

Test runs for words with a length ≥ 10 and three 3-char-grams

Data for certain regions of the vocabulary

Intermediate conclusion for tokens longer than 9 letters

Test runs for words with a length ≤ 9 and two 3-char-grams

Conclusion