TF-IDF – which formula to take in combination with the Keras Tokenizer?

When performing Computer based text analysis we sometimes need to shorten our texts by some criteria before we apply machine learning algorithms. One of the reasons could be that a classical vectorization process applied to the original texts would lead to matrices or tensors which are beyond our PC memory capabilities.

The individual texts we deal with mostly are members of a text collection (ie.a text corpus). Then one criterion for the reduction of the texts could be the significance of the words for each individual text in which they appear. We only keep significant words.

A measure of a word's significance is given by a a quantity called "tf-idf" - "term frequency - inverse document frequency" (see below). If you have "tf-idf"-values for all the words used in a specific text (of the collection), a simple method to shorten the text for further analysis is to use a "tf-idf"-threshold: We keep words which have a "tf-idf"-value above the defined threshold and omit others.

"tf-idf"-values require a statistical analysis over a text ensemble. The basic statistical data are often collected during the application of a tokenizer to the text ensemble. And here things can become problematic as some tokenizers provide "tf-idf"-data during vectorization, only. Then the snake bites in its tail: We need tf-idf to to shorten texts reasonably and to avoid memory problems during vectorization, but sometimes the tool set provides "tf-idf"-data by vectorization.

A typical example is given by the Keras tokenizer. In such a situation one must invest some (limited) effort into a "manual" calculation of tf-idf values. But the you may find that your (text-book) formula for a "tf-idf"-calculation does not reproduce the values your tokenizer would have given you by a "tfidf"-vectorization of your texts. A reasonable formula for the tf-idf alculation with the help of the Keras tokenizer is the topic of this post. I omit the hyphen in tf-idf below sometimes for convenience reasons.

Vectorization of texts in tfidf-mode and the problem of one-hot like encodings

Most frameworks for text analysis or NLP, of course, provide a Tokenizer. Often, the Tokenizer object does not only identify individual tokens in a text, but the tokenizer is, in addition, capable to vectorize texts. Vectorization leads to the representation of a text by an (ordered) series of integer or float numbers, which in a unique way refer to the words of a vocabulary extracted from the text collection. The indexed position in the vector refers to a specific word in the vocabulary of the text ensemble, the value given at this position instead describes the word's (statistical) appearance in a text in some way.

A typical and basic vectorization approach is a "one-hot"-encoding, resulting in a "bag-of words"-model: A word appearing in a text is marked by a "one" in an indexed vector referring to words appearing in the text collection in an (ordered) fashion.

But vectorization can be provided in different modes, too: The "ones" (1) in a simple "one-hot-encoded" vector can e.g. be replaced by tf-idf values of the words (tfidf-mode). So, by using respective tokenizer functions you may get the aspired "tf-idf"-values for reducing the texts during a vectorization run. The tf-idf data describe the statistical overabundance of a word in a specific text by some formula measuring the word's appearance in a specific text and over all texts in a weighted and normalized way.

However, all one-hot like encodings of texts come with a major disadvantage:

The length of the word vectors depends on the number of words the tokenizer has identified over all texts in a collection for the vocabulary.

If you have extracted 2 million words out of hundreds of thousands of texts you may run into major trouble with the RAM of PC (and CPU-time). There are cases where you cannot or do not want to restrict the number of vocabulary words taken into account for analysis purposes.

Most tokenizers allow for a (manual) sequential approach for a limited number of texts to overcome memory problems under such circumstances. But often enough you may instead want to calculate "tf-idf"-values on your own - just to save time. And here we may talk about a difference of hours!

I recently had this problem with 200,000 texts, the pretty fast Keras tokenizer and a vocabulary of 1.7 million words (of which I wanted to use at least a million entries). The Keras tokenizer itself offers almost all relevant data for a calculation of the tf-idf-values after it has been applied to a list of text. In my case the CPU-time required to tokenize and build a vocabulary for the 200,000 texts took 25 secs, only. A manual and sequential approach to create all tf-idf values via vectorization required about an hour's time.

TF-IDF formulas: The "idf"-term

During my own "tf-idf"-calculation based on some Python code for a tfidf-formula and basic tokenizer-data I, of course, wanted to reproduce the values the Keras tokenizer gave me during my previous vectorization approach. To achieve this goal was a bit more difficult than expected. Just using a reasonable "tf-idf"-formula taken from some NLP text-book failed. The reason was that "tf-idf"-data can be and are indeed calculated in different ways. The Keras tokenizer does it differently than SciKit - actually for both the tf and the idf-part. There is a basic structure behind a normalized tfidf-value; however there are differences in the details. Lets look at both points.

Everybody who has once in his/her life programmed a search engine knows that the significance of a word for a specific text (of an ensemble) depends on the number of occurrences of the word inside the specific text, but also on the occurrence of the very same word in all the other texts of a given text collection:

If a word appears too often in (other) texts of a text ensemble then it is not very significant for the specific text we are looking at.

Examples are typical "stop-words" - like "this" or "that" or "and". Such words appear in very many texts.

Thus we expect that a measure of the statistical overabundance of a word in a specific text (of a collection of texts) is a combination of the abundance in the chosen text and a measure of the occurrence in multiple of texts. The "tf-idf" quantity follows this recipe: It is a combination of the so so called "term frequency" [tf(t)] with the "inverse document frequency [idf(t)], with "t" representing a special word or term:

tfidf(t)   =   tf(t)   *   idf(t)

While the term frequency measures the occurrence of a word within a selected text, the "idf" factor measures the occurrence of a word in different texts of the collection. To get some weighing and normalization into this formula, the "idf"-term is typically based on the natural logarithm of the fraction

  • of the number of texts NT in a collection (nominator)
  • and the number of documents ND(t) in which a special word or term appears (denominator)

A tf-idf therefore is always characteristic of a word or term and the specific text we look at. (This is one reason, why it actually can be used in text vectorization).

But, the "idf"-term is calculated in various manners in different text-books on text-analysis. Some variants avoid the idf-term becoming negative or avoid a division by zero; typical examples are:

  1. idf(t) = log( NT / (ND + 1) )

  2. idf(t) = log( (1 + NT) / (ND + 1) )

  3. idf(t) = log( 1 + NT / (ND + 1) )

  4. idf(t) = log( 1 + NT / ND )

  5. idf(t) = log( (1 + NT) / (ND + 1) ) + 1

Note: log() represents the natural logarithm above.

I have e.g. taken he second variant from a book of S. Raschka (see below) on "Python Machine Learning" (2016, Packt Publishing). The last one in the list above is used in Sci-Kit according to https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

This is in so far consistent to Raschka's version as he defines the SciKit "tf-idf" as:

tfidf(t) = tf(t) * [ idf(t) + 1 ]

The third variant is the one you find in the source code of the Keras tokenizer, despite the reference there to a point in a Wikipedia article which reflects the fourth form (!).

Source code excerpt of the Keras Tokenizer:

.....
.....
elif mode == 'tfidf':
                    # Use weighting scheme 2 in
                    # https://en.wikipedia.org/wiki/Tf%E2%80%93idf
                    tf = 1 + np.log(c)
                    idf = np.log(1 + self.document_count /
                                 (1 + self.index_docs.get(j, 0)))
                    x[i][j] = tf * idf
.....
.....

What we learn from this is that there are multiple variants of the "idf"-term out there. So, if you want to reproduce tfidf-numbers you should better look into the code of your framework objects or functions if possible.

Variants of the "term frequency"? Yes, they do exist!

While I was already aware of different idf-variants, I did not at all know that here are even differences regarding the term-frequency "tf(t)". Normally, one would think that it is just the number describing how often a certain word or term appears in a specific text.

Let us, for example, assume that we have turned a specific text via a tokenizer function into a "sequence" of numbers. An entry in this sequence refers to a unique number assigned to a word of a somehow sorted vocabulary. A tokenizer vocabulary is often represented by a Python dictionary where the key is the word itself (or a hash of it) and the value corresponds to a unique number for the word. In my applications I always create a supplementary dictionary, which I call "switched_vocab", with keys and values switched (number => word). A sequence then is typically represented by a Python list of numbers "li_seq": the position in the list corresponds to the word's position in the text (marked by separators), the number given corresponds to the words unique number in the vocabulary.

Then, with Python 3, a straight-forward method to get simple tf-values (as he sum of the number's occurence in the sequence) would be

ind_w = li_seq[i]    # with "i" selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = d_count[ind_w]

This code snippet creates a dictionary "d_count" with the word's unique number appearing in the original sequence and the sum of occurrences of this specific number in the text's sequence - i.e. in the text we are looking at.

Does the Keras tokenizer calculate and use tf in this manner when vectorizing texts in tfidf-mode? No, it does not! And this was a major factor for differences in tfidf-values I naively produced for my texts.

With the terms above the Keras tokenizer instead uses a logarithmic value for tf:

ind_w = li_seq[i] # i selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = log( 1 + d_count[ind_w] )

This in the end makes a significant difference in the derived "tf-idf" values in comparison to naive approach - even if you had gotten the "idf"-term right!

Quick and dirty Python code to calculate tfidf values manually for a list of texts with the Keras tokenizer

For reasons of completeness, I outline some code fragments below, which may help readers to calculate "tf-idf"-values, which are consistent with those produced during "sequences to matrix"-vectorization calculations with the Keras tokenizer. I assume that you already have a working Keras implementation using either CPU or GPU.

I further assume that you have gathered a collection of texts (cleansed by some Regex operations) in a column "txt" of a dataframe "df_rex". We first extract the texts into a list and apply the Keras tokenizer:

from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer

num_words = 1800000    # or whatever number of words you want to be taken into account from the vocabulary  

li_txts = df_rex['txt'].to_list()
tokenizer = Tokenizer(num_words=num_words, lower=True) # converts tokens to lower-case 
tokenizer.fit_on_texts(li_txts)    

vocab   = tokenizer.word_index
w_count = tokenizer.word_counts
w_docs  = tokenizer.word_docs
num_tot_vocab_words = len(vocab) 
    
# Switch vocab - key <> value 
# ****************************
switched_vocab = dict([(value, key) for key, value in vocab.items()])

Tokenizing should be a matter of seconds or a few ten-seconds depending on the number of texts and the length of the texts. In my case with 200,000 texts, on average each with 2000 words, it took 25 secs and produced a vocabulary of about 1.8 million words.

In a next step we create "integer sequences" from all texts:

li_seq_full  = tokenizer.texts_to_sequences(li_txts)
leng_li_seq_full = len(li_seq_full)

Now, we are able to create a super-list of lists - including a list of tf-idf-values per text:

li_all_txts = []

j_end = leng_li_seq_full
for j in range(0, j_end):
    li_text = []
    li_text.append(j)

    leng_seq = len(li_seq_full[j])
    li_seq     = []
    li_tfidf   = []
    li_words   = []
    d_count    = {}

    d_count  = Counter(li_seq_full[j])
    for i in range(0,leng_seq):
        ind_w    = li_seq_full[j][i] 
        word     = switched_vocab[ind_w]
        
        # calculation of tf-idf
        # ~~~~~~~~~~~~~~~~~~~~~
        # https://github.com/keras-team/keras-preprocessing/blob/1.1.2/keras_preprocessing/text.py#L372-L383
        # Use weighting scheme 2 in https://en.wikipedia.org/wiki/Tf%E2%80%93idf
        dfreq    = w_docs[word] # document frequency 
        idf      = np.log( 1.0 + (leng_li_seq_full)  / (dfreq + 1.0) )
        tf_basic = d_count[ind_w]
        tf       = 1.0 + np.log(tf_basic)
        tfidf    = tf * idf 
                
        li_seq.append(ind_w) 
        li_tfidf.append(tfidf) 
        li_words.append(word) 

    li_text.append(li_seq)
    li_text.append(li_tfidf)
    li_text.append(li_words)

    li_all_txts.append(li_text)

leng_li_all_txts = len(li_all_txts)

This last run took around 4 minutes in my case. When getting the same numbers with a sequential approach calculating Keras vectorization matrices in tf-idf mode for around 6000 texts with in-between memory cleansing it took me around an hour with continuous manual system interactions.

Conclusion

In this article I have demonstrated that "tf-idf"-values can be calculated almost directly from the output of a tokenizer like the Keras Tokenizer. Such a "manual" calculation is preferable in comparison to a vectorization run in "tf-idf"-mode when the number of texts and the vocabulary of your texts is big or huge. "tf-idf"-word-vectors may easily get a length of more than a million words with a reasonably complex text ensembles. This poses memory problems on many PC-based systems.

With directly calculated tf-idf-values you get a measure for the significance of words in a text. Therefore, the "tf-idf"- values may help you to shorten texts reasonably before you vectorize your texts, i.e. ahead of applying advanced ML-algorithms.