Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions

After two stressful months at my job I used part of my Christmas holidays to play a bit around with Blender. For me Blender has always been a fascinating tool to perform optical experiments with mirrors, lenses, light emitting gases, etc. It's real fun ... like in a virtual lab. See e.g. the image of two half-transparent cubes intersecting each other in an asymmetric way; the cubes were filled with volumetric gas emitting red and green light:

The idea for the experiment illustrated above arose in a discussion with a German artist (Michael Grossmann) about different kinds of color mixtures. The human eye and the neural networks behind it interpret a dense mixture of green and red light rays as yellow. This is true only for active light emitters, but not for passive reflective particles as used in painting. Think of pixels emitting light on a TV-screen: There neighboring red and green pixels create an impression of yellow.

In this present article series, however, I want to describe two experiments for highly reflective mirroring surfaces. I got the ideas from two real art installations. All credit must be given to the artists behind the original art objects:

One object is located in a sculpture park in Norway called "Kistefoss museum". See https://www.kistefosmuseum.com/sculpture/the-sculpture-park. I warn you - a visit to this park is really expensive in my opinion (18 Euros per person + 8 Euros or more for parking. OK, the fact that enjoying modern art is a kind of luxury had to be expected in the richest country outside the EU. And, of course, Kistefoss is run by a private investor .... See https://www.kistefos.no/. Some things obviously never change during the history of capitalism. Presently the impression of art in original "nature" of an originally beautiful river valley is spoiled by a huge construction site for a 4 or 6 track autobahn bridge. Well, well - so much about the relation between art and capitalism.

Nevertheless - there are some really nice installations at Kistefoss to look at.

The object I refer to is named “S-Curve” and was made by the well known contemporary artist Anish Kapoor. See "https://www.kistefosmuseum.com/sculptur/s-curve" or google for "s-curve anish kapoor" and look at the images. The installation consists of a bent elongated rectangular metallic surface looking like a twisted curved band. Curved in two directions: The curvature on the left side of the s-curved band is concave, on the right side it is convex. I found this idea of combining reflective concave and convex mirror surfaces breathtakingly simple and impressive at the same time. The whole installation makes you think about the reality behind visual images triggered by some conceptional network in your brain. It reminded me of the old Platonic idea that our relation to the real world must be compared to a man sitting in a cave where he only sees shadows of a real world on a wall. Now, imagine a world where our cave walls where made of curved mirrors - how could we get a clue about the reality behind the strange reflections? Well, such questions seemingly trigger something inside physicists ...

The second public installation is located in the Norwegian city of Drammen at a river bank. There you are confronted with highly reflective outer surfaces of two spheres on each side of the river. However, these spheres also have a kind of spherical indentation on their outer sides. Sometimes when the sun is at the right position you can see strange ring like reflections of light in these indentations - with rater sharp edges. Similar disturbing effects can be seen when you put some intensive LED lamp outside and inside the sphere. Makes you wonder what kind of images an open half-sphere would create.

The question what a camera placed at different positions in front or inside a reflective open half-sphere is for many reasons interesting if you start thinking about it with some recovered school knowledge about optics. Three major points are:

  • Things may appear to be located in front of the mirror surface.
  • There is inversion. Reflected things appear upside down and left-right mirrored.
  • In addition multiple reflections have to be taken into account in a half-sphere. Which is by the way the reason for multiple ring like reflections of distant bright light sources.

There is a phantastic video on Youtube "What Does It Look Like INSIDE a Spherical Mirror?" (https://www.youtube.com/watch?v=Y8c7TZx8HeY) which gives you a live impression of the strange things a concave mirror surface can do with light rays. Well, not everybody has the means to make or get a perfect mirroring half sphere. Or get some huge metal plates and deform them as a S-shaped cylindrical band. For us normal people Blender will have to do a job with virtual objects.

To raise your appetite I first want to present two preliminary rendered images from Blender. One just shows the reflections of some objects (cylinders, spheres, cones) placed in front of two plain cylindrical surfaces attached to each other. This is a first simplified approximation to the S-curve. But it already reveals some of the properties which we can expect to find on a twisted continuous metal structure like the S-curve of Mr. Kapoor.

The other image shows the reflections of some relatively small objects (again a sphere, a cone and a cylinder) positioned deep inside a concave half-sphere. This second picture indicates the complexity which multiple reflections within an open reflective half-sphere can create. We shall later enhance the artificial scenes displayed below by additional mirrors (flat, cylindrical and spherical) behind the camera. Another goal is a movie with the camera slowly moving in and out of an open half-sphere.

It took me a while to create the pictures below as I needed to adapt to some changes in the Blender tool set and to master some specific tasks of modeling. For the present project I use version 2.82 of Blender which came together with Opensuse Leap 15.3 on my laptop. The last time I worked with blender I had a version around 2.0. Especially the problem how to construct perfect reflecting surfaces of cones, cylinders and spheres required some investigation. Also we need to choose the "cycles" rendering engine to get satisfactory results. Note also that the "metallicity" parameter used these days was set to 1 in the pictures below. This gives you an unrealistic loss free reflection. I shall discuss these points in detail in forthcoming articles.

Two cylindrical surfaces with some objects in front of them

The following image from Blender's viewport interface clearly reveals the shape and form of the cylindrical surfaces. Also the objects creating the reflections are shown. The reader will also find some point like light sources spread around in the scene.

And with a simple background the rendered result of ray tracing looks like:

What we can see here on the left side is that concave mirror surfaces can create some illusionary, rather deceptive images with shapes very different from the original object from which the light rays are emitted (by a first reflection of light from the surroundings).

The half sphere with some objects in it

The first two pictures below show the basic spatial setup of the next scene with blender.

The rendered result looks like:

You see the distinct ring-like reflection zones in the outer parts? This is the result of multiple reflections - we can count at least 9 reflections before the reflection zones become indistinguishable. The image also displays the rich complexity reflections by the inner zones of the half-sphere and multiple reflections between the objects themselves can create for a viewer outside the sphere. In forthcoming experiments we shall create pictures also for positions of the camera inside the sphere.

Stay tuned. And a happy new year 2022 to everybody!

Pandas – Extending a vocabulary or simple dataframe relatively fast

During some work for a ML project on a large text corpus I needed to extend a personally used reference vocabulary by some complex ad unusual German compounds and very branch specific technical terms. I kept my vocabulary data in a Pandas dataframe. Each "word" there had some additional information associated with it in some extra columns of the dataframe - as e.g. the length of a word or a stem or a list of constituting tri-char-grams. I was looking for a fast method to extend the dataframe in a quick procedure with a list of hundreds or thousands of new words.

I tried the df.append() method first and got disappointed with its rather bad performance. I also experimented with the incorporation of some lists or dictionaries. In the end a procedure based on csv-data was the by far most convenient and fastest approach. I list up the basic steps below.

In my case I used the lower case character version of the vocabulary words as an index of the dataframe. This is a very natural step. It requires some small intermediate column copies in the step sequence below, which may not be necessary for other use-cases. For the sake of completeness the following list contains many steps which have to be performed only once and which later on are superfluous for a routine workflow.

  1. Step1: Collect your extension data, i.e. a huge bunch of words, in a Libreoffice Calc-file in ods-format or (if you absolutely must) in an MS Excel-file. One of the columns of your datasheet should contain data which you later want to use as a (unique) index of your dataframe - in my case a column "lower" (containing the low letter representation of a word).
  2. Step 2: Avoid any operations for creating additional column information which you later can create by Python functions working on information already contained in some dataframe columns. Fill in dummy values into respective columns. (Or control the filling of a dataframe with special data during the data import below)
  3. Step 3: Create a CSV-File containing the collected extension data with all required field information in columns which correspond to respective columns of the dataframe to be extended.
  4. Step 4:Create a backup copy of your original dataframe which you want to extend. Just as a precaution ....
  5. Step 5: Copy the contents of the index of your existing dataframe to a specific dataframe column consistent with step 1. In my case I copied the words' lower case version into a new data column "lower".
  6. Step 6: Delete the existing index of the original dataframe and create a new basic integer based index.
  7. Step 7: Import the CSV-file into a new and separate intermediate Pandas dataframe with the help of the method pd.read_csv(). Map the data columns and the data formats properly by supplying respective (list-like) information to the parameter list of read_csv(). Control the filling of possibly empty row-fields. Check for fields containing "null" as string and handle these by the parameter "na_filter" if possible (in my case by "na_filter=False")
  8. Step 8: Work on the freshly created dataframe and create required information in special columns by applying row-specific Python operations with a function and the df.apply()-method. For the sake of performance: Watch out for naturally vectorizable operations whilst doing so and separate them from other operations, if possible.
  9. Step 9: Check for completeness of all information in your intermediate dataframe. verify that the column structure matches the columns of the original dataframe to be extend.
  10. Step 10: Concatenate the original Pandas dataframe (for your vocabulary) with the new dataframe containing the extension data by using the df.concat() or (simpler) by df.append() methods.
  11. Step 11: Drop the index in the extended dataframe by the method pd.reset_index(). Afterward recreate a new index by pd.set_index() and using a special column containing the data - in my case the column "lower"
  12. Step 12: Check the new index for uniqueness - if required.
  13. Step 13: If uniqueness is not given but required:
    Apply df = df[~df.index.duplicated(keep='first')] to keep only the first occurrence of rows for identical indices. But be careful and verify that this operation really fits your needs.
  14. Step 14: Resort your index (and extended dataframe) if necessary by applying df.sort_index(inplace=True)

Some steps in the list above are of course specific for a dataframe with a vocabulary. But the general scheme should also be applicable for other cases.

From the description you have certainly realized which steps must only be performed once in the beginning to establish a much shorter standard pipeline for dataframe extensions. Some operations regarding the index-recreation and re-sorting can also be automatized by some simple Python function.

Have fun with Pandas!

TF-IDF – which formula to take in combination with the Keras Tokenizer?

When performing Computer based text analysis we sometimes need to shorten our texts by some criteria before we apply machine learning algorithms. One of the reasons could be that a classical vectorization process applied to the original texts would lead to matrices or tensors which are beyond our PC memory capabilities.

The individual texts we deal with mostly are members of a text collection (ie.a text corpus). Then one criterion for the reduction of the texts could be the significance of the words for each individual text in which they appear. We only keep significant words.

A measure of a word's significance is given by a a quantity called "tf-idf" - "term frequency - inverse document frequency" (see below). If you have "tf-idf"-values for all the words used in a specific text (of the collection), a simple method to shorten the text for further analysis is to use a "tf-idf"-threshold: We keep words which have a "tf-idf"-value above the defined threshold and omit others.

"tf-idf"-values require a statistical analysis over a text ensemble. The basic statistical data are often collected during the application of a tokenizer to the text ensemble. And here things can become problematic as some tokenizers provide "tf-idf"-data during vectorization, only. Then the snake bites in its tail: We need tf-idf to to shorten texts reasonably and to avoid memory problems during vectorization, but sometimes the tool set provides "tf-idf"-data by vectorization.

A typical example is given by the Keras tokenizer. In such a situation one must invest some (limited) effort into a "manual" calculation of tf-idf values. But the you may find that your (text-book) formula for a "tf-idf"-calculation does not reproduce the values your tokenizer would have given you by a "tfidf"-vectorization of your texts. A reasonable formula for the tf-idf alculation with the help of the Keras tokenizer is the topic of this post. I omit the hyphen in tf-idf below sometimes for convenience reasons.

Vectorization of texts in tfidf-mode and the problem of one-hot like encodings

Most frameworks for text analysis or NLP, of course, provide a Tokenizer. Often, the Tokenizer object does not only identify individual tokens in a text, but the tokenizer is, in addition, capable to vectorize texts. Vectorization leads to the representation of a text by an (ordered) series of integer or float numbers, which in a unique way refer to the words of a vocabulary extracted from the text collection. The indexed position in the vector refers to a specific word in the vocabulary of the text ensemble, the value given at this position instead describes the word's (statistical) appearance in a text in some way.

A typical and basic vectorization approach is a "one-hot"-encoding, resulting in a "bag-of words"-model: A word appearing in a text is marked by a "one" in an indexed vector referring to words appearing in the text collection in an (ordered) fashion.

But vectorization can be provided in different modes, too: The "ones" (1) in a simple "one-hot-encoded" vector can e.g. be replaced by tf-idf values of the words (tfidf-mode). So, by using respective tokenizer functions you may get the aspired "tf-idf"-values for reducing the texts during a vectorization run. The tf-idf data describe the statistical overabundance of a word in a specific text by some formula measuring the word's appearance in a specific text and over all texts in a weighted and normalized way.

However, all one-hot like encodings of texts come with a major disadvantage:

The length of the word vectors depends on the number of words the tokenizer has identified over all texts in a collection for the vocabulary.

If you have extracted 2 million words out of hundreds of thousands of texts you may run into major trouble with the RAM of PC (and CPU-time). There are cases where you cannot or do not want to restrict the number of vocabulary words taken into account for analysis purposes.

Most tokenizers allow for a (manual) sequential approach for a limited number of texts to overcome memory problems under such circumstances. But often enough you may instead want to calculate "tf-idf"-values on your own - just to save time. And here we may talk about a difference of hours!

I recently had this problem with 200,000 texts, the pretty fast Keras tokenizer and a vocabulary of 1.7 million words (of which I wanted to use at least a million entries). The Keras tokenizer itself offers almost all relevant data for a calculation of the tf-idf-values after it has been applied to a list of text. In my case the CPU-time required to tokenize and build a vocabulary for the 200,000 texts took 25 secs, only. A manual and sequential approach to create all tf-idf values via vectorization required about an hour's time.

TF-IDF formulas: The "idf"-term

During my own "tf-idf"-calculation based on some Python code for a tfidf-formula and basic tokenizer-data I, of course, wanted to reproduce the values the Keras tokenizer gave me during my previous vectorization approach. To achieve this goal was a bit more difficult than expected. Just using a reasonable "tf-idf"-formula taken from some NLP text-book failed. The reason was that "tf-idf"-data can be and are indeed calculated in different ways. The Keras tokenizer does it differently than SciKit - actually for both the tf and the idf-part. There is a basic structure behind a normalized tfidf-value; however there are differences in the details. Lets look at both points.

Everybody who has once in his/her life programmed a search engine knows that the significance of a word for a specific text (of an ensemble) depends on the number of occurrences of the word inside the specific text, but also on the occurrence of the very same word in all the other texts of a given text collection:

If a word appears too often in (other) texts of a text ensemble then it is not very significant for the specific text we are looking at.

Examples are typical "stop-words" - like "this" or "that" or "and". Such words appear in very many texts.

Thus we expect that a measure of the statistical overabundance of a word in a specific text (of a collection of texts) is a combination of the abundance in the chosen text and a measure of the occurrence in multiple of texts. The "tf-idf" quantity follows this recipe: It is a combination of the so so called "term frequency" [tf(t)] with the "inverse document frequency [idf(t)], with "t" representing a special word or term:

tfidf(t)   =   tf(t)   *   idf(t)

While the term frequency measures the occurrence of a word within a selected text, the "idf" factor measures the occurrence of a word in different texts of the collection. To get some weighing and normalization into this formula, the "idf"-term is typically based on the natural logarithm of the fraction

  • of the number of texts NT in a collection (nominator)
  • and the number of documents ND(t) in which a special word or term appears (denominator)

A tf-idf therefore is always characteristic of a word or term and the specific text we look at. (This is one reason, why it actually can be used in text vectorization).

But, the "idf"-term is calculated in various manners in different text-books on text-analysis. Some variants avoid the idf-term becoming negative or avoid a division by zero; typical examples are:

  1. idf(t) = log( NT / (ND + 1) )

  2. idf(t) = log( (1 + NT) / (ND + 1) )

  3. idf(t) = log( 1 + NT / (ND + 1) )

  4. idf(t) = log( 1 + NT / ND )

  5. idf(t) = log( (1 + NT) / (ND + 1) ) + 1

Note: log() represents the natural logarithm above.

I have e.g. taken he second variant from a book of S. Raschka (see below) on "Python Machine Learning" (2016, Packt Publishing). The last one in the list above is used in Sci-Kit according to https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

This is in so far consistent to Raschka's version as he defines the SciKit "tf-idf" as:

tfidf(t) = tf(t) * [ idf(t) + 1 ]

The third variant is the one you find in the source code of the Keras tokenizer, despite the reference there to a point in a Wikipedia article which reflects the fourth form (!).

Source code excerpt of the Keras Tokenizer:

.....
.....
elif mode == 'tfidf':
                    # Use weighting scheme 2 in
                    # https://en.wikipedia.org/wiki/Tf%E2%80%93idf
                    tf = 1 + np.log(c)
                    idf = np.log(1 + self.document_count /
                                 (1 + self.index_docs.get(j, 0)))
                    x[i][j] = tf * idf
.....
.....

What we learn from this is that there are multiple variants of the "idf"-term out there. So, if you want to reproduce tfidf-numbers you should better look into the code of your framework objects or functions if possible.

Variants of the "term frequency"? Yes, they do exist!

While I was already aware of different idf-variants, I did not at all know that here are even differences regarding the term-frequency "tf(t)". Normally, one would think that it is just the number describing how often a certain word or term appears in a specific text.

Let us, for example, assume that we have turned a specific text via a tokenizer function into a "sequence" of numbers. An entry in this sequence refers to a unique number assigned to a word of a somehow sorted vocabulary. A tokenizer vocabulary is often represented by a Python dictionary where the key is the word itself (or a hash of it) and the value corresponds to a unique number for the word. In my applications I always create a supplementary dictionary, which I call "switched_vocab", with keys and values switched (number => word). A sequence then is typically represented by a Python list of numbers "li_seq": the position in the list corresponds to the word's position in the text (marked by separators), the number given corresponds to the words unique number in the vocabulary.

Then, with Python 3, a straight-forward method to get simple tf-values (as he sum of the number's occurence in the sequence) would be

ind_w = li_seq[i]    # with "i" selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = d_count[ind_w]

This code snippet creates a dictionary "d_count" with the word's unique number appearing in the original sequence and the sum of occurrences of this specific number in the text's sequence - i.e. in the text we are looking at.

Does the Keras tokenizer calculate and use tf in this manner when vectorizing texts in tfidf-mode? No, it does not! And this was a major factor for differences in tfidf-values I naively produced for my texts.

With the terms above the Keras tokenizer instead uses a logarithmic value for tf:

ind_w = li_seq[i] # i selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = log( 1 + d_count[ind_w] )

This in the end makes a significant difference in the derived "tf-idf" values in comparison to naive approach - even if you had gotten the "idf"-term right!

Quick and dirty Python code to calculate tfidf values manually for a list of texts with the Keras tokenizer

For reasons of completeness, I outline some code fragments below, which may help readers to calculate "tf-idf"-values, which are consistent with those produced during "sequences to matrix"-vectorization calculations with the Keras tokenizer. I assume that you already have a working Keras implementation using either CPU or GPU.

I further assume that you have gathered a collection of texts (cleansed by some Regex operations) in a column "txt" of a dataframe "df_rex". We first extract the texts into a list and apply the Keras tokenizer:

from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer

num_words = 1800000    # or whatever number of words you want to be taken into account from the vocabulary  

li_txts = df_rex['txt'].to_list()
tokenizer = Tokenizer(num_words=num_words, lower=True) # converts tokens to lower-case 
tokenizer.fit_on_texts(li_txts)    

vocab   = tokenizer.word_index
w_count = tokenizer.word_counts
w_docs  = tokenizer.word_docs
num_tot_vocab_words = len(vocab) 
    
# Switch vocab - key <> value 
# ****************************
switched_vocab = dict([(value, key) for key, value in vocab.items()])

Tokenizing should be a matter of seconds or a few ten-seconds depending on the number of texts and the length of the texts. In my case with 200,000 texts, on average each with 2000 words, it took 25 secs and produced a vocabulary of about 1.8 million words.

In a next step we create "integer sequences" from all texts:

li_seq_full  = tokenizer.texts_to_sequences(li_txts)
leng_li_seq_full = len(li_seq_full)

Now, we are able to create a super-list of lists - including a list of tf-idf-values per text:

li_all_txts = []

j_end = leng_li_seq_full
for j in range(0, j_end):
    li_text = []
    li_text.append(j)

    leng_seq = len(li_seq_full[j])
    li_seq     = []
    li_tfidf   = []
    li_words   = []
    d_count    = {}

    d_count  = Counter(li_seq_full[j])
    for i in range(0,leng_seq):
        ind_w    = li_seq_full[j][i] 
        word     = switched_vocab[ind_w]
        
        # calculation of tf-idf
        # ~~~~~~~~~~~~~~~~~~~~~
        # https://github.com/keras-team/keras-preprocessing/blob/1.1.2/keras_preprocessing/text.py#L372-L383
        # Use weighting scheme 2 in https://en.wikipedia.org/wiki/Tf%E2%80%93idf
        dfreq    = w_docs[word] # document frequency 
        idf      = np.log( 1.0 + (leng_li_seq_full)  / (dfreq + 1.0) )
        tf_basic = d_count[ind_w]
        tf       = 1.0 + np.log(tf_basic)
        tfidf    = tf * idf 
                
        li_seq.append(ind_w) 
        li_tfidf.append(tfidf) 
        li_words.append(word) 

    li_text.append(li_seq)
    li_text.append(li_tfidf)
    li_text.append(li_words)

    li_all_txts.append(li_text)

leng_li_all_txts = len(li_all_txts)

This last run took around 4 minutes in my case. When getting the same numbers with a sequential approach calculating Keras vectorization matrices in tf-idf mode for around 6000 texts with in-between memory cleansing it took me around an hour with continuous manual system interactions.

Conclusion

In this article I have demonstrated that "tf-idf"-values can be calculated almost directly from the output of a tokenizer like the Keras Tokenizer. Such a "manual" calculation is preferable in comparison to a vectorization run in "tf-idf"-mode when the number of texts and the vocabulary of your texts is big or huge. "tf-idf"-word-vectors may easily get a length of more than a million words with a reasonably complex text ensembles. This poses memory problems on many PC-based systems.

With directly calculated tf-idf-values you get a measure for the significance of words in a text. Therefore, the "tf-idf"- values may help you to shorten texts reasonably before you vectorize your texts, i.e. ahead of applying advanced ML-algorithms.