Blender – complexity inside spherical and concave cylindrical mirrors – III – a second step towards the S-curve

I continue with my mini-series on how to (re-) build something like the S-curve of Mr. Kapoor in Blender. See :

Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions

In my last post

Blender – complexity inside spherical and concave cylindrical mirrors – II – a step towards the S-curve

I discussed how we can add a smooth transition from convex and concave curvature around the y-axis of an originally flat rectangular Blender mesh positioned in the (y,z)-plane. The rectangle had its longer side in y-direction. Starting from a flat area around the vertical middle axis (in z-direction) of the object we bent the wings to the left and right around the y-axis with systematically growing curvature, i.e. with shrinking curvature radius, in y-direction. The curvature around the y-axis left of the central z-axis got a different sign than the curvature right of the central z-axis. At the outer edges of both wings we approximated the form of a half cylinder. So curvature became a function of both x and y.

To create a really smooth surface with differentiable gradient and curvature we had to apply a modifier called “Subdivide Surface“. The trick to make this modifier work sufficiently well with only a few data points in z-direction was to keep curvature almost constant in z-direction for a given y-position along the horizontal axis. We achieved this by putting the vertices of our mesh on the central rotation-axis at the same z-(height)-values as the vertices on the circles at the outer edges. In the end we had established a smoothly varying gradient of the surface curvature around the y-axis with the y-coordinate while the partial derivative in z-direction of this curvature was close to zero for any fixed y-position.

In this post I want to add an “S-curvature” of the object in y-direction. Meaning: We are now going to bend the object along a S-shaped path in the (x,y)-plane. Physically, we are adding curvature in x-direction, more or less constant around two imaginary vertical axes positioned at some distance in y-direction from the central z-axis – and with different signs of the curvature. So, we are creating a superposition of a growing curvature around the y-axis with a constant curvature around two z-axes for each of the wings left and right of the central rotational z-axis of our object. Eventually, we build something like shown below:

When we look at images of Mr. Kapoor’s real S-curve we see that he keeps curvature at zero both in z- and a diagonal x/y-direction at the central rotational axis – due to the “S”. The same is true for our object. But: In comparison to the real S-curve of Mr. Kapoor our object is more extreme:

The ratio of height in z-direction to length in y-direction is bigger in our case. The object is shorter in y-direction and thus relatively higher in z-direction. The curvature radii around the y-axis are significantly smaller. Our surface approximates full half circle curves at the outer edges; in contrast to the real S-curve our surface approximates a shifted cut of a cylinder.

We could, of course, adapt our S-curve in Blender a bit more to the real S-curve. However: Our more extreme bending around the y-axis at the outer left and right edges of the object allows for multiple reflections of light falling in in x-direction on the concave sides of the object. This is not the case for the real S-curve.

Steps to give the surface an S-shape

How do we get to the surfaces presented in the above figures in Blender? The following steps comprise building a suitable symmetric and very fine grained Bezier-path and the use of a curve modifier. To achieve a relatively smooth bending in x-direction we must first subdivide our original object in sufficient sections in y-direction in addition to the already existing subdivisions in z-direction.

Step 1: Sub-dividing our object in y-direction

The final object mesh constructed in my last post is depicted below:

We “apply” (via menu “Object” > “apply”) any rotations we may have done so far. The object’s center should now coincide with the world center. We position the camera some distance apart in x-direction from our object, but at y=z=0. The following image shows the camera perspective.

Remember that our object consists of two meshes which we have joined at the central axis. We are now going to separate each half in 32 sections in y-direction.

Go to Edit mode, position the cursor in one half and press Ctrl-R:

Turn your mouse wheel until you have created 32 subdivisions. (The number is shown at the bottom left of your view-port). Left click twice to fix the subdivision lines at their present positions; do NOT move the mouse in between the clicks.

Doing this on both halves eventually gives us:

We check that y- and x-distances of the vertices have equal absolute values. We in addition check z-heights for selected vertices and compare them to the height of vertices on the outer circles.

Step 2: Design a S-path

We move our object, which has dimensions of 6m in y-direction and 2m in z- and x-direction, respectively, to the left (y=-6m). Just to get some space at the world origin. We now add a Bezier curve to our scene. Its origin is located at the world center and it is stretched along the x-axis. Rotate the curve around the z-axis by 90 degree such that it stretches along the y-axis. Choose a top-view position and rotate the viewport such that the y-axis points to the right.

Stretch the curve to an y-dimension of 6m. Select the curve. Go to edit mode. We now rotate the rightmost and the leftmost tangent bars such that we get the following curve:

You fix the required symmetry by watching and adjusting the position information of the tangent bar\’s end vertices in the sidebar of the viewport. Note that a tangent bar has 2 associated handles and vertices. By changing vertex positions systematically to [(x=-0.5, y= -3.5), (x=0.5, y= -2.5)] and [(x=-0.5, y=2.5), (x=0.5, y=3.5)] on the left and the right side, respectively, you get a seemingly nice flat S-Curve.

Go to object mode and change the dimension of the curve in x-direction to 2m. You then clearly see that the path is not a very smooth one, but consists of distinguished linear segments. This would be a major problem later on. In addition the curvature for the S seems to be a bit extreme. We change the x-dimension to just 1 m. Then we subdivide the curve into further segments in Edit mode. Multiple times.

And we get a smoothly curved and well dimensioned path:

Eventually, we end up with the following situation:

Step 3: Apply a curve modifier to our object

Now comes an important point: If you have assigned the origin of the curve to its midpoint, you should do the same with your object- see the Internet for appropriate Blender operations:

We then move our object to y=3.171m and z=1m, i.e. a bit further than rightmost end-point of the path and above the worlds central plane.

Now, we add a curve-modifier to our object, select the Bezier curve (our path) and get

We center our view and choose a top-position. We adjust the y-location of the object until its present center coincides with the world center

And there we have our personal S-curve – more extreme than Kapoor’s real S-cure – but this is only a question of dimension adjustments AND the right choice of how to cut the limiting circle mesh in the beginning (see the last post). Our S-curve will show multiple reflections on its concave side(s).

Apply the modifier “subdivision surface”

The rest is routine for us already. We apply the subdivision surface modifier to get a smooth surface.

When we modify the world’s sky texture a bit with a sun just at the horizon then we get images like the following just from the horizon line and from different camera perspectives with different focal lengths (wide angle shot).

We clearly see multiple reflections in vertical direction on the left concave side of our object.

Complexity out of simple things … Real fun …

Conclusion

Rebuilding something like the S-curve of Mr. Kapoor was hard work for me who uses Blender just as a hobby tool. But it was worth the effort. In my next post,

Blender – complexity inside spherical and concave cylindrical mirrors – IV – reflective images of a Blender variant of Mr Kapoor’s S-curve

I am going to show what happens when we place objects in front of our S-curve. This will give us a first impression of what might happen with a totally concave surface as an open half sphere.
Stay tuned …

 

Blender – complexity inside spherical and concave cylindrical mirrors – II – a step towards the S-curve

In my last post

Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions

I briefly discussed some interesting sculptures and optical experiments in reality. The basic ideas are worth some optical experiments in the virtual ray-tracing world of Blender. In this post I start with my trial to reconstruct something like the so called “S-curve” of the artist Anish Kapoor with Blender meshes.

If you looked at the link I gave in my last article or googled for other pictures of the S-curve you certainly saw that the metallic surface the artist placed at the Kistefoss museum is not just a simple combination of mirrored cylindrical surfaces. It is much more elegant:

The first point is that it consists of one continuous coherent piece of metal. The surface is deformed and changes its curvature continuously. It shows symmetry and rotational axes. When my wife and I first saw it we stood at a rather orthogonal position opposite of it. We only got aware of the different cylindrical deformations on the left and right side. We wondered what Kapoor had done at the middle vertical axis as we expected a gap there. Later we went to another position – and there was no gap at all, but a smooth variation of curvatures along the main axes of the object.

The second point is the combination of different curvatures: a cylindrical curvature in vertical direction (mirrored in left/right direction) plus the elegant S-like curvature in horizontal direction. The curvature in vertical direction grows with horizontal distance from the center – it is zero at the central vertical axis. The left and right part of the object are identical – they reflect a 180 °ree; rotation (not a mirroring process) around the central vertical axis. Actually, the gradient at the central rotational axis and the along the horizontal symmetry axes disappears. And no curvature at all at the central vertical axis.

All in all a lot of different symmetries and smooth curvature transitions! The artist plays with the appeal of symmetries to the human brain. But, at the same time, he breaks symmetry strongly in the visual impression of the viewer with the help of the rules of optics. Wonderful!

In this article I want to tackle the problem of a smooth transition between two cylindrically deformed surfaces in Blender first. The S-curvature is the topic of the next post.

The result first

I first show you what we want to achieve:

We get an impression of the mirroring effects in “viewport shading mode” by adding a sky texture to the world background and a simple textured plane:

The reader may have detected small dips (indentations) at the centers of the upper and lower edge. I come to this point later on. Compared to the real S-curve a major difference in vertical direction is that Kapoor did not use a the full curvature of a half cylinder at the outmost left and right ends. He maybe took only a cut off part of a half circle there. But what part of a half-circle you use in the end is a minor point regarding the general construction of such a surface in Blender.

How to get there?

As I only use Blender seldom I
really wondered how to create a surface like the one shown above. Mesh based or nurbs based? And how to get a really smooth surface? Regarding the latter point you may think of subdivisions, but this is a wrong approach as a subdivision of a mesh intersects linear connections between vertices. Therefore, if you applied simple subdivision to the object you would create points not residing on a circle/cylinder/surface – which in the end would disturb the optics by visible lines and flat planes. Even if you added a smoothing modifier afterward.

The solution in the end was simple and mesh based. There is one important point to note which has to do with rules for object creation in Blender:

You define the resolution of the mesh(es) we are going to construct in the beginning!

As we need to edit some vertex positions manually the resolution in first experiments should rather be limited. For a continuous surface we shall apply a surface smoothing modifier anyway. This modifier rounds up edges a bit – which leads to the “dips” I mentioned. They will be smaller the higher you choose the meshes’ resolution – but this is something for a final polished version.

Constructional steps

All in all there are many steps to follow. I only give a basic outline. Read the Blender manual for details.
Note: I added the application of a modifier in the middle of the steps for illustration purposes. You should skip this step and apply the modifier only in the end. I sometimes experienced strange effects when applying and deleting the modifier during work with vertices.

Step 1: You first create a mesh based circle. You now decide which number of mesh nodes and basic resolution of the later surface you want to have. This is done by the the tool menu that opens in Blender version 2.82 in the lower left of the viewport. Lets keep to the standard value of 32 mesh points (vertices). This obviously means that a half-circle later on will contain 17 vertices. All vertices of our first reside exactly on the circle line. The circles center resides at the global world center. You also see that 4 points of the circle sit on the world axes X, Y. Leave the circle exactly where it is. Do not apply any translation. (It would be hard to realign it with world axes later on.)

Step 2: Change to Edit mode and remove one side of the circle (left of the X-axis) by eliminating the superfluous vertices. Do it such that the end points of the remaining half-circle reside exactly on the X-axis of the world mesh. Keep the origin of the mesh were it is. Do NOT close the circle mesh on the X-axis, i.e. do not create a closed loop of vertices!

Step 3: You then add a line mesh in Object mode. This can e.g. achieved by first creating a path. Move it along the world Y-axis to get some X-distance from the half-circle (-3m). Select the path by right clicking and convert the path to a mesh by the help of a menu point. Go to Edit mode again and eliminate vertices (or adding by subdividing) – until the resulting line mesh has exactly the same number of vertices (17) as your half circle (including the end points). In object mode set the origin to the mesh’s geometry, i.e. its center. Move the line mesh to X=0. Change its X-dimension to the same value the half circle has (2m).

Step 4: Rotate the half-circle by 90 degrees around the X-axis to get a basic scene like in the picture below. Join the two meshes to one object.

Step 5: Go to Edit mode and provide missing edges to connect the line segment with the half-circle.

Step 6: Add faces by selecting all vertices and choosing menu point “Face > Grid Fill”.

Hey, this was a major step. save your results – and make a backup copy for later experiments.

Step 7: Add a Sky Texture to the world. Activate the Cycles renderer. Rotate the object by 90 degree around the Y-axis. Choose viewport shading mode.

Step 8: Move object to Z=1m. Right click in Object mode on your object it; choose “Shade smooth”.

Just to find that you still see the edges of the faces. Smooth is not really smooth, unfortunately.

Step 9: Skip this step in your own experiment and perform it at the end of our construction. Just for illustrating that the flat surfaces can be eliminated later on, I add a modifier to our object – namely the modifier “subdivision surface” – which offers a more intelligent algorithm than “Shade smooth”. Just for testing I give it the following parameters:

We get:

Much more convincing! You see e.g. at the left side that the corners have been rounded – this will later lead to the dips I mentioned.

Intermediate consideration
We could now duplicate our object, rotate the duplicate and join it with the original. But before we do this we change the height values of the vertices along the left edge (actually a line segment). From our construction it is clear that corresponding vertices on the half circle and the left edge cannot have the same Z-coordinate values – they reside at different heights above the ground. The “catmull clark” algorithm of our modifier therefore creates a surface with gradients and curvature varying in all coordinate directions. There is no real problem with this. However, we reduce the chance for certain caustics and cascades of multiple reflections on the concave side of the final surface. Cylindrical surfaces (i.e. with constant curvature) give rise to sharp reflective caustics. To retain a bit of this and keep the curvature rather constant in Z-direction (whilst varying in X-direction), we are going to adjust the heights of the vertices along the straight left edge to the heights of the vertices along the half-circle.

Step 10: Go to edit mode. Do NOT move the vertices of the half circle! Check the Z-value of each of the vertices of the half-circle by selecting one by one and looking at the information on
the sidebar of the Blender interface (View > Sidebar). Change the Z-coordinate of the half-circle’s counterpart on the left straight edge to the very same value. Repeat this process for all vertices of the half-circle and the corresponding ones of the straight edge.

You see that the vertices are now non-equidistantly distributed along the Z-axis on the left side !
This gives us already a slightly different shading in the lower part.

Step 11: Important! Remove the modifier if you applied it. Then: Move the object such that all vertices on the left corner are at Y=0 and X=0. For Y=0 you can just adjust the median of the vertices. Check also that the corners of the half-circle have X=0 and Y=3. All vertices of the half circle should have Y=3.

Then snap the cursor to the grid at X=0, Y=0, Z=1. Afterward snap the origin of the object to the cursor. The object’s coordinates should now be X=0, Y=0, Z=1.

Step 10: In Object mode: Duplicate the object by SHIFT D + Enter. Do not move the mouse in between; don’t touch it. Rotate the active duplicate around the Z-axis by 180 degrees.

Check the coordinates of the vertices of the mirrored object. If its right vertices reside at y=0 and its left at y=-3 then join the two objects to one. Note: At the middle there are still two rows of vertices. But their vertices should coincide exactly at their x=0, Y=0 and Z-values. If not you will see it later by some distortions in the optics.

Step 11: Add a metallic material

Place the camera at

and add
the modifier again with the settings given above. Render with the help of the material preview:

Step 11: Add a Sun at almost 180 degrees and play a bit with the sky

We get in full viewport shading:

Watch the sharp edges created by multi-reflections on the left concave side of the object. This we got due to our laborious adjustment of the Z-coordinates of our central vertices.

Save your result for later purposes!

Adding some elements to the scene

After having created such an object we can move and rotate it as we like. In the following images I mirrored it (2 rotations!). The concave curvature is now at the right side. Then I added a plane with some minimum texture with disturbances. Eventually, I added some objects and extended light sources, plus a change of the sun’s color to the red side of the spectrum. (Hint: When moving around spacious light sources relatively close to the object the reflections should not show any straight line disturbances. Its a way to test the smoothness of your surface created by the modifier.)

Yeah, one piece of metal with growing cylindrical concave and convex curvatures to the left and the right. We are getting closer to a reconstruction of the S-curve. And have a look at the nice deformations of the reflected images of a red cylinder, a green cone and a blue sphere, which I have placed relatively closely to the concave surface on the right side. Physics and Blender are fun! But all respect and tribute again to Anish Kapoor for his original idea!

In the next post

Blender – complexity inside spherical and concave cylindrical mirrors – III – a second step towards the S-curve

I have a look at an additional S-curvature in horizontal direction. Stay tuned ..

 

Blender – complexity inside spherical and concave cylindrical mirrors – I – some impressions

After two stressful months at my job I used part of my Christmas holidays to play a bit around with Blender. For me Blender has always been a fascinating tool to perform optical experiments with mirrors, lenses, light emitting gases, etc. It’s real fun … like in a virtual lab. See e.g. the image of two half-transparent cubes intersecting each other in an asymmetric way; the cubes were filled with volumetric gas emitting red and green light:

The idea for the experiment illustrated above arose in a discussion with a German artist (Michael Grossmann) about different kinds of color mixtures. The human eye and the neural networks behind it interpret a dense mixture of green and red light rays as yellow. This is true only for active light emitters, but not for passive reflective particles as used in painting. Think of pixels emitting light on a TV-screen: There neighboring red and green pixels create an impression of yellow.

In this present article series, however, I want to describe two experiments for highly reflective mirroring surfaces. I got the ideas from two real art installations. All credit must be given to the artists behind the original art objects:

One object is located in a sculpture park in Norway called “Kistefoss museum”. See https://www.kistefosmuseum.com/ sculpture/ the-sculpture-park. I warn you – a visit to this park is really expensive in my opinion (18 Euros per person + 8 Euros or more for parking. OK, the fact that enjoying modern art is a kind of luxury had to be expected in the richest country outside the EU. And, of course, Kistefoss is run by a private investor …. See https://www.kistefos.no/. Some things obviously never change during the history of capitalism. Presently the impression of art in original “nature” of an originally beautiful river valley is spoiled by a huge construction site for a 4 or 6 track autobahn bridge. Well, well – so much about the relation between art and capitalism.

Nevertheless – there are some really nice installations at Kistefoss to look at.

The object I refer to is named “S-Curve” and was made by the well known contemporary artist Anish Kapoor. See “https://www.kistefosmuseum.com/ sculptur/ s-curve” or google for “s-curve anish kapoor” and look at the images. The installation consists of a bent elongated rectangular metallic surface looking like a twisted curved band. Curved in two directions: The curvature on the left side of the s-curved band is concave, on the right side it is convex. I found this idea of combining reflective concave and convex mirror surfaces breathtakingly simple and impressive at the same time. The whole installation makes you think about the reality behind visual images triggered by some conceptional network in your brain. It reminded me of the old Platonic idea that our relation to the real world must be compared to a man sitting in a cave where he only sees shadows of a real world on a wall. Now, imagine a world where our cave walls where made of curved mirrors – how could we get a clue about the reality behind the strange reflections? Well, such questions seemingly trigger something inside physicists …

The second public installation is located in the Norwegian city of Drammen at a river bank. There you are confronted with highly reflective outer surfaces of two spheres on each side of the river. However, these spheres also have a kind of spherical indentation on their outer sides. Sometimes when the
sun is at the right position you can see strange ring like reflections of light in these indentations – with rater sharp edges. Similar disturbing effects can be seen when you put some intensive LED lamp outside and inside the sphere. Makes you wonder what kind of images an open half-sphere would create.

The question what a camera placed at different positions in front or inside a reflective open half-sphere is for many reasons interesting if you start thinking about it with some recovered school knowledge about optics. Three major points are:

  • Things may appear to be located in front of the mirror surface.
  • There is inversion. Reflected things appear upside down and left-right mirrored.
  • In addition multiple reflections have to be taken into account in a half-sphere. Which is by the way the reason for multiple ring like reflections of distant bright light sources.

There is a phantastic video on Youtube “What Does It Look Like INSIDE a Spherical Mirror?” (https://www.youtube.com/watch?v=Y8c7TZx8HeY) which gives you a live impression of the strange things a concave mirror surface can do with light rays. Well, not everybody has the means to make or get a perfect mirroring half sphere. Or get some huge metal plates and deform them as a S-shaped cylindrical band. For us normal people Blender will have to do a job with virtual objects.

To raise your appetite I first want to present two preliminary rendered images from Blender. One just shows the reflections of some objects (cylinders, spheres, cones) placed in front of two plain cylindrical surfaces attached to each other. This is a first simplified approximation to the S-curve. But it already reveals some of the properties which we can expect to find on a twisted continuous metal structure like the S-curve of Mr. Kapoor.

The other image shows the reflections of some relatively small objects (again a sphere, a cone and a cylinder) positioned deep inside a concave half-sphere. This second picture indicates the complexity which multiple reflections within an open reflective half-sphere can create. We shall later enhance the artificial scenes displayed below by additional mirrors (flat, cylindrical and spherical) behind the camera. Another goal is a movie with the camera slowly moving in and out of an open half-sphere.

It took me a while to create the pictures below as I needed to adapt to some changes in the Blender tool set and to master some specific tasks of modeling. For the present project I use version 2.82 of Blender which came together with Opensuse Leap 15.3 on my laptop. The last time I worked with blender I had a version around 2.0. Especially the problem how to construct perfect reflecting surfaces of cones, cylinders and spheres required some investigation. Also we need to choose the “cycles” rendering engine to get satisfactory results. Note also that the “metallicity” parameter used these days was set to 1 in the pictures below. This gives you an unrealistic loss free reflection. I shall discuss these points in detail in forthcoming articles.

Two cylindrical surfaces with some objects in front of them

The following image from Blender’s viewport interface clearly reveals the shape and form of the cylindrical surfaces. Also the objects creating the reflections are shown. The reader will also find some point like light sources spread around in the scene.

And with a simple background the rendered result of ray tracing looks like:

What we can see here on the left side is that concave mirror surfaces can create some illusionary, rather deceptive images with shapes very different from the original object from which the light rays are emitted (by a first reflection of light from the surroundings).

The half sphere with some objects in it

The first two pictures below show the basic spatial setup of the next scene with blender.

The rendered result looks like:

You see the distinct ring-like reflection zones in the outer parts? This is the result of multiple reflections – we can count at least 9 reflections before the reflection zones become indistinguishable. The image also displays the rich complexity reflections by the inner zones of the half-sphere and multiple reflections between the objects themselves can create for a viewer outside the sphere. In forthcoming experiments we shall create pictures also for positions of the camera inside the sphere.

In my next post
Blender – complexity inside spherical and concave cylindrical mirrors – II – a step towards the S-curve
I will make a first step to reconstruct something like the real smooth S-curve within Blender. Stay tuned. And a happy new year 2022 to everybody!

Pandas – Extending a vocabulary or simple dataframe relatively fast

During some work for a ML project on a large text corpus I needed to extend a personally used reference vocabulary by some complex ad unusual German compounds and very branch specific technical terms. I kept my vocabulary data in a Pandas dataframe. Each “word” there had some additional information associated with it in some extra columns of the dataframe – as e.g. the length of a word or a stem or a list of constituting tri-char-grams. I was looking for a fast method to extend the dataframe in a quick procedure with a list of hundreds or thousands of new words.

I tried the df.append() method first and got disappointed with its rather bad performance. I also experimented with the incorporation of some lists or dictionaries. In the end a procedure based on csv-data was the by far most convenient and fastest approach. I list up the basic steps below.

In my case I used the lower case character version of the vocabulary words as an index of the dataframe. This is a very natural step. It requires some small intermediate column copies in the step sequence below, which may not be necessary for other use-cases. For the sake of completeness the following list contains many steps which have to be performed only once and which later on are superfluous for a routine workflow.

  1. Step1: Collect your extension data, i.e. a huge bunch of words, in a Libreoffice Calc-file in ods-format or (if you absolutely must) in an MS Excel-file. One of the columns of your datasheet should contain data which you later want to use as a (unique) index of your dataframe – in my case a column “lower” (containing the low letter representation of a word).
  2. Step 2: Avoid any operations for creating additional column information which you later can create by Python functions working on information already contained in some dataframe columns. Fill in dummy values into respective columns. (Or control the filling of a dataframe with special data during the data import below)
  3. Step 3: Create a CSV-File containing the collected extension data with all required field information in columns which correspond to respective columns of the dataframe to be extended.
  4. Step 4:Create a backup copy of your original dataframe which you want to extend. Just as a precaution ….
  5. Step 5: Copy the contents of the index of your existing dataframe to a specific dataframe column consistent with step 1. In my case I copied the words’ lower case version into a new data column “lower”.
  6. Step 6: Delete the existing index of the original dataframe and create a new basic integer based index.
  7. Step 7: Import the CSV-file into a new and separate intermediate Pandas dataframe with the help of the method pd.read_csv(). Map the data columns and the data formats properly by supplying respective (list-like) information to the parameter list of read_csv(). Control the filling of possibly empty row-fields. Check for fields containing “null” as string and handle these by the parameter “na_filter” if possible (in my case by “na_filter=False”)
  8. Step 8: Work on the freshly created dataframe and create required information in special columns by applying row-specific Python operations with a function and the df.apply()-method. For the sake of performance: Watch out for naturally vectorizable operations whilst doing so and separate them from other operations, if possible.
  9. Step 9: Check for completeness of all information in
    your intermediate dataframe. verify that the column structure matches the columns of the original dataframe to be extend.
  10. Step 10: Concatenate the original Pandas dataframe (for your vocabulary) with the new dataframe containing the extension data by using the df.concat() or (simpler) by df.append() methods.
  11. Step 11: Drop the index in the extended dataframe by the method pd.reset_index(). Afterward recreate a new index by pd.set_index() and using a special column containing the data – in my case the column “lower”
  12. Step 12: Check the new index for uniqueness – if required.
  13. Step 13: If uniqueness is not given but required:
    Apply df = df[~df.index.duplicated(keep=’first’)] to keep only the first occurrence of rows for identical indices. But be careful and verify that this operation really fits your needs.
  14. Step 14: Resort your index (and extended dataframe) if necessary by applying df.sort_index(inplace=True)

Some steps in the list above are of course specific for a dataframe with a vocabulary. But the general scheme should also be applicable for other cases.

From the description you have certainly realized which steps must only be performed once in the beginning to establish a much shorter standard pipeline for dataframe extensions. Some operations regarding the index-recreation and re-sorting can also be automatized by some simple Python function.

Have fun with Pandas!

TF-IDF – which formula to take in combination with the Keras Tokenizer? And when to calculate TF-IDF by your own code …

When performing Computer based text analysis we sometimes need to shorten our texts by some criterion before we apply machine learning algorithms. One of the reasons could be that a classical vectorization process applied to the original texts would lead to matrices or tensors which are beyond our PC’s memory capabilities.

Another reason for shortening might be that we want to focus our analysis upon words or tokens which are “significant” for the text documents we are dealing with. The individual texts we work with typically are members of a limited collection of texts -a so called text corpus.

What does “significance” mean in this context? Well, words which are significant for a specific text should single out this text among all other texts of the corpus – or vice versa. There should be a strong and specific correlation between the text and its “significant” tokens. Such a kind of distinguishing correlation could be: The significant tokens may appear in the selected specific document, only, or especially often – always in comparison to other texts in the corpus.

It is clear that we need some “measure” for the significance of a token with respect to individual texts. And we somehow need to compare the frequency by which a word/token appears in a text with the frequencies by which the token appears in other texts of our corpus.

For some analysis we might in the end only keep “significant” words which distinguish the texts of corpus from each other. Note that such a shortening procedure would reduce the full vocabulary, i.e. the set of all unique words appearing in the corpus’ texts. And after shortening the statistical basis for “significance” may have changed.

Due to the impact the choice for a “measure” of a token’s significance may have on our eventual analysis results we must be careful and precise when we discuss our results and its presuppositions. We should name the formula used for the measure of significance.

This leads to the question: Do all authors in the filed of Machine Learning [ML] and NLP use the same formula for the “significance” of words or tokens? Are there differences which can be confusing? Oh yes, there are … The purpose of this post is to remind beginners in the field of NLP or text analysis about this fact and to give an overview over the most common approaches. In addition I will discuss some practical aspects and give some snippets of a code which reproduces the TF-IDF values which the fast Keras Tokenizer would give you.

For large corpus the differences of using different formulas for measuring the significance of tokens may be minor and not change fundamental conclusions. But in my opinion the differences should be checked and at least be named.

TF-IDF values as a measure of a token’s significance – dependency on token, text AND corpus

In NLP a measure of a word’s significance with respect to a specific text of a defined corpus is given by a quantity called “TF-IDF” – “term frequency – inverse document frequency” (see below). TF-IDF values are specific for a word (or token) and a selected text (out of the corpus). We will discuss elements of the formulas in a minute.

Note: Significance values as TF-IDF values will in general depend on the corpus, too.

This is due to the fact that “significance” is based on correlations. As said above: We need to compare token frequencies within a specific text with the token’s frequencies in other texts of the given corpus. Significance is thus rooted in a singular text and the collection of other texts in a specific corpus. Keep a specific text, but change the corpus (e.g by eliminating some texts) – and you will change the significance of a token for the selected text.

If you have somehow calculated “TF-IDF”-values for all the words used in a specific text (of the corpus collection), a simple method to shorten the selected text would be to use a “TF-IDF”-threshold: We keep words which have a “TF-IDF”-value above the defined threshold and omit others from our specific text.

Such a shortening procedure would depend on word- and text-specific values. Another way of shortening could be based on an averaged TF-IDF value of each token evaluated over all documents. We then would get a corpus-specific significance value <TF-IDF> for each of our tokens. Such averaged TF-IDF values for our tokens together with a threshold could also be used for text-shortening.

Whatever method for shortening you choose: “TF-IDF”-values require a statistical analysis over the given ensemble of texts, i.e. the text corpus. The basic statistical data are often collected during the application of a tokenizer to the texts of a given corpus. A tokenizer identifies unique tokens appearing in the texts of a corpus and collects them in a long vector. Such a vector represents the tokenizer’s (and the corpus’) “vocabulary”. The tokenizer vocabulary often is sorted by the total frequency of the tokens in the corpus. In a Python environment the tokenizer’s vocabulary often will be represented by one or more Python dictionary objects.

Technical obstacles may require the explicit calculation of TF-IDF values by a coded formula in your programs

NLP frameworks in most cases provide specific objects and methods to automatically calculate TF-IDF values for the tokens and texts of a corpus during certain analysis runs. But things can become problematic because some tokenizers provide “TF-IDF”-data during a vectorization procedure, only. By vectorization we mean a digital encoding of the texts with respect to the tokenizer’s vocabulary in a common way for all texts.

An example is one-hot encoding: Each text can be represented by numbers 0,1..,n at positions in a long vector in which each position represents a specific token. 0,1,..n would then mean: In this text the token appears 0,1 or n times. Such a kind of encoding is especially useful for the training of neural networks.

A tool which gives us TF-IDF values after some vectorization is the Keras Tokenizer. I prefer the Keras Tokenizer in my projects because it really is super fast.

But: You see at once that we may run into severe trouble if we need to feed all of the tokens of a really big corpus into a TF-IDF analysis based on vectorization. For a collection of texts the number of unique tokens may lie in the range of several millions. This in turn leads to very long vectors. The number of vectors to look at is given by the number of texts which in itself may be hundreds of thousands or even millions. You can do the math for RAM requirements with a 16- or 32 Bit-representation yourself. As a consequence you may have to accept that you can do vectorization only batch-wise. And this may be time-consuming and may require intermediate manual interaction with your programs – especially when working with Python code and Jupyter notebooks (see below).

For the analysis of huge text corpora the snake of tokenizing, TF-IDF calculation, reasonable shortening and vectorization for ML may bite in its tail:
We need tf-idf to to shorten texts in a reasonable way and to avoid later memory problems during vectorization for ML tasks. But sometimes our tool-kit provides “tf-idf”-data only after a first vectorization, which may not be feasible due to the size of our corpus and the size of the resulting vocabulary of the tokenizer.

A typical example is given by the Keras tokenizer. If you have a big corpus with millions of texts and tokens vectorization may not be a good recipe to get your TF-IDF values. Especially in the case of post-OCR applications you may not be allowed to throw away any of the identified tokens before you have corrected them for possible OCR errors. And your ML-mechanism for such correction may depend on the results of some kind of TF-IDF analysis.

In such a situation one must invest some (limited) effort into a “manual” calculation of TF-IDF values. You then have to pick some formula from a text book on NLP and/or ML-based text analysis. But having done so you may soon find out that your (text-book) formula for a “TF-IDF”-calculation does not reproduce the values the tokenizer of a selected NLP framework would have given you after a vectorization of your texts. Note that you can always check for such differences by using an artificially and drastically reduced corpus.

A formula for the TF-IDF calculation performed the Keras tokenizer is one of the central topic of this post. I omit the hyphen in TF-IDF below sometimes for convenience reasons.

Is it worth the effort to calculate TF-IDF values by your own code?

Tokenizers of a NLP tool-kit do not only identify individual tokens in a text, but are also capable to vectorize texts. Vectorization leads to the representation of a text by an (ordered) series of integer or float numbers, i.e. a vector, which in a unique way refers to the words of a vocabulary extracted from the text collection:

The indexed position in the vector refers to a specific word in the vocabulary of the text corpus, the value given at this position instead describes the word’s (statistical) appearance in a text in some way.

We saw already that “one-hot”-encoding for vectorization may result in a “bag-of words”-model: A word appearing in a text is marked by a “one” (or integer) in an indexed vector referring to tokens of the ordered vocabulary derived by a tokenizer. But vectorization can be provided in different modes, too: The “ones” (1) in a simple “one-hot-encoded” vector can e.g. be replaced by TF-IDF values (floats) of the words (tfidf-mode). This is the case for the Keras Tokenizer: By using respective Keras tokenizer functions you may get the aspired TFIDF-values for reducing the texts during a vectorization run. However, all one-hot like encodings of texts come with a major disadvantage:

The length of the word vectors depends on the number of words the tokenizer has identified over all texts in a collection for the vocabulary.

If you have extracted 3 million words (tokens) out of hundreds of thousands of texts you may run into major trouble with the RAM of PC (and CPU-time). Most tokenizers allow for a (manual) sequential vectorization approach for a limited number of texts to overcome memory problems under such circumstances. What does this mean practically when working with Jupyter notebooks?

Well, if you work with notebooks on a PC with a small amount of RAM – and small may mean 128 GB (!) in some cases – you may have to perform a sequence of vectorization runs, each with maximally some hundred or thousand texts. Then you may have to export your results and afterward manually reset your notebook kernel to get rid of the RAM consumption – just because the standard garbage collection of your Python environment may not work fast enough.

I recently had this problem in a project with 200,000 texts, the Keras tokenizer and a vocabulary of 2.4 million words (where words with less than 4 characters were already omitted). The Keras tokenizer produces almost all relevant data for a “manual” calculation of TF-IDFF values after it has been applied to a corpus. In my case the CPU-time required to tokenize and build a vocabulary for the 200,000 texts took 20 secs, only. A manual and sequential approach to create all TFIDF values via vectorization, however, required about an hour’s time. This was due both to the time the vectorization and tf-idf calculation needed and the time required for resetting the notebook kernel.

After I had decided to implement the TFIDF calculation on my own in my codes I could work on the full corpus and get the values within a minute. So, if one has to work on a big corpus multiple times with some iterative processes (e.g. in post-OCR procedures) we may talk of a performance difference in terms of hours.

Therefore, I would in general recommend to perform TFIDF-calculations by your own code segments when being confronted with big corpora – and not to rely on (vectorizing) tools of a framework. Besides performance aspects another reason for TF-IDF functions programmed by yourself is that you afterward know exactly what kind of TF-IDF formula you have used.

TF-IDF formulas: The “IDF”-term – and what does the Keras tokenizer use for it?

The TF-IDF data describe the statistical overabundance of a token in a specific text by some formula measuring the token’s frequency in the selected text in comparison to the frequency over all texts in a weighted and normalized way.

During my own “TF-IDF”-calculations based on some Python code and basic tokenizer data, I, of course, wanted to reproduce the values the Keras tokenizer gave me during my previous vectorization approach. To my surprise it was rather difficult to achieve this goal. Just using a reasonable “TF-IDF”-formula taken from some NLP text-book simply failed.

The reason was that “TF-IDF”-data can be and are indeed calculated in different ways both in NLP literature adn in NLP frameworks. The Keras tokenizer does it differently than the tokenizer of SciKit – actually for both the TF and the IDF-part. There is a basic common structure behind a normalized TFIDF-value; there are, however, major differences in the details. Lets look at both points.

Everybody who has once in his/her life programmed a search engine knows that the significance of a word for a specific text (of an ensemble) depends on the number of occurrences of the word inside the specific text, but also on the frequency of the very same word in all the other texts of a given text collection:

If a word also appears very often in all other texts of a text ensemble then the word is not very significant for the specific text we are looking at.

Examples are typical “stop-words” – like “this” or “that” or “and”. Such words appear in very many texts of a corpus of English texts. Therefore, stop-words are not significant for a specific text.

Thus we expect that a measure of the statistical overabundance of a word in a selected text is a combination of the abundance in this specific text and a measure of the frequency in all the other texts of the corpus. The “TF-IDF” quantity follows this recipe. It is a combination of the so so called “term frequency” [tf(t)] with the “inverse document frequency [idf(t)], with “t” representing a special token or term:

tfidf(t)   =   tf(t)   *   idf(t)

While the term frequency measures the occurrence or frequency of a word within a selected text, the “idf” factor measures the frequency of a word in multiple texts of the collection. To get some weighing and normalization into this formula, the “idf”-term is typically based on the natural logarithm of the fraction described

  • by the number of texts NT comprised by a corpus as the nominator
  • and the number of documents ND(t) in which a special word or term appears as the denominator.

Once again: A TF-IDF value is always characteristic of a word or term and the specific text we look at. But, the “idf”-term is calculated in different manners in various text-books on text-analysis. Most variants try to avoid the idf-term becoming negative or want to avoid a division by zero; typical examples are:

  1. idf(t) = log( NT / (ND + 1) )

  2. idf(t) = log( (1 + NT) / (ND + 1) )

  3. idf(t) = log( 1 + NT / (ND + 1) )

  4. idf(t) = log( 1 + NT / ND )

  5. idf(t) = log( (1 + NT) / (ND + 1) ) + 1

Note: log() represents the natural logarithm above.

Who uses which IDF version?
The second variant appears e.g. in a book of S. Raschka (see below) on “Python Machine Learning” (2016, Packt Publishing).

The fourth version in the list above is used in Scikit according to https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

This is in so far consistent to Raschka’s version as he himself characterizes the SciKit “TF-IDF” version as:

tfidf(t) = tf(t) * [ idf(t) + 1 ]

Keras: The third variant is the one you find in the source code of the Keras tokenizer. The strange thing is that you also find a reference in the Keras code which points to a section in a Wikipedia article that actually reflects the fourth form (!) given in the list above.

Source code excerpt of the Keras Tokenizer (as of 10/2021):

.....
.....
elif mode == 'tfidf':
                    # Use weighting scheme 2 in
                    # https://en.wikipedia.org/wiki/Tf%E2%80%93idf
                    tf = 1 + np.log(c)
                    idf = np.log(1 + self.document_count /
                                 (1 + self.index_docs.get(j, 0)))
                    x[i][j] = tf * idf
.....
.....

What we learn from this is that there are several variants of the “IDF”-term out there. So, if you want to reproduce TFIDF-numbers of a certain NLP framework you should better look into the code of your framework’s classes and functions – if possible.

Variants of the “term frequency” TF? Yes, they do exist!

While I had already become aware of the existence of different IDF-variants, I did not at all know that here were and are even differences regarding the term-frequency “tf(t)“. Normally, one would think that it is just the number describing how often a certain word or term appears in a specific text, i.e. the token’s frequency fro the selected text.

Let us, for example, assume that we have turned a specific text via a tokenizer function into a “sequence” of numbers. An entry in this sequence refers to a unique number assigned to a word of a somehow sorted tokenizer vocabulary. A tokenizer vocabulary is typically represented by a Python dictionary where the key is the word itself (or a hash of it) and the value corresponds to a unique number for the word. (Hint: In my applications for texts I always create a supplementary dictionary, which I call “switched_vocab”, with keys and values switched (number => word). Such a dictionary is useful for a lot of analysis steps)

A sequence can typically be represented by a Python list of numbers “li_seq”: the position in the list corresponds to the word’s position in the text (marked by separators), the number given corresponds to the words unique index number in the vocabulary.

Then, with Python 3, a straight-forward code snippet to get simple tf-values (as we sum of the number’s frequency in the sequence) would be

ind_w = li_seq[i]    # with "i" selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = d_count[ind_w]

This code creates a dictionary “d_count” with the word’s unique number appearing in the original sequence and the sum of occurrences of this specific number in the text’s sequence – i.e. in the text we are looking at.

Does the Keras tokenizer calculate and use the tf-term in this manner when it vectorizes texts in tfidf-mode? No, it does not! And this was a major factor for the differences in comparison to the TFIDF-values I naively produced for my texts.

With the terms above the Keras tokenizer instead uses a logarithmic value for tf (= TF):

ind_w = li_seq[i] # i selecting a specific point or word in the sequence 
d_count  = Counter(li_seq)
tf = log( 1 + d_count[ind_w] )

This makes a significant difference in the derived “TF-IDF” values – even if one had gotten the “IDF”-term right!
Please note that all of the variants used for the TF- and the IDF-terms have their advantages and disadvantages. You should at least know exactly which formula you use in your analysis. In my project the Keras way of doing TF-IDF was useful, but there may be cases where another choice is appropriate.

Quick and dirty Python code to calculate TF-IDF values manually for a list of texts with the Keras tokenizer

For reasons of completeness, I outline some code fragments below, which may help readers to calculate “TF-IDF”-values, which are consistent with those produced during “sequences to matrix”-vectorization runs with the Keras tokenizer (as of 10/2021).

I assume that you already have a working Keras implementation using either CPU or GPU. I further assume that you have gathered a collection of texts (cleansed by some Regex operations) in a column “txt” of a Pandas dataframe “df_rex”. We first extract all the texts into a list (representing the corpus) and then apply the Keras tokenizer:

from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer

num_words = 1800000    # or whatever number of words you want to be taken into account from the vocabulary  

li_txts = df_rex['txt'].to_list()
tokenizer = Tokenizer(num_words=num_words, lower=True) # converts tokens to lower-case 
tokenizer.fit_on_texts(li_txts)    

vocab   = tokenizer.word_index
w_count = tokenizer.word_counts
w_docs  = tokenizer.word_docs
num_tot_vocab_words = len(vocab) 
    
# Switch vocab - key <> value 
# ****************************
switched_vocab = dict([(value, key) for key, value in vocab.items()])

Tokenizing should be a matter of seconds or a few ten-second intervals depending on the number of texts and the length of the texts. In my case with 200,000 texts, on average each with 2000 words, it took 25 secs and produced a vocabulary of about 2.4 million words.

In a next step we create “integer sequences” from all texts:

li_seq_full  = tokenizer.texts_to_sequences(li_txts)
leng_li_seq_full = len(li_seq_full)

Now, we are able to create a super-list of lists – including a list of tf-idf-values per text:

li_all_txts = []

j_end = leng_li_seq_full
for j in range(0, j_end):
    li_text = []
    li_text.append(j)

    leng_seq = len(li_seq_full[j])
    li_seq     = []
    li_tfidf   = []
    li_words   = []
    d_count    = {}

    d_count  = Counter(li_seq_full[j])
    for i in range(0,leng_seq):
        ind_w    = li_seq_full[j][i] 
        word     = switched_vocab[ind_w]
        
        # calculation of tf-idf
        # ~~~~~~~~~~~~~~~~~~~~~
        # https://github.com/keras-team/keras-preprocessing/blob/1.1.2/keras_preprocessing/text.py#L372-L383
        # Use weighting scheme 2 in https://en.wikipedia.org/wiki/Tf%E2%80%93idf
        dfreq    = w_docs[word] # document frequency 
        idf      = np.log( 1.0 + (leng_li_seq_full)  / (dfreq + 1.0) )
        tf_basic = d_count[ind_w]
        tf       = 1.0 + np.log(tf_basic)
        tfidf    = tf * idf 
                
        li_seq.append(ind_w) 
        li_tfidf.append(tfidf) 
        li_words.append(word) 

    li_text.append(li_seq)
    li_text.append(li_tfidf)
    li_text.append(li_words)

    li_all_txts.append(li_text)

leng_li_all_txts = len(li_all_txts)

This last run took about 1 minute in my case. When getting the same numbers with a sequential approach calculating Keras vectorization matrices in tf-idf mode for around 6000 texts in each run with in-between memory cleansing it took me around an hour and required continuous manual system interactions. Well, such a run provides the encoded text vectors as one of its major products, but, actually, we do not need these vectors for just evaluating TF-IDF values.

Conclusion

In this article I have demonstrated that “TF-IDF”-values can be calculated almost directly from the output of a tokenizer like the Keras Tokenizer. Such a “manual” calculation by one’s own coded instructions is preferable in comparison to e.g. a Keras based vectorization run in “tf-idf”-mode when both the number of texts and the vocabulary of your text corpus are huge. “tf-idf”-word-vectors may easily get a length of more than a million words with reasonably complex text ensembles. This poses RAM problems on many PC-based systems.

With TF-IDF-values calculated by your code functions you will get a measure for the significance of words or tokens within a reasonable CPU time and without any RAM problems. The evaluated “TF-IDF”- values may afterward help you to shorten your texts in a well founded and reasonable way before you vectorize your texts, i.e. ahead of applying advanced ML-algorithms.

Formulas for TF-IDF values used in the literature and various NLP tool-kits or frameworks do differ with respect to both for the TF-terms and the IDF-terms. You should be aware of this fact and choose one of the formulas given above carefully. You should also specify the formula used during your text analysis work when presenting results to your customers.