More fun with veth and Linux network namespaces – VII – two namespaces connected by a veth based VLAN

With the last one of the previous posts in this series

we started to look at VLANs based on veths and Linux bridges. I presented multiple options to configure a veth endpoint in a network namespace for enabling communication with other namespaces via tagged Ethernet packets.

We saw that we can potentially start splitting network traffic already from within a multi-homed namespace which is connected to different VLAN-segments. However, a veth endpoint with multiple interfaces also poses a basic ambiguity problem for the direction of both ARP and ICMP requests into the right one of the various attached (V)LAN-segments. From the results of previous experiments we would assume that the Linux kernel solves this problem by following routes.

In this post we will study the most simple VLAN configuration one can think of: We connect two network namespaces directly with a veth based VLAN line. I.e. we use a veth connection to transmit tagged packets along it. As this would be a lit boring, we add some pepper to the scenario by allowing for untagged packets, too.

To achieve this we reduce “option 4” discussed in the last post to a one-armed solution: We allow for only one VLAN interface per veth endpoint (see a sketch of the scenario below). We start a series of experiments with assigning an IP to the veth’s main interface, only. The setting for the source validation kernel parameter rp_filter will be relaxed.

In the experiments of this post we focus on

  • some claimed aspects regarding the role of the main trunk interface of a veth endpoint,
  • the potential impact of routes on the ARP communication.
  • the fact that working ARP does not mean working ICMP for a variety of different reasons.

Regarding the first point we will see that tagged packets just traverse the trunk interfaces on their way from and to the VLAN interface. Regarding route settings we will look at 36 possible variants. We will see that under the given conditions asymmetric route settings may or may not disable communication – already on the ARP level. But even if ARP works and even if we had symmetric routes in the namespaces, ICMP may not function. I will try to isolate the causes of positive and negative results for ARP and ICMP requests.

I recommend that readers who want to perform these experiments on their own to watch the system log in parallel.

Continue reading

Leap/SLES 15.5 – strange compatibility problem between tcpdump, libpcap and arping from iputils

My readers know that I presently work again with virtual networks. A part of my studies is related to ARP and routes on veth devices with VLAN-interfaces. I follow the packet transfer across VLANs with tcpdump, which itself depends on libpcap. On Leap 15 the relevant package is: libpcap1. ARP commands were generated by the arping command, which gets available after an installation of the package “iputils“.

This worked perfectly on a laptop. I know for a presentation had to use another system with the same kernelversion and virtual networking support. There I got strange messages for ARP packets passing a veth endpoint’s main device (in my example: veth2V) on their way to a VLAN interface (“veth2V.30) on the same veth endpoint:

netns2:~ # tcpdump -n -e -i any -v
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

15:15:36.334692 veth2V B   ifindex 2 46:b9:81:00:00:1e ethertype ARP (0x0806), length 52: Unknown Hardware (36461) (len 0), Unknown Protocol (0x0000) (len 1), Unknown (2048) 
        0x0000:  8e6d 0000 0001 0800 0604 0001 46b9 8b4b  .m..........F..K
        0x0010:  8e6d c0a8 0501 ffff ffff ffff c0a8 0502  .m..............
15:15:36.334692 veth2V.30 B   ifindex 4 46:b9:8b:4b:8e:6d ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.5.2 (ff:ff:ff:ff:ff:ff) tell 192.168.5.1, length 28
15:15:36.334720 veth2V.30 Out ifindex 4 d2:85:39:c8:43:fc ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Reply 192.168.5.2 is-at d2:85:39:c8:43:fc, length 28
15:15:36.334721 veth2V Out ifindex 2 d2:85:81:00:00:1e ethertype ARP (0x0806), length 52: Unknown Hardware (17404) (len 0), Unknown Protocol (0x0000) (len 1), Unknown (2048) 
        0x0000:  43fc 0000 0001 0800 0604 0002 d285 39c8  C.............9.
        0x0010:  43fc c0a8 0502 46b9 8b4b 8e6d c0a8 0501  C.....F..K.m....

I had never seen similar messages in comparable experiments with veths before. And these messages about “Unknown Hardware”. In addition the length of the Ethernet packets were wrong. I did not get such errors on my laptop where I had prepared the setups of the virtual VLANs.

It took me some time to find the difference between the systems: iputils as well as tcpdump on both systems came from the Network:Utilites-repository
https://download.opensuse.org/ repositories/ network: /utilities/ 15.5/“.

However, libpcap1 on the presentation system came from the main SLES 15.5 OSS repository in version 1.10.1-150400.1.7. On my laptop I had instead fetched the library from the Network:Utilites-repository, too.

Changing libpcap1 to the present version 1.10.4-lp155.92.1 from the Network:Utilites-repository led to correct tcpdump information:

netns2:~ # tcpdump -n -e -i any -v
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes

15:38:44.899645 veth2V B   ifindex 2 f2:5b:23:ba:16:8a ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.5.2 (ff:ff:ff:ff:ff:ff) tell 192.168.5.1, length 28
15:38:44.899645 veth2V.30 B   ifindex 4 f2:5b:23:ba:16:8a ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.5.2 (ff:ff:ff:ff:ff:ff) tell 192.168.5.1, length 28
15:38:44.899667 veth2V.30 Out ifindex 4 9a:88:24:0c:f9:99 ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Reply 192.168.5.2 is-at 9a:88:24:0c:f9:99, length 28
15:38:44.899669 veth2V Out ifindex 2 9a:88:24:0c:f9:99 ethertype ARP (0x0806), length 48: Ethernet (len 6), IPv4 (len 4), Reply 192.168.5.2 is-at 9a:88:24:0c:f9:99, length 28

Interestingly, even if one changed both tcpdump, iputils and libpcap1 to the main Leap /SLES 15.5 repository the problem would come up, too.

So, there seems to be something severely wrong with libpcap1 of the main Leap /SLES 15.5 repositories.

 

WordPress 6.5, a bug in plugin Kadence Importer, defunct FooGallery pages – and some lessons for WP error finding

My wife still administrates Web-sites for customers. Some of them are based on WordPress. After an upgrade to WP6.5 some pages of a customer web-site went down in the middle of last week.

In addition previewing of authors on these specific pages did not work any longer.

A standard WP error message regarding a critical error of the few affected pages and messages from Browser developer tools that the pages were found to be in “Quirks mode” were not really helpful. Especially as the header of the pages in question consistently were correct!

The common property of all these pages was:

They included relatively large FooGalleries with image numbers beyond 35.

As the customer had to participate in a fair pretty soon and wanted to prepare even more pages with large Foogalleries, he and my wife got into a time pressure situation. Both asked me for help. Well, it took me some time to find the culprit. But I was stupid enough not to follow three simple reasonable rules for such cases:

  • Check for PHP errors first!
  • Systematically deactivate or even delete unnecessary plugins afterward.
  • Never perform some changes in parallel when trying to isolate faulty SW and errors.

Instead I was misguided by experimenting with the FooGalleries first.

As a first measure we performed all the steps recommended by Foogallery for problems (see this site). No effect.

Then I found that if one emptied the thumbnail cache completely the web pages with the large galleries could be loaded just once (either by a page preview or a call from a browser via a VPN). The second call crashed the page however. From this experience I focused on a caching problem first. Which obviously existed …

Afterward I had a look at a reference test site with WP version 6.0.3. There we also had some problem in the beginning.

Concentrating on potential cache handling problems we after a while thought that we had a major problem with a certain Statify method to count visitor numbers in case of caching plugins. From a time when the website owner had installed the W3TC-plugin, the settings of Statify have remained such that a Javascript based evaluation of site visits had been activated without nonce check. Meanwhile W3TC was removed due to some major problems with this plugin in WP and PHP-upgrade phases. But the Statify settings had remained the same. On our test site this appeared to have helped immediately. But I had overlooked that my wife in the meantime also had reduced the number of plugins in parallel to a minimum.

Again a substantial mistake in a process of finding errors! Never work with multiple people at different investigation fronts in parallel without a coordination of the changes in a step-wise row! And do not let pressure from customers affect you.

Correcting the Statify method for counting visitors on the main web-site did not prove to be a remedy in the long run. Some test pages which had not worked before now ran flawlessly. The stupid thing was that for other major pages a cache problem remained and after some time some pages became defunct again. But only some!

On the customer’s website the following plugins were installed: Kadence theme, Statify, FooGallery, FooBox Image Lightbox, Pinnacle Toolkit, Kadence blocks, Kadence Importer.

To make a long story short: Kadence Importer (as of present version V2.1.1) has a major PHP bug. Had I run the website in debug mode from the beginning I would have found it and saved me from hours of testing. And I am now pretty sure that the problem had existed for a longer time – undetected. So even returning to a backup before the upgrade would not have helped.

Deactivating and deleting this plugin, which is not needed on productive WP sites anyway (!), removed all problems. I cannot go into details here why this stupid plugin created caching problems. It is bad – just do not use it!

Lessons learned

  1. Do not let the customer play with WP-plugins. Mistrust any plugins which basically are made for advertisement.
  2. Strictly cling to a step-wise strategy for isolating faulty SW after upgrades. Do not let the pressure of a customer make you deviate from your planned sequence of steps.
  3. First look for PHP errors in WP debug mode.
  4. Have a test site available and compare the plugin status and settings in detail.
  5. Deactivate and remove all unnecessary plugins in a systematic manner.
  6. Keep track of special settings of WP plugins, in particular with respect to other caching plugins. And do not let the customer fiddle with such parameters undocumented.
  7. And, of course: Make a full backup of yours site before upgrades. Deactivate any automatic upgrade.

I hope this helps other WP admins.