Performance of Linux md raid-10 arrays – negative impact of the intel_pstate CPU governor for HWP?

Last November I performed some tests with "fio" on Raid arrays with SSDs (4 Samsung EVO 850). The test system ran on Opensuse Leap 42.1 with a Linux kernel of version 4.1. It had an onboard Intel Sunrise Point controller and an i7 6700k CPU.

For md-raid arrays of type Raid10, i.e. for Linux SW Raid10 arrays created via the mdadm command, I was quite pleased with the results. Both for N2 and F2 layouts and especially for situations where the Read/Write load was created by several jobs running in parallel. Depending on the size of the data packets you may, e.g., well reach a Random Read [RR] performance between 1.0 Gbyte/sec and 1.5 GByte/sec for packet sizes ≥ 512 KB and %gt; 1024k, respectively. Even for one job only the RR-performance for such packet sizes lay between 790 MByte/sec and 950 Mbyte/sec - i.e. well beyond the performance of a single SSD.

Significant drop in md-raid performance on Opensuse Leap 42.2 with kernel 4.4

Then I upgraded to Opensuse Leap 42.2 with kernel 4.4. Same system, same HW, same controller, same CPU, same SSDs and raid-10 setup.
I repeated some of my raid-array tests. Unfortunately, I then experienced a significant drop in performance - up to 25% depending on the packet size. With absolute differences in the range of 60 MByte/sec to over 200 Mbyte/sec, again depending on the chosen data packet sizes.

In addition I saw irregular ups and downs in the performance (large spread) for repeated tests with the same fio parameters. Up to some days ago I never found a convincing reason for this strange performance variation behavior of the md-Raid arrays for different kernels and OS versions.

Impact of the CPU governor ?!

Three days ago I read several articles about CPU governors. On my system the "intel_pstate" driver is relevant for CPU power saving. It offers exactly two active governor modes: "powersave" and "performance" (see e.g.: https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pstate.html).

The chosen CPU governor standard for Opensuse Leap 42.2 is "powersave".

Just for fun I repeated some simple tests for the raid array again - one time with "powersave" active on all CPU cores/threads and a second time with "performance" active on all cores/threads. The discrepancy was striking:


Test setup

HW: CPU i7 6700K, Asus Z170 Extreme 7 with Intel Sunrise Point-H Sata3 controller

Raid-Array: md-raid-10, Layout: N2, Chunk Size: 32k, bitmap: none

SSDs: 4 Samsung Evo 850

Fio parameters:

size=500m
directory=/mnt2
direct=1
ioengine=sync
bs=512k
; bs=1024k
iodepth=1
numjobs=1
[read]
rw=randread
[write]
stonewall
rw=randwrite


Test results:

Fio test case 1 - bs=512k, Random Read [RR] / Random Write [RW]:

Leap 42.1, Kernel 4.1:
RR: 780 MByte/sec - RW: 754 MByte/sec,
RR Spread around 35 Mbyte/sec

Leap 42.2, Kernel 4.4, CPU governor powersave :
RR: 669 MByte/sec - RW: 605 MByte/sec,
RR Spread around 50 Mbyte/sec

Leap 42.2, Kernel 4.4, CPU governor: performance :
RR: 780 MByte/sec - RW: 750 MByte/sec,

RR Spread around 35 Mbyte/sec


Fio test case 2 - bs=1024k, Random Read [RR] / Random Write [RW]:

Leap 42.1, Kernel 4.1:
RR: 860 MByte/sec - RW: 800 MByte/sec
RR Spread around 30 Mbyte/sec

Leap 42.2, Kernel 4.4, CPU governor: powersave :
RR: 735 MByte/sec - RW: 660 MByte/sec
RR Spread > 50 Mbyte/sec

Leap 42.2, Kernel 4.4, CPU governor: performance :
RR: 877 MByte/sec - RW: 792 MByte/sec
RR Spread around 25 Mbyte/sec

Interpretation

The differences are so significant that one begins to worry. The data seem to indicate two points:

  • It seems that a high CPU frequency is required for optimum performance of md-raid arrays with SSDs.
  • It seems that the Leap 42.1 with kernel 4.1 reacts differently to load requests from fio test runs - with resulting CPU frequencies closer to the maximum - than Leap 42.2 with kernel 4.4 in powersave mode. This could be a matter of different CPU governors or major changes in drivers ...

With md-raid arrays active, the CPU governor should react very directly to a high I/O load and a related short increase of CPU consumption. However, the md-raid-modules seem to be so well programmed that the rise in CPU load is on average below any thresholds for the "powersave"-governor to react adequately. At least on Leap 42.2 with kernel 4.4 - for whatever reasons. Maybe the time structure of I/O and CPU load is not analyzed precisely enough or there is in general no reaction to I/O. Or there is a bug in the related version of intel_pstate ...

Anyway, I never saw a rise of the CPU frequency above 2300 Mhz during the tests on Leap 42.2 - but short spikes to half of the maximum frequency are quite normal on a desktop system with some applications active. Never, however, was the top level of 4200 Mhz reached. I tested also with 2000 MB to be read/written in total - then we talk already of time intervals around 3 to 4 secs for the I/O load to occur.

Questions

When reading a bit more, I got the impression, that the admins possibilities to influence the behavior of the "intel_pstate" governors are very limited. So, some questions arise directly:

  1. What is the major difference between OS Leap 42.1 with kernel 4.1 compared to Opensuse 42.2 with kernel 4.4? Does Leap 41.1 use the same CPU governor as Leap 42.2?
  2. Is the use of Intel's standard governor mode "powersave" in general a bad choice on systems with a md-raid for SSDs?
  3. Are there any kernel or intel-pstate parameters that would change the md-raid-performance to the better again?

If somebody knows the answers, please contact me. The whole topic is also discussed here: https://bugzilla.kernel.org/show_bug.cgi?id=191881. At least I hope so ...

Addendum, 19.06.2017:
Answer to question 1 - I checked which CPU governor runs on Opensuse Leap 42.1:
It is indeed the good old (ACPI-based) "ondemand" governor - and not the "new" intel_pstate based "powersave" governor for Intel's HWP. So, the findings described above raise some questions regarding the behavior of the "intel_pstate based "powersave" governor vs. comparable older CPU_FREQ solutions like the "ondemand" governor: The intel_pstate "powersave" governor does not seem to support md-raid as well as the old "ondemand" governor did!

I do not want to speculate too much. It is hard to measure details of the CPU frequency changes on small timescales, and it is difficult to separate the cpu consumption of fio and the md-modules (which probably work multithreaded). But it might be that the "ondemand" governor provides on average higher CPU frequencies to the involved processes over the timescale of a fio run.

Anyway, I would recommend all system admins who use SSD-Raid-arrays under the control of mdadm to check and test for a possible performance dependency of their Raid installation on CPU governors!