In my last article
Linux SSD partition alignment – problems with external USB-to-SATA controllers – I
I wrote about different partition alignments parted and YaST's Partitioner (Opensuse) applied when one and the same SSD (a Samsung 850 Pro) was attached
- to external USB-to-SATA controllers and the USB bus of a Linux system (1 MiB-alignment: start sectors as multiples of 2048 logical 512 byte blocks)
- or directly to an internal SATA-III-controller of a Linux system (alignment to multiples of 65535 logical 512 byte blocks).
We saw that different controllers led to the detection of different disk topology parameters (I/O limits) by the Linux system via the libblkid library. For one and the same SSD different values were reported by different controllers e.g. for the following parameters (I/O Limits data) :
physical_block_size, minimum_io_size, optimal_io_size
We saw in addition that even different Linux disk tools may report different values; e.g. fdisk showed a different value [512 byte] for the "optimal_io_size" for the SSD on the SATA bus than e.g. lsblk and parted .
Guided by a Red Hat article https://people.redhat.com/msnitzer/docs/io-limits.txt we came to the conclusion that at least parted and YaST's Partitioner use heuristic rules for its alignment decisions. The rules take into account the values for disk "I/O Limits" parameters. They are consistent with a default of "optimal" for the alignment parameter of parted and provide a decision when "the value for "optimal_io_size" is found to be zero. By applying these rules we could explain why we got different partition offsets and alignments for one and the same disk when it was attached to different controllers.
But this insight left us in an uncomfortable situation:
- Should we cling to the chosen settings when we use the SSD on external controllers, only? Can we partition SSDs on an external USB-to-SATA controller and move them later directly to a SATA-bus without adjusting partition borders? We saw that "parted" would complain about misalignment when SSD partitions were prepared on a different controller.
- As many people discuss the importance of partition alignment for SSD performance - will we see a noticeable drop in performance when we read/write to "misaligned" partitions?
- We saw that at least a JMicron controller indicated a bundling of 8 logical 512 byte blocks into a 4096 byte (fake?) "physical block". Another question might therefore be what happens after an installation and something is written by Grub to the first sector of a disk with GPT layout - and maybe assuming some wrong disk topology? This is not so far fetched as one may think; see the third link at the bottom for a disaster with an MBR.
I cannot answer all these questions in general. But in this article I will at least look a bit into performance issues and answer the last question for my test situation.
Could I boot when I transferred the disk from an external USB enclosure to an internal SATA controller?
Alarmed by the discussion in https://superuser.com/questions/679725/how-to-correct-512-byte-sector-mbr-on-a-4096-byte-sector-disk I tested what happened in course of the following steps:
- I installed a bootable Opensuse Leap 15 on my SSD when it was attached to an external JMicron USB-to-Sata controller
- and afterwards attached the same SSD to my laptop's internal Sata III controller.
The laptop is of the last generation with a BIOS system. I used Grub2 as boot manager - which operates on a legacy BIOS system via the first sector of my SSD with GPT layout and a BIOS boot partition. The root filesystem (with the "/boot"-directory resided inside a LVM volume of another partition which was used as a Volume Group. An additional LVM volume for SWAP was added. I even used fully LUKS-encrypted logical volumes (LUKS on LVM). All the underlying partitions were created when the SSD was in the external USB enclosure (with the JMicron controller). Thus these partitions were seen as misaligned when the disk was moved to the internal SATA bus interface of the laptop (see my first article). Could I boot Opensuse Leap 15 on the laptop from this SSD at the SATA bus?
Yes, I could. At least in my case the physical block size (of 8 logical blocks) "faked" by the JMicron controller and the strange alignment on multiples "33553920 bytes" (65535 blocks) had no effect on booting. Booting was also possible when I manually adjusted partition borders to an 1 MiB alignment from the beginning.
Why did I not experience the kind of disaster which was described in the article named above? Well, I do not know. But the guy there used a native MBR disk layout, whereas I used a GPT disk layout with a protective MBR and a BIOS boot partition. But the enclosure used by the other guy may have done some really funny things with addressing ...
A look at performance
At least in the early years of SSDs partition alignment to multiples of 2048 logical blocks was a major topic. And some sellers of disk tools still make a big fuss about it. Linux users are mollified by the statements that all present Linux partition tools do the alignment correctly (meaning a 1MiB-alignment) - and you as a user do not have to think about it. My last article showed, however, that this mollification is sometimes based on heuristic decisions - and depend on disk property data which may be faked by external controllers.
So I got interested in the question whether I would notice consequences of a misalignment of a partition with respect to ... ja, to what? Do we really know the internal page size SSD vendors use for a specific SSD? Do we know anything about internal block mapping of the SSD controllers - as the MEX controller of Samsung? But almost every relevant article on the Internet assumes that an alignment according to the 1 MiB rule leads to superior performance on Linux systems .... One of the arguments I read in a forum for the 1 MiB-alignment was the following:
The SSDs are produced for the mass market which is dominated by Microsoft. And MS finds an alignment to 4096 bytes (2048 logical blocks) sufficient. So, the vendors would not do anything that would lead to a disadvantage in their competition with other producers.
Oh my .. . So, Linux uses the 1 MiB alignment, too. At least when the disk topology fits to "heuristic" rules... But, how big are the performance deficits today really? How big are they for my Samsung 850 Pro (512GB) SSD - which is relatively new?
In my last article I had created a partition "/dev/sda5" which was misaligned on the SATA bus due to the fact that the JMicron controller had suggested an optimal_io_size of "33553920 bytes" (i.e. 65535 logical blocks). I also had created an partition "/dev/sda6" - aligned to the MiB rule; it was explicitly qualified as aligned by parted's align-check tool. Both partitions had a capacity of 5 GiB and were formatted with ext4. So, what do performance tests tell us for both partitions ?
Test with hdparm - and with SSD on the SATA-III bus
"hdparm -t" delivers results which are comparable to a sequential read. The reproducible results on my laptop with the disk on the SATA-III-controller were (after TRIM):
mylux:~ # hdparm -t /dev/sda5 /dev/sda5: Timing buffered disk reads:1582 MB in 3.00 seconds = 539.22 MB/sec mylux:~ # hdparm -t /dev/sda6 /dev/sda6: Timing buffered disk reads:1624 MB in 3.00 seconds = 539.40 MB/sec
Tests with dd - and with SSD on the SATA-III bus
I first used the following commands on both partitions which were each mounted on /mnt (one after the other of course):
echo 3 | tee /proc/sys/vm/drop_caches dd if=/dev/zero of=tmpfile bs=1M count=2048 conv=fdatasync,notrunc echo 3 | tee /proc/sys/vm/drop_caches dd if=tmpfile of=/dev/null bs=1M count=2048
The modification of "/proc/sys/vm/drop_caches" is required to empty the cache for read tests. I got the following results:
/dev/sda5 WRITE: 452 MB/s - 483 MB/s READ: 545 MB/s - 547 MB/s /dev/sda6 WRITE: 457 MB/s - 484 MB/s READ: 545 MB/s - 546 MB/s
The second value for WRITE stems from a direct repetition of the write command without emptying the cache and with overwriting the "tmpfile"; in this case buffering probably has some effect. The second value for READ comes from repetition with emptying the cache.
Again: Basically no difference!
Then I changed to bs=4K, count=400000:
/dev/sda5 WRITE: 387 MB/s - 436 MB/s READ: 542 MB/s - 543 MB/s /dev/sda6 WRITE: 391 MB/s - 436 MB/s READ: 545 MB/s - 546 MB/s
Almost no difference.
Tests with fio - and with SSD on the SATA-III-bus
I wrote/read 1.2 GiB and for control 2.4 GiB of data. Files were deleted after each fio run. The bs parameter was varied to account for a transfer of different block sizes (k in fio is for kiB). ioengine=psync. The runs were repeated 6 times; fio files were explicitly deleted and recreated after 3 runs). The values given below are averaged values. TRIM commands were issued before each run.
/dev/sda5: bs=2048k : Sequential READ: 506 MB/s - sequential WRITE: 498 MB/s bs=64k : Sequential READ: 432 MB/s - sequential WRITE: 395 MB/s bs=4k : Sequential READ: 156 MB/s - sequential WRITE: 137 MB/s bs=2048k : Random READ: 496 MB/s - random WRITE: 485 MB/s bs=64k : Random READ: 295 MB/s - random WRITE: 389 MB/s bs=4k : Random READ: 50 MB/s - random WRITE: 117 MB/s /dev/sda6: bs=2048k : Sequential READ: 510 MB/s - sequential WRITE: 495 MB/s bs=64k : Sequential READ: 437 MB/s - sequential WRITE: 405 MB/s bs=4k : Sequential READ: 156 MB/s - sequential WRITE: 142 MB/s bs=2048k : Random READ: 504 MB/s - random WRITE: 501 MB/s bs=64k : Random READ: 300 MB/s - random WRITE: 398 MB/s bs=4k : Random READ: 50 MB/s - random WRITE: 135 MB/s
The spread in variation per run type is relatively big - around +/- 8 MB/s. So, one would need some more runs to reduce the impact of temporary loads in the OS.
Nevertheless and as far as I could see: There seems to be a systematic advantage in the range of 2,5% up to 14% of the 1MiB-aligned "/dev/sda6" for random write operations and relatively small data blocks.
Test with hdparm - and with SSD in USB enclosure at USB-3 bus
The following results show deficits of the USB bus attachment:
mylux:~ # hdparm -t /dev/sda5 /dev/sda5: Timing buffered disk reads:1582 MB in 3.00 seconds = 432.95 MB/sec mylux:~ # hdparm -t /dev/sda6 /dev/sda6: Timing buffered disk reads:1624 MB in 3.00 seconds = 432.92 MB/sec
But again - no differences between the partitions
Tests with dd - and with SSD in USB enclosure at USB-3 bus
For "bs=1M, count=2048" I got the following results:
/dev/sdh5 WRITE: 399 MB/s - 424 MB/s READ: 454 MB/s - 455 MB/s /dev/sdh6 WRITE: 402 MB/s - 425 MB/s READ: 454 MB/s - 455 MB/s
For "bs=4K, count=400000":
/dev/sdh5 WRITE: 353 MB/s - 372 MB/s READ: 454 MB/s - 455 MB/s /dev/sdh6 WRITE: 351 MB/s - 371 MB/s READ: 454 MB/s - 455 MB/s
So, basically no difference!
Tests with fio - and with SSD in USB enclosure at USB-3 bus
Would fio now also reveal a difference for small block-sizes?
/dev/sdh5: bs=2048k : Random READ: 430 MB/s - random WRITE: 435 MB/s bs=4k : Random READ: 28.6 MB/s - random WRITE: 50.8 MB/s /dev/sdh6: bs=2048k : Random READ: 433 MB/s - random WRITE: 440 MB/s bs=4k : Random READ: 28.8 MB/s - random WRITE: 50.9 MB/s
No difference! The sequence of USB-3 bridge controller on my PC, JMicron controller in the enclosure and the SSDs Mex-controller seems to eliminate any differences.
Looking at all data we come to the following result:
The Samsung SSD 850 Pro shows a remarkable indifference against alignment/misalignment (multiples of 2048 vs. 65535 logical blocks) for sequential reads/writes and different block sizes transferred to the disk.
With respect to random read/writes we see no difference either, when the SSD was attached to an external JMicron USB-to-SATA controller.
However, when attached to a SATA III controller, there is a noticeable higher performance for random write operations on a partition with a "1MiB-alignment" (start blocks/sectors = multiples of 2048) and small transferred block sizes (64kiB => 4096 bytes):
bs=64 kiB => 2.5% faster with "1 MiB"-alignment.
bs=4096 bytes => 14% faster with "1 MiB"-alignment.
Final Conclusion for part I and part II
The decisions of parted and some other Linux partitioning tools for partition offsets and alignment depend on disk topology properties known as "I/O-Limits" - as e.g. physical_block_size, minimum_io_size, optimal_io_size. The respective values libbklid detects do not only depend on the SSD, but primarily on the controller to which the disk is attached. External USB-to-SATA-controllers may give you quite different values than the SSD itself when directly attached to a SATA-III-controller.
For a Samsung 850 Pro we got an expected 1MiB-alignment (multiples of 2048 logical blocks) on the SATA-III bus - but we noted an unexpected alignment to multiples of 33553920 bytes (65535 blocks) for the same disk in an USB enclosure with USB-to-SATA-controller. These incompatible alignments could be explained by heuristic rules Linux still seems to use for deriving offsets from I/O-limits data (which may be "manipulated" by controllers). An especially important factor for alignment decisions is the value of "optimal_io_size".
We found that external SSDs with GPT layout (and protective MBR) can be prepared with partitions and LUKS-encrypted LVM volumes on an external USB-to-SATA-controller. A bootable and Grub2 dependent Opensuse Leap 15 installation was still bootable when the SSD later on was transferred to the internal SATA controller of a laptop.
The impact of the alignment or misalignment of partitions was negligible when the disk was attached to the external USB-to-SATA controller. A significant advantage of a standard "1MiB-alignment" could be seen for random write operations and data of small block sizes between 4096 byte and 64 kiB - but only when the disk was directly attached to a SATA-III-controller. This is, however, the most common use case. The advantage may increase from 2.5% for 64kiB up to 14% for 4096 Bytes. The latter is important as most filesystems operate with 4096 byte blocks.
However, the performance impact of misalignment was by far not as big as expected - at least not for the Samsung 850 Pro. (I have e.g. seen a much (!) larger impact of the CPU governor on SSD Raid configurations built with mdadm for Intel SATA RAID controllers.) Still the performance difference indicates superfluous write operations at internal page borders of the SSD.
Recommendations: Check the I/O limit values by the help of lsblk when you use external USB enclosures and controllers. These data may explain an unexpected alignment. If you plan to use the SSD later on a SATA bus (i.e attached to a SATA-controller) better set partition borders manually to fit to a "1 MiB"-alignment (multiples of 2048 blocks). This would mean that you may have to define start and end blocks explicitly you create the partitions (instead of just defining partition sizes). (This would e.g. be the case for YaST's Partitioner). Use e.g. fdisk or gdisk to check the start and end block/sector numbers of your partitions after you have created them. Verify that the start sector numbers and the size in sectors of a partition can be divided by 2048 without a remainder.
By the way: "gdisk" seems to work with 1MiB-alignment even when the disk is attached to an external USB-to-SATA-controller.
Final opinions at the end: The whole matter that different controllers may provoke different alignments is confusing not only for end users. The "heuristic" used by parted is not documented well enough (I found the Red Hat text by chance; had to align this with a default parameter "optimal" of parted). The fact that different Linux disk tools may lead to different alignments for the same controller and the same SSD is disturbing.
Others have/had the same problem ...
Importance of alignment for SSD performance