Linux SSD partition alignment – problems with external USB-to-SATA controllers – I

When you try to find a solution for one problem suddenly another problem appears - and you find yourself confronted with something you never worried about before. For my article series on full laptop encryption with Opensuse Leap 15 I wanted to prepare a new SSD on my Linux desktop. This SSD should later on become the main disk of the laptop. I used an external disk box with an USB-to-SATA-controller from JMicron to attach the disk to the USB-3 bus of my desktop system. And stumbled across a really strange partitioning scheme YaST's "partitioner" set up on my SSD during installation.

For a SSD you normally expect Linux to choose a so called "1 MiB-alignment". Among other things this alignment leads to a minimum offset of the first partition of 2048 logical sectors/blocks with 512 byte. The math is 2048 * 512 = 1024 * 1024 = 1 (MiB) = 256 * 4096. This sector start address is well suited for so called AF disks with native 4k support, but also for disks with conventional 512 byte physical blocks. Note that 1024 * 1024 can be divided by both 512 and 4096 without remainder.

From previous experience with SSDs attached to SATA-III controllers on a Linux system I would have said: Any partition on a SSD shows a logical start block number which is a multiple of 2048 on a Linux system. And I would have bet that this were true for the size of a partition number expressed in block numbers, too. I had never questioned this before ... In addition there are many articles on the Internet that tell users not to worry about alignment because all Linux tools today handle alignment correctly.

However, in my test situation I got something very different - and consequently some (!) Linux partitioning tools on the laptop later complained about misalignment. This worried me. Googling lead me to the following unanswered question at
https://superuser.com/questions/1291467/trouble-figuring-out-optimal-alignment-of-multiple-partitions-on-a-931-5-gib-ext,
which describes more or less my problem.

I have tried to figure out what was/is behind these "misalignment" problem - and the result shocked me a bit. It triggers questions about manufacturers of disk controllers and the way Linux handles disk property information. In addition I had a brief look at performance for an aligned and an unaligned partition. The results again puzzled me. I find it difficult to get a consistent picture out of my findings for partition alignment on SSDs at different controllers. Write me a mail if you know better ...

Anyway, I hope this article, which digs into areas of Linux which are a bit offside standard usage, finds the interest of some other Linux fans, too ....

I shall use the words "blocks" (OS perspective) and "sectors" (atomic disk unit) on the level of partitioning interchangeably - which is not quite correct semantically. But you find this mix of wording also in disk tools :-).

The problem

I used an external box for my SSD which contained a JMicron USB-to-SATA controller. When my SSD - a Samsung 850 Pro - was attached to my Linux desktop via an external USB interface, the physical sector size was reported to be 4096 bytes by lsblk, by parted and fdisk. This stood a bit in contrast to other reports for this SSD type on the Internet; but SSD technology changes so often; I did not make me suspicious. I added a BIOS boot partition and several other partitions to the disk first with YaST's Partitioner. The start sector of the first partition was chosen by YaST to be 65535.

mytux:~ # fdisk -l /dev/sdh 
Disk /dev/sdh: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 33553920 bytes
Disklabel type: gpt
Disk identifier: ....

Device         Start       End   Sectors  Size Type
/dev/sdh1      65535    131069     65535   32M BIOS boot
/dev/sdh2     131070   1376234   1245165  608M EFI System
/dev/sdh3    1376235  37027274  35651040  17GM Linux swap
...

OK, I had noticed before that some present partitioning tools choose a 32 MiB offset for the first partition these days. However, the number 65535 is NOT a multiple of 2048 and thus gives 31,99951171875 MiB. An offset of 65536 logical blocks would, however, fit exactly to 32 MiB.
Also all other partitions on the external SSD were "aligned" (?) to start sector numbers which were multiples of 65535.
Addendum 05.12.2018: I retested this with the tool parted - without defining a special alignment type, i.e. using defaults. Same result.

65535 is compatible with a physical block size of 512 bytes, but NOT 4096 bytes. Interestingly, 65535 also fits an erase block size of 1536 MiB, which a lot of people assume to be used by Samsung for its SSDs. Wherever these people got this information from ... see e.g. https://www.phoronix.com/forums/forum/hardware/general-hardware/1030306-samsung-970-evo-nvme-ssd-benchmarks-on-ubuntu-linux.

But: I checked against partition boundaries on another Linux system with the same type of SSD directly attached to the internal (Intel) SATA controller. There the alignment fitted exactly the theory of boundaries as multiples of 1 MiB (=> multiples of 2048 logical blocks as start sector numbers). I also checked for systems with other internal SSDs. 1 MiB alignment ...

Then I built the disk into my laptop and attached it directly to the SATA interface there - and got a plain "1 MiB"-alignment (both for YaST's Partitioner and parted with defaults).

mylux:~ # fdisk -l /dev/sda 
Disk /dev/sda: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ....

Device         Start       End   Sectors  Size Type
/dev/sda1      2048      67583     65536   32M BIOS boot
...

If one had added more partitions the pattern would have repeated itself: With the SSD on the SATA controller the start sector numbers became multiples of 2048; on the external JMicron USB-to-Sata-controller bus the start sector numbers became multiples of 65535.

Where did this discrepancy come from?

Why should we care?

One motivation for me is, of course, to get a basic idea of what happens. Why do we get a different alignment on different controllers for one and the same disk?

The other point is that alignment is discussed as a requirement for optimal performance and avoiding unnecessary writes at page boundaries. See e.g. here:
https://wiki.ubuntuusers.de/SSD/Alignment/
https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-partition-alignment-tech-brief.pdf
http://www.linux-magazine.com/Issues/2016/187/SSD-tuning
https://blog.helmutkarger.de/linux-und-moderne-ssd-speicher/

A third point is that a misalignment on partition borders would probably be accompanied by a misalignment of other types of "offsets" - e.g. within LVM volumes or encrypted LUKS partitions/volumes. Both for LVM and LUKS the disk area for data payload shows an offset against headers and management data areas at the beginning of a volume. One would assume that a misalignment there also would reduce the overall performance.

In my case another aspect worried me: I used the external USB attachment to prepare a SSD which later on should become the main disk of a Linux laptop - but then directly attached to the laptops internal SATA-controller. So - even if the suspected misalignment on the USB controller had no consequences - it probably would lead to problems on the SATA bus.

Does the strange alignment also occur for other external USB-to-SATA-controllers?

Yes, it does. My first external disk box contained a JMicron USB-to-Sata controller (JMS561). But I tested for another popular controller (JSM578) from JMicron and from ASMedia (ASMedia Technology Inc. ASM1051E SATA 6Gb/s bridge, ASM1053E SATA 6Gb/s bridge, ASM1153 SATA 3Gb/s bridge), too. Same result: Strange partition alignment on start sector numbers which were multiples of 65535.

Disk properties collected and reported by Linux

From the described problem we may assume that the Linux system decides for different partition alignments in situations

  • where a SATA disk is attached via an USB-to-SATA-controller to an external USB interface of the Linux host,/li>
  • where a Sata disk is directly attached to an internal SATA (III) controller of the Linux host.

What could such a decision be based upon? Probably on information about disk properties. Where could we find such information? The "/sys"-directory with its sub-directories is a natural place for it. And indeed - some googling will give you the information that disk topology information is gathered by programs of the library "libblkid" and offered to userspace by files below /sys/block//. See e.g.::
https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/v2.21/libblkid-docs/
https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/v2.21/libblkid-docs/libblkid-Topology-information.html

The partitioning tool "parted", YaST's partitioner and also LVM use information gathered by libblkid. (For YaST one can conclude this from the dependencies tree of the package "yast2-storage-ng". For parted see
https://people.redhat.com/msnitzer/docs/io-limits.txt.
The latter text is, by the way, a relevant source of information for our topic. It directs our attention to the following disk properties and related infos at the sysfs-interface:

/sys/block//alignment_offset
/sys/block///alignment_offset
/sys/block//queue/physical_block_size
/sys/block//queue/logical_block_size
/sys/block//queue/minimum_io_size
/sys/block//queue/optimal_io_size

These data are also called "I/O-Limits" data.

How else can we detect information on these disk properties in userspace? One can look it up via "lsblk". "lsblk --help" will give you information on several columns for related information.

mytux:~ # lsblk -o  NAME,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC  /dev/sdh
NAME         ALIGNMENT MIN-IO   OPT-IO PHY-SEC LOG-SEC
sdh                  0    512 33553920     512     512
├─sdh1               0    512 33553920     512     512
├─sdh2               0    512 33553920     512     512
├─sdh3               0    512 33553920     512     512
│ └─vgs-lvs1         0    512 33553920     512     512
├─sdh4               0    512 33553920     512     512
│ ├─vga-lva1         0    512 33553920     512     512
│ └─vga-lva2         0    512 33553920     512     512
└─sdh5               0    512 33553920     512     512

Another possibility is to "lsblk -t" (with "t" for "topology"). Also fdisk is helpful; on an external ASmedia controller I get:

mytux:~ # fdisk -l -u /dev/sdh 
Disk /dev/sdh: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 33553920 bytes
Disklabel type: gpt
....

Well, this output was for the ASmedia controller. Will another external USB-to-SATA controller give us the same information? Actually, NO!
Here is the output for the same disk in an external box with a JMicron controller:

mytux:~ # lsblk -o  NAME,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC  /dev/sdi
NAME         ALIGNMENT MIN-IO   OPT-IO PHY-SEC LOG-SEC
sdi                  0   4096 33553920    4096     512
├─sdi1               0   4096 33553920    4096     512
├─sdi2               0   4096 33553920    4096     512
├─sdi3               0   4096 33553920    4096     512
│ └─vgs-lvs1        -1   4096        0    4096     512
├─sdi4               0   4096 33553920    4096     512
│ ├─vga-lva1        -1   4096        0    4096     512
│ └─vga-lva2        -1   4096        0    4096     512
└─sdi5            3072   4096 33553920    4096     512

(Ignore the negative offsets for the LVM volumes for a while - although they indicate something strange.)

Hmmm, we get a different minimum_io_size and a different physical_block_size! Note that the information on the physical block size indicates that the JMicron controller internally combines 8 logical blocks into a bigger cluster of 4096 byte. Interesting and worrisome at the same time because the optimal_io_size cannot be divided by 4096 without remainder.

These results make you really wonder what we would get for the SSD device on a SATA controller. Well, here is the output for the very same SSD build into the laptop and attached to the SATA interface there:

mylux:~ # lsblk -o  NAME,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC  /dev/sda
NAME         ALIGNMENT MIN-IO   OPT-IO PHY-SEC LOG-SEC
sda                  0    512        0     512     512
├─sda1               0    512        0     512     512
├─sda2               0    512        0     512     512
├─sda3               0    512        0     512     512
│ └─vgs-lvs1         0    512        0     512     512
├─sda4               0    512        0     512     512
│ ├─vga-lva1         0    512        0     512     512
│ └─vga-lva2         0    512        0     512     512
└─sda5               0    512        0     512     512

Now, we again find a different info on optimal_io_size! You think this is confusing? Well, let us have a look at the alignment-check of parted for the existing partitions on the SSD.

Which of existing partitions are aligned?

On the SATA-bus of the laptop we perform the commands fdisk, parted and parted's sub-command "align-check" for partitions 3, 4, 5:

mylux:~ # fdisk -l /dev/sda 
Disk /dev/sda: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ....

Device         Start       End   Sectors  Size Type
/dev/sda1      65536    131071     65536   32M BIOS boot
/dev/sda2     131072   1400831   1269760  620M EFI System
/dev/sda3    1400832  37052415  35651584   17G Linux LVM
/dev/sda4   37052416 529883135 492830720  235G Linux LVM
/dev/sda5  529916010 540401609  10485600    5G Linux filesystem

Note: I had in the meantime manually adjusted the first 4 partitions to fit to a "natural" 1 MiB alignment.

First noteworthy thing: fdisk produces yet another value for optimal_io_size !!
However:

mylux:~ # cat /sys/block/sda/queue/optimal_io_size 
0 

Where does fdisk get its version of optimal_io_size from? Frankly, I do not know ...

Also note the gap in blocks between the end-sector of partition 4 and the start sector of partition 5. As said, partition 4 was created manually to fit with its start sector to a multiple of 2048. However, partition 5 was created by YaST's partitioner with the disk on the external JMicron USB-to-SATA controller. (Yeah, I moved the SSD a lot back and forth between external USB-disk-boxes and the laptop ...)
The partition creation on the JMicron controller lead directly to the gap - and the alignment was done for a start sector which is both a multiple of 512 and 65535 (the factor is 8086) - but NOT a multiple of 2048 (factor is 258748,0517578125).

Now, what does parted on the laptop with the disk attached to the SATA controller think about the alignment?

mylux:~ # parted /dev/sda
GNU Parted 3.2
Using /dev/sda
Welcome ....
(parted) align-check
alignment type (min/op) [optimal]/minimal?
Partition number? 3
3 aligned 
... 
Partition number? 4
4 aligned 
.... 
Partition number? 5
5 not aligned 
.... 

Again: Partition 5 was created on the desktop with YaST's Partitioner (which internally uses parted-libraries!) when the disk was attached by the JMicron controller. When the very same SSD gets attached directly to a SATA controller of the laptop "/dev/sda5" is characterized as misaligned.

Now, let us create yet another new partition with the disk on the laptop's SATA bus and with YaST's Partitioner (of Leap 15). We get:

mylux:~ # fdisk -l /dev/sda 
Disk /dev/sda: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ....

Device         Start       End   Sectors  Size Type
/dev/sda1      65536    131071     65536   32M BIOS boot
/dev/sda2     131072   1400831   1269760  620M EFI System
/dev/sda3    1400832  37052415  35651584   17G Linux LVM
/dev/sda4   37052416 529883135 492830720  235G Linux LVM
/dev/sda5  529916010 540401609  10485600    5G Linux filesystem
/dev/sda5  540401664 550887432  10485760    5G Linux filesystem

Now, this ended in an alignment according to the standard 1MiB recipe. You can do the math on your own. align-check of parted consequently tells us for /dev/sda6:

Partition number? 6
6 aligned 

Ok ....

Now, what do we get when the disk is attached to the JMicron controller in one of my USB enclosures?
You guess it: Just the opposite!
Partition 5 is qualified as "aligned" then and our new partition 6 as "not aligned". And in addition fdisk then complains that "Partition 5 does not start on physical sector boundary." Because the JMicron controller signals a physical block size of 4096 bytes....

Disturbing ... Different controllers, same SSD => different alignments.

The cause of the different alignment decisions

Some Linux partitioning tools - as parted - use the output of libblkid. But Linux appears to use a "heuristic" to decide upon the alignmant based on the disk property data. The rules for Linux' heuristic for alignment decisions are described e.g. here: https://people.redhat.com/msnitzer/docs/io-limits.txt and http://fibrevillage.com/storage/563-storage-i-o-alignment-and-size. I quote:

"The heuristic parted uses is:
1)  Always use the reported 'alignment_offset' as the offset for the
    start of the first primary partition.
2a) If 'optimal_io_size' is defined (not 0) align all partitions on an
    'optimal_io_size' boundary.
2b) If 'optimal_io_size' is undefined (0) and 'alignment_offset' is 0
    and 'minimum_io_size' is a power of 2: use a 1MB default alignment.
    - as you can see this is the catch all for "legacy" devices which
      don't appear to provide "I/O hints"; so in the default case all
      partitions will align on a 1MB boundary.
    - NOTE: we can't distinguish between a "legacy" device and modern
      device that provides "I/O hints" with alignment_offset=0 and
      optimal_io_size=0.  Such a device might be a single SAS 4K device.
      So worst case we lose < 1MB of space at the start of the disk."

(Note that this ruleset is consistent with a default alignment parameter "optimal" for parted ("-a" command option!; see the discussion below).

Rule 2a) is obviously the reason for what happens when the disk is attached to one of the named USB-to-SATA controllers: We have a non zero value for optimal_io_size then:

optimal_io_size: 33553920 (bytes)

To see what this means in logical blocks we have to divide by 512 : 33553920 : 512 = 65535 ! This is exactly what we find as a factor without remainder for
/dev/sda5: 529916010 : 65535 = 8086.

I double checked and deleted the 1st partition and create it again with the SSD attached to the JMicron controller:

mytux:~ # fdisk -l /dev/sdh 
Disk /dev/sdh: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 512 bytes / 33553920 bytes
Disklabel type: gpt
Disk identifier: ....

Device         Start       End   Sectors  Size Type
/dev/sdh1      65535    131070     65536   32M BIOS boot
/dev/sdh2     131072   1400831   1269760  620M EFI System
/dev/sdh3    1400832  37052415  35651584   17G Linux LVM
/dev/sdh4   37052416 529883135 492830720  235G Linux LVM
/dev/sdh5  529916010 540401609  10485600    5G Linux filesystem
/dev/sdh5  540401664 550887432  10485760    5G Linux filesystem

The first partition again gets an offset of exactly 65535 512-byte-blocks/sectors!

Thus:

Result 1: It is the combined effect of the "fake" optimal_io_size values which some external USB-to-SATA-controllers provide AND the heuristic parted and its libraries use which leads to different alignment decisions:
With the disk directly attached to a systems SATA (III) controller you will get a 1 MiB alignment. When attached to external USB-to-SATA controllers you may get something very different as e.g. an alignment to 33553920 bytes (which is not compatible with an 1 MiB alignment).

Getting nervous? Never thought such things could happen on Linux systems? Well read the following bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1420935
and look at its sad history!

What happens when optimal_io_size=0 ?

In this case it is rule 2b) of the above heuristic which governs the alignment: You would get an 1 MiB alignment!
As all Samsung's SSDs I know (850 PRO 1TB/512GB, EVO 850 500GB, 840 PRO 512GB, EVO 840 500GB) provide such a 0-value on a SATA bus - we always get a standard 1MiB alignment on internal SATA controllers - despite an information on logical block sizes = physical block size = 512 bytes. This latter information for the physical block size is also disturbing as the SS internally uses larger page sizes. It is speculated that SAMSUNG uses a 512 byte value to be compatible to older Windows systems. Anyway, the 0-value for optimal_io_size explains my previous experience that SSDs were "always" treated with an 1 MiB alignment on Linux systems.

But what about the alignment type parameter "-a" of "parted"?

Addendum 05.08.2018: Parted offers a command parameter "-a", which one can use to define a certain alignment (none, cylinder, minimal, optimal). Let us therefore assume that our finding is NOT due to the described "heuristic". Let us instead assume that YaST's Partitioner and parted use a default value for optimization. As far as I know parted may use "optimal" as the default since version 2.3. With the data given above this would also explain the strange alignment in a natural way for the tested USB-to-SATA-controllers.
Note, however, 2 things:

  • A default alignment setting of "optimal" does not contradict the quoted heuristic rules!
  • With a value of "0" for "optimal_io_size" and a default of "optimal" parted still needs a kind of decision tree. And in this case the heuristics is probably used.

Note also that neither a default of "minimal" or "none" would explain the above findings consistently.

Why is a high optimal_io_value of almost 32 MiB used on USB-to-Sata-controllers ?

Almost 32 MiB seems to be a quite high value for optimal_io_size. I can only speculate. But I assume it has to do with caching/buffering on the controller.

Do all partition tools work as parted and YaST's partitioner?

I did not test this in detail. However, gdisk seems to work with a "1MiB"-alignment even when the disk is used in an external USB enclosure with a JMicron controller.

Intermediate conclusion

Partition alignment on SSDs with parted - and other partitioning tools which use the same libraries - is based on a heuristic which itself relies on disk property data retrieved by libblkid. External USB-to-SATA-controllers may provide different property data for one and the same disk => e.g. different values for physical_block_size, miminum_io_size, optimal_io_size. These values may be very different from the data the disk itself reveals when it is directly attached to an (internal) SATA controller of a Linux system.

Especially the optimal_io_size "faked" by popular external USB-to-SATA controllers may lead to a different alignment decision of the Linux system compared to a situation where the SSD is directly attached to a SATA controller.

The fact that external controllers provide "fake" data makes the heuristic of Linux partitioning tools a bit questionable in my opinion.

So much for today. In the next article
Linux SSD partition alignment – problems with external USB-to-SATA controllers – II
I shall have a look at some performance data - which may also surprise you.

Links

https://bugzilla.redhat.com/show_bug.cgi?id=1420935
https://docplayer.net/27672377-Linux-advanced-storage-interfaces.html