From: Michal Soltys on
Mikael Abrahamsson wrote:
> On Mon, 8 Mar 2010, Tejun Heo wrote:
>
>> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> Excellent summary.
>
>> C-2. Windows XP depends on the traditional partition layout.
>
> Is this really true? WD ships their EARS drives with an alignment tool
> that as far as I can understand, moves the partition so
> it's aligned to 4KiB:
>

XP SP2 (or later) can boot from any place, including logical partitions
(tested that recently). Most important thing is "hidden sectors" (recent
chain.c32 can set that automatically through ntldr and/or sethidden
options). No idea about pre-SP2 ; Win 2000 will not boot from "misaligned"
(with reference to cylinder boundary) partition.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Tokarev on
Karel Zak wrote:
> On Tue, Mar 09, 2010 at 09:53:37AM +0300, Michael Tokarev wrote:
[]
>> Think of a raid5 array - with all the mentioned good stuff in place
>> fdisk should figure out to align partitions on the array stripe
>> boundary, and should do that automatically. And this should be
>
> Yes. For userspace there is not a difference between RAID and non-RAID
> device -- the topology support in kernel provides unified API to all
> devices. It means we needn't any extra support for RAIDs in
> fdisk/parted. The userspace tools follow topology data from kernel.
>
> The good thing with 1MiB default alignment is that it is usable for
> usual stripe sizes (for sizes greater than 1MiB we use optimal I/O
> size).

No, it's not that simple. For raid5 (and I especially mentioned raid5
above), raid4 and raid6, 1MiB is only good when the number of devices
is 2^N+1 (for raid[45]) or 2^N+2 (for raid6). For raid5 that means
3, 5, 9, 17, .. disks. In all other cases the alignment (which should
match stripe size) will not be power of two. For example, for a 4-disk
raid5 array with 1MiB chunk size the partitions should be aligned at
3MiB boundaries. For 6-disk raid5 with 256KiB chunk size it is
5x256=1280 Kib. And so on.

Yes it has little to do with the $subject (4KiB sectors), but it is
closely related still.

>> most easy to debug/test, since the whole thing is controllable
>> by kernel.
>
> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> It works as expected.

Actually, for raid0, the alignment is questionable. Should it be a
multiple of chunk size or whole stripe size? I'm not sure, both ways
has bad and good sides.. But if it is the latter, the same issues
pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.

[]
> Disk /dev/sdb: 2621 MB, 2621440000 bytes
> 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 32768 bytes

Good.

> # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}

That's 3-disk stripe size with default 64Kb chunk size, which makes
3x64=320KiB - the number to which everything should be aligned.

> # fdisk -lcu /dev/md8
>
> Disk /dev/md8: 1572 MB, 1572667392 bytes
> 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 65536 bytes / 65536 bytes

And here we go: fdisk does not see the right number: nothing
is dividable by 3.

[]
> # cat /sys/block/md8/md8p{1,2}/alignment_offset
> 0
> 0

And that's where the issue is. md does not {sup,re}port all
this stuff yet.

This is what I'm talking about.

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> > It works as expected.
>
> Actually, for raid0, the alignment is questionable. Should it be a
> multiple of chunk size or whole stripe size? I'm not sure, both ways
> has bad and good sides.. But if it is the latter, the same issues
> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.

Yes, alignment is still needed, especially for filesystems that can
do stripe unit aligned allocation like XFS. If you don't align the
filesystem properly, all the data IO will be mis-aligned to the
underlying disks and stripe unit sized IO will hit multiple disks
rather than just one....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Tokarev on
Dave Chinner wrote:
> On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
>> Karel Zak wrote:
>>> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
>>> It works as expected.
>> Actually, for raid0, the alignment is questionable. Should it be a
>> multiple of chunk size or whole stripe size? I'm not sure, both ways
>> has bad and good sides.. But if it is the latter, the same issues
>> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
>
> Yes, alignment is still needed, especially for filesystems that can
> do stripe unit aligned allocation like XFS. If you don't align the
> filesystem properly, all the data IO will be mis-aligned to the
> underlying disks and stripe unit sized IO will hit multiple disks
> rather than just one....

I understand alignment is needed, the question is if the alignment
should be to chunk size or full-stripe size. In neither case it
will be bad for underlying disks.

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Karel Zak on
On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
>
> That's 3-disk stripe size with default 64Kb chunk size, which makes
> 3x64=320KiB - the number to which everything should be aligned.
>
> > # fdisk -lcu /dev/md8
> >
> > Disk /dev/md8: 1572 MB, 1572667392 bytes
> > 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> > Units = sectors of 1 * 512 = 512 bytes
> > Sector size (logical/physical): 512 bytes / 4096 bytes
> > I/O size (minimum/optimal): 65536 bytes / 65536 bytes
>
> And here we go: fdisk does not see the right number: nothing
> is dividable by 3.
>
> []
> > # cat /sys/block/md8/md8p{1,2}/alignment_offset
> > 0
> > 0
>
> And that's where the issue is. md does not {sup,re}port all
> this stuff yet.
>
> This is what I'm talking about.

Note that I have 2.6.31.12-174.2.22.fc12.x86_64 kernel on my laptop.
It would be better for serious tests to use 2.6.33.

Karel

--
Karel Zak <kzak(a)redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/