From: Karel Zak on
On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> Karel Zak wrote:
> > # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}
>
> That's 3-disk stripe size with default 64Kb chunk size, which makes
> 3x64=320KiB - the number to which everything should be aligned.
>
> > # fdisk -lcu /dev/md8
> >
> > Disk /dev/md8: 1572 MB, 1572667392 bytes
> > 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
> > Units = sectors of 1 * 512 = 512 bytes
> > Sector size (logical/physical): 512 bytes / 4096 bytes
> > I/O size (minimum/optimal): 65536 bytes / 65536 bytes
>
> And here we go: fdisk does not see the right number: nothing
> is dividable by 3.

Well, the same setup with 2.6.34-0.9.rc0.git13.fc14.x86_64:

# fdisk -luc /dev/sdb

Disk /dev/sdb: 2621 MB, 2621440000 bytes
255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 32768 bytes
Disk identifier: 0x77fbab55

Device Boot Start End Blocks Id System
/dev/sdb1 2048 1026047 512000 83 Linux
/dev/sdb2 1026048 2050047 512000 83 Linux
/dev/sdb3 2050048 3074047 512000 83 Linux
/dev/sdb4 3074048 4098047 512000 83 Linux


# mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}


# fdisk -luc /dev/md8

Disk /dev/md8: 1572 MB, 1572667392 bytes
2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 65536 bytes / 65536 bytes


# cat /sys/block/md8/queue/{minimum,optimal}_io_size
65536
65536

> > # cat /sys/block/md8/md8p{1,2}/alignment_offset
> > 0
> > 0
>
> And that's where the issue is. md does not {sup,re}port all
> this stuff yet.

Hmm...

Karel

--
Karel Zak <kzak(a)redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Tue, Mar 09, 2010 at 02:38:57PM +0300, Michael Tokarev wrote:
> Dave Chinner wrote:
> > On Tue, Mar 09, 2010 at 01:16:01PM +0300, Michael Tokarev wrote:
> >> Karel Zak wrote:
> >>> I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
> >>> It works as expected.
> >> Actually, for raid0, the alignment is questionable. Should it be a
> >> multiple of chunk size or whole stripe size? I'm not sure, both ways
> >> has bad and good sides.. But if it is the latter, the same issues
> >> pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.
> >
> > Yes, alignment is still needed, especially for filesystems that can
> > do stripe unit aligned allocation like XFS. If you don't align the
> > filesystem properly, all the data IO will be mis-aligned to the
> > underlying disks and stripe unit sized IO will hit multiple disks
> > rather than just one....
>
> I understand alignment is needed, the question is if the alignment
> should be to chunk size or full-stripe size. In neither case it
> will be bad for underlying disks.

Depends on the RAID implementation. High end RAID arrays often have
cache bypass features that are triggered by stripe width aligned and
sized IOs. cwWhen receiving well formed IO they can more than double
write performance because they are not limited by internal cache
mirroring bandwidth (e.g. the controller magically switches to
write-through for those well formed IOs instead of writeback).

So from that perspective, alignment needs to be to stripe width,
not stripe unit. Similarly for RAID5/6 alignment needs to be to
stripe width, so that a well formed IO issued by the filesystem
only hits one RAID5/6 stripe.

FWIW, XFS takes great care to ensure that it doesn't place all it's
allocation group headers on the same stripe unit. Failing to
distribute the AG headers across all the ѕtripe units evenly loads
the disks/luns in the stripe unevenly. As soon as you have uneven
load on a stripe the performance tanks as stripe is only as fast as
it's slowest member.

Also, while XFS prefers to align to stripe unit, there are mount
options to change the default allocation alignment to be stripe
width based. Hence if you have large files and applications that are
doing well formed IO, stripe width alignment of the filesystem to
the underlying block device is critical to acheiving deterministic
throughput close to the maximum the hardware can support.....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mark Lord on
On 03/07/10 22:48, Tejun Heo wrote:
...
> Please note that hdparm is misreporting the alignment offset. It
> should be reporting 512 instead of 256 for offset-by-one drives.
...

That issue was fixed quite a while ago.
Upgrade your elderly copy of hdparm.

:)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Daniel Taylor on

hpa> I would very much like a reference for a platform which has
hpa> firmware which can successfully boot from 4K-logical media. It
hpa> would be very useful for bootloader testing.


I am told that the Mac UEFI platform will boot from 4K logical/physical
drives.

Now I have to scrounge one of the old drives to test it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Greg Freemyer on
<snip>
>
> As far as partitioning... I believe we should be using GPT partition tables
> where possible. �Even on non-EFI systems, it's simply a much better
> partition table format.
>
> � � � �-hpa

GPT can not be used for boot disks in non-EFI systems, right?

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/