From: Theodore Tso on
On Mar 10, 2010, at 11:19 AM, Damian Lukowski wrote:
>
> I have practically no knowledge of Linux' block device drivers,
> but is this really a partitioning issue? I think the problem is
> with the filesystems when clustering multiple blocks without
> knowledge of the sector alignment and sector size of the underlying
> block device. Maybe it is a better solution to adapt the filesystem
> buffer routine which reads/writes data from/to the block device?

No, it's really a partitioning issue. If the paging subsystem wants a 4k block to fill a particular page, we need to read that 4k block into memory. If we need to swap out that 4k block, we need to write that 4k block to swap space, or to the memory segment's backing store. If the partition is misaligned by 512 bytes, this is simply not possible. The file system has to do what is requested of it by its users, and the reality is that we need to do 4k aligned reads and writes with respect to the beginning of the partition, far more often than not.

Hence, if we want the best performance on 4k sector drives, the partition needs to be aligned with respect to what is most desirable for the device in question.

Best regards,

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nikanth Karthikesan on
On Thursday 11 March 2010 18:34:56 Theodore Tso wrote:
> On Mar 10, 2010, at 11:19 AM, Damian Lukowski wrote:
> > I have practically no knowledge of Linux' block device drivers,
> > but is this really a partitioning issue? I think the problem is
> > with the filesystems when clustering multiple blocks without
> > knowledge of the sector alignment and sector size of the underlying
> > block device. Maybe it is a better solution to adapt the filesystem
> > buffer routine which reads/writes data from/to the block device?
>
> No, it's really a partitioning issue. If the paging subsystem wants a 4k
> block to fill a particular page, we need to read that 4k block into
> memory. If we need to swap out that 4k block, we need to write that 4k
> block to swap space, or to the memory segment's backing store. If the
> partition is misaligned by 512 bytes, this is simply not possible. The
> file system has to do what is requested of it by its users, and the
> reality is that we need to do 4k aligned reads and writes with respect to
> the beginning of the partition, far more often than not.
>
> Hence, if we want the best performance on 4k sector drives, the partition
> needs to be aligned with respect to what is most desirable for the device
> in question.
>

I guess, what he meant was, to keep filesystem blocks aligned, even if the
partition is not. Say if the partition is mis-aligned by 512-bytes, let the
filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
case of over-engineering, possibly requiring disk format change.

Thanks
Nikanth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Theodore Tso on

On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>
> I guess, what he meant was, to keep filesystem blocks aligned, even if the
> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
> case of over-engineering, possibly requiring disk format change.

Ah, yes, I agree with you; that's probably what he meant.

Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.

It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Snitzer on
On Thu, Mar 11, 2010 at 9:28 AM, Theodore Tso <tytso(a)mit.edu> wrote:
>
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>>
>> I guess, what he meant was, to keep filesystem blocks aligned, even if the
>> partition is not. Say if the partition is mis-aligned by 512-bytes, let the
>> filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
>> case of over-engineering, possibly requiring disk format change.
>
> Ah, yes, I agree with you; that's probably what he meant.
>
> Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.
>
> It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

Yes, the current supported approach is to rely on partitions (parted,
fdisk) or LVM to account for 'alignment_offset'.

This avoids having a filesystem add its own padding (format change).
But e2fsprogs at least warns if a device, that it is to format, has an
alignment_offset != 0.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on
On Thu, 2010-03-11 at 09:28 -0500, Theodore Tso wrote:
> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
> >
> > I guess, what he meant was, to keep filesystem blocks aligned, even if the
> > partition is not. Say if the partition is mis-aligned by 512-bytes, let the
> > filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a
> > case of over-engineering, possibly requiring disk format change.
>
> Ah, yes, I agree with you; that's probably what he meant.
>
> Sure, that's theoretically possible, but it would mean changing every
> single filesystem, and it would require a file system format change
> --- or at least a file system format extension.
>
> It would seem to be way easier to simply fix the partitioning tools to
> do the right thing, though.

Actually, it's a layering violation. The filesystem shouldn't need to
probe the device layout ... particularly when there are complexities
like is it logical 512 or physical, and if logical 512 on 4k does it
have an offset exponent or not.

We can transmit certain abstractions of information up the stack (like
stripe width for RAID arrays which should be the fs optimal write size),
but for this type of alignment, which can be completely solved at the
partition layer, the information should really stay there and the
filesystem should "just work".

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/