From: H. Peter Anvin on
On 03/08/2010 07:41 AM, Martin K. Petersen wrote:
>>>>>> "Martin" == Martin K Petersen <martin.petersen(a)oracle.com> writes:
>
>>>>>> "Martin" == Martin K Petersen <martin.petersen(a)oracle.com> writes:
> Martin> There are 4 KB LBS SSDs out there but in general the industry is
> Martin> sticking to ATA for local boot.
>
> Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
> Martin> people stick to the tried-and-true 512.
>
> *sigh* I haven't had my breakfast tea yet...
>
> What I meant to say was that I know ATA supports 4 KB LBS and that
> nobody appears to care about it.
>

Well, apparently Western Digital are looking at it for USB drives due to
XP compatibility requirements -- those presumably are ATA internally and
use a USB-ATA bridge.

On the flipside, though, there really is very little net benefit to 4K
as opposed to 512 byte logical sectors: the additional protocol overhead
is relatively minimal, and as long as writes are aligned full blocks,
there shouldn't be any additional overhead on either the OS or the drive
side. On the plus side, you get full compatibility with the existing
software stack. The equation really seems rather simple.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on
On Mon, 2010-03-08 at 10:50 -0800, H. Peter Anvin wrote:
> On 03/08/2010 07:41 AM, Martin K. Petersen wrote:
> >>>>>> "Martin" == Martin K Petersen <martin.petersen(a)oracle.com> writes:
> >
> >>>>>> "Martin" == Martin K Petersen <martin.petersen(a)oracle.com> writes:
> > Martin> There are 4 KB LBS SSDs out there but in general the industry is
> > Martin> sticking to ATA for local boot.
> >
> > Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
> > Martin> people stick to the tried-and-true 512.
> >
> > *sigh* I haven't had my breakfast tea yet...
> >
> > What I meant to say was that I know ATA supports 4 KB LBS and that
> > nobody appears to care about it.
> >
>
> Well, apparently Western Digital are looking at it for USB drives due to
> XP compatibility requirements -- those presumably are ATA internally and
> use a USB-ATA bridge.
>
> On the flipside, though, there really is very little net benefit to 4K
> as opposed to 512 byte logical sectors: the additional protocol overhead
> is relatively minimal, and as long as writes are aligned full blocks,
> there shouldn't be any additional overhead on either the OS or the drive
> side. On the plus side, you get full compatibility with the existing
> software stack. The equation really seems rather simple.

There's another problem that afflicts 4k drives emulating 512b: they
have to do a read modify write for any isolated 512b write ... that
leads to potential corruption of adjacent 512b blocks if power is lost
at the moment the write is being done. Since most Linux filesystems are
4k sectors, misalignment really hammers this, plus most journal writes
seem to be done in 512 byte increments. I suppose for USB this could be
regarded as flakey as usual, though.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: H. Peter Anvin on
On 03/08/2010 10:58 AM, James Bottomley wrote:
>>
>> On the flipside, though, there really is very little net benefit to 4K
>> as opposed to 512 byte logical sectors: the additional protocol overhead
>> is relatively minimal, and as long as writes are aligned full blocks,
>> there shouldn't be any additional overhead on either the OS or the drive
>> side. On the plus side, you get full compatibility with the existing
>> software stack. The equation really seems rather simple.
>
> There's another problem that afflicts 4k drives emulating 512b: they
> have to do a read modify write for any isolated 512b write ... that
> leads to potential corruption of adjacent 512b blocks if power is lost
> at the moment the write is being done. Since most Linux filesystems are
> 4k sectors, misalignment really hammers this, plus most journal writes
> seem to be done in 512 byte increments. I suppose for USB this could be
> regarded as flakey as usual, though.
>

Misalignment sucks in general. This is nothing new - the RAID and flash
people have had these problems for a long time now. It's clear we need
to align our filesystems, period.

As to the read-modify-write issue: to some degree there is very little
you can do about it other than a big enough capacitor. If you can't
write a sector atomically and have it stick, you're screwed no matter what.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mike Snitzer on
On Mon, Mar 8, 2010 at 10:18 AM, Martin K. Petersen
<martin.petersen(a)oracle.com> wrote:
>>>>>> "Tejun" == Tejun Heo <tj(a)kernel.org> writes:
>
> Tejun> The [Windows Vista/7] partitioner seems to be using 1M as the
> Tejun> basic alignment unit and offsetting from there if explicitly
> Tejun> requested by the drive
>
> Yep.
>
>
> Tejun> Please note that hdparm is misreporting the alignment offset. �It
> Tejun> should be reporting 512 instead of 256 for offset-by-one drives.
>
> Already fixed. �Your hdparm must be old.
>
>
>
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
>
> I don't think we take the partition type into account. �Karel?
>
>
> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't
> Tejun> done properly for drives with 4 KiB physical sectors. �4 KiB
> Tejun> logical sector support is broken in both the kernel
>
> Huh, what? �My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.
>
>
> Tejun> (need more details and probably a whole section on partitioner
> Tejun> behaviors)
>
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively. �Karel, Jim: The full
> writeup is here:
>
> � � � �http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> It'd be great if you guys could share what you have been doing to the
> tooling.

I've been keeping track of all the pieces in play, have coordinated
with kzak and jim, and have a summary that offers some amount of macro
detail (at the end I touch on parted and fdisk):

http://people.redhat.com/msnitzer/docs/io-limits.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Karel Zak on
On Mon, Mar 08, 2010 at 10:18:27AM -0500, Martin K. Petersen wrote:
> >>>>> "Tejun" == Tejun Heo <tj(a)kernel.org> writes:
> Tejun> Partitioners maybe should only align partitions which will be
> Tejun> used by Linux and default to the traditional layout for others
> Tejun> while allowing explicit override.
>
> I don't think we take the partition type into account. Karel?

Yes, you're right.

(IMHO our goal should be to minimize number of places where anything
depends on partition type.)

> Tejun> Reportedly, commonly used partitioners aren't ready to handle
> Tejun> drives larger than 2 TiB in any configuration and alignment isn't

The limit is specific for DOS partition table (with 512-byte log.
sectors), but for example GPT uses 64-bit LBA. I believe that our
partitioning tools don't introduce any other restriction.

> Tejun> done properly for drives with 4 KiB physical sectors. 4 KiB
> Tejun> logical sector support is broken in both the kernel
>
> Huh, what? My homedir is on a 4KiB LBS/PBS drive and has been for ~2
> years.
>
>
> Tejun> (need more details and probably a whole section on partitioner
> Tejun> behaviors)
>
> I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
> alignment work for fdisk and parted respectively. Karel, Jim: The full
> writeup is here:
>
> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> It'd be great if you guys could share what you have been doing to the
> tooling.

small summary:

- libblkid provides unified API to topology information, it supports:
- ioctls (kernel >= 2.6.32)
- sysfs (kernel >= 2.6.31)
- stripe chunk size and stripe width for DM, MD. LVM and evms on
old kernels
- libparted and fdisk are linked against libblkid

- fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
- fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
- fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
and alignment_offset for all partitions in non-DOS mode
(util-linux-ng >= 2.17.1)

- parted supports 4KiB physical sector size
- parted uses 1MiB alignment for disks with unknown topology, disks
with topology information are aligned to optimal (or minimum) I/O
size (parted >= 2.1)

- EFI GPT code in the kernel has been updated to works properly with
4KiB sectors (kernel >= 2.6.33)

- mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
topology information, mkfs.{ext,xfs} are linked against libblkid
for compatibility with old kernel (for stripe chunk size / width)

- Fedora-13/RHEL6 installer uses libparted with 4KiB support

- alignment_offset & 4KiB support is planned for LUKS (cryptsetup)

> Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
> Tejun> or logical too, is looking fairly ugly. Hopefully, a reasonable
> Tejun> solution can be reached in not too distant future but even with
> Tejun> all the software side updated, it looks like it's gonna cause
> Tejun> significant amount of confusion and frustration.
>
> With regards to XP compatibility I don't think we should go too much out
> of our way to accommodate it. XP has been disowned by its master and I
> think virtualization will take care of the rest.
>
> FWIW, recent fdisk has a command line flag that will enable/disable DOS
> compatible layout.

yes, util-linux-ng 2.17.1, fdisk -c

Note that non-DOS mode will be default in the next major
util-linux-ng release.

Karel

--
Karel Zak <kzak(a)redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/